The objective of this contest is to use advanced analytics, text mining and visualization methods combined with domain knowledge, ontologies and frameworks to support and accelerate the process of drug discovery.
Drug discovery in the pharmaceutical industry is confronted with the challenges of high costs, increasing lead-times, stricter regulation and high failure rate. Industry experience has shown that the high failure rate of drug development can be largely attributed to improper target selection. A target in the drug discovery process refers to a molecular entity that forms the basis of the drug. Research on target discovery is primarily taking place by experimental work involving wet lab syntheses and other molecular modeling and simulation techniques. This route can be error prone and time-consuming.
The Business Opportunity - Text Mining to support Drug Discovery
We are witnessing an unprecedented "omics" era with the explosion of biological data and information. For example, the most popular biomedical literature database, MEDLINE/PubMed, currently contains more than 18 million literature abstracts, and more than 60,000 new abstracts are added monthly. The number of databases warehousing chemical, genomic, proteomic and metabolic data is rapidly growing with their size estimated to double every two years. This wealth of biological data and information presents immense new opportunities for target discovery in support of the drug discovery pipeline.
Text Mining of Patents
Patent documents contain important research that is valuable to the industry, business, law, and policy-making communities. Take the patent documents from the United States Patent and Trademark Office (USPTO) as examples. The structured data include: filing date, application date, assignees, UPC (US Patent Classification) codes, IPC codes, and others, while the unstructured segments include: title, abstract, claims, and description of the invention. The description of the invention can be further segmented into field of the invention, background, summary, and detailed description.
Given a set of "Source" patents or documents, we can use text mining to identify patents that are "similar" and "relevant" for the purpose of discovery of drug variants. These relevant patents could further be clustered and visualized appropriately to reveal implicit, previously unknown, and potentially useful patterns.
The eventual goal is to obtain a focused and relevant subset of patents, relationships and patterns to accelerate discovery of variations or evolutions of the drugs represented by the "source" patents.
Background of the Contest - Discovery of Variants of H1N1 Drugs to Combat Mutants
The neuraminidase inhibitor (NI) class of drugs (Oseltamivir and Zanamivir) was introduced in 1999-2000. These drugs are more popular by the brand names, Tamiflu and Relenza. Since this time, there has been a gradual increase in global use of these drugs to treat seasonal influenza A and B infections. Many countries have acquired large stockpiles of Oseltamivir during pre-pandemic planning; therefore the emergence of NI-resistant strains with a capacity to spread is of concern.
Oseltamivir- resistant human H1N1 virus emerged globally during 2007-2008 in the Northern Hemisphere winter season with no evidence of drug exposure. This necessitated the discovery of the experimental drug, Peramivir (2009).
For the purpose of this contest, the "source" drugs are -
- Tamiflu (Oseltamivir Phosphate)
- Relenza (zanamivir)
The relevant patents # ("source" patents) are:
- Tamiflu - 5763483, 5866601, 5952375
- Relenza - 5360817, 5648379, 6294572
- Peramivir - 8101745, 8080562, 8067426, 8062864, 8058069, 8026392, 7999001, 7981930, 7977344, 7935340, 7919454, 7906117, 7905852, 7893272, 7879028, 7858660, 7816366, 7682356, 7507546, 7208176, 6955888
Influenza A virus or H1N1 is also refered to as swine flu A/Mexico/09 and in this section, a brief overview of the structure of virus is provided for domain knowledge and would be helpful for Part 1 of the contest. The influenza virus is rough spherical in shape and has an enveloped outer layer membrane which is taken from the host cell where the virus multiplies. On top of the membrane are "spikes", which are proteins and determines the subtype of influence strain. Based on the types of proteins on the surface, Influenza virus A can be subtyped into 2 categories:
- H = hemagglutinin
- N = neuraminidase
Different influenza viruses encode different hemagglutinin and neuraminidase proteins. For eg. H1N1 or H5N1. There are 17 known types of hemagglutinin and 9 known types of neuraminidase. In theory, there can be 153 different combination of proteins. H1N1 is currently pandemic in both humans and pigs population.
The H and N numbers are important in the immune response against the virus; anitbodies against these spikes may protect against infection. The NA protein is the target of the antiviral drug Relenza and Tamiflu.
Beneath the lipid membrane is a viral protein called M1 or matrix protein and it gives strength and rigidity to the virus. And inside it, contains the genetic information of the virus i.e. viral RNA's. Each RNA segment also contains various proteins like B1, PB2, PA, NP.
The goal of this contest is to perform data exploration / text mining on the set of patents and patent applications provided (about 22000 in number) using the knowledge, the "lens" of the 27 "source" patents to come up with insights, patterns and clusters, in a way that will provide a subset of patents to focus drug discovery efforts. Since the "source" patents belong to the drugs to treat H1N1, we hope to come up with a short list of patents or patents applications, from the list of 22000 patents, that could provide some clue to discover or synthesize drugs that could combat mutants of H1N1.
Solvers who would crack this contest need to focus on the following beyond regular text mining:
- incorporating domain knowledge into the ontology and methodology. An H1N1 specific ontology and taxonomy will be very useful.
- modeling and advanced visualization of the outputs
- July 19, 2012 - Start of the Contest Part 1
- August 23, 2012 - Deadline for Submission of Onotolgy delieverables
- August 24 to August 29, 2012 - Crowdsourced And Expert Evaluation for Part 1. NO SUBMISSIONS ACCEPTED for contest during this week.
- : August 30, 2012 - Winner for Part 1 contest announced and Ontology release to the community for Contest Part 2
- Aug. 31 to Sept. 21, 2012 - Contest Part 2 Begins - Data Exploration / Text Mining of Patent Data
- : Sept. 21, 2012 - Deadline for Submission Contest Part 2. FULL CONTEST CLOSING.
- Sept. 22 to Oct. 5, 2012 - Crowdsourced and Expert Evaluation for contest Part 2
- : Oct. 5, 2012 - Conditional Winners Announcement
For Evaluation (see Criterion for more details)
- Insights substantiated by credible domain knowledge, ontologies and frameworks
- A findings summary
For Replication after award of conditional prize money:
- Code, models, and everything need to run this in our labs to replicate results
These data sets have been obtained by crawling data available in the public domain. The data sets contain:
- 27 text files - these are the "source" patents pertaining to the drugs Tamiflu (Oseltamivir Phosphate), Relenza (zanamivir) and Peramivir. These are important, because these are the reference patents. You need to discover patents that are similar to these patents.
- About 22000 patents, in txt format, grouped into about 30 zip files. These need to be downloaded to your local environment. These patents and patent applications have been identified by looking at all relevant categories of patents as per the patent organization provided by US PTO, and then downloading all the patents in those categories.
- The URL above also provides HTML versions of the patents; these are for the purpose of viewing only. Please use only the txt files for performing text mining.
Prize Pool - $4000
- 1st Prize $1000
- 2nd Prize $600
- 3rd Prize $400
- 1st Prize $1000
- 2nd Prize $600
- 3rd Prize $400
Benefits to Solvers
- The learning from reviewing everyone's approach and methodology
- The opportunity to participate in future "invitation only" private contest. (Future contests will likely be larger, more complex, bigger data and will prefer solvers with a track record)