SAR Trek

In a previous Blog post we highlighted the day-to-day informatics problems facing IT/IS staff and researchers in biopharma companies as they struggle to discover and develop better drugs faster and more cheaply. Key among these was the challenge of dealing with the data deluge – more complex data at greater volumes, in multiple formats and often stored in disparate internal and external data silos. As Jerry Karabelas, former head of R&D at Novartis quipped, in an updated and repurposed phrase from Coleridge “Data, data everywhere, and not a drug I think.” And that was at the turn of the 21st century, and things have only got worse since then, with the term “big data” now getting over four million hits in Google. 

Typical therapeutic research projects continually generate and amass data – chemical structures; sequences; formulations; primary, secondary, and high-content assay results; observed and predicted physicochemical properties; DMPK study results; instrument data; sample genealogy and provenance; progress reports, etc. – and researchers are then charged with the responsibility of making sense of all the data, to advance and explore hypotheses, deduce insights, and decide which compounds and entities to pursue, which formulations or growth conditions to optimize, and which to drop or shelve. 

A usual first step will be to collect together all the relevant data and get it into a form that is amenable to further searching and refinement: but this poses a potentially challenging set of questions – what data exists, where is it, what format is it in, and how much of it is there? Answering these questions may then be complicated if the data resides in different, possibly disconnected, potentially legacy systems: e.g. chemical structures in an aging corporate registry, sequences in a newer system, assay results in another database, DMPK values buried inside an electronic lab notebook, and instrument data in an unconnected LIMS or LES. 

So the researcher is faced with knowing:

(a) Which systems exist, where they are located, and what they contain, 

(b) How to search each of them to find the required data, 

(c) How to extract the desired information from each source in the correct usable format, and 

(d) How to meld or mash-up these various disparate data sets to generate a project corpus for further analysis and refinement. 

They will still be likely to get frustrated along the way by things like different query input paradigms (e.g. pre-designed and inflexible search forms or the need to write SQL queries), slow search response times, and either too many or too few results to generate a tractable data set. If they opt to start with an overlarge hit list, they can try to whittle the list down by tightening up their search parameters, or by locating and subtracting items with undesirable properties, but in most cases they will be faced with a slew of somewhat different hit files which then need to be sensibly merged through a sequence of cumbersome list logic operations (e.g. intersect the pyrrolidine substructure search compounds with the bioassay IC50 < 0.5 nanomolar hits, and then see if any of those match the required physicochemical property and DMPK profiles in the third and fourth files). This trial-and-error approach is inefficient, unpredictable, potentially unreliable, and time-consuming.  

Fortunately, modern systems such as PerkinElmer Signals™ Lead Discovery are now available to overcome these challenges and to equip scientists with efficient tools to rapidly locate and assemble accurate, comprehensive and workable data sets for detailed and scientifically intelligent refinement and analysis. Prerequisites include a future-proof, flexible, and extensible underlying informatics infrastructure and platform that can intelligently and flexibly handle and stage all types of R&D data, now and in the future, data or text, structured or unstructured, internal or external.   Establishing an informatics platform like this and making all the relevant data instantly accessible to researchers removes the data wrangling challenges (a – d) discussed above and delivers immediate productivity and outcomes benefits as researchers are free to focus on science rather than software. 

Rather than struggling to remember where data is located, and how to search it, scientists can now be intelligently guided. Signals Lead Discovery lists and clearly presents the relevant available data (including internal and external sources) and offers simple and consistent yet flexible searching paradigms to query the underlying content. Modern indexing techniques (including blazing fast, patent-pending, no-SQL chemical searching, and a full range of biological sequence searching tools) ensure rapid response times to searches with immediate feedback to see whether a query is delivering the required number of hits. Intuitive views of the data in tables and forms with advanced display capabilities built on Spotfire also give immediate visual feedback about the evolving content of a hit set as it is refined, and data drill down is always available to get a more granular view of the underlying data. 

Once the researcher has adequately shaped and refined a data set  to contain all the required relevant data, it is then immediately available for further detailed analysis and visualization, using Signals Lead Discovery’s powerful set of built-in workflows and tools, or via RESTful APIs with external and third-party tools. This downstream analysis and visualization will be the subject of future blog posts in this series. This video shows how guided search and analytics can power your SAR analysis quickly and effectively.

Light at the end of lead discovery tunnel?

Drug discovery is hard (nine out of ten drug candidates fail), time-consuming (typically 10 - 15 years), and expensive (Tufts’ 2016 estimate $2.87Bn). But things are getting better, right? In 2017, although the EMA only approved 35 new active substances,  FDA drug approvals hit a 21-year high, with 46 new molecular entities approved, the highest number since 1996. This was mix of 29 small molecules and, demonstrating their increasing therapeutic importance, 17 biologics (nine antibodies, five peptides, two enzymes, and an antibody-drug conjugate). But of the 46 approvals, the FDA only counted 33% as new classes of compound, so the others would have to be from older classes of compound, which probably entered the R&D pipeline 15 – 20 years ago. 

Is this bumper crop of 2017 new approvals some reflection of major advances in drug discovery techniques and technology that primed the R&D pipeline at the turn of the century? Or is it just an artifact of the FDA approval process and timeline? Hard to say either way, but in the long game of drug development, scientists and researchers will be keen to jump on any improvements that can be made now. 

What contributes to the tri-fold challenges that make drug discovery and development hard, time-consuming and expensive? Surely the plethora of “latest things” – personalized and translational medicine, biomarkers, the cloud, AI, NLP, CRISPR, data lakes, etc. – will lead to better drugs sooner and more cheaply? At the highest level, probably; but down in the trenches researchers and their IT and data scientist colleagues are engaged in an ever-increasing daily struggle to develop and run more complex assays, to capture and manage larger volumes of variable and disparate data, and to handle a mix of small molecule and biologic entities; then to make sense of this data deluge and draw conclusions and insights: and often to do this with inflexible and hard-to-maintain home-grown or legacy systems that can no longer keep pace.  

Let’s look at some of these challenges in more detail.

The Sneakernet

Informatics systems built on traditional RDBMS require expensive DB operators just to keep them functioning, and much time and budget has to be devoted to fixing issues and keeping up with software and system upgrades: this leaves little or no time to make enhancements or to adjust the system to incorporate a new assay or manage and index a novel and different data type. This delays IT staff making even the simplest requested change and may spur researchers to go rogue and revert to using spreadsheets and sneakernet to capture and share data. 

The Data Scientist’s inbox

Organizing and indexing the variety and volume of data and datatypes generated in modern drug discovery research is an ongoing challenge. Scientists want timely and complete access to the data, with reasonable response times to searches, and easy-to-use display forms and tables. 

Older legacy informatics systems did a reasonable job of capturing, indexing, linking and presenting basic chemistry, physical properties and bioassay structured data, but at the cost of devising, setting up, and maintaining an unwieldy array of underlying files and forms.  Extending a bioassay that captures additional data, reading in a completely new instrument data file, or linking two previously disconnected data elements all require modifications to the underlying data schema and forms, and add to the growing backlog of unaddressed enhancement tasks in the data scientist’s inbox. 

In addition to managing well-structured data, scientists increasingly want combined access to non-structured data such as text contained in comments or written reports, and legacy systems have very limited capabilities to incorporate and index such material in a usable way, so that potentially valuable information is ignored when making decisions or drawing insights.

Lack of tools for meaningful exploration

Faced with the research data deluge, scientists want to get to just the right data in the right format, and with the right tools on hand for visualization and analysis. But the challenge is to know what data exists, where, and in what format. Legacy systems often provide data catalogs to help find what is available, and offer simple, brute-force search tools, but often response times are not adequate, and hit lists contain far too few or too many results to be useful. Iterative searches may help to focus a hit set on a lead series or assay type of interest, but often the searcher is left trying to make sense of a series of slightly different hits lists by using cumbersome list logic operations to arrive at the correct intersection list that has all the specified substructure/dose response/physical property range parameters.

Once a tractable hit set is available, the researcher is then challenged to locate and use the appropriate tools to explore structure activity relationships (SARs), develop and test hypotheses, and identify promising candidates for more detailed evaluation. Such tools are often hard to find, and each may come with its own idiosyncratic user interface, with a steep and challenging learning curve. Time is also spent designing and tweaking display forms to present the data in the best way, and every change slows down decision making. Knowing which tools and forms to use, in what order, and on which sets of data can be frustrating, and lead to incomplete or misleading analyses or conclusions. 

In the area SAR and bioSAR, underlying chemical structural and biosequence intelligence are key requirements for meaningful exploration and analysis, and these are often only available in separate and distinct applications with different user interfaces, when ideally they should be accessible through a unified chemistry/biosequence search and display application, supported by a full range of substructure and sequence analysis and display tools. 

R&D Management

Lab, section, and therapeutic area managers are all challenged to help discover, develop, and deliver better drugs faster and more cheaply. They want their R&D teams to be working at peak efficiency, with the best tools available to meet current and future demands. This first requires the foundation of a future-proof, flexible, and extensible platform. Next, any system built on the platform must be able to intelligently and flexibly handle all types of R&D data, now and in the future, structured or unstructured. Research scientists can then exploit this well-managed data with tools that guide them through effective and timely search and retrieval; analysis workflows; and advanced SAR visual analytics. This will lead to better science and faster insights to action. 

Follow us on social media to be notified of the next blog in this series