In a previous Blog post we highlighted the day-to-day informatics problems facing IT/IS staff and researchers in biopharma companies as they struggle to discover and develop better drugs faster and more cheaply. Key among these was the challenge of dealing with the data deluge – more complex data at greater volumes, in multiple formats and often stored in disparate internal and external data silos. As Jerry Karabelas, former head of R&D at Novartis quipped, in an updated and repurposed phrase from Coleridge “Data, data everywhere, and not a drug I think.” And that was at the turn of the 21st century, and things have only got worse since then, with the term “big data” now getting over four million hits in Google.
Typical therapeutic research projects continually generate and amass data – chemical structures; sequences; formulations; primary, secondary, and high-content assay results; observed and predicted physicochemical properties; DMPK study results; instrument data; sample genealogy and provenance; progress reports, etc. – and researchers are then charged with the responsibility of making sense of all the data, to advance and explore hypotheses, deduce insights, and decide which compounds and entities to pursue, which formulations or growth conditions to optimize, and which to drop or shelve.
A usual first step will be to collect together all the relevant data and get it into a form that is amenable to further searching and refinement: but this poses a potentially challenging set of questions – what data exists, where is it, what format is it in, and how much of it is there? Answering these questions may then be complicated if the data resides in different, possibly disconnected, potentially legacy systems: e.g. chemical structures in an aging corporate registry, sequences in a newer system, assay results in another database, DMPK values buried inside an electronic lab notebook, and instrument data in an unconnected LIMS or LES.
So the researcher is faced with knowing:
(a) Which systems exist, where they are located, and what they contain,
(b) How to search each of them to find the required data,
(c) How to extract the desired information from each source in the correct usable format, and
(d) How to meld or mash-up these various disparate data sets to generate a project corpus for further analysis and refinement.
They will still be likely to get frustrated along the way by things like different query input paradigms (e.g. pre-designed and inflexible search forms or the need to write SQL queries), slow search response times, and either too many or too few results to generate a tractable data set. If they opt to start with an overlarge hit list, they can try to whittle the list down by tightening up their search parameters, or by locating and subtracting items with undesirable properties, but in most cases they will be faced with a slew of somewhat different hit files which then need to be sensibly merged through a sequence of cumbersome list logic operations (e.g. intersect the pyrrolidine substructure search compounds with the bioassay IC50 < 0.5 nanomolar hits, and then see if any of those match the required physicochemical property and DMPK profiles in the third and fourth files). This trial-and-error approach is inefficient, unpredictable, potentially unreliable, and time-consuming.
Fortunately, modern systems such as PerkinElmer Signals™ Lead Discovery are now available to overcome these challenges and to equip scientists with efficient tools to rapidly locate and assemble accurate, comprehensive and workable data sets for detailed and scientifically intelligent refinement and analysis. Prerequisites include a future-proof, flexible, and extensible underlying informatics infrastructure and platform that can intelligently and flexibly handle and stage all types of R&D data, now and in the future, data or text, structured or unstructured, internal or external. Establishing an informatics platform like this and making all the relevant data instantly accessible to researchers removes the data wrangling challenges (a – d) discussed above and delivers immediate productivity and outcomes benefits as researchers are free to focus on science rather than software.
Rather than struggling to remember where data is located, and how to search it, scientists can now be intelligently guided. Signals Lead Discovery lists and clearly presents the relevant available data (including internal and external sources) and offers simple and consistent yet flexible searching paradigms to query the underlying content. Modern indexing techniques (including blazing fast, patent-pending, no-SQL chemical searching, and a full range of biological sequence searching tools) ensure rapid response times to searches with immediate feedback to see whether a query is delivering the required number of hits. Intuitive views of the data in tables and forms with advanced display capabilities built on Spotfire also give immediate visual feedback about the evolving content of a hit set as it is refined, and data drill down is always available to get a more granular view of the underlying data.
Once the researcher has adequately shaped and refined a data set to contain all the required relevant data, it is then immediately available for further detailed analysis and visualization, using Signals Lead Discovery’s powerful set of built-in workflows and tools, or via RESTful APIs with external and third-party tools. This downstream analysis and visualization will be the subject of future blog posts in this series. This video shows how guided search and analytics can power your SAR analysis quickly and effectively.