With the amount of data currently being generated, we are in a unique position to find diagnoses and treatments for a multitude of diseases. However, the progress of lab technologies in generating data is now beset by another very unique form of challenge- making sense of the copious amounts of immensely heterogeneous data.
In the 2010 paper, “The $1,000 genome, the $100,000 analysis?” the author rightly points out that regardless of how cheap human genome sequencing gets the development of ‘clinical grade’ interpretation analysis is needed to make coherent clinical sense out of the data. However, as mentioned in one of our previous blogs (Beyond Genomics: Translational Medicine Goes Data Mining), genomic data cannot work in isolation within a biological context and integration of knowledge from different biological silos is the next big challenge. The clinical utility of all this data will be determined by our ability to mine these data appropriately by addressing some very critical pain points in the data life cycle that are briefly discussed below:
1) Collaborative Data Storage & Security
As the availability of computational resources becomes challenging, cloud computing is becoming increasingly more important in development and execution of large scale biological data. It’s scalability on demand is an attractive option, especially for multisite collaborative research projects. Healthcare data is subject to certain regulations that other industry sectors might not have to comprehend, such as ensuring that data is stored in on premise private data centers. However, cloud security has surpassed the security measures at most private data centers and cloud solutions are well positioned to be turned into data-security aggregators. Furthermore, on premise solutions are not able to provide the same level of scalability as cloud computing without significantly increasing the infrastructure costs. This coupled with multisite collaborative research projects that happen in healthcare, makes cloud solutions an attractive scalable option on demand. This article can help you assess whether an on-prem or cloud solution is better suited for your needs.
2) Facilitating rapid transfer and data processing: Support for Distributed Research
Using tools that allow for processing and storage of extremely large data sets in a distributed computing environment are a foundation for Big Data processing tasks. Some of these Big Data tools include Hadoop Distributed File System (HDFS) and Spark. The distributed file system of HDFS facilitates rapid data transfer rates among nodes and drastically lowers the risk of system failure, whereas Spark can process data from a variety of data repositories including HDFS, NoSQL databases and relational data stores (e.g. Apache Hive). These technologies help organizations move away from traditional data warehouses towards a data lake where data can be stored in its original structure.
3) Access to public or legacy databases: Data Type Flexibility
There is currently a large amount of data sitting in public databases and the need to integrate them with your data can be of utmost importance. Any Big Data platform looking at life sciences data needs to deploy technology that allow for a seamless access to data stored in databases such as Gene Expression Omnibus (GEO), tranSMART, OHDSI etc. just to name a few.
4) Searching and analyzing data in real time: Accessible Data
The ability to search and query data at a fast speed is a critical step in the data lifecycle. Tools such as Elasticsearch vastly improve the ability to query/mine your data by focusing on searching an index instead of searching/querying the text directly. This allows for a seamless flow of information from the data lake to the user.
5) Enriching or curating your data: Information Intelligence
A comprehensive environment should further integrate tools that enrich data by adding context for deeper and more meaningful integration of data from different sources. Tools such as Attivio achieve precisely this by semantically enriching the data across structured and unstructured silos and thereby making the eventual analysis more powerful.
6) User friendly Advanced Data Exploration applications:
Providing the end-user with a state of the art user friendly workflow is a two-tiered challenge. Firstly, the ability to reuse analytics workflows for reproducible analysis of biomedical data is become increasingly important. Secondly, visually aided data exploration is an important component of combining scientific data and disseminating complex knowledge. The ability to interact with data to slice and dice it in different ways whilst working with a reproducible analytics workflow can help the end user to identify unexpected patterns and allow them to further refine their hypothesis. Visual data analytics platform such as TIBCO Spotfire® allow for self-service access to all relevant data and allow the end-users to take an exploratory approach of their data and make informed decisions based not just on interactive dashboards but with best in class statistical analysis.
The large scale nature of biological data means that we need to have an agile, integrated environment that implements the right tools to tackle the problem of data storage, management, integration and eventual analysis. All of the components of the data life cycle need to work in sync and in an optimal manner to enable the end-users to make real-time decisions in a scalable and informed manner. In subsequent blogs, we intend to tackle each one of these pain points in detail to see how a turnkey solution can be created for Translational Medicine applications.
Want to learn how you can configure your data analytics workflow to address the critical needs of your Precision Medicine Research? Join David John
for a dedicated webinar on April 24th.