A key to successful personalized and customized medicine is knowledge of a person’s genetic profile. Next-generation sequencing (NGS), which has seen a dramatic drop in cost, plays a vital role in this cause, thanks to its ability to sequence a genome (i.e., the human genome). DNA sequencers in the life sciences can generate a terabyte (one trillion bytes) of data per minute, according to Wu Feng of the Dept. of Computer Science in the College of Engineering at Virginia Tech (Blacksburg, Va.).

Feng says that the size of DNA sequence databases will increase 10-fold every 18 months. However, Moore’s Law implies that a processor can compute on such increases by only two-fold every 24 months. That means the rate of data generation is far outstripping a processor’s capability to compute on it.

Tackling that challenge, Feng led a research team that, after two years, has created a new generation of efficient data management and analysis software for large-scale, data-intensive scientific applications “in the cloud.” Cloud computing refers to a large number of globally connected computers that can simultaneously run a program at a large scale. The team developed two software-based research artifacts called SeqInCloud and CloudFlow.

SeqInCloud combined with Microsoft’s Azure cloud-computing platform and infrastructure, provides a portable cloud solution for next-generation sequence analysis. It optimizes data management (e.g., data partitioning and data transfer) to enhance performance and resource use of cloud resources.

CloudFlow is the team’s scaffolding for managing workflows, such as SeqInCloud. A researcher can install this software to allow the construction of pipelines that simultaneously use the client and cloud resources for running the pipeline and automating data transfers, says Feng.

Feng announced the research development in October at the O’Reilly Strata Conference + Hadoop World, held in New York City, N.Y.