Select Page

A Cancer Data Pipeline in the Clouds

By Rob Mitchum // February 17, 2014

Over the decade since the completion of the Human Genome Project, next-generation sequencing has spurred the field of genomics to a faster and faster pace. Laboratories studying the genetics of disease can gather detailed data from more patients at a cheaper price than ever before, bringing scientists closer to new treatments and realizing the vision of personalized medicine. But even as the speed of sequencing shifts into a higher gear, other research tasks lag behind, producing unnecessary drag that prevents the science from truly taking flight.

Globus Genomics is a new cloud-based platform designed to eliminate these hindrances, combining data management tools, elastic computation and a graphical workflow environment. In one of the first Globus Genomics pilot projects, the platform helped the laboratory of Kenan Onel, associate professor of pediatrics and director of the Familial Cancer Clinic at University of Chicago Medicine, tackle the growing data challenges involved in studying the genetics of cancer.


The six researchers in the Onel lab search for genetic variants associated with predisposition to cancers, inflammatory bowel disease and other conditions. In a typical study, the laboratory gathers genetic information from multiple members of a family to look for variants that appear in those with the target disease, but do not appear in their healthy relatives. In one 2011 study, published in Nature Medicine, the laboratory discovered two genetic variants that predict a higher risk for therapy-induced secondary cancers in Hodgkin’s Lymphoma patients.

Recently, the Onel lab moved from genome-wide association studies (GWAS) to more detailed whole-exome sequencing. In GWAS, researchers look for hundreds of thousands of pre-chosen gene variants in a subject’s DNA, sampling only a small fraction of the genome. But whole-exome sequencing reads every piece of the genome that encodes proteins, allowing researchers to more easily discover unknown and rare variants.

“The great thing about exomes is you get sequence data; it’s all there,” said Mark Sasaki, a postdoctoral researcher in the Onel lab. “Because it’s exome data, most of the time you find a variant in an exome analysis that is usually deleterious in a protein, so you have a functional readout. I think exomes right now are the most bang for the buck.”

Although the exome makes up roughly 1.5 percent of the human genome, the sequence data from just one person can be as large as 10 gigabytes. When handling sequences from dozens or hundreds of people, that can accumulate to present serious data challenges – in transfer, storage and analysis – for a small research team such as Onel’s. To streamline this process, the laboratory worked with Globus Genomics to create a customized exome sequencing and analysis pipeline.


Whether the sequencing is done at campus core facilities or off-site at an external vendor, moving hundreds of gigabytes over an FTP connection or by sending hard drives through the mail can be a slow and error-prone process. Globus Genomics worked with the Onel lab to set up a Globus endpoint on their computers and at the sequencing centers, allowing for the rapid transfer of dozens of exome sequences to the laboratory.

“Globus works really well for us, particularly the sheer ease of use and the speed of the transfer,” Sasaki said. “Instead of sending hard drives, it’s easier for us to just get data transferred electronically.”

A Globus Genomics pipeline can also be established to transfer data from laboratory storage to cloud-based analytics software and back. For the pilot project, Onel researchers sent 45 exomes to a cloud-based instance of Galaxy, the open, web-based genomics analysis platform. Though each exome requires around 20 hours of computational time to analyze, scaling up via the elasticity of Amazon Web Services allowed the pipeline to run analyses in parallel, delivering results from dozens of sequences back to the laboratory in less than a week – at a tenth of the cost offered by external vendors.


Sequencing centers also offer analysis services, but often use proprietary software and methods that are not completely revealed to outside researcher-customers. These closed systems can be restrictive – or expensive – if a researcher wants to create a customized analysis process or add in their own custom modules.

With Dinanath Sulakhe from Globus Genomics, Sasaki tinkered with different combinations of analysis tools to better target the genetic variants they were seeking. Raw data could be repeatedly sent via Globus to the cloud-based Galaxy instance to test different analysis workflows, a process that would have taken weeks and been subject to multiple charges from a sequencing center.

“The most important thing here was that the user had an option to decide the pipeline,” said Sulakhe, engagement manager and solution architect with Globus Genomics. “They were actively involved in the workflow design process: they picked the tools that they wanted to use, they decided on the parameters that they wanted to run with these workflows. That adds great value, because they were actively involved in understanding the science and helping them come up with some new tools.”


With future plans to sequence hundreds of exomes a year, these steps will help facilitate new research in the Onel lab and accelerate the translation of new discoveries from bench to clinic. Previously, working with such a large influx of new data would require hiring additional personnel skilled in advanced computational techniques. But Globus Genomics offered a simpler and less expensive option for laboratories unable to build a new cyberinfrastructure from scratch.

“It’s very collaborative and user-friendly,” Sasaki said. “It’s great for a lab that’s on a budget and has minimal people. It’s been good for us.”