A Computerized Hunt for Cancer Classifiers

By Rob Mitchum // August 20, 2013

Finding a better way to fight cancer doesn’t always mean discovering a new drug or surgical technique. Sometimes just defining the disease in greater detail can make a big difference. A more specific diagnosis may allow a physician to better tailor a patient’s treatment, using available therapies proven to work better on a specific subtype of disease or avoiding unnecessary complications for less aggressive cases.

“Finding better ways to stratify kids when they present and decide who needs more therapy and who needs less therapy is one of the ways in which we’ve gotten much better at treating pediatric cancer,” said Samuel Volchenboum, Computation Institute Fellow, Assistant Professor of Pediatrics at Comer Children’s Hospital and Director of the UChicago Center for Research Informatics. “For example, kids can be put in one of several different groups for leukemia, and each group has its own treatment course.”

Classically, patients have been sorted into risk or treatment groups based on demographic factors such as age or gender, and relatively simple results from laboratory tests or biopsies. Because cancer is a genetic disease, physicians hope that genetic factors will point the way to even more precise classifications. Yet despite this promise, many of the “genetic signatures” found to correlate with different subtypes of cancer are too complex – involving dozens or hundreds of genes – for clinical use and difficult to validate across patient populations.

For the clinic, the ideal gene classifier would be a small number of genes whose expression levels predict a relevant phenotype, such as disease severity or response to treatment. For example, if elevated levels of gene A and gene B reliably predicts a high-risk form of the disease, physicians can quickly test those genes in each new patient to help determine the best treatment.

In a new paper published in Cancer Research, Volchenboum and his colleagues looked for smaller predictive gene combinations within the genetic signatures found for a common type of pediatric cancer, rhabdomyosarcoma. The principle behind the search was simple: pick any two (or three, or four) genes at random from a published 50-gene signature, and see if that combination predicts a particular clinical phenotype – in this case, whether the cancer was of the high-risk or low-risk subtype. To automate the task, Volchenboum wrote a simple program in Python that tested all the possible combinations.

That initial test was a success, discovering 28 combinations of two or three genes that could classify samples from a clinical data set into the two groups with at least 98% accuracy. But Volchenboum wondered if there were even more classifier combinations out there to be found, using genes not included in this particular signature. So he ran a bigger test, using the 1,000 genes most likely to be differentially expressed in high-risk and low-risk samples.

Since there are one million possible two-gene pairs from this pool of candidates, this larger test created higher computational demands. On Volchenboum’s personal computer, the search took roughly 6 hours to run. Then, he decided to look for two-gene pairs in the entire microarray of nearly 55,000 genes, testing a whopping 1.5 billion pairs, and exceeding the ability of an everyday office computer.

“You could run it for months on your laptop, and it will still not be finished,” Volchenboum said.

At this point, Volchenboum reached out to CI Fellow Mike Wilde, who helped him adapt the original algorithm to run under Swift, a programming language that makes parallel programming much easier for scientists and data analysts. Instead of testing one pair at a time on his laptop, the program could run simultaneously on dozens or hundreds of cores of a high-performance parallel computing cluster, completing the search much more quickly. When all 1.5 billion pairs were run on UChicago’s Midway research cluster, it took the same amount of time as the smaller search on Volchenboum’s laptop: 6 hours.

The searches turned up thousands of combinations that could potentially differentiate high-risk rhabdomyosarcoma cases from low-risk cases. Many of those combinations were successfully validated in different rhabdomyosarcoma data sets to prove that they weren’t only relevant to the patients used in the original search.

These new classifiers could be useful in the clinic both by speeding up the process of diagnosing a patient’s cancer and by offering a helpful clue for patients that are difficult to classify, Volchenboum said.

“Where I think it can make a difference is in the edge cases,” Volchenboum said. “You might be headed toward a particular treatment based on traditional factors, but one of these genetic tests might make you rethink the therapy. Bringing these tests into the clinic is important, because I think it offers another way to get kids that are sick started right away on treatment.”

Beyond the clinic, the test could be used for identifying promising new avenues for studying the biology of different cancers, perhaps suggesting new targets for drug therapies. Volchenboum and Wilde are also generalizing the procedures they used so that researchers studying anytype of cancer – or any disease – could run their own searches for simple genetic classifiers, perhaps using cloud-based, on-demand computing. Indeed, in the paper, the researchers tested the method on a lung cancer dataset, finding two-gene pairs that could differentiate between early death and long-term survival cases of the disease.

“What I’d like to do is make the general testing case possible for any cancer,” Volchenboum said. “You could upload data from any tumor, set the number of tests you want to run on a high-performance computer, and just press go. It would let you quickly identify sets of genes that may have importance, clinically and biologically, that you may have never have thought of otherwise.”