Cloud-computing revolution applies to evolution

NSF grant will help Rice University scientists simplify tools to trace genes across species 

A $1.1 million National Science Foundation grant to two Rice University computer science groups will allow them to build cloud-computing tools to help analyze evolutionary patterns.

With the three-year grant, Christopher Jermaine and Luay Nakhleh, both associate professors of computer science, will develop parallel-processing tools that track the evolution of genes and genomes across species.

Luay Nakhleh, left, and Christopher Jermaine. Photo by Jeff Fitlow

The Rice team expects its new open-source algorithms will bring sophisticated computing techniques to researchers who have limited access to supercomputing resources but can easily rent “cloud-computing” time from the likes of Amazon or Microsoft.

Even those who have access to mainframes may find it easier to go to the cloud. The programs will be able to run parallel analyses on thousands of computers, with results that may not only be faster but may also make it possible to trace genes at scales that were not practical before.

“We’re doing basic analysis of evolutionary questions,” Nakhleh said. “Evolutionary biologists sample taxa from across the tree of life. They want to know, for example, how a big group of plants may have evolved.”

The NSF-funded project will expand upon Bayesian inference techniques that allow biologists to build upon prior knowledge. (Bayesian inference is a statistics-based method to estimate probabilities based on a data set.) “They allow biologists to incorporate any prior knowledge they might have into the analysis itself,” Nakhleh said.

“Analyzing data sets with 10 or 20 gene sequences can easily take hundreds of hours,” he said. “But the tree of life has millions of sequences and is built from millions of species. There’s no way traditional Bayesian techniques are even going to get close to handling that.”

“A problem involving, say, 50 organisms would require tens of thousands of hours of compute time, which is doable,” Jermaine said. “But if you want to move into thousands of organisms, you have to multiply that by 100. Suddenly it’s not so doable.”

Jermaine feels computer farms that allow thousands of machines to cooperatively work on a problem hold great promise for bioinformatics in general. He recently received another NSF grant to develop tools for more machine learning in the cloud and sees phylogenetics – the study of evolutionary relationships — as a prime candidate for parallelization.

“We’re not talking about taking a one-day calculation and taking it down to minutes,” he said. “We’re talking about potentially taking a years- or decadeslong computation and making it feasible by changing the underlying algorithm and making it amenable to distributed computing.”

The researchers plan a turnkey approach to their software that they hope will appeal to biologists. “My impression is they want a very low bar to entry,” Jermaine said. “If they have to write a lot of code or have to figure out how to use all these servers, they’re just not going to do it. Hopefully our solution will be as easy for biologists as pressing a return key.”

About Mike Williams

Mike Williams is a senior media relations specialist in Rice University's Office of Public Affairs.