The sequence master

Computer scientist David Haussler leads a pioneering UCSC group in the booming field of bioinformatics

Sequences of letters from human DNA illuminate computer scientist David Haussler, leader of a UCSC group applying powerful computational techniques to dig through mountains of biological data.

By Robert Irion

In the 1840s, prospectors scoured California's hills to stake their claims in a frantic search for gold. Today, another Gold Rush is sweeping the country, but the stakes are far higher. The quarry of this high-tech pursuit, more precious than any mineral, is the genetic gold buried within our cells: DNA, the blueprint for life.

Researchers stand on the threshold of decoding the complex instructions in the strands of our DNA. Spelled out bit by bit, these instructions tell a body how to develop from conception to death and how to function from day to day. They also may contain errors, triggering disease. Charting this genetic landscape, our "genome," is the goal of the Human Genome Project, which will wind up next decade.

A map of this landscape could have profound benefits for human health. Genetic screens may provide early warnings for a host of inherited diseases. More tantalizing still, researchers hope to design new drugs by studying disease-causing genes and the cellular gears they drive. Biologists also plan to compare our DNA to the genomes of other organisms, which should unearth the evolutionary roots of life.

To realize those visions, researchers will need smart tools to cope with the staggering size of biological databases. The human genome alone holds 3 billion units of raw genetic data--mountains of information that conceal life's key genes and their functions. Some of the nation's top computer wizards have allied with biologists to create automated ways of mining those nuggets. Their hot new science is called bioinformatics.

"This field represents the convergence of two great technologies of the last half of this century: the computer revolution and biotechnology," says David Haussler, professor of computer science at UCSC. "Everyone from pharmaceutical companies to molecular biologists is screaming for bioinformatics. It truly has the potential to revolutionize medicine and the life sciences."

A tall and eloquent man with a fondness for suspenders, Haussler leads a young group at the forefront of bioinformatics. Their work spans four UCSC departments: computer science, computer engineering, chemistry and biochemistry, and biology. Both graduate students and undergraduates play important roles.

All point to Haussler as the reason for the team's success. "David is a first-rate computer scientist, but he also listens carefully to biologists to learn what we know and what we need," says Harry Noller, director of UCSC's Center for the Molecular Biology of RNA. "His group has developed some terrific computational approaches in a very short time."

Geneticist Sean Eddy of Washington University, another leader in the field, says Haussler's signal achievement is bringing "rigorous mathematical formalism" to bear on biological data analysis. "The ideas he's contributing are very powerful and are grounded in serious computer science and statistics. He's a brilliant scientist, and he takes the biology very seriously."

To see why the statistical power of bioinformatics is a boon for biologists, one need only ponder the volume of data they face. The human genome's billions of units act as "letters"--four different chemical building blocks that interlock along the DNA molecule. Those letters spell out about 100,000 recipes, or genes, that each cell follows to make its rich broth of proteins. Those proteins, in turn, perform the elemental work of life.

To scan 3 billion letters at a pace of 10 per second would take nearly a decade. Indeed, just finding and defining a single gene used to take years in the lab. Now, computers can help identify new genes in a matter of days.

However, only about 3 percent of the human genome contains genes. Most sequences of DNA letters, biologists believe, encode nothing at all or serve some unknown purpose. And within most genes, the instructions may start and stop dozens of times, interrupted by more apparently useless DNA.

"Genes don't raise red flags and say, 'Here I am!'" Haussler says. "You need intelligent programs to locate where they begin and end."

Creating those programs is where Haussler's group excels. His team's most potent method has an intriguing name: hidden Markov models, or HMMs, named for a turn-of-the-century Russian mathematician. Since the mid-1960s, researchers have used HMMs to reveal patterns within human speech. In essence, HMMs provide statistical models of different ways to pronounce a word. "If someone comes along with a new accent, you'll recognize the word because the model gives you a picture of the variability," says Kimmen Sjolander, a graduate student under Haussler.

DNA, like speech, also obeys rules of "grammar." Only certain patterns of letters lead to viable genes; special letters tell the cell when to start and stop the protein assembly line. In 1992, Haussler's group proposed that one could apply HMMs to biological data. The group now uses HMMs to create libraries of the "words," or genes, that a genome likes to say--and to expose subtly different words that may represent new genes.

One notable program that relies on HMMs is "Genie," a gene finder devised with researchers at Lawrence Berkeley National Laboratory. UCSC graduate student David Kulp, Genie's main brain, explains that it almost always pinpoints the most probable genes in a novel sequence of DNA better than any other program.

Genie may help grant the wishes of biologists such as UCSC's Manuel Ares and John Tamkun, who study yeast and fruit flies, respectively. Yeast and flies may seem like lowly lab critters. However, they serve as valuable models for many human processes, because all living things--from primitive bacteria to people--share much of the same basic genetic machinery. Ares and Tamkun hope to employ bioinformatics to compare and contrast the well-studied genes and proteins in their organisms with unknown genes arising from the Human Genome Project.

One also must know the shape of the protein that a gene encodes to grasp the gene's role in the cell. Haussler's group has made inroads into this "protein folding problem," an urgent issue in pharmaceutical research (see sidebar). Haussler collaborates with UCSC biochemists Anthony Fink and Lydia Gregoret, who probe the details of how proteins fold. In particular, bioinformatics may help them unravel the mystery of certain proteins that clump into lesions, triggering Alzheimer's and other devastating diseases.

These colleagues point out, and Haussler concurs, that bioinformatics is no panacea. Lab research by biologists and chemists will always be essential to test the predictions of computers and to see how genes and drugs work in living things. But all agree that computational tools will grow more critical as the torrent of biological data swells.

"I can't think of anything more rewarding than exploring the human genome and trying to decipher its messages," Haussler says. "It's exciting to be one of the pioneers in this area, setting the stage for the work of generations to come."

Computer engineer Richard Hughey, flanked by undergraduate David Dahle (left) and postdoctoral researcher Jeff Hirschberg, holds a prototype chip for his "Kestrel" programmable array. Kestrel will boost the speed of searching through biological databases by a factor of 100 or more. The team used computer-aided design to cram efficient processors into chips just a few millimeters square (see enlargement, above, left).

This display shows how the "Genie" computer program forages through biological sequences for genes. At top, the purple bar shows the location of a gene, correctly predicted by Genie (blue) but missed by competing programs. At bottom, Genie sifts for key strings within a sequence of DNA letters.