The sequence master
Computer scientist David Haussler leads a pioneering UCSC group in the
booming field of bioinformatics
|
Sequences of letters from human DNA illuminate computer scientist David
Haussler, leader of a UCSC group applying powerful computational techniques
to dig through mountains of biological data.
|
By Robert Irion
In the 1840s, prospectors scoured California's hills to stake their
claims in a frantic search for gold. Today, another Gold Rush is sweeping
the country, but the stakes are far higher. The quarry of this high-tech
pursuit, more precious than any mineral, is the genetic gold buried within
our cells: DNA, the blueprint for life.
Researchers stand on the threshold of decoding the complex instructions
in the strands of our DNA. Spelled out bit by bit, these instructions tell
a body how to develop from conception to death and how to function from
day to day. They also may contain errors, triggering disease. Charting this
genetic landscape, our "genome," is the goal of the Human Genome
Project, which will wind up next decade.
A map of this landscape could have profound benefits for human health. Genetic
screens may provide early warnings for a host of inherited diseases. More
tantalizing still, researchers hope to design new drugs by studying disease-causing
genes and the cellular gears they drive. Biologists also plan to compare
our DNA to the genomes of other organisms, which should unearth the evolutionary
roots of life.
To realize those visions, researchers will need smart tools to cope with
the staggering size of biological databases. The human genome alone holds
3 billion units of raw genetic data--mountains of information that conceal
life's key genes and their functions. Some of the nation's top computer
wizards have allied with biologists to create automated ways of mining those
nuggets. Their hot new science is called bioinformatics.
"This field represents the convergence of two great technologies of
the last half of this century: the computer revolution and biotechnology,"
says David Haussler, professor of computer science at UCSC. "Everyone
from pharmaceutical companies to molecular
biologists is screaming for bioinformatics. It truly
has the potential to revolutionize medicine and the life sciences."
A tall and eloquent man with a fondness for suspenders, Haussler leads a
young group at the forefront of bioinformatics. Their work spans four UCSC
departments: computer science, computer engineering, chemistry and biochemistry,
and biology. Both graduate students and undergraduates play important roles.
All point to Haussler as the reason for the team's success. "David
is a first-rate computer scientist, but he also listens carefully to biologists
to learn what we know and what we need," says Harry Noller, director
of UCSC's Center for the Molecular Biology of RNA. "His group has developed
some terrific computational approaches in a very short time."
Geneticist Sean Eddy of Washington University, another leader in the field,
says Haussler's signal achievement is bringing "rigorous mathematical
formalism" to bear on biological data analysis. "The ideas he's
contributing are very powerful and are grounded in serious computer science
and statistics. He's a brilliant scientist, and he takes the biology very
seriously."
To see why the statistical power of bioinformatics is a boon for biologists,
one need only ponder the volume of data they face. The human genome's billions
of units act as "letters"--four different chemical building blocks
that interlock along the DNA molecule. Those letters spell out about 100,000
recipes, or genes, that each cell follows to make its rich broth of proteins.
Those proteins, in turn, perform the elemental work of life.
To scan 3 billion letters at a pace of 10 per second would take nearly a
decade. Indeed, just finding and defining a single gene used to take years
in the lab. Now, computers can help identify new genes in a matter of days.
However, only about 3 percent of the human genome contains genes. Most sequences
of DNA letters, biologists believe, encode nothing at all or serve some
unknown purpose. And within most genes, the instructions may start and stop
dozens of times, interrupted by more apparently useless DNA.
"Genes don't raise red flags and say, 'Here I am!'" Haussler says.
"You need intelligent programs to
locate where they begin and end."
Creating those programs is where Haussler's group excels. His team's most
potent method has an intriguing name: hidden Markov models, or HMMs, named
for a turn-of-the-century Russian
mathematician. Since the mid-1960s, researchers
have used HMMs to reveal patterns within human speech. In essence, HMMs
provide statistical models of different ways to pronounce a word. "If
someone comes along with a new accent, you'll recognize the word because
the model gives you a picture of the variability," says
Kimmen Sjolander, a graduate student under Haussler.
DNA, like speech, also obeys rules of "grammar." Only certain
patterns of letters lead to viable genes; special letters tell the cell
when to start and stop the protein assembly line. In 1992, Haussler's group
proposed that one could apply HMMs to biological data. The group now uses
HMMs to create libraries of the "words," or genes, that a genome
likes to say--and to expose subtly different words that may represent new
genes.
One notable program that relies on HMMs is "Genie," a gene finder
devised with researchers at Lawrence Berkeley National Laboratory. UCSC
graduate student David Kulp, Genie's main brain, explains that it almost
always pinpoints the most probable genes in a novel sequence of DNA better
than any other program.
Genie may help grant the wishes of biologists such as UCSC's Manuel Ares
and John Tamkun, who study yeast and fruit flies, respectively. Yeast and
flies may seem like lowly lab critters. However, they serve as valuable
models for many human processes, because all living things--from primitive
bacteria to people--share much of the same basic genetic machinery. Ares
and Tamkun hope to employ bioinformatics to compare
and contrast the well-studied
genes and proteins in their organisms with unknown genes arising from the
Human Genome Project.
One also must know the shape of the protein that a gene encodes to grasp
the gene's role in the cell. Haussler's group has made inroads into this
"protein folding problem," an urgent issue in pharmaceutical research
(see sidebar). Haussler
collaborates with UCSC biochemists Anthony
Fink and Lydia Gregoret, who probe the details of how proteins fold. In
particular, bioinformatics may help them unravel the mystery of certain
proteins that clump into lesions, triggering Alzheimer's and other devastating
diseases.
These colleagues point out, and Haussler concurs, that bioinformatics is
no panacea. Lab research by biologists and chemists will always be essential
to test the predictions of computers and to see how genes and drugs work
in living things. But all agree that computational tools will grow more
critical as the torrent of biological data swells.
"I can't think of anything more rewarding than exploring the human
genome and trying to decipher its messages," Haussler says. "It's
exciting to be one of the pioneers in this area, setting the stage for the
work of generations to come."
|
Computer engineer Richard Hughey, flanked by undergraduate David Dahle
(left) and postdoctoral researcher Jeff Hirschberg, holds a prototype chip
for his "Kestrel" programmable array. Kestrel will boost the speed
of searching through biological databases by a factor of 100 or more. The
team used computer-aided design to cram efficient processors into chips
just a few millimeters square (see enlargement, above, left).
|
This display shows how the "Genie" computer program forages
through biological sequences for genes. At top, the purple bar shows the
location of a gene, correctly predicted by Genie (blue) but missed by competing
programs. At bottom, Genie sifts for key strings within a sequence of DNA
letters.
Go to related story: "Take the
protein challenge"
Return to Summer '97 home page