Abstract:
Basic biological information is stored in strings of nucleic acids (DNA, RNA) or amino acids
(proteins). Teasing out the meaning of these strings is a central problem of modern biology. Matching and
aligning strings brings out their shared characteristics. Although string matching is well-understood in the
edit-distance model, biological strings with transpositions and inversions violate this model's assumptions.
We propose a family of heuristics called walking trees to align biologically reasonable strings.
Both edit-distance and walking tree methods can locate specifi c genes within a large string when the
genes' sequences are given. When we attempt to match whole strings, the walking tree matches most
genes, while the edit-distance method fails. We also give examples in which the walking tree matches
substrings even if they have been moved or inverted. The edit-distance method was not designed to
handle these problems. We include an example in which the walking tree "discovered" a gene.
Calculating scores for whole genome matches gives a method for approximating evolutionary distance.
We show two evolutionary trees for the picornaviruses which were computed by the walking tree heuristic.
Both of these trees show great similarity to previously constructed trees. The point of this demonstration
is that WHOLE genomes can be matched and distances calculated. The first tree was created on a Sequent
parallel computer and demonstrates that the walking tree heuristic can be effeciently parallelized. The
second tree was created using a network of work stations and demonstrates that there is suffient parallelism
in the phylogenetic tree calculation that the sequential walking tree can be used effectively on a network.