Algorithms for string matching with applications in molecular biology Public Deposited

http://ir.library.oregonstate.edu/concern/graduate_thesis_or_dissertations/ng451k71c

Descriptions

Attribute NameValues
Creator
Abstract or Summary
  • As the volume of genetic sequence data increases due to improved sequencing techniques and increased interest, the computational tools available to analyze the data are becoming inadequate. This thesis seeks to improve a few of the computational methods available to access and analyze data in the genetic sequence databases. The first two results are parallel algorithms based on previously known sequential algorithms. The third result is a new approach, based on assumptions that we believe make sense in the biological context of the problem, to approximating an NP complete problem. The final result is a fundamentally new approach to approximate string matching using the divide and conquer paradigm instead of the dynamic programming approach that has been used almost exclusively in the past. Dynamic programming algorithms to measure the distance between sequences have been known since at least 1972. Recently there has been interest in developing parallel algorithms to measure the distance between two sequences. We have developed an optimal parallel algorithm to find the edit distance, a metric frequently used to measure distance, between two sequences. It is often interesting to find the substrings of length k that appear most frequently in a given string. We give a simple sequential algorithm to solve this problem and an efficient parallel version of the algorithm. The parallel algorithm uses an efficient novel parallel bucket sort. When sequencing a large segment of DNA, the original DNA sequence is reconstructed using the results of sequencing fragments, that may or may not contain errors, of many copies of the original DNA. New algorithms are given to solve the problem of reconstructing the original DNA sequence with and without errors introduced into the fragments. A program based on this algorithm is used to reconstruct the human beta globin region (HUMHBB) when given a set of 300 to 500 mers drawn randomly from the HUMHBB region. Approximate string matching is used in a biological context to model the steps of evolution. While such evolution may proceed base by base using the change, insert, or delete operators, there is also evidence that whole genes may be moved or inverted. We introduce a new problem, the string to string rearrangement problem, that allows movement and inversion of substrings. We give a divide and conquer algorithm for finding a rearrangement of one string within another.
Resource Type
Date Available
Date Issued
Degree Level
Degree Name
Degree Field
Degree Grantor
Commencement Year
Advisor
Committee Member
Academic Affiliation
Non-Academic Affiliation
Subject
Rights Statement
Language
Replaces
Additional Information
  • description.provenance : Made available in DSpace on 2009-05-07T16:23:46Z (GMT). No. of bitstreams: 1 Jim_Holloway.pdf: 1020518 bytes, checksum: 8895cee6147fab5bbb504d5c356b491c (MD5)
  • description.provenance : Submitted by Philip Vue (vuep@onid.orst.edu) on 2009-05-07T16:02:02Z No. of bitstreams: 1 Jim_Holloway.pdf: 1020518 bytes, checksum: 8895cee6147fab5bbb504d5c356b491c (MD5)
  • description.provenance : Approved for entry into archive by Linda Kathman(linda.kathman@oregonstate.edu) on 2009-05-07T16:20:49Z (GMT) No. of bitstreams: 1 Jim_Holloway.pdf: 1020518 bytes, checksum: 8895cee6147fab5bbb504d5c356b491c (MD5)
  • description.provenance : Approved for entry into archive by Linda Kathman(linda.kathman@oregonstate.edu) on 2009-05-07T16:23:46Z (GMT) No. of bitstreams: 1 Jim_Holloway.pdf: 1020518 bytes, checksum: 8895cee6147fab5bbb504d5c356b491c (MD5)

Relationships

In Administrative Set:
Last modified: 08/01/2017

Downloadable Content

Download PDF
Citations:

EndNote | Zotero | Mendeley

Items