Protein secondary structure prediction using conditional random fields and profiles Public Deposited

http://ir.library.oregonstate.edu/concern/graduate_thesis_or_dissertations/np193d45q

Descriptions

Attribute NameValues
Creator
Abstract or Summary
  • Protein secondary structure prediction plays a pivotal role in predicting protein folding in three-dimensions. Its task is to assign each residue one of the three secondary structure classes helix, strand, or random coil. This is an instance of the problem of sequential supervised learning in machine learning. This thesis describes a new model, TreeCRFpsi, for addressing this problem. TreeCRFpsi combines recent advances in machine learning with new sequence representations developed in molecular biology. The machine learning method, TreeCRF, constructs a conditional random field (CRF) by fitting a set of regression trees via an algorithm known as gradient tree boosting. The new sequence representation is the PSI-BLAST profile introduced by D. Jones, which is based on matching sequences of known protein structure against a much larger set of sequences drawn from the NCBI non-redundant protein sequence database. A new methodology of cross validation was developed and applied to choose the best parameter values for the model. The chosen parameters were the following: tree size of 10 leaves, sliding window size of 15 residues, and 3 rounds of PSI-BLAST searching. The mean three-state prediction accuracy reached 77.6% on both our new SD482 and the popular CB513 datasets. This result is the best among all published results. TreeCRFpsi improved especially on helix and strand predictions by 1-2.3 percentage points over the previous best methods. SOV99 scores were 74.6% and 73.9% for SD482 and CB513, respectively. In addition, there was no apparent overfitting problem observed in our model. Besides achieving higher accuracy, TreeCRFpsi is the first secondary structure prediction method based on a well-defined probabilistic model, which makes it easier to use the output predictions as inputs to subsequent analysis steps.
Resource Type
Date Available
Date Copyright
Date Issued
Degree Level
Degree Name
Degree Field
Degree Grantor
Commencement Year
Advisor
Committee Member
Academic Affiliation
Non-Academic Affiliation
Keyword
Subject
Rights Statement
Language
File Format
File Extent
  • 889024 bytes
Replaces
Additional Information
  • description.provenance : Made available in DSpace on 2006-06-01T19:35:24Z (GMT). No. of bitstreams: 1 Shen_MS_Thesis_2006.pdf: 889024 bytes, checksum: 592948652eefdc3ca0af0652a3902318 (MD5)
  • description.provenance : Submitted by Rongkun Shen (shenr) on 2006-05-23T18:58:26Z No. of bitstreams: 1 Shen_MS_Thesis_2006.pdf: 889024 bytes, checksum: 592948652eefdc3ca0af0652a3902318 (MD5)
  • description.provenance : Approved for entry into archive by Julie Kurtz(julie.kurtz@oregonstate.edu) on 2006-05-30T18:16:03Z (GMT) No. of bitstreams: 1 Shen_MS_Thesis_2006.pdf: 889024 bytes, checksum: 592948652eefdc3ca0af0652a3902318 (MD5)

Relationships

In Administrative Set:
Last modified: 08/18/2017

Downloadable Content

Download PDF
Citations:

EndNote | Zotero | Mendeley

Items