Learning multiple non-redundant codebooks with word clustering for document classification Public Deposited

http://ir.library.oregonstate.edu/concern/graduate_thesis_or_dissertations/794081570

Descriptions

Attribute NameValues
Creator
Abstract or Summary
  • The problem of document classification has been widely studied in machine learning and data mining. In document classification, most of the popular algorithms are based on the bag-of-words representation. Due to the high dimensionality of the bag-of-words representation, significant research has been conducted to reduce the dimensionality via different approaches. One such approach is to learn a codebook by clustering the words. Most of the current word- clustering algorithms work by building a single codebook to encode the original dataset for classification purposes. However, this single codebook captures only a part of the information present in the data. This thesis presents two new methods and their variations to construct multiple non-redundant codebooks using multiple rounds of word clusterings in a sequential manner to improve the final classification accuracy. Results on benchmark data sets are presented to demonstrate that the proposed algorithms significantly outperform both the single codebook approach and multiple codebooks learned in a bagging-style approach.
Resource Type
Date Available
Date Copyright
Date Issued
Degree Level
Degree Name
Degree Field
Degree Grantor
Commencement Year
Advisor
Committee Member
Academic Affiliation
Non-Academic Affiliation
Keyword
Subject
Rights Statement
Language
Replaces
Additional Information
  • description.provenance : Made available in DSpace on 2009-07-09T17:58:20Z (GMT). No. of bitstreams: 1 Thesis.pdf: 888897 bytes, checksum: e494c427fbafe521f4d973a4889e48eb (MD5)
  • description.provenance : Submitted by Akshat Surve (survea@onid.orst.edu) on 2009-06-25T02:30:45Z No. of bitstreams: 1 Thesis.pdf: 888897 bytes, checksum: e494c427fbafe521f4d973a4889e48eb (MD5)
  • description.provenance : Approved for entry into archive by Julie Kurtz(julie.kurtz@oregonstate.edu) on 2009-06-30T16:13:51Z (GMT) No. of bitstreams: 1 Thesis.pdf: 888897 bytes, checksum: e494c427fbafe521f4d973a4889e48eb (MD5)
  • description.provenance : Approved for entry into archive by Laura Wilson(laura.wilson@oregonstate.edu) on 2009-07-09T17:58:20Z (GMT) No. of bitstreams: 1 Thesis.pdf: 888897 bytes, checksum: e494c427fbafe521f4d973a4889e48eb (MD5)

Relationships

In Administrative Set:
Last modified: 08/18/2017

Downloadable Content

Download PDF
Citations:

EndNote | Zotero | Mendeley

Items