Active learning with committees : an approach to efficient learning in text categorization using linear threshold algorithms Public Deposited

http://ir.library.oregonstate.edu/concern/graduate_thesis_or_dissertations/7w62fd57s

Descriptions

Attribute NameValues
Creator
Abstract or Summary
  • We developed and investigated machine learning methods that require minimal preprocessing of the input data, use few training examples, run fast, and still obtain high levels of accuracy. Most approaches to designing machine learning programs are based on the supervised learning paradigm – training examples are chosen randomly and given to the learner. We explore the "active learning" paradigm – the learner automatically selects the more informative training examples. Our domain of interest is text categorization, but most of the methods developed are quite general. The purpose of text categorization is to assign each document in a collection to appropriate categories. Most existing text categorization methods require large amounts of time to prepare the documents for learning and large numbers of examples for training. Humans must assign correct categories to documents before they can be used for training; this costs time and money. Our goal is to develop machine learning methods that, when compared to other methods currently available, are more efficient in time and space, use fewer training documents, and are as accurate. We developed the Active Learning with Committees (ALC) framework – inspired by the Query by Committee approach of Freund, Seung, et al. A "committee" is a group of learners that jointly participate in learning and in predicting the classes of new examples. We perform minimal preprocessing of the documents and thus the domain is noisy, high dimensional, and has large numbers of irrelevant attributes. We use linear threshold learning algorithms to obtain computational efficiency with respect to these large numbers of attributes, with specific algorithms being chosen because they also generalize well when large numbers of attributes are irrelevant. We developed and analyzed several ALC systems. Our results show that it is possible to design active learning systems that scale up to large numbers of features and obtain accuracies comparable to the supervised learning methods while using an order of magnitude fewer examples and an order of magnitude less time. The ALC methods developed have run times on the order of seconds, typically use only 5 - 7% of the training documents, and are as accurate as their supervised counterparts.
Resource Type
Date Available
Date Copyright
Date Issued
Degree Level
Degree Name
Degree Field
Degree Grantor
Commencement Year
Advisor
Committee Member
Academic Affiliation
Non-Academic Affiliation
Subject
Rights Statement
Language
Replaces
Additional Information
  • description.provenance : Approved for entry into archive by Linda Kathman(linda.kathman@oregonstate.edu) on 2009-02-10T15:54:26Z (GMT) No. of bitstreams: 1 Ray_Lierer.pdf: 983104 bytes, checksum: 73cdc0c827d29dfb8c83cd6851c674b6 (MD5)
  • description.provenance : Submitted by Philip Vue (vuep@onid.orst.edu) on 2009-02-09T20:58:23Z No. of bitstreams: 1 Ray_Lierer.pdf: 983104 bytes, checksum: 73cdc0c827d29dfb8c83cd6851c674b6 (MD5)
  • description.provenance : Approved for entry into archive by Linda Kathman(linda.kathman@oregonstate.edu) on 2009-02-10T15:50:48Z (GMT) No. of bitstreams: 1 Ray_Lierer.pdf: 983104 bytes, checksum: 73cdc0c827d29dfb8c83cd6851c674b6 (MD5)
  • description.provenance : Made available in DSpace on 2009-02-10T15:54:27Z (GMT). No. of bitstreams: 1 Ray_Lierer.pdf: 983104 bytes, checksum: 73cdc0c827d29dfb8c83cd6851c674b6 (MD5)

Relationships

In Administrative Set:
Last modified: 10/20/2017

Downloadable Content

Download PDF
Citations:

EndNote | Zotero | Mendeley

Items