Abstract:
We developed and investigated machine learning methods that require
minimal preprocessing of the input data, use few training examples, run fast, and
still obtain high levels of accuracy.
Most approaches to designing machine learning programs are based on the
supervised learning paradigm – training examples are chosen randomly and given
to the learner. We explore the "active learning" paradigm – the learner
automatically selects the more informative training examples. Our domain of
interest is text categorization, but most of the methods developed are quite general.
The purpose of text categorization is to assign each document in a collection
to appropriate categories. Most existing text categorization methods require large
amounts of time to prepare the documents for learning and large numbers of
examples for training. Humans must assign correct categories to documents before
they can be used for training; this costs time and money. Our goal is to develop
machine learning methods that, when compared to other methods currently available, are more efficient in time and space, use fewer training documents, and
are as accurate.
We developed the Active Learning with Committees (ALC) framework –
inspired by the Query by Committee approach of Freund, Seung, et al. A
"committee" is a group of learners that jointly participate in learning and in
predicting the classes of new examples. We perform minimal preprocessing of the
documents and thus the domain is noisy, high dimensional, and has large numbers
of irrelevant attributes. We use linear threshold learning algorithms to obtain
computational efficiency with respect to these large numbers of attributes, with
specific algorithms being chosen because they also generalize well when large
numbers of attributes are irrelevant.
We developed and analyzed several ALC systems. Our results show that it is
possible to design active learning systems that scale up to large numbers of features
and obtain accuracies comparable to the supervised learning methods while using
an order of magnitude fewer examples and an order of magnitude less time. The
ALC methods developed have run times on the order of seconds, typically use only
5 - 7% of the training documents, and are as accurate as their supervised
counterparts.