Abstract:
This thesis presents a case study of applying machine learning tools to build a predictive
model of annual infestations of grasshoppers in Eastern Oregon. The purpose of the
study was two-fold. First, we wanted to develop a predictive model. Second, we wanted
to explore the capabilities of existing machine learning tools and identify areas where
further research was needed.
The study succeeded in constructing a model with modest ability to predict future
grasshopper infestations and provide advice to farmers about whether to treat their
croplands with pesticides. Our analysis of the learned model shows that it should be able
to provide useful advice if the ratio of treatment cost to crop damage cost is 1 to 1.67 or
more. However, there is some evidence that the model is not able to make good
predictions in years with extremely high levels of grasshopper infestation.
To arrive at this successful model, three critical steps had to be taken. First, we had to
properly formulate the prediction task both in terms of exactly what we were trying to
predict (i.e., the probability of infestation) and the spatial area over which we could make
predictions (i.e., areas within 6.77 kms radius of a weather station). Second, we had to
define and extract a set of features that incorporated knowledge of the grasshopper life
cycle. Third, we had to employ evaluation metrics that were able to measure small
improvements in the quality of predictions.
The study identified important directions for future research. In the area of grasshopper
ecology, there is a need for improved data gathering tools including a much denser and
more widespread network of weather stations. These stations should also measure
subsoil temperatures. Recording of the dates of hatching of grasshopper nymphs would
also be very valuable. In machine learning, methods are needed for automating the
definition and extraction of features guided by qualitative knowledge of the domain, such
as our qualitative knowledge of the grasshopper lifecycle.
Description:
Abstract_______________________________________________________________ 3
Table of Contents _______________________________________________________ 5
Chapter 1 Introduction __________________________________________________ 7
1.1 The Grasshopper Infestation Prediction Problem______________________________ 7
1.2 The Data Mining Process __________________________________________________ 8
Chapter 2 Data _______________________________________________________ 10
2.1 Sources of Data _________________________________________________________ 10
2.1.1 DBMS_____________________________________________________________________10
2.1.2 Text file ___________________________________________________________________10
2.1.3 Binary file__________________________________________________________________11
2.2 Imperfections in the data _________________________________________________ 11
2.2.1 Missing values ______________________________________________________________12
2.2.2 Partially Missing Values_______________________________________________________12
2.2.3 No Value___________________________________________________________________12
2.2.4 Noise and Uncertainty ________________________________________________________12
2.2.5 Missing attributes ____________________________________________________________13
2.2.6 Dynamic data _______________________________________________________________13
Chapter 3 Feature Definition and Extraction _______________________________ 14
3.1 Issues in Feature Extraction ______________________________________________ 14
3.1.1 Reducing Dimensionality ______________________________________________________14
3.1.2 Defining Useful Features ______________________________________________________14
3.1.3 Incorporating Background Knowledge____________________________________________15
3.2 Methods for Feature Extraction ___________________________________________ 16
3.2.1 Semi-automated Feature Extraction ______________________________________________16
3.2.2 Manual Feature Extraction _____________________________________________________17
3.2.3 Time and Spatial Feature Extraction _____________________________________________17
3.2.4 Types of feature extraction _____________________________ Error! Bookmark not defined.
3.2.5 Tuning up the feature _________________________________________________________18
Chapter 4 Learning Algorithms __________________________________________ 19
4.1 The Learning Problem___________________________________________________ 19
4.1.1 Training Examples, Classes, etc. ________________________ Error! Bookmark not defined.
4.2 The Fundamental Tradeoff _______________________________________________ 19
4.2.1 Tradeoff between Number of Examples, Size of Hypothesis Space, Accuracy of Result _____20
4.2.2 Overfitting _________________________________________________________________20
4.3 Decision Tree Algorithms_________________________________________________ 21
4.3.1 Growing ___________________________________________________________________22
4.3.2 Pruning ____________________________________________________________________22
4.4 Regression Tree Algorithms ______________________________________________ 22
Chapter 5 Grasshopper Infestation Prediction ______________________________ 25 5.1 Linear Regression and State-wide Models ___________________________________ 25
5.1.1 Modeling and Methods________________________________________________________25
5.1.2 Results ____________________________________________________________________26
5.2 Decision Tree and Grid Site Model_________________________________________ 26
5.2.1 Models and Methods _________________________________________________________26
5.2.2 Results ____________________________________________________________________27
5.3 Regression Tree and Region Site Model _____________________________________ 27
5.3.1 Models and Methods _________________________________________________________27
5.3.2 Results ____________________________________________________________________27
5.4 Decision Tree and Weather Station Site Model _______________________________ 27
5.4.1 Models and Methods _________________________________________________________27
5.4.2 Results ____________________________________________________________________28
5.5 Probabilistic Decision Trees and the Weather Station Site Model________________ 28
5.5.1 Models and Methods _________________________________________________________28
5.5.2 Results ____________________________________________________________________32
Chapter 6 Conclusions and Future Improvements ___________________________ 41
Appendix A: Pictures of Grasshopper Infestation in Eastern Oregon Problem _____ 43
Appendix B: Summary of Study, Processing and Modeling in the Grasshopper
Infestation Project _____________________________________________________ 45
B.1 Data Sources ___________________________________________________________ 45
B.2 Imperfect Data Treatments_______________________________________________ 45
B.3 Data Manipulations and Productions_______________________________________ 46
B.4 Feature Extractions _____________________________________________________ 48
B.6 Prediction and Evaluation________________________________________________ 51
Appendix C: Feature Set in Weather Station Site Model_______________________ 53
C.1 List and Definitions _____________________________________________________ 53
C.2 Information Gain Performance ___________________________________________ 54
References ___________________________________________________________ 59