Graduate Thesis Or Dissertation
 

Construction of Anomaly Scores and Probabilities using Random Sampling : A New Score, Efficient Computation, and Threshold Selection.

Public Deposited

Downloadable Content

Download PDF
https://ir.library.oregonstate.edu/concern/graduate_thesis_or_dissertations/37720j60b

Descriptions

Attribute NameValues
Creator
Abstract
  • Anomaly detection is the task of identifying observations (points) that differ from the majority of other points, which requires some measure of difference, or distance. Many anomaly detection methods rely on “implicit distance” measures: rather than directly calculating an explicitly defined distance, these approaches quantify a point’s “abnormality” by examining how difficult it is to isolate the point. Here I investigate using explicit distance metrics to quantify the degree to which a point is abnormal. Distance-based methods are computationally expensive, and present the additional challenge of selecting a distance metric. Aggarwal et al. [2001] demonstrated theoretically and empirically that a fractional distance metric provides a much better separation between points in high dimensional space. Following Aggarwal et al. [2001], I propose a method, Random Sampling Outlier Score (RSOS), which uses fractional pairwise distances and taking advantage of sub-sampling to reduce the computational complexity to identify anomalies. I demonstrate that the RSOS method provides comparable or superior performance to other anomaly detection approaches across a variety of data settings. I also investigate the choice of score threshold for calling points anomalies, and find that a model-based clustering approach does a reasonable job of separating anomalies from non-anomalies using the scores I develop. I demonstrate that the proposed method’s computational efficiency can be improved via a random projection data preprocessing step. I show that random forest and model-based clustering can be combined to allow clustering of random forest purity scores into two sets of features: important (selected) features will have the highest purity scores while non-important features will have smaller purity scores. In the reduced space, the proposed method’s computation time is much faster than in the original feature space. Finally, I elaborate that a Bayesian mixture model can be used to convert an anomaly score into a probability of being an anomaly for every data point.
License
Resource Type
Date Issued
Degree Level
Degree Name
Degree Field
Degree Grantor
Commencement Year
Advisor
Academic Affiliation
Rights Statement
Publisher
Peer Reviewed
Language

Relationships

Parents:

This work has no parents.

In Collection:

Items