Original title:
Interpreting and Clustering Outliers with Sapling Random Forests
Authors:
Kopp, Martin ; Pevný, T. ; Holeňa, Martin Document type: Papers Conference/Event: ITAT 2014. European Conference on Information Technologies - Applications and Theory /14./, Demänovská dolina (SK), 2014-09-25 / 2014-09-29
Year:
2014
Language:
eng Abstract:
The main objective of outlier detection is finding samples considerably deviating from the majority. Such outliers, often referred to as anomalies, are nowadays more and more important, because they help to uncover interesting events within data. Consequently, a considerable amount of statistical and data mining techniques to identify anomalies was proposed in the last few years, but only a few works at least mentioned why some sample was labelled as an anomaly. Therefore, we propose a method based on specifically trained decision trees, called sapling random forest. Our method is able to interpret the output of arbitrary anomaly detector. The explanation is given as a subset of features, in which the sample is most deviating, or as conjunctions of atomic conditions, which can be viewed as antecedents of logical rules easily understandable by humans. To simplify the investigation of suspicious samples even more, we propose two methods of clustering anomalies into groups. Such clusters can be investigated at once saving time and human efforts. The feasibility of our approach is demonstrated on several synthetic and one real world datasets.
Keywords:
anomaly detection; anomaly interpretation; clustering; decision trees; feature selection; random forest Project no.: GA13-17187S (CEP), GPP103/12/P514 (CEP) Funding provider: GA ČR, GA ČR Host item entry: ITAT 2014. Information Technologies - Applications and Theory. Part II, ISBN 978-80-87136-19-5
Institution: Institute of Computer Science AS ČR
(web)
Document availability information: Fulltext is available in the digital repository of the Academy of Sciences. Original record: http://hdl.handle.net/11104/0236773