Národní úložiště šedé literatury Nalezeno 10 záznamů.  Hledání trvalo 0.01 vteřin. 
Some Robust Estimation Tools for Multivariate Models
Kalina, Jan
Standard procedures of multivariate statistics and data mining for the analysis of multivariate data are known to be vulnerable to the presence of outlying and/or highly influential observations. This paper has the aim to propose and investigate specific approaches for two situations. First, we consider clustering of categorical data. While attention has been paid to sensitivity of standard statistical and data mining methods for categorical data only recently, we aim at modifying standard distance measures between clusters of such data. This allows us to propose a hierarchical agglomerative cluster analysis for two-way contingency tables with a large number of categories, based on a regularized measure of distance between two contingency tables. Such proposal improves the robustness to the presence of measurement errors for categorical data. As a second problem, we investigate the nonlinear version of the least weighted squares regression for data with a continuous response. Our aim is to propose an efficient algorithm for the least weighted squares estimator, which is formulated in a general way applicable to both linear and nonlinear regression. Our numerical study reveals the computational aspects of the algorithm and brings arguments in favor of its credibility.
On the Consistency of an Estimator for Hierarchical Archimedean Copulas
Górecki, J. ; Hofert, M. ; Holeňa, Martin
The paper addresses an estimation procedure for hierarchical Archimedean copulas, which has been proposed in the literature. It is shown here that this estimation is not consistent in general. Furthermore, a correction is proposed, which leads to a consistent estimator.
Online System for Fire Danger Rating in Colorado
Vejmelka, Martin ; Kochanski, A. ; Mandel, J.
A method for the data assimilation of fuel moisture surface observations has been developed for the purpose of incorporation in wildfire forecasting and fire danger rating. In this work, we describe the method itself and also an online computer system that implements the method and combines it with the Real-Time Mesoscale Analysis to track local weather conditions and estimate the fuel moisture content in the state of Colorado. We discuss the construction of the system and future development.
Case Study in Approaches to the Classification of Audiovisual Recordings of Lectures and Conferences
Pulc, P. ; Holeňa, Martin
Several methods for classification of semistructured documents already exist, thus also classifications for individual modalities of multimedia content. However, every classifier can behave differently on different data modalities and can be differently appropriate for classification of the considered multimedia content as a whole. Because of that, relying on a single classifier or a static weighting of the classification of individual modalities is not adequate. The present paper describes a case study in searching for suitable classification methods, and in investigating appropriate methods for the aggregation of their results to determine a final class of a lecture or conference recording.
Explaining Anomalies with Sapling Random Forests
Pevný, T. ; Kopp, Martin
The main objective of anomaly detection algorithms is finding samples deviating from the majority. Although a vast number of algorithms designed for this already exist, almost none of them explain, why a particular sample was labelled as an anomaly. To address this issue, we propose an algorithm called Explainer, which returns the explanation of sample’s differentness in disjunctive normal form (DNF), which is easy to understand by humans. Since Explainer treats anomaly detection algorithms as black-boxes, it can be applied in many domains to simplify investigation of anomalies. The core of Explainer is a set of specifically trained trees, which we call sapling random forests. Since their training is fast and memory efficient, the whole algorithm is lightweight and applicable to large databases, datastreams, and real-time problems. The correctness of Explainer is demonstrated on a wide range of synthetic and real world datasets.
Interpreting and Clustering Outliers with Sapling Random Forests
Kopp, Martin ; Pevný, T. ; Holeňa, Martin
The main objective of outlier detection is finding samples considerably deviating from the majority. Such outliers, often referred to as anomalies, are nowadays more and more important, because they help to uncover interesting events within data. Consequently, a considerable amount of statistical and data mining techniques to identify anomalies was proposed in the last few years, but only a few works at least mentioned why some sample was labelled as an anomaly. Therefore, we propose a method based on specifically trained decision trees, called sapling random forest. Our method is able to interpret the output of arbitrary anomaly detector. The explanation is given as a subset of features, in which the sample is most deviating, or as conjunctions of atomic conditions, which can be viewed as antecedents of logical rules easily understandable by humans. To simplify the investigation of suspicious samples even more, we propose two methods of clustering anomalies into groups. Such clusters can be investigated at once saving time and human efforts. The feasibility of our approach is demonstrated on several synthetic and one real world datasets.
Robustness of High-Dimensional Data Mining
Kalina, Jan ; Duintjer Tebbens, Jurjen ; Schlenker, Anna
Standard data mining procedures are sensitive to the presence of outlying measurements in the data. This work has the aim to propose robust versions of some existing data mining procedures, i.e. methods resistant to outliers. In the area of classification analysis, we propose a new robust method based on a regularized version of the minimum weighted covariance determinant estimator. The method is suitable for data with the number of variables exceeding the number of observations. The method is based on implicit weights assigned to individual observations. Our approach is a unique attempt to combine regularization and high robustness, allowing to downweight outlying high-dimensional observations. Classification performance of new methods and some ideas concerning classification analysis of high-dimensional data are illustrated on real raw data as well as on data contaminated by severe outliers.
Important Markov-Chain Properties of (1,lambda)-ES Linear Optimization Models
Chotard, A. ; Holeňa, Martin
Several recent publications investigated Markov-chain modelling of linear optimization by a (1,lambda)-ES, considering both unconstrained and linearly constrained optimization, and both constant and varying step size. All of them assume normality of the involved random steps. This is a very strong and specific assumption. The objective of our contribution is to show that in the constant step size case, valuable properties of the Markov chain can be obtained even for steps with substantially more general distributions. Several results that have been previously proved using the normality assumption are proved here in a more general way without that assumption. Finally, the decomposition of a multidimensional distribution into its marginals and the copula combining them is applied to the new distributional assumptions, particular attention being paid to distributions with Archimedean copulas.
Towards Low-Dimensional Gaussian Process Metamodels for CMA-ES
Bajer, Lukáš ; Holeňa, Martin
Gaussian processes and kriging models has attracted attention of researchers from different areas of black-box optimization, especially since Jones’ introduction of the Efficient Global Optimization (EGO) algorithm. However, current implementations of the EGO or real-world applications are rather few. We conjecture that the EGO is not suitable for higher-dimensional optimization and try to investigate whether hybridization of a low-dimensional local optimization with the current state-of-the-art continuous black-box optimizer CMA-ES (Covariance Matrix Adaptation Evolution Strategy) could help. In this paper, only a first proposal of such a GP/CMA-ES connection is described and some preliminary tests are presented.
Robust Regularized Cluster Analysis for High-Dimensional Data
Kalina, Jan ; Vlčková, Katarína
This paper presents new approaches to the hierarchical agglomerative cluster analysis for high-dimensional data. First, we propose a regularized version of the hierarchical cluster analysis for categorical data with a large number of categories. It exploits a regularized version of various test statistics of homogeneity in contingency tables as the measure of distance between two clusters. Further, our aim is cluster analysis of continuous data with a large number of variables. Various regularization techniques tailor-made for high-dimensional data have been proposed, which have however turned out to suffer from a high sensitivity to the presence of outlying measurements in the data. As a robust solution, we recommend to combine two newly proposed methods, namely a regularized version of robust principal component analysis and a regularized Mahalanobis distance, which is based on an asymptotically optimal regularization of the covariance matrix. We bring arguments in favor of the newly proposed methods.

Chcete být upozorněni, pokud se objeví nové záznamy odpovídající tomuto dotazu?
Přihlásit se k odběru RSS.