National Repository of Grey Literature 3 records found  Search took 0.00 seconds. 
Hodnocení Výsledků Fuzzy Shlukování
Říhová, Elena ; Pecáková, Iva (advisor) ; Řezanková, Hana (referee) ; Žambochová, Marta (referee)
Cluster analysis is a multivariate statistical classification method, implying different methods and procedures. Clustering methods can be divided into hard and fuzzy; the latter one provides a more precise picture of the information by clustering objects than hard clustering. But in practice, the optimal number of clusters is not known a priori, and therefore it is necessary to determine the optimal number of clusters. To solve this problem, the validity indices help us. However, there are many different validity indices to choose from. One of the goals of this work is to create a structured overview of existing validity indices and techniques for evaluating fuzzy clustering results in order to find the optimal number of clusters. The main aim was to propose a new index for evaluating the fuzzy clustering results, especially in cases with a large number of clusters (defined as more than five). The newly designed coefficient is based on the degrees of membership and on the distance (Euclidean distance) between the objects, i.e. based on principles from both fuzzy and hard clustering. The suitability of selected validity indices was applied on real and generated data sets with known optimal number of clusters a priory. These data sets have different sizes, different numbers of variables, and different numbers of clusters. The aim of the current work is regarded as fulfilled. A key contribution of this work was a new coefficient (E), which is appropriate for evaluating situations with both large and small numbers of clusters. Because the new validity index is based on the principles of both fuzzy clustering and hard clustering, it is able to correctly determine the optimal number of clusters on both small and large data sets. A second contribution of this research was a structured overview of existing validity indices and techniques for evaluating the fuzzy clustering results.
Míry podobnosti pro nominální data v hierarchickém shlukování
Šulc, Zdeněk ; Řezanková, Hana (advisor) ; Šimůnek, Milan (referee) ; Žambochová, Marta (referee)
This dissertation thesis deals with similarity measures for nominal data in hierarchical clustering, which can cope with variables with more than two categories, and which aspire to replace the simple matching approach standardly used in this area. These similarity measures take into account additional characteristics of a dataset, such as frequency distribution of categories or number of categories of a given variable. The thesis recognizes three main aims. The first one is an examination and clustering performance evaluation of selected similarity measures for nominal data in hierarchical clustering of objects and variables. To achieve this goal, four experiments dealing both with the object and variable clustering were performed. They examine the clustering quality of the examined similarity measures for nominal data in comparison with the commonly used similarity measures using a binary transformation, and moreover, with several alternative methods for nominal data clustering. The comparison and evaluation are performed on real and generated datasets. Outputs of these experiments lead to knowledge, which similarity measures can generally be used, which ones perform well in a particular situation, and which ones are not recommended to use for an object or variable clustering. The second aim is to propose a theory-based similarity measure, evaluate its properties, and compare it with the other examined similarity measures. Based on this aim, two novel similarity measures, Variable Entropy and Variable Mutability are proposed; especially, the former one performs very well in datasets with a lower number of variables. The third aim of this thesis is to provide a convenient software implementation based on the examined similarity measures for nominal data, which covers the whole clustering process from a computation of a proximity matrix to evaluation of resulting clusters. This goal was also achieved by creating the nomclust package for the software R, which covers this issue, and which is freely available.
Cluster analysis of large data sets: new procedures based on the method k-means
Žambochová, Marta ; Řezanková, Hana (advisor) ; Húsek, Dušan (referee) ; Antoch, Jaromír (referee)
Abstract Cluster analysis has become one of the main tools used in extracting knowledge from data, which is known as data mining. In this area of data analysis, data of large dimensions are often processed, both in the number of objects and in the number of variables, which characterize the objects. Many methods for data clustering have been developed. One of the most widely used is a k-means method, which is suitable for clustering data sets containing large number of objects. It is based on finding the best clustering in relation to the initial distribution of objects into clusters and subsequent step-by-step redistribution of objects belonging to the clusters by the optimization function. The aim of this Ph.D. thesis was a comparison of selected variants of existing k-means methods, detailed characterization of their positive and negative characte- ristics, new alternatives of this method and experimental comparisons with existing approaches. These objectives were met. I focused on modifications of the k-means method for clustering of large number of objects in my work, specifically on the algorithms BIRCH k-means, filtering, k-means++ and two-phases. I watched the time complexity of algorithms, the effect of initialization distribution and outliers, the validity of the resulting clusters. Two real data files and some generated data sets were used. The common and different features of method, which are under investigation, are summarized at the end of the work. The main aim and benefit of the work is to devise my modifications, solving the bottlenecks of the basic procedure and of the existing variants, their programming and verification. Some modifications brought accelerate the processing. The application of the main ideas of algorithm k-means++ brought to other variants of k-means method better results of clustering. The most significant of the proposed changes is a modification of the filtering algorithm, which brings an entirely new feature of the algorithm, which is the detection of outliers. The accompanying CD is enclosed. It includes the source code of programs written in MATLAB development environment. Programs were created specifically for the purpose of this work and are intended for experimental use. The CD also contains the data files used for various experiments.