National Repository of Grey Literature 11 records found  1 - 10next  jump to record: Search took 0.00 seconds. 
Methods for class prediction with high-dimensional gene expression data
Šilhavá, Jana ; Matula, Petr (referee) ; Železný, Filip (referee) ; Smrž, Pavel (advisor)
Dizertační práce se zabývá predikcí vysokodimenzionálních dat genových expresí. Množství dostupných genomických dat významně vzrostlo v průběhu posledního desetiletí. Kombinování dat genových expresí s dalšími daty nachází uplatnění v mnoha oblastech. Například v klinickém řízení rakoviny (clinical cancer management) může přispět k přesnějšímu určení prognózy nemocí. Hlavní část této dizertační práce je zaměřena na kombinování dat genových expresí a klinických dat. Používáme logistické regresní modely vytvořené prostřednictvím různých regularizačních technik. Generalizované lineární modely umožňují kombinování modelů s různou strukturou dat. V dizertační práci je ukázáno, že kombinování modelu dat genových expresí a klinických dat může vést ke zpřesnění výsledku predikce oproti vytvoření modelu pouze z dat genových expresí nebo klinických dat. Navrhované postupy přitom nejsou výpočetně náročné.  Testování je provedeno nejprve se simulovanými datovými sadami v různých nastaveních a následně s~reálnými srovnávacími daty. Také se zde zabýváme určením přídavné hodnoty microarray dat. Dizertační práce obsahuje porovnání příznaků vybraných pomocí klasifikátoru genových expresí na pěti různých sadách dat týkajících se rakoviny prsu. Navrhujeme také postup výběru příznaků, který kombinuje data genových expresí a znalosti z genových ontologií.
High-performance exploration and querying of selected multi-dimensional spaces in life sciences
Kratochvíl, Miroslav ; Bednárek, David (advisor) ; Glaab, Enrico (referee) ; Svozil, Daniel (referee)
This thesis studies, implements and experiments with specific application-oriented approaches for exploring and querying multi-dimensional datasets. The first part of the thesis scrutinizes indexing of the complex space of chemical compounds, and details a design of high-performance retrieval system for small molecules. The resulting system is then utilized within a wider context of federated search in heterogeneous data and metadata related to the chemical datasets. In the second part, the thesis focuses on fast visualization and exploration of many-dimensional data that originate from single- cell cytometry. Self-organizing maps are used to derive fast methods for analysis of the datasets, and used as a base for a novel data visualization algorithm. Finally, a similar approach is utilized for highly interactive exploration of multimedia datasets. The main contributions of the thesis comprise the advancement in optimization and methods for querying the chemical data implemented in the Sachem database cartridge, the federated, SPARQL-based interface to Sachem that provides the heterogeneous search support, dimensionality reduction algorithm EmbedSOM, design and implementation of the specific EmbedSOM-backed analysis tool for flow and mass cytometry, and design and implementation of the multimedia...
Interactive clustering approaches in single-cell cytometry
Urban, Nicole Aemilia ; Šmelko, Adam (advisor) ; Stuchlý, Jan (referee)
Flow cytometry allows inexpensive monitoring of large and diverse cell populations using fluorescent markers, providing immense applications in studying biological properties of blood and tissues as well as diagnostics in the clinical setting. Recent methodological advances highlight automatic clustering as a tool of choice for data analysis, and many clustering algo- rithms were developed for various use cases. However, the applicability of such algorithms in biology and medicine remains challenging unless the tools expose user-friendly, interactive interfaces that are accessible to domain ex- perts. The goal of the thesis is to review the available methods that allow such interaction and supervision of the clustering process by the user, specif- ically focusing on interfaces desirable in clinical setting that does not require the user to interact with programming environments. As the main practi- cal result, the thesis should design a new tool that builds upon previously developed methodology (iDendro, gMHCA), allowing the application of the researched methodology on real datasets. By using proper data visualization techniques, the end user should be able to interact with the dataset in a way that is both intuitive and useful for producing biologically relevant results.
High-performance exploration and querying of selected multi-dimensional spaces in life sciences
Kratochvíl, Miroslav ; Bednárek, David (advisor) ; Glaab, Enrico (referee) ; Svozil, Daniel (referee)
This thesis studies, implements and experiments with specific application-oriented approaches for exploring and querying multi-dimensional datasets. The first part of the thesis scrutinizes indexing of the complex space of chemical compounds, and details a design of high-performance retrieval system for small molecules. The resulting system is then utilized within a wider context of federated search in heterogeneous data and metadata related to the chemical datasets. In the second part, the thesis focuses on fast visualization and exploration of many-dimensional data that originate from single- cell cytometry. Self-organizing maps are used to derive fast methods for analysis of the datasets, and used as a base for a novel data visualization algorithm. Finally, a similar approach is utilized for highly interactive exploration of multimedia datasets. The main contributions of the thesis comprise the advancement in optimization and methods for querying the chemical data implemented in the Sachem database cartridge, the federated, SPARQL-based interface to Sachem that provides the heterogeneous search support, dimensionality reduction algorithm EmbedSOM, design and implementation of the specific EmbedSOM-backed analysis tool for flow and mass cytometry, and design and implementation of the multimedia...
Regression for High-Dimensional Data: From Regularization to Deep Learning
Kalina, Jan ; Vidnerová, Petra
Regression modeling is well known as a fundamental task in current econometrics. However, classical estimation tools for the linear regression model are not applicable to highdimensional data. Although there is not an agreement about a formal definition of high dimensional data, usually these are understood either as data with the number of variables p exceeding (possibly largely) the number of observations n, or as data with a large p in the order of (at least) thousands. In both situations, which appear in various field including econometrics, the analysis of the data is difficult due to the so-called curse of dimensionality (cf. Kalina (2013) for discussion). Compared to linear regression, nonlinear regression modeling with an unknown shape of the relationship of the response on the regressors requires even more intricate methods.
GPU-accelerated Mahalanobis-average hierarchical clustering
Šmelko, Adam ; Kratochvíl, Miroslav (advisor) ; Hric, Jan (referee)
Hierarchical clustering algorithms are common tools for simplifying, exploring and analyzing datasets in many areas of research. For flow cytometry, a specific variant of agglomerative clustering has been proposed, that uses cluster linkage based on Mahalanobis distance to produce results better suited for the domain. Applicability of this clustering algorithm is currently limited by its relatively high computational complexity, which does not allow it to scale to common cytometry datasets. This thesis describes a specialized, GPU-accelerated version of the Mahalanobis-average linked hierarchical clustering, which improves the algorithm performance by several orders of magnitude, thus allowing it to scale to much larger datasets. The thesis provides an overview of current hierarchical clustering algorithms, and details the construction of the variant used on GPU. The result is benchmarked on publicly available high-dimensional data from mass cytometry.
Robust Regularized Discriminant Analysis Based on Implicit Weighting
Kalina, Jan ; Hlinka, Jaroslav
In bioinformatics, regularized linear discriminant analysis is commonly used as a tool for supervised classification problems tailormade for high-dimensional data with the number of variables exceeding the number of observations. However, its various available versions are too vulnerable to the presence of outlying measurements in the data. In this paper, we exploit principles of robust statistics to propose new versions of regularized linear discriminant analysis suitable for highdimensional data contaminated by (more or less) severe outliers. The work exploits a regularized version of the minimum weighted covariance determinant estimator, which is one of highly robust estimators of multivariate location and scatter. The performance of the novel classification methods is illustrated on real data sets with a detailed analysis of data from brain activity research.
Fulltext: content.csg - Download fulltextPDF
Plný tet: v1241-16 - Download fulltextPDF
Methods for class prediction with high-dimensional gene expression data
Šilhavá, Jana ; Matula, Petr (referee) ; Železný, Filip (referee) ; Smrž, Pavel (advisor)
Dizertační práce se zabývá predikcí vysokodimenzionálních dat genových expresí. Množství dostupných genomických dat významně vzrostlo v průběhu posledního desetiletí. Kombinování dat genových expresí s dalšími daty nachází uplatnění v mnoha oblastech. Například v klinickém řízení rakoviny (clinical cancer management) může přispět k přesnějšímu určení prognózy nemocí. Hlavní část této dizertační práce je zaměřena na kombinování dat genových expresí a klinických dat. Používáme logistické regresní modely vytvořené prostřednictvím různých regularizačních technik. Generalizované lineární modely umožňují kombinování modelů s různou strukturou dat. V dizertační práci je ukázáno, že kombinování modelu dat genových expresí a klinických dat může vést ke zpřesnění výsledku predikce oproti vytvoření modelu pouze z dat genových expresí nebo klinických dat. Navrhované postupy přitom nejsou výpočetně náročné.  Testování je provedeno nejprve se simulovanými datovými sadami v různých nastaveních a následně s~reálnými srovnávacími daty. Také se zde zabýváme určením přídavné hodnoty microarray dat. Dizertační práce obsahuje porovnání příznaků vybraných pomocí klasifikátoru genových expresí na pěti různých sadách dat týkajících se rakoviny prsu. Navrhujeme také postup výběru příznaků, který kombinuje data genových expresí a znalosti z genových ontologií.
Some Robust Estimation Tools for Multivariate Models
Kalina, Jan
Standard procedures of multivariate statistics and data mining for the analysis of multivariate data are known to be vulnerable to the presence of outlying and/or highly influential observations. This paper has the aim to propose and investigate specific approaches for two situations. First, we consider clustering of categorical data. While attention has been paid to sensitivity of standard statistical and data mining methods for categorical data only recently, we aim at modifying standard distance measures between clusters of such data. This allows us to propose a hierarchical agglomerative cluster analysis for two-way contingency tables with a large number of categories, based on a regularized measure of distance between two contingency tables. Such proposal improves the robustness to the presence of measurement errors for categorical data. As a second problem, we investigate the nonlinear version of the least weighted squares regression for data with a continuous response. Our aim is to propose an efficient algorithm for the least weighted squares estimator, which is formulated in a general way applicable to both linear and nonlinear regression. Our numerical study reveals the computational aspects of the algorithm and brings arguments in favor of its credibility.
Robustness of High-Dimensional Data Mining
Kalina, Jan ; Duintjer Tebbens, Jurjen ; Schlenker, Anna
Standard data mining procedures are sensitive to the presence of outlying measurements in the data. This work has the aim to propose robust versions of some existing data mining procedures, i.e. methods resistant to outliers. In the area of classification analysis, we propose a new robust method based on a regularized version of the minimum weighted covariance determinant estimator. The method is suitable for data with the number of variables exceeding the number of observations. The method is based on implicit weights assigned to individual observations. Our approach is a unique attempt to combine regularization and high robustness, allowing to downweight outlying high-dimensional observations. Classification performance of new methods and some ideas concerning classification analysis of high-dimensional data are illustrated on real raw data as well as on data contaminated by severe outliers.

National Repository of Grey Literature : 11 records found   1 - 10next  jump to record:
Interested in being notified about new results for this query?
Subscribe to the RSS feed.