Národní úložiště šedé literatury Nalezeno 3 záznamů.  Hledání trvalo 0.01 vteřin. 
Machine Learning-based Prediction of Mutational Effects on Protein Immunogenicity
Lacko, Dávid ; Martínek, Tomáš (oponent) ; Musil, Miloš (vedoucí práce)
The immune system is a vital part in human survival since it is responsible for protecting the body against pathogens.This ability stems from molecular mechanisms for the recognition of non-human proteins and molecules. While this system is critical for survival, it hampers the use of non-human proteins as biotherapeutics, many of which have already demonstrated significant potential in healthcare. To exploit this potential, it is vital that the immune system does not attack and inactivate the proteins. Therefore, it is often necessary to engineer these proteins to reduce the immunogenicity and avoid early detection by the immune system. To this end, scientists introduce mutations to a protein of interest to lower the response. Large-scale experimental validation of such mutations is typically unfeasible due to the enormous size of combinatorial space to explore. With the help of machine learning tools, this process can be accelerated and total development cost significantly reduced by scoring the mutations in silico first and experimentally validating only a subset of short-listed viable designs. However, the field of machine-learning-based tools for predicting such mutational effects is yet to be explored. To address this challenge, we present a novel dataset focused on the effect of mutations on epitopes - protein regions that trigger the immune system response. The newly collected dataset contains epitopes, their single and double-point mutations, and the effect of these mutations on imunogenicity as labels. By leveraging this novel dataset and recent advances in large language models for protein engineering, we train a set of machine-learning-based models that are able to classify mutations based on their effect on immunogenicity, showing a significant improvement in performance over the baselines. Additionally, we investigate and present a way to separate the dataset into different train-test splits to minimize data leakage between these splits. This leads to a more robust real-world performance evaluation of the models trained on this data.
Platform for Biological Sequence Analysis Using Machine Learning
Lacko, Dávid ; Burgetová, Ivana (oponent) ; Martínek, Tomáš (vedoucí práce)
Machine learning has many active areas and one of them is protein characterisation since experimental annotation is usually costly and time-consuming, and many datasets suitable for training predictors are currently being published. One of the recent methods, called innov'SAR, combines the Fourier transform with partial linear regression and has been used in several protein engineering applications. However, the code for the method is not freely available and the method itself was not statistically verified. The goal of this thesis is to address these limitations, implement and extend the method using Python language in an easy-to-use platform that allows training and testing of the models. The extensions include parallelization, Spearman scoring function and aligned sequence input. The statistical significance testing is also performed to verify the impact of the found dependencies between input sequences and properties of the proteins. The method proved to be statistically significant with strong dependencies found between inputs and outputs. Two newly collected halalkane dehalogenase datasets were used to train models and they have cross validation scores of Q2 = 0.54 and Q2 = 0.77 with almost double the improvement over the baseline models. Created models allow filtering of large sequence databases and scanning for potential improvements in the protein properties.
Platform for Biological Sequence Analysis Using Machine Learning
Lacko, Dávid ; Burgetová, Ivana (oponent) ; Martínek, Tomáš (vedoucí práce)
Machine learning has many active areas and one of them is protein characterisation since experimental annotation is usually costly and time-consuming, and many datasets suitable for training predictors are currently being published. One of the recent methods, called innov'SAR, combines the Fourier transform with partial linear regression and has been used in several protein engineering applications. However, the code for the method is not freely available and the method itself was not statistically verified. The goal of this thesis is to address these limitations, implement and extend the method using Python language in an easy-to-use platform that allows training and testing of the models. The extensions include parallelization, Spearman scoring function and aligned sequence input. The statistical significance testing is also performed to verify the impact of the found dependencies between input sequences and properties of the proteins. The method proved to be statistically significant with strong dependencies found between inputs and outputs. Two newly collected halalkane dehalogenase datasets were used to train models and they have cross validation scores of Q2 = 0.54 and Q2 = 0.77 with almost double the improvement over the baseline models. Created models allow filtering of large sequence databases and scanning for potential improvements in the protein properties.

Viz též: podobná jména autorů
1 Lacko, Daniel
Chcete být upozorněni, pokud se objeví nové záznamy odpovídající tomuto dotazu?
Přihlásit se k odběru RSS.