National Repository of Grey Literature 
Difference in amino acid distribution in sequences of structured and unstructured proteins
Sotáková, Patrícia ; Vondrášek, Jiří (advisor) ; Sanchez Rocha, Alma Carolina (referee)
Disordered proteins are a topic of growing interest. With ongoing research describing the relationship between sequence and structure, this work aims to investigate features in an amino acid sequence that could indicate finger- prints of structured or disordered proteins. These fingerprints could deepen our understanding of disordered regions or protein folding. Furthermore, this knowledge could help design new deep-learning predictors of protein dis- order or protein domain recognition. Statistical analysis was performed on sequences obtained from Protein Data Bank and DisProt database, including a comparison of protein sequences with artificial ones generated under the assumption of amino acid pairwise independence. Subsequently, we identified triples of two amino acids and their distance that are significantly different in occurrence to the artificial set. Based on this analysis, we sorted the triples into the following categories: overestimated, random, and underesti- mated. Observed pairs with abnormal frequency in a given distance can be interpreted as a fingerprint of secondary structure, motif, domain, or other unknown identification of disordered proteins depending on the dataset. A simple example of a sequence fingerprint was observed in the PDB dataset; the abundance of histidines in...

