National Repository of Grey Literature 23 records found  1 - 10nextend  jump to record: Search took 0.01 seconds. 
Textual Ciphers as a Tool for Better Understanding the Transformers
Provazník, Jan ; Libovický, Jindřich (advisor) ; Kasner, Zdeněk (referee)
The Transformer architecture is very popular, so it is potentially im- pactful to interpret what influences its performance. We test the hypothesis that the model relies on the linguistic properties of a text when working with it. We remove interference with cultural aspects of meaning by using a character-level task with the ByT5 Transformer model. We train ByT5 to decipher sentences encrypted with text ciphers (Vigenère, Enigma). We annotate a sentence dataset with linguistic properties with published NLP tools. On this dataset, we study the relationships between the linguistic properties and the fine-tuned ByT5 decipherment error rate. We analyze correlations, train ML models to predict error rates from the properties and interpret them with SHAP. We find small significant correlations but can- not predict error rates from the properties. We conclude the properties we identified do not give much insight into the performance of the Transformer.
Writing assistant based on large language models
Klement, David ; Helcl, Jindřich (advisor) ; Libovický, Jindřich (referee)
A standard approach to many natural language processing tasks is to take an existing, pre-trained large language model and fine-tune it for the given task. Such an approach leads to having a separate model for each task; furthermore, the fine-tuning must be repeated when upgrading to a new pre-trained model. This thesis explores the possibilities of using a single off-the-shelf model for three different tasks without fine-tuning. We present Preditor, a writing assistant that supports rewriting a sentence after replacing one of its words, suggesting continuations, and suggesting words that fit into a sentence. We design the system in a model-agnostic way, making it possible to upgrade to a new model with little effort. We also provide an extension that integrates the assistant into the text editor. 1
Document embedding using Transformers
Burian, David ; Libovický, Jindřich (advisor) ; Variš, Dušan (referee)
We develop a method to train a document embedding model with an unlabeled dataset and low computational resources. Using teacher-student training, we distill SBERT's capacity to capture text structure and Paragraph Vector's ability to encode extended context into the resulting embedding model. We test our method on Longformer, a Transformer model with sparse attention that can process up to 4096 tokens. We explore several loss functions for the distillation of knowledge from the two teachers (SBERT and Paragraph Vector) to our student model (Longformer). Throughout experimentation, we show that despite SBERT's short maximum context, its distillation is more critical to the student's performance. However, the student model can benefit from both teachers. Our method improves Longformer's performance on eight downstream tasks, including citation prediction, plagiarism detection, and similarity search. Our method shows excep- tional performance with few finetuning data available, where the trained student model outperforms both teacher models. By showing consistent performance of differently con- figured student models, we demonstrate our method's robustness to various changes and suggest areas for future work. 1
Evolution of Gender Forms and Bias in Multilingual Corpora
Jurášová, Daniela ; Limisiewicz, Tomasz (advisor) ; Libovický, Jindřich (referee)
Although state-of-the-art machine translation models achieve high translation quality, they often exhibit bias. The imbalance of gender forms in the training data was identified as the key source of gender bias. The aim of this work is to study the evolution of gender forms in the data and subsequently mitigate the gender bias of machine translation models. We focus on languages with morphological gender (Czech, German, Spanish and Polish). We thoroughly analyze the development of the frequency of gendered occupations in the data over time and report a steady but slow trend in the increased frequency of female occupations. We then curate the available natural data based on temporal and topic analysis to obtain a gender-balanced portion, and perform experiments with fine- tuning on such data. We report a reduction in the gender bias of the models and increased accuracy of translating to the correct gender with a slight decrease in translation quality. This confirms the benefit of debiasing techniques based on fine-tuning models on balanced data. We contribute a novel method for obtaining gender-balanced data from available natural data and emphasize the significant presence of stereotypes in the data and the need to minimize them.
Sentence representations with similarity interpretation
Svobodová, Zuzana ; Hudeček, Vojtěch (advisor) ; Libovický, Jindřich (referee)
Sentence representations - embeddings - obtained from neural network models are the core part of many applications in both academia and industry. Although embeddings reach great results in correlation with human sense of sentence similarity, there is often a lack of explanation for why models choose sentences to be similar. In this thesis, we strive to increase the interpretability of model embeddings by incorporating different semantic sentence level annotations in the learning process. We introduce a model called SBERTslice that produces embeddings that can distinguish nuanced semantic variations in text, including elements like negation, sentiment, named entities, emotional tone, and verb-oriented relation between words in a text. We evaluated SBERTslice embeddings in various text classification and semantic sim- ilarity tasks and for a majority of them, SBERTslice outperformed the original SBERT. 1
Understanding cross-lingual abilities in large multilingual language models
Del Valle Girón, José Jacobo ; Libovický, Jindřich (advisor) ; Limisiewicz, Tomasz (referee)
Cross-lingual abilities have been evident in large multilingual language models over the past few years. However, understanding why and under what circumstances they work is not entirely clear. In this work, we work towards a better understanding of these aspects in a specific subset of multilingual models, namely modular multilingual models with cross-lingual transfer learning abilities. We try to quantify claims in Pfeiffer et al. [2022] regarding their proposed model, X-MOD, as it was tested in a very specific setting which may not align with common low-resource settings. Specifically, we evaluate how the following factors may affect downstream performance: the amount of available pre- training data; hyperparameters such as number of training steps, checkpoint selection criteria, available overlapping lexicon. With the help of our findings, we also aim to provide guidelines on how to best use X-MOD, especially from a low-resource perspective. 1
Gender stereotypes in neural sentence representations
Al Ali, Adnan ; Libovický, Jindřich (advisor) ; Dušek, Ondřej (referee)
Neural networks have seen a spike in popularity in natural language processing in re- cent years. They consistently outperform the traditional methods and require less human labor to perfect as they are trained unsupervised on large text corpora. However, these corpora may contain unwanted elements such as biases. We inspect multiple language models, primarily focusing on a Czech monolingual model - RobeCzech. In the first part of this work, we present a dynamic benchmarking tool for identifying gender stereotypes in a language model. We present the tool to a group of annotators to create a dataset of biased sentences. In the second part, we introduce a method of measuring the model's perceived political values of men and women and compare them to real-world data. We argue that our proposed method provides significant advantages over other methods in our knowledge. We find no strong systematic beliefs or gender biases in the measured political values. We include all the code and created datasets in the attachment. 1
Neural Concept-to-text Generation with Knowledge Graphs
Szabová, Kristína ; Dušek, Ondřej (advisor) ; Libovický, Jindřich (referee)
Modern language models are strong at generating grammatically correct, natural lan- guage. However, they still struggle with commonsense reasoning - a task involving making inferences about common everyday situations without explicitly stated informa- tion. Prior research into the topic has shown that providing additional information from external sources helps language models generate better outputs. In this thesis, we explore methods of extracting information from knowledge graphs and using it as additional input for a pre-trained generative language model. We do this by either extracting a subgraph relevant to the context or by using graph neural networks to predict which information is relevant. Moreover, we experiment with a post-editing approach and with a model trained in a multi-task setup (generation and consistency classification). Our methods are evaluated on the CommonGen benchmark for generative commonsense reasoning using both automatic metrics and a detailed error analysis on a small sample of outputs. We show that the methods improve over a simple language model fine-tuning baseline, although they do not set a new state of the art. 1
Implicit information extraction from news stories
Kydlíček, Hynek ; Libovický, Jindřich (advisor) ; Helcl, Jindřich (referee)
This work deals with information extraction from Czech News Stories. We focus on four tasks: Publishing server, Article category, Author's textual gender and Publication day of week. Due to the absence of a suitable dataset for the tasks, we present CZEch NEws Classification dataset (CZE-NEC), one of the most extensive Czech classification datasets, composed of news articles from various sources, spanning over twenty years. Tasks are solved using Logistic Regression and pre-trained Transformer encoders. Emphasis is put on fine-tuning methods of the Transformer models, which are evaluated in detail. The models are compared to human evaluators, revealing significant superiority over humans on all tasks. Furthermore, the models are pitted against the commercial large language model GPT-3, outperforming it on half of the tasks, despite GPT-3 being significantly larger. Our work sets strong baseline results on CZE-NEC allowing for further research in the field.
Automatic generation of medical reports from chest X-rays in Czech
Chaloupský, Lukáš ; Rosa, Rudolf (advisor) ; Libovický, Jindřich (referee)
This thesis deals with the problem of automatic generation of medical reports in the Czech language based on the input chest X-ray images using deep neural networks. The first part deals with the analysis of the problem itself including a comparison of existing solutions from several common points of view. In order to interpret medical images in the Czech language, we present a fine-tuned Czech GPT-2 model specialized on medical texts based on the original pre-trained English GPT-2 model along with its evaluation. In the second part, the created Czech GPT-2 is used for training a neural network model for generating medical reports. The training was conducted on freely available data along with data preprocessing and their adjustment for the Czech language. Furthermore, the model results are discussed and evaluated using standard metrics for natural language processing to determine the performance. 1

National Repository of Grey Literature : 23 records found   1 - 10nextend  jump to record:
Interested in being notified about new results for this query?
Subscribe to the RSS feed.