National Repository of Grey Literature 38 records found  1 - 10nextend  jump to record: Search took 0.00 seconds. 
Improving Subword Tokenization Methods for Multilingual Models
Balhar, Jiří ; Limisiewicz, Tomasz (advisor) ; Popel, Martin (referee)
In this thesis, we explore the differences between tokenization methods for multilingual neural language models and investigate their impact on language model representation quality. We propose a set of metrics to evaluate the quality of tokenizations. We show that the metrics capture the differences between tokenizers and that they correlate with the downstream performance of multilingual language models. Then, using our metrics, we assess why is the standard tokenizer training on a multilingual corpus reported to be ineffective for multilingual models. We investigate design choices such as data size, implementation or alphabet size. We identify that the issue might be caused by data imbalance and to solve it we propose to sample tokenizer training data uniformly. We compare the standard tokenizer training with three proposed methods we replicate, that aim to mitigate the same reported issues. We show that the principle behind the improvements of the proposed methods is the same as with the uniform sampling. Our findings offer a deeper understanding of tokenization methods for multilingual models. We propose a methodology and guidelines for training multilingual tokenizers. Lastly, we show how to achieve improvements in tokenization without the need for more complex tokenization methods.
Implementation of a software keyboard to input text into the machine translation application
Dvořák, Šimon ; Straňák, Pavel (advisor) ; Popel, Martin (referee)
A vast amount of applications need to consume textual input from their users. Trans- lation web applications are not an exception. Contrary to the other applications, the textual input is very diverse. Everything can happen, be it all kinds of characters, key- board layouts, or users with little or no knowledge of the source language. In this thesis, we try to develop means of making the input into the translation web applications more comfortable. We developed a configurable software keyboard supporting multiple features. Such features are: defining multiple keyboard layouts, remapping the physical keys to the active layout's keys, next-word prediction, and phonetic input correction. The software keyboard is easily extensible thanks to the fact that it uses straightforward architecture. 1
Mutual Relation of Machine Translation and Quality Estimation
Tryhubyshyn, Iryna ; Tamchyna, Aleš (advisor) ; Popel, Martin (referee)
Machine Translation Quality Estimation predicts quality scores for translations pro- duced by Machine Translation systems based on source and output segments. Qual- ity Estimation systems are usually trained in a supervised manner using training data that contains translation produced by one or more (other) Machine Translation systems. Therefore, the choice of training data for Machine Translation has an impact on how well the Quality Estimation system works. This thesis studies the relationship between Machine Translation systems and sentence- level Quality Estimation systems. Using our definitions of Machine Translation system power and Quality Estimation system power, we conducted experiments that involve training Machine Translation and Quality Estimation systems of varying power. We pre- sented Quality Estimation systems evaluation results on test sets of different domains and translated by Machine Translation systems of different power. We find that (i) Quality Estimation systems trained on translations of lower quality outperform Quality Estimation systems trained on translations of higher quality; (ii) evaluating high-quality Machine Translation systems is challenging for Quality Estimation systems of all powers; (iii) high-power Quality Estimation systems work better for out-of-domain distribution...
Non-Autoregressive Neural Machine Translation
Helcl, Jindřich ; Hajič, Jan (advisor) ; Duh, Kevin (referee) ; Popel, Martin (referee)
In recent years, a number of mehtods for improving the decoding speed of neural machine translation systems have emerged. One of the approaches that pro- poses fundamental changes to the model architecture are non-autoregressive models. In standard autoregressive models, the output token distributions are conditioned on the previously decoded outputs. The conditional dependence al- lows the model to keep track of the state of the decoding process, which improves the fluency of the output. On the other hand, it requires the neural network computation to be run sequentially, and thus it cannot be parallelized. Non- autoregressive models impose conditional independence on the output distri- butions, which means that the decoding process is parallelizable and hence the decoding speed improves. A major drawback of this approach is lower trans- lation quality compared to the autoregressive models. The goal of the non- autoregressive translation research is to find methods that improve the trans- lation quality, while retaining high decoding speed. In this thesis, we explore the research progress so far and identify flaws in the generally accepted eval- uation methodology. We experiement with non-autoregressive models trained with connectionist temporal classification. We find that even though our models...
Machine Translation Using Syntactic Analysis
Popel, Martin ; Žabokrtský, Zdeněk (advisor) ; Ircing, Pavel (referee) ; Čmejrek, Martin (referee)
Machine Translation Using Syntactic Analysis Martin Popel This thesis describes our improvement of machine translation (MT), with a special focus on the English-Czech language pair, but using techniques ap- plicable also to other languages. First, we present multiple improvements of the deep-syntactic system TectoMT. For instance, we implemented a novel context-sensitive translation model, comparing several machine learning ap- proaches. We also adapted TectoMT to other domains and languages. Sec- ond, we present Transformer - a state-of-the-art end-to-end neural MT sys- tem. We analyzed in detail the effect of several training hyper-parameters. With our optimized training, the system outperformed the best result on the WMT2017 test set by +1.0 BLEU. We further extended this system by uti- lization of monolingual training data and by a new type of backtranslation (+2.8 BLEU compared to the baseline system). In addition, we leveraged domain adaptation and the effect of "translationese" (i.e which language in parallel data is the original and which is the translation) to optimize MT systems for original-language and translated-language data (gaining further +0.2 BLEU). Our improved neural MT system significantly (p¡0.05) out- performed all other systems in English-Czech and Czech-English WMT2018 shared tasks,...
Možnosti zlepšení strojového překladu z angličtiny do češtiny
Popel, Martin ; Žabokrtský, Zdeněk (advisor) ; Bojar, Ondřej (referee)
This thesis describes English-Czech Machine Translation as it is implemented in TectoMT system. The transfer uses deep-syntactic dependency (tectogrammatical) trees and exploits the annotation scheme of Prague Dependency Treebank. The primary goal of the thesis is to improve the translation quality using both rule-base and statistical methods. First, we present a manual annotation of translation errors in 250 sentences and subsequent identi cation of frequent errors, their types and sources. The main part of the thesis describes the design and implementation of modi cations in the three transfer phases: analysis, transfer and synthesis. The most prominent modi cation is a novel approach to the transfer phase based on Hidden Markov Tree Models (a tree modi cation of Hidden Markov Models). The improvements are evaluated in terms of BLEU and NIST scores.
Tool for comparison and evaluation of machine translation
Klejch, Ondřej ; Popel, Martin (advisor) ; Tamchyna, Aleš (referee)
This bachelor thesis is about development of a tool for comparison and eva- luation of machine translation called MT-ComparEval. With this tool it is possi- ble to compare translations according to several criteria, such as automatic met- rics of machine translation quality computed on whole documents or single sen- tences, quality comparison of single sentence translation with highlighting confir- med, improving and worsening n-grams or summaries of the most improving and worsening n-grams for the whole document. When comparing two translations, MT-ComparEval also plots a chart with absolute differences of metrics compu- ted on single sentences and a chart with values obtained from paired bootstrap resampling.
Word prediction using language models
Koutný, Michal ; Popel, Martin (advisor) ; Novák, Michal (referee)
The thesis utilizes ngram language models to improve text entry with QWERTY keyboard by the means of word prediction. Related solutions are briedly introduced. Then follows theoretical background for the work. The analysis in the next part divides problems into four tasks: language model training, incorporating model for word prediction, GUI component and evaluation framework. The realization combines Python and C++. The used corpora come from Czech (19\,M words) and (84\,M words) English Wikipedia articles. A small corpus of Czech educative texts was used to test domain adaptation. The quality metrics are defined and various configuration are measured. The best solutions reduced keystrokes per character to 0.44, resp. 0.55 for English, resp. Czech on testing data.

National Repository of Grey Literature : 38 records found   1 - 10nextend  jump to record:
See also: similar author names
1 POPEL, Milan
Interested in being notified about new results for this query?
Subscribe to the RSS feed.