National Repository of Grey Literature 4 records found  Search took 0.01 seconds. 
Automatic inflection in Czech language
Sourada, Tomáš ; Rosa, Rudolf (advisor) ; Vidra, Jonáš (referee)
This thesis focuses on the task of automatic morphological inflection of Czech nouns, specifically in out-of-vocabulary (OOV) conditions (inflecting previously unseen words). We automatically extracted a large dataset suit- able for training and evaluation in the OOV conditions. We also manually built a real-world OOV dataset of neologisms. We developed three different systems: a retrograde model performing a variation of kNN algorithm, and two sequence-to-sequence (seq2seq) models based on LSTM and Transformer. Compared to an available rule-based inflection system sklonuj.cz and stan- dard SIGMORPHON shared task baselines, our seq2seq model reaches the best results in the standard OOV conditions. Moreover, it achieves state-of- the-art results for 6 out of 16 development languages from SIGMORPHON 2022 shared task data in the OOV evaluation (feature overlap) on large data condition. On the real-world OOV dataset, the retrograde model outper- forms all neural models and is competitive with a non-neural SIGMORPHON baseline. We release the inflection system with seq2seq model as a ready-to- use Python library. It could serve as a complement to the state-of-the-art dictionary-based inflection system MorphoDiTa as a back-off for OOV words, especially once extended to other parts of speech. 1
Automatic correction of errors in the CUBBITT translator outputs
Švandelík, Vojtěch ; Popel, Martin (advisor) ; Vidra, Jonáš (referee)
The thesis deals with post-processing of the outputs of the Czech-English and English- Czech translator CUBBITT. The aim of the work was to develop a tool that would be able to search for mistranslated phrases using a rule-based system and subsequently correct such phrases. We focus on a few specific phenomena, mainly the correction of numbers with units whose original meaning has been changed by the translation and the correction of thousand and decimal separators, which are not always adapted to follow the target-language rules. In addition, we have dealt with correcting personal proper names which the translator sometimes changes completely. For each of the phenomena, we have analyzed the frequency and the origin of the problem, proposed a solution, and implemented it in a Python package. We have also created a web interface where the package can be tested. Finally, we have evaluated the accurracy of our solution and suggested further extensions. 1
Morphological segmentation of Czech Words
Vidra, Jonáš ; Žabokrtský, Zdeněk (advisor) ; Mareček, David (referee)
In linguistics, words are usually considered to be composed of morphemes: units that carry meaning and are not further subdivisible. The task of this thesis is to create an automatic method for segmenting Czech words into morphemes, usable within the network of Czech derivational relations DeriNet. We created two different methods. The first one finds morpheme boundaries by differentiating words against their derivational parents, and transitively against their whole derivational family. It explicitly models morphophonological alternations and finds the best boundaries using maximum likelihood estimation. At worst, the results are slightly worse than the state of the art method Morfessor FlatCat, and they are significantly better in some settings. The second method is a neural network made to jointly predict segmentation and derivational parents, trained using the output of the first method and the derivational pairs from DeriNet. Our hypothesis that such joint training would increase the quality of the segmentation over training purely on the segmentation task seems to hold in some cases, but not in other. The neural model performs worse than the first one, possibly due to being trained on data which already contains some errors, multiplying them.
Extending the Lexical Network DeriNet
Vidra, Jonáš ; Žabokrtský, Zdeněk (advisor) ; Hlaváčová, Jaroslava (referee)
DeriNet is a database of Czech lexical derivates. It is a wordnet in which nodes represent lemmas sampled from the Czech National Corpus and edges represent derivational relations between them (such as work → workable → unworkable). Sourcing the lemmas from a corpus brings two problems: errors and missing lemmas that could link together currently unconnected clusters. Therefore, a more reliable and more complete source of lemmas is needed. The goal of this thesis is to extend the lexicon of DeriNet using lemmas sourced from MorfFlex CZ, a Czech morphological dictionary, and to correct the derivational rules that produce errors with the new lexicon. Error rate is measured by comparing the relations in the database with manually annotated data created as part of the thesis. Powered by TCPDF (www.tcpdf.org)

Interested in being notified about new results for this query?
Subscribe to the RSS feed.