National Repository of Grey Literature 78 records found  beginprevious69 - 78  jump to record: Search took 0.00 seconds. 
Pravděpodobnostní překladový slovník
Rouš, Jan ; Pecina, Pavel (referee) ; Žabokrtský, Zdeněk (advisor)
In this work we present the method of semi-automatic training of the probabilistic translation dictionary using large automatically annotated parallel corpora. According to the study of translation errors and the role of translation dictionary within the TectoMt translatio system in general we propose models of various complexity. These basic models were combined to hierarchical models that were designed to reduce impact of the sparse data problem. Various extensions were implemented to deal with common lexical errors. The dictionary along with extensions was compared to the former approach on test data and the results show improved translation quality.
Document Keyword Extraction
Klíč, Radoslav ; Schlesinger, Pavel (referee) ; Pecina, Pavel (advisor)
In the present work, the problem of keyword extraction is studied. The work contains a brief introduction to the problem and description of some approaches to its solution. As a part of the work, some of the approaches are implemented and their e ciency is evaluated on a basis of a collection of documents. Two software tools are created. The rst one's purpose is keyword extraction. The other one is a web-based interface for the rst tool with one more function. It can be used for manual assigning of keywords to texts.
Lexical Association Measures Collocation Extraction
Pecina, Pavel ; Hajič, Jan (advisor)
Lexical Association Measures: Collocation Extraction Pavel Pecina Abstract of Doctoral Thesis This thesis is devoted to an empirical study of lexical association measures and their application for collocation extraction. We focus on two-word (bigram) collocations only. We compiled a comprehensive inventory of 82 lexical association measures and present their empirical evaluation on four reference data sets: dependency bigrams from the manually annotated Prague Dependency Trcebank, surface bigrams from the same source, instances of the previous from the Czech National Corpus provided with automatically assigned lemmas and part-of-speech tags, and distance verb-noun bigrams from the automatically part-of-spcech tagged Swedish Parole Corpus. Collocation candidates in the reference data sets were manually annotated and identified as collocations and non-collocations. The evaluation scheme is based on measuring the quality of ranking collocation candidates according to their chance to form collocations. The methods are compared by precision-recall curves and mean average precision scores adopted from the field of information retrieval. Tests of statistical significance were also performed. Further, we study the possibility of combining lexical association measures and present empirical results of several...
Entity retrieval on Wikipedia in the scope of the gikiCLEF track
Duarte Torres, Sergio Raul ; Pecina, Pavel (advisor) ; Žabokrtský, Zdeněk (referee)
This thesis presents a system to retrieve entities specified by a question or description given in natural language, this description indicates the entity type and the properties that the entities need to satisfy. This task is analogous to the one proposed in the GikiCLEF 2009 track. The system is fed with the Spanish Wikipedia Collection of 2008 and every entity is represented by a Wikipage. We propose three novel methods to perform query expansion in the problem of entity retrieval. We also introduce a novel method to employ the English Yago and DBpedia semantic resources to determine the target named entity type; this method is used to improve previous approaches in which the target NE type is based solely on Wikipedia categories. We show that our system obtains promising results when we evaluate its performance in the GikiCLEF 2009 topic list and compare the results with the other participants of the track.
Automatický word alignment
Kravalová, Jana ; Novák, Václav (referee) ; Pecina, Pavel (advisor)
Word alignment is a crucial component of modern machine translation systems. Given a sentence in two languages, the task is to determine which words from one language are the most likely translations of words from the other language. As an alternative to classical generative approach (IBM models) new methods based on discriminative training and maximum-weight bipartite matching algorithms for complete bipartite graphs have been proposed in recent years. The graph vertices represent words in the source and target language. The edges are weighted by measures of association estimated from parallel training data. This work focuses on the effective implementation of maximum weight bipartite matching algorithm, implementation of scoring procedures for graph vertexes, and basic experiments and their evaluation.
Syntax in methods for information retrieval
Kravalová, Jana ; Holub, Martin (referee) ; Pecina, Pavel (advisor)
In the last years, application of language modeling in information retrieval has been studied quite extensively. Although language models of any type can be used with this approach, only traditional n-gram models based on surface word order have been employed and described in published experiments (often only unigram language models). The goal of this thesis is to design, implement, and evaluate (on Czech data) a method which would extend a language model with syntactic information, automatically obtained from documents and queries. We attempt to incorporate syntactic information into language models and experimentally compare this approach with unigram and bigram model based on surface word order. We also empirically compare methods for smoothing, stemming and lemmatization, effectiveness of using stopwords and pseudo relevance feedback. We perform a detailed analysis of these retrieval methods and describe their performance in detail.
Automatic Evaluation of Parallel Bilingual Data Quality
Kolovratník, David ; Kuboň, Vladislav (advisor) ; Pecina, Pavel (referee)
Statistical machine translation is an approach dependent particularly on huge amount of parallel bilingual data. It is used to train a translation model. The translation model works instead of a rule-based transfer; in some systems even lexical. It is believed that quality of the translation may be improved with more data for training. I have tried contrary to give less data and watch how the score of the translation changes. I selected sentence pairs to stay a part of the corpus with some key fi rst randomly, then according to sentence length ratio and finaly according to the number of word couples that a dictionary knows as translation pairs. I show that selection according to an advisable criteria slows down falling of NIST and BLEU score with decreasing size of the corpus and in some cases may tend even to better score. Decreasing the corpus size also lead to faster evaluation and less need of space. It may be useful in an implementation of the machine translation system in small devices with limited system resources.
Webcrawler
Lessner, Daniel ; Podveský, Petr (referee) ; Pecina, Pavel (advisor)
Práce se zabývá tvorbou webového robota. Jeho úkolem je rekurzivně stahovat z internetu české stránky a čistit je na samotný prostý text (žádné HTML značky, styly nebo skripty). Ten potom bude využit pro tvorbu obrovského jazykového korpusu, užitečného pro další výzkum. Klíčovou vlastností robota je nenápadnost běhu, nezatěžování cizích prostředků a plné respektování nezávazného doporučení Robots Exclusion Standard. Robot je napsán v jazyce Python a intenzivně využívá jeho standardní knihovny a rychlou práci s textovými řetězci. Vzhledem k charakteru úlohy jsme se rozhodli pro paralelní implementaci, která by měla plně využít šířku pásma. S tímto záměrem jsme měli úspěch. Výsledkem práce je tedy robot připravený získat dostatek textů pro korpus. Samozřejmě je ale použitelný i pro jiné účely, zvlášť tam, kde je potřeba šetrnost k cizím prostředkům. Kromě jeho přínosu pro lingvistiku poskytuje i zajímavé informace o obsahu českého internetu.
Text segmentation
Češka, Pavel ; Pecina, Pavel (advisor) ; Podveský, Petr (referee)
The bachelor thesis focuses on basic pre-processing (tokenization and segmentation) of Czech texts, mainly for purposes of Czech internet corpus. The texts for this corpus will be automatically obtained from the world wide web, therefore the segmentation is preceeded by character encoding recognition, cleaning and language identification. We performed experiments with two methods of language identification and present their results. The first method is based on comparison of the most frequent n-grams (substrings of length n) extracted from an unknown document and a large Czech corpus. The second one employs a model estimating word probabilities by conditional probabilities of trigrams estimated on the same corpus. For wider usage, we developed a module for tokenization and identification of sentences boundaries by a decision tree analysis of the nearest context of potential sentence boundaries and utilizing extensive lists of Czech abbreviations. The decision tree was trained on a set of manually processed data. Its evaluation was based on independent human judgements and results are presented in the work.

National Repository of Grey Literature : 78 records found   beginprevious69 - 78  jump to record:
See also: similar author names
3 Pecina, Petr
Interested in being notified about new results for this query?
Subscribe to the RSS feed.