National Repository of Grey Literature 78 records found  previous11 - 20nextend  jump to record: Search took 0.00 seconds. 
Univerzalní morfologický značkovač
Long, Duong Thanh ; Pecina, Pavel (advisor) ; Žabokrtský, Zdeněk (referee)
Part-of-speech (POS) tagging is one of the most basic and crucial tasks in Natural Language Processing (NLP). Supervised POS taggers perform well on many resource-rich languages i.e. English, French, Portuguese etc, where manually annotated data is available. However, it is impossible to use a supervised approach for the vast number of resource-poor languages. In this thesis, we apply a multilingual unsupervised method for building taggers for resource-poor languages base additionally on parallel data (Universal Tagger), that is, we use parallel data as the bridge to transfer tag information from resource-rich to resource-poor languages. On average, our tagger performs on par with the state of the art on the same test set of eight languages. However, we use less data and a less sophisticated method which also results in significant difference in speed. In an effort to further improve performance, we investigate the choice of source language. We found that English is rarely the best source language. We successfully built a model that can predict the best source language only based on monolingual data. However, even better predictions can be made if we additionally use parallel data. Finally, we show that, if multiple source languages are available, it is possible to get further improvement by incorporating...
Near duplicate detection in large document collections
Benčík, Daniel ; Pecina, Pavel (advisor) ; Kopecký, Michal (referee)
This thesis deals with the problematics of detecting documents, which are so similair one to another, that we can consider them to be (nearly) identical and that in collections having up to millions of documents. The greatest aim of this thesis is a comparison of new, fast algorithms designed to solve this task with current algorithms, which due to their complexitiy cannot be used for large collections. The thesis contains an implementation of both new and current methods of solving the given task toghether with applications that are designed to experimentally compare these methods.
Syntax in methods for information retrieval
Kravalová, Jana ; Pecina, Pavel (advisor) ; Holub, Martin (referee)
In the last years, application of language modeling in information retrieval has been studied quite extensively. Although language models of any type can be used with this approach, only traditional n-gram models based on surface word order have been employed and described in published experiments (often only unigram language models). The goal of this thesis is to design, implement, and evaluate (on Czech data) a method which would extend a language model with syntactic information, automatically obtained from documents and queries. We attempt to incorporate syntactic information into language models and experimentally compare this approach with unigram and bigram model based on surface word order. We also empirically compare methods for smoothing, stemming and lemmatization, effectiveness of using stopwords and pseudo relevance feedback. We perform a detailed analysis of these retrieval methods and describe their performance in detail.
Webcrawler
Lessner, Daniel ; Pecina, Pavel (advisor) ; Podveský, Petr (referee)
Práce se zabývá tvorbou webového robota. Jeho úkolem je rekurzivně stahovat z internetu české stránky a čistit je na samotný prostý text (žádné HTML značky, styly nebo skripty). Ten potom bude využit pro tvorbu obrovského jazykového korpusu, užitečného pro další výzkum. Klíčovou vlastností robota je nenápadnost běhu, nezatěžování cizích prostředků a plné respektování nezávazného doporučení Robots Exclusion Standard. Robot je napsán v jazyce Python a intenzivně využívá jeho standardní knihovny a rychlou práci s textovými řetězci. Vzhledem k charakteru úlohy jsme se rozhodli pro paralelní implementaci, která by měla plně využít šířku pásma. S tímto záměrem jsme měli úspěch. Výsledkem práce je tedy robot připravený získat dostatek textů pro korpus. Samozřejmě je ale použitelný i pro jiné účely, zvlášť tam, kde je potřeba šetrnost k cizím prostředkům. Kromě jeho přínosu pro lingvistiku poskytuje i zajímavé informace o obsahu českého internetu.
Graph-based dependency parsing
Wimberský, Antonín ; Pecina, Pavel (advisor) ; Schlesinger, Pavel (referee)
In the present work we study practical solution of dependency parsing's problem with help of graph algorithm for finding maximal spanning tree in oriented graph (multigraph). Advantage of this approach is very easily parsing of non-projective constructions. We represent the parsing sentence as an oriented multigraph, which vertices constitutes words of our sentence and edges symbolize (potential) relation between single pairs of words. Evaluation of edges we get from training data, it can be count for example as probability of relation between given two words, possibly in combination with other more advanced methods. Resulting maximal spanning tree gives then the dependency tree of our sentence.
Matching Images to Texts
Hajič, Jan ; Pecina, Pavel (advisor) ; Průša, Daniel (referee)
We build a joint multimodal model of text and images for automatically assigning illustrative images to journalistic articles. We approach the task as an unsupervised representation learning problem of finding a common representation that abstracts from individual modalities, inspired by multimodal Deep Boltzmann Machine of Srivastava and Salakhutdinov. We use state-of-the-art image content classification features obtained from the Convolutional Neural Network of Krizhevsky et al. as input "images" and entire documents instead of keywords as input texts. A deep learning and experiment management library Safire has been developed. We have not been able to create a successful retrieval system because of difficulties with training neural networks on the very sparse word observation. However, we have gained substantial understanding of the nature of these difficulties and thus are confident that we will be able to improve in future work.
Automatické čištění HTML dokumentů
Marek, Michal ; Pecina, Pavel (advisor) ; Straňák, Pavel (referee)
This paper describes a system for automatic cleaning of HTML documents, which was used in the participation of the Charles University in CLEANEVAL 2007. CLEANEVAL is a shared task and competitive evaluation of automatic systems for cleaning arbitrary web pages with the goal of preparing web data for use as a corpus in the area of computational linguistics and natural language processing. We try to solve this task as a sequence-labeling problem and our experimental system is based on Conditional Random Fields exploiting a set of features extracted from textual content and HTML structure of analyzed web pages for each block of text.
Automatic construction of semantic networks
Kirschner, Martin ; Pecina, Pavel (advisor) ; Holub, Martin (referee)
Presented work explores the possibilities of automatic construction and expansion of semantic networks with use of machine learning methods. The main focus is put on the feature retrieving procedure for the data set. The work presents a method of semantic relation retrieval, based on distributional hypothesis and trained on the data from Czech WordNet. We also show the first results for Czech language in this area of research. Part of the thesis is also a set of software for processing and evaluating of input data and a overview and discussion about its results on real-world data. The resulting tools can process data of amount in orders of hundreds of millions of words. The research part of the thesis used Czech morphologically and syntactically annotated data, but the methods are not language dependent.
Methods of multiword expression extraction from text
Przywara, Česlav ; Pecina, Pavel (advisor) ; Schlesinger, Pavel (referee)
The goal of this thesis is an effective implementation of the methods of multiword expression extraction from text, so that designed program would be capable of processing large textual corpora containing up to billions of words. Additional function of the program is context tracing of extracted N-grams. For thesis purposes the program implementation is specially adjusted for collocation extraction from The Prague Dependency Treebank, but the program is designed in such manner that allows an easy future extensibility.
Lexical Association Measures Collocation Extraction
Pecina, Pavel ; Hajič, Jan (advisor) ; Semecký, Jiří (referee) ; Baldwin, Timothy (referee)
This thesis is devoted to an empirical study of lexical association measures and their application to collocation extraction. We focus on two-word (bigram) collocations only. We compiled a comprehensive inventory of 82 lexical association measures and present their empirical evaluation on four reference data sets: dependency bigrams from the manually annotated Prague Dependency Treebank, surface bigrams from the same source, instances of surface bigrams from the Czech National Corpus provided with automatically assigned lemmas and part-of-speech tags, and distance verb-noun bigrams from the automatically part-of-speech tagged Swedish Parole corpus. Collocation candidates in the reference data sets were manually annotated and labeled as collocations and non-collocations. The evaluation scheme is based on measuring the quality of ranking collocation candidates according to their chance to form collocations. The methods are compared by precision-recall curves and mean average precision scores adopted from the field of information retrieval. Tests of statistical significance were also performed. Further, we study the possibility of combining lexical association measures and present empirical results of several combination methods that significantly improved the performance in this task. We also propose a model...

National Repository of Grey Literature : 78 records found   previous11 - 20nextend  jump to record:
See also: similar author names
3 Pecina, Petr
Interested in being notified about new results for this query?
Subscribe to the RSS feed.