National Repository of Grey Literature 70 records found  beginprevious51 - 60next  jump to record: Search took 0.00 seconds. 
Semantic Network - Manual Annotation and its Evaluation
Novák, Václav ; Hajič, Jan (advisor) ; Peregrin, Jaroslav (referee) ; Štěpánek, Jan (referee)
The Prague Dependency Treebank (PDT) is a valuable resource of linguistic information annotated on several layers. These layers range from shallow to deep and they should contain all the linguistic information about the text. The natural extension is to add a semantic layer suitable as a knowledge base for tasks like question answering, information extraction etc. In this thesis I set up criteria for this representation, explore the possible formalisms for this task and discuss their properties. One of them, Multilayered Extended Semantic Networks (MultiNet), is chosen for further investigation. Its properties are described and an annotation process set up. I discuss some practical modifications of MultiNet for the purpose of manual annotation. MultiNet elements are compared to the elements of the deep linguistic layer of PDT. The tools and problems of the annotation process are presented and initial annotation data evaluated.
Lexical Association Measures Collocation Extraction
Pecina, Pavel ; Hajič, Jan (advisor) ; Semecký, Jiří (referee) ; Baldwin, Timothy (referee)
This thesis is devoted to an empirical study of lexical association measures and their application to collocation extraction. We focus on two-word (bigram) collocations only. We compiled a comprehensive inventory of 82 lexical association measures and present their empirical evaluation on four reference data sets: dependency bigrams from the manually annotated Prague Dependency Treebank, surface bigrams from the same source, instances of surface bigrams from the Czech National Corpus provided with automatically assigned lemmas and part-of-speech tags, and distance verb-noun bigrams from the automatically part-of-speech tagged Swedish Parole corpus. Collocation candidates in the reference data sets were manually annotated and labeled as collocations and non-collocations. The evaluation scheme is based on measuring the quality of ranking collocation candidates according to their chance to form collocations. The methods are compared by precision-recall curves and mean average precision scores adopted from the field of information retrieval. Tests of statistical significance were also performed. Further, we study the possibility of combining lexical association measures and present empirical results of several combination methods that significantly improved the performance in this task. We also propose a model...
Netgraph-A Tool for Searching in the Prague Dependency Treebank 2.0
Mírovský, Jiří ; Hajič, Jan (advisor) ; Rosen, Alexandr (referee) ; Ondruška, Roman (referee)
Three sides existed whose connection is solved in this thesis. First, it was the Prague Dependency Treebank 2.0, one of the most advanced treebanks in the linguistic world. Second, there existed a very limited but extremely intuitive search tool - Netgraph 1.0. Third, there were users longing for such a simple and intuitive tool that would be powerful enough to search in the Prague Dependency Treebank. In the thesis, we study the annotation of the Prague Dependency Treebank 2.0, especially on the tectogrammatical layer, which is by far the most complex layer of the treebank, and assemble a list of requirements on a query language that would allow searching for and studying all linguistic phenomena annotated in the treebank. We propose an extension to the query language of the existing search tool Netgraph 1.0 and show that the extended query language satisfies the list of requirements. We also show how all principal linguistic phenomena annotated in the treebank can be searched for with the query language. The proposed query language has also been implemented - we present the search tool as well and talk about the data format for the tool. An attached CD-ROM contains the installation of the tool.
Error detection in speech recognition
Tobolíková, Petra ; Peterek, Nino (referee) ; Hajič, Jan (advisor)
This thesis tackles the problem of error detection in speech recognition. First, principles of recent approaches to automatic speech recognition are introduced. Various deficiencies of speech recognition that cause imperfect recognition results are outlined. Current known methods of "confidence score" computation are then described. The next chapter introduces three machine learning algorithms which where employed in the error detection methods implemented in this thesis: logistic regression, artificial neural networks and decision trees. This machine learning methods use certain attributes of the recognized words as input variables and predict an estimated confidence score value. The open source software "R" has been used throughout, showing the usage of the aforementioned methods. These methods have been tested on Czech radio and TV broadcasts. The results obtained by those methods are compared using ROC curves, standard errors and possible (oracle) WER reduction. Programming documentation of the code used in the implementation is enclosed as well. Finally, efficient word attributes for error detection are summarized.
Verb Valency Frames Disambiguation
Semecký, Jiří ; Hajič, Jan (advisor) ; Krbec, Pavel (referee) ; Lopatková, Markéta (referee)
Semantic analysis has become a bottleneck of many natural language applications. Machine translation, automatic question answering, dialog management, and others rely on high quality semantic analysis. Verbs are central elements of clauses with strong influence on the realization of whole sentences. Therefore the semantic analysis of verbs plays a key role in the analysis of natural language. We believe that solid disambiguation of verb senses can boost the performance of many real-life applications. In this thesis, we investigate the potential of statistical disambiguation of verb senses. Each verb occurrence can be described by diverse types of information. We investigate which information is worth considering when determining the sense of verbs. Different types of classification methods are tested with regard to the topic. In particular, we compared the Naive Bayes classifier, decision trees, rule-based method, maximum entropy, and support vector machines. The proposed methods are thoroughly evaluated on two different Czech corpora, VALEVAL and the Prague Dependency Treebank. Significant improvement over the baseline is observed.
Functional Arabic Morphology: Formal System and Implementation
Smrž, Otakar ; Vidová Hladká, Barbora (advisor) ; Hajič, Jan (referee) ; Habash, Nizar Y. (referee)
Functional Arabic Morphology is a formulation of the Arabic inflectional system seeking the working interface between morphology and syntax. ElixirFM is its high-level implementation that reuses and extends the Functional Morphology library for Haskell. Inflection and derivation are modeled in terms of paradigms, grammatical categories, lexemes and word classes. The computation of analysis or generation is conceptually distinguished from the general-purpose linguistic model. The lexicon of ElixirFM is designed with respect to abstraction, yet is no more complicated than printed dictionaries. It is derived from the open-source Buckwalter lexicon and is enhanced with information sourcing from the syntactic annotations of the Prague Arabic Dependency Treebank. MorphoTrees is the idea of building effective and intuitive hierarchies over the information provided by computational morphological systems. MorphoTrees are implemented for Arabic as an extension to the TrEd annotation environment based on Perl. Encode Arabic libraries for Haskell and Perl serve for processing the non-trivial and multi-purpose ArabTEX notation that encodes Arabic orthographies and phonetic transcriptions in parallel.
Natural Language Interface for online webcasts
Macošek, Jan ; Vidová Hladká, Barbora (referee) ; Hajič, Jan (advisor)
This text describes development of natural language interface for online webcasts. These webcasts are transformed from text to speech and then played by the electronic rabbit Nabaztag. Its user can control it by voice commands, so the text also focuses on training accoustic models with the HTK Toolkit and on using these models to recognize speech with the Julius speech recognizer. Searching for the webcasts and their processing is also described, along with some problems that occured during speech synthesis of sportoriented texts.
Lexical Association Measures Collocation Extraction
Pecina, Pavel ; Hajič, Jan (advisor)
Lexical Association Measures: Collocation Extraction Pavel Pecina Abstract of Doctoral Thesis This thesis is devoted to an empirical study of lexical association measures and their application for collocation extraction. We focus on two-word (bigram) collocations only. We compiled a comprehensive inventory of 82 lexical association measures and present their empirical evaluation on four reference data sets: dependency bigrams from the manually annotated Prague Dependency Trcebank, surface bigrams from the same source, instances of the previous from the Czech National Corpus provided with automatically assigned lemmas and part-of-speech tags, and distance verb-noun bigrams from the automatically part-of-spcech tagged Swedish Parole Corpus. Collocation candidates in the reference data sets were manually annotated and identified as collocations and non-collocations. The evaluation scheme is based on measuring the quality of ranking collocation candidates according to their chance to form collocations. The methods are compared by precision-recall curves and mean average precision scores adopted from the field of information retrieval. Tests of statistical significance were also performed. Further, we study the possibility of combining lexical association measures and present empirical results of several...
Automatic annotation of English on the tectogrammatical level
Toman, Josef ; Hajič, Jan (advisor) ; Žabokrtský, Zdeněk (referee)
Tectogrammatical layer is very complex and its annotation is di cult and expensive. Unlike other corpora, the Prague English Dependency Treebank (pedt) is based on data for which there already exists a syntactic annotation, even though a fundamentally di erent one. The goal of this work is to propose and implement methods of automatic annotation that are using the available data and (preferably) would lead to minimization of the e ort needed for a manual annotation. A high-quality evaluation is important so that the contribution of the used methods can be veri ed. Tens of modules, which focus on various aspects of annotation, were created. The analysis of their activity is complicated and required a complex system to be created. The analyses created with it are very detailed. The outcome is positive and urges to continue the work and extend it further.

National Repository of Grey Literature : 70 records found   beginprevious51 - 60next  jump to record:
See also: similar author names
2 Hajič, Jakub
Interested in being notified about new results for this query?
Subscribe to the RSS feed.