National Repository of Grey Literature 72 records found  beginprevious21 - 30nextend  jump to record: Search took 0.01 seconds. 
Vícejazyčná databáze kolokací
Helcl, Jindřich ; Hajič, Jan (advisor) ; Mareček, David (referee)
Collocations are groups of words which are co-occurring more often than appearing separately. They also include phrases that give a new meaning to a group of unrelated words. This thesis is aimed to find collocations in large data and to create a database that allows their retrieval. The Pointwise Mutual Information, a value based on word frequency, is computed for finding the collocations. Words with the highest value of PMI are considered candidates for good collocations. Chosen collocations are stored in a database in a format that allows searching with Apache Lucene. A part of the thesis is to create a Web user interface as a quick and easy way to search collocations. If this service is fast enough and the collocations are good, translators will be able to use it for finding proper equivalents in the target language. Students of a foreign language will also be able to use it to extend their vocabulary. Such database will be created independently in several languages including Czech and English. Powered by TCPDF (www.tcpdf.org)
Natural Language Interface for online webcasts
Macošek, Jan ; Hajič, Jan (advisor) ; Vidová Hladká, Barbora (referee)
This text describes development of natural language interface for online webcasts. These webcasts are transformed from text to speech and then played by the electronic rabbit Nabaztag. Its user can control it by voice commands, so the text also focuses on training accoustic models with the HTK Toolkit and on using these models to recognize speech with the Julius speech recognizer. Searching for the webcasts and their processing is also described, along with some problems that occured during speech synthesis of sportoriented texts.
Lexical Association Measures Collocation Extraction
Pecina, Pavel ; Hajič, Jan (advisor) ; Semecký, Jiří (referee) ; Baldwin, Timothy (referee)
This thesis is devoted to an empirical study of lexical association measures and their application to collocation extraction. We focus on two-word (bigram) collocations only. We compiled a comprehensive inventory of 82 lexical association measures and present their empirical evaluation on four reference data sets: dependency bigrams from the manually annotated Prague Dependency Treebank, surface bigrams from the same source, instances of surface bigrams from the Czech National Corpus provided with automatically assigned lemmas and part-of-speech tags, and distance verb-noun bigrams from the automatically part-of-speech tagged Swedish Parole corpus. Collocation candidates in the reference data sets were manually annotated and labeled as collocations and non-collocations. The evaluation scheme is based on measuring the quality of ranking collocation candidates according to their chance to form collocations. The methods are compared by precision-recall curves and mean average precision scores adopted from the field of information retrieval. Tests of statistical significance were also performed. Further, we study the possibility of combining lexical association measures and present empirical results of several combination methods that significantly improved the performance in this task. We also propose a model...
Error detection in speech recognition
Tobolíková, Petra ; Hajič, Jan (advisor) ; Peterek, Nino (referee)
This thesis tackles the problem of error detection in speech recognition. First, principles of recent approaches to automatic speech recognition are introduced. Various deficiencies of speech recognition that cause imperfect recognition results are outlined. Current known methods of "confidence score" computation are then described. The next chapter introduces three machine learning algorithms which where employed in the error detection methods implemented in this thesis: logistic regression, artificial neural networks and decision trees. This machine learning methods use certain attributes of the recognized words as input variables and predict an estimated confidence score value. The open source software "R" has been used throughout, showing the usage of the aforementioned methods. These methods have been tested on Czech radio and TV broadcasts. The results obtained by those methods are compared using ROC curves, standard errors and possible (oracle) WER reduction. Programming documentation of the code used in the implementation is enclosed as well. Finally, efficient word attributes for error detection are summarized.
Functional Arabic Morphology: Formal System and Implementation
Smrž, Otakar ; Vidová Hladká, Barbora (advisor) ; Hajič, Jan (referee) ; Habash, Nizar Y. (referee)
Functional Arabic Morphology is a formulation of the Arabic inflectional system seeking the working interface between morphology and syntax. ElixirFM is its high-level implementation that reuses and extends the Functional Morphology library for Haskell. Inflection and derivation are modeled in terms of paradigms, grammatical categories, lexemes and word classes. The computation of analysis or generation is conceptually distinguished from the general-purpose linguistic model. The lexicon of ElixirFM is designed with respect to abstraction, yet is no more complicated than printed dictionaries. It is derived from the open-source Buckwalter lexicon and is enhanced with information sourcing from the syntactic annotations of the Prague Arabic Dependency Treebank. MorphoTrees is the idea of building effective and intuitive hierarchies over the information provided by computational morphological systems. MorphoTrees are implemented for Arabic as an extension to the TrEd annotation environment based on Perl. Encode Arabic libraries for Haskell and Perl serve for processing the non-trivial and multi-purpose ArabTEX notation that encodes Arabic orthographies and phonetic transcriptions in parallel.
Popularity Meter
Hajič, Jan ; Bojar, Ondřej (advisor) ; Popel, Martin (referee)
Having the possibility of automatically tracking a person's popularity in the newspapers is an idea appealing not just to those in the media spotlight. While sentiment (subjectivity) analysis is a rapidly growing subfield of computational linguistics, no data from the news domain are yet available for Czech. We have therefore started building a manually annotated polarity corpus of sentences from Czech news texts; however, these texts have proven themselves rather unwieldy for such processing. We have also designed a classifier which should be able to track popularity based on this corpus; the classifier has been tested on a corpus of product reviews of domestic appliances and some introductory testing has been done on the nascent news corpus. As a model, we simply extract a unigram polarity lexicon from the data. We then use three related methods for identifying lemma polarity and a number of simple filters for feature selection. On the domestic appliance data, our simplest model has achieved results comparable to the state of the art, however, the properties of Czech news texts and preliminary results hint a more linguistically oriented approach might be preferrable.
Rules for analyzing anaphora in Czech
Nguy, Giang Linh ; Hajič, Jan (advisor) ; Hajičová, Eva (referee)
With the increasing importance of natural language processing there is growing number of research with the theme automatic anaphora resolution.. The contribution to the research on this problem is also this thesis. The aim of the work is to propose a set of rules for anaphora resolution in Czech. The created set of rules consists of handwritten rules as well as rules developped with the aid of machine learning system C4.5. For the rules training and testing were used anoted data from the Prague Dependency Treebank, in which following types of anaphora are captured: pronominal anaphora, control, reciprocity and dependency relation of adjuncts. Our work is focused on these types of anaphora. The evaluation of the rules is done with standard methods for interpretation of recall and precision.
Analytical tools for Gregorian chant
Szabová, Kristína ; Hajič, Jan (advisor) ; Pecina, Pavel (referee)
One of the most interesting problems regarding Gregorian chant is its evolution across centuries. Discovering related chants, and, conversely, unrelated ones, is a necessary step in the handling of the problem, after expert selection of the set of chants to compare. Computational methods may help with this step, as it requires aligning large amounts of chants. While there exist large databases of digitalized chants, digital musicology lacks the software necessary to perform this step. This thesis presents a software tool that can help in the discovery of related chants using multiple sequence alignment (MSA) algorithms, methods borrowed from bioinformatics. It enables researchers to align arbi- trary sets of related (and unrelated) chants, thus revealing clusters of related melodies. Additionally, it facilitates the discovery of contrafacta and transpositions. Nevertheless, the tool has some limitations: it is run locally and some of its interactive functionality becomes slow when processing hundreds of data. Further development is planned as part of an ongoing collaboration with digital musicology researchers from the Czech Academy of Sciences and the Faculty of Arts of Charles University. 1
Czech NLP with Contextualized Embeddings
Vysušilová, Petra ; Straka, Milan (advisor) ; Hajič, Jan (referee)
With the increasing amount of digital data in the form of unstructured text, the importance of natural language processing (NLP) increases. The most suc- cessful technologies of recent years are deep neural networks. This work applies the state-of-the-art methods, namely transfer learning of Bidirectional Encoders Representations from Transformers (BERT), on three Czech NLP tasks: part- of-speech tagging, lemmatization and sentiment analysis. We applied BERT model with a simple classification head on three Czech sentiment datasets: mall, facebook, and csfd, and we achieved state-of-the-art results. We also explored several possible architectures for tagging and lemmatization and obtained new state-of-the-art results in both tagging and lemmatization with fine-tunning ap- proach on data from Prague Dependency Treebank. Specifically, we achieved accuracy 98.57% for tagging, 99.00% for lemmatization, and 98.19% for joint accuracy of both tasks. Best models for all tasks are publicly available. 1

National Repository of Grey Literature : 72 records found   beginprevious21 - 30nextend  jump to record:
See also: similar author names
2 Hajič, Jakub
Interested in being notified about new results for this query?
Subscribe to the RSS feed.