National Repository of Grey Literature 23 records found  beginprevious14 - 23  jump to record: Search took 0.00 seconds. 
English grammar checker and corrector: the determiners
Auersperger, Michal ; Pecina, Pavel (advisor) ; Straňák, Pavel (referee)
Correction of the articles in English texts is approached as an article generation task, i.e. each noun phrase is assigned with a class corresponding to the definite, indefinite or zero article. Supervised machine learning methods are used to first replicate and then improve upon the best reported result in the literature known to the author. By feature engineering and a different choice of the learning method, about 34% drop in error is achieved. The resulting model is further compared to the performance of expert annotators. Although the comparison is not straightforward due to the differences in the data, the results indicate the performance of the trained model is comparable to the human-level performance when measured on the in-domain data. On the other hand, the model does not generalize well to different types of data. Using a large-scale language model to predict an article (or no article) for each word of the text has not proved successful. 1
Natural Language Correction
Náplava, Jakub ; Straka, Milan (advisor) ; Straňák, Pavel (referee)
The goal of this thesis is to explore the area of natural language correction and to design and implement neural network models for a range of tasks ranging from general grammar correction to the specific task of diacritization. The thesis opens with a description of existing approaches to natural language correction. Existing datasets are reviewed and two new datasets are introduced: a manually annotated dataset for grammatical error correction based on CzeSL (Czech as a Second Language) and an automatically created spelling correction dataset. The main part of the thesis then presents design and implementation of three models, and evaluates them on several natural language correction datasets. In comparison to existing statistical systems, the proposed models learn all knowledge from training data; therefore, they do not require an error model or a candidate generation mechanism to be manually set, neither they need any additional language information such as a part of speech tags. Our models significantly outperform existing systems on the diacritization task. Considering the spelling and basic grammar correction tasks for Czech, our models achieve the best results for two out of the three datasets. Finally, considering the general grammatical correction for English, our models achieve results which are...
Creating a Bilingual Dictionary using Wikipedia
Ivanova, Angelina ; Zeman, Daniel (advisor) ; Straňák, Pavel (referee)
Title: Creating a Bilingual Dictionary using Wikipedia Author: Angelina Ivanova Department/Institute: Institute of Formal and Applied Linguistics (32-ÚFAL) Supervisor of the master thesis: RNDr. Daniel Zeman Ph.D. Abstract: Machine-readable dictionaries play important role in the research area of com- putational linguistics. They gained popularity in such fields as machine translation and cross-language information extraction. In this thesis we investigate the quality and content of bilingual English-Russian dictionaries generated from Wikipedia link structure. Wiki-dictionaries differ dramatically from the traditional dictionaries: the re- call of the basic terminology on Mueller's dictionary was 7.42%. Machine translation experiments with Wiki-dictionary incorporated into the training set resulted in the rather small, but statistically significant drop of the the quality of the translation compared to the experiment without Wiki-dictionary. We supposed that the main reason was domain difference between the dictio- nary and the corpus and got some evidence that on the test set collected from Wikipedia articles the model with incorporated dictionary performed better. In this work we show how big the difference between the dictionaries de- veloped from the Wikipedia link structure and the traditional...
Voice command for a TV set
Černý, Patrik ; Straňák, Pavel (advisor) ; Peterek, Nino (referee)
Title: Voice command for a TV set Author: Patrik Černý Department: Institute of Formal and Applied Linguistics Supervisor: Mgr. Pavel Straňák, Ph.D. Abstract: A goal of this thesis is to create television voice control intended for poeple with speech and movement disorder. This is achieved by interconnecting computer and television. Voice control is based on well-known dynamic time warping algorithm. It has been shown, that due to high and frequent changes in sound intensity the voice control of television is quite a complex task. The word recognition success rate of the final application is not very high, but for the purpose sufficient. Because of application design, program can be easily extended by techniques, that can improve recognition effectivity. Keywords: voice control, word recognition, dynamic time warping, television 1
Today's news
Jankovský, Petr ; Holan, Tomáš (advisor) ; Straňák, Pavel (referee)
The project deals with the design and implementation of the program based on frequency analysis of the text. The results should provide a quick overview about currently published articles in the newspapers. The program downloads the current articles from newspaper Web sites. For each of defined section and each article is able to list the most frequent n-tuple of words. There is option to define dictionary of uninteresting (banned) words and dictionary of phrases. Implementation solves some problems with downloading articles from various structure different servers, such as problems with encoding and problems with recognition articles from advertisement. The work reveals that simple frequency analysis can bring interesting results.
Annotation of Multiword Expressions in the Prague Dependency Treebank
Straňák, Pavel ; Hajič, Jan (advisor) ; Pala, Karel (referee) ; Pecina, Pavel (referee)
This thesis explores annotation of multiword expressions in the Prague Dependency Treebank 2.0. We explain, what we understand as multiword expressions (MWEs), review the state of PDT 2.0 with respect to MWEs and present our annotation. We describe the data format developed for the annotation, the annotation tool, and other soware developed to allow for visualisation and searching of the data. We also present the annotation lexicon SemLex and analysis of the annotation.
Pokročilý korektor češtiny
Richter, Michal ; Straňák, Pavel (advisor) ; Žabokrtský, Zdeněk (referee)
The aim of this work is to implement a Czech spell-checker using several language models and a lexical morphological analyser in order to o er proper correction suggestions and also to nd real-word spelling errors (spelling errors that happen to be in the lexicon). The system should also be able to complete diacritics to Czech text. Mac OS X was chosen as the target platform for the application. During the implementation, emphasis was put especially on memory-effient representation of the above-mentioned statistical models. In the beginning, a gentle introduction to Hiden Markov Models, Language Models and Viterbi algorithm is given. The actual system implementation and the statistical models training is discussed further. In the nal part of the work, the achived results are evaluated and discussed in depth.
N-gram language model for a Czech spellchecker
Richter, Michal ; Bojar, Ondřej (referee) ; Straňák, Pavel (advisor)
The aim of this thesis is to explore the possibilities of using n-gram language models for spellchecking Czech texts and to implement an extension to the spellchecker which would be able to find such misspelled words that are true Czech words. Furthermore, the aim was to implement a simple web application which would present the extended spellchecker. The influence of using lemmatization and morphology analysis of words regarding the hit rate of finding misspelled words was also looked into. The methods of language modelling used in the thesis are described first. What follow, then, is the description of the procedure of the spellchecking program using language models. The next part shows the way of getting the data for language model training. In the following part, the evaluation of the language models created is presented. The final part shows the results achieved for each option of spellchecking.
Automatické čištění HTML dokumentů
Marek, Michal ; Straňák, Pavel (referee) ; Pecina, Pavel (advisor)
This paper describes a system for automatic cleaning of HTML documents, which was used in the participation of the Charles University in CLEANEVAL 2007. CLEANEVAL is a shared task and competitive evaluation of automatic systems for cleaning arbitrary web pages with the goal of preparing web data for use as a corpus in the area of computational linguistics and natural language processing. We try to solve this task as a sequence-labeling problem and our experimental system is based on Conditional Random Fields exploiting a set of features extracted from textual content and HTML structure of analyzed web pages for each block of text.
Software for a Czech-Chinese and Chinese-Czech dictionary
Hudeček, Jan ; Straňák, Pavel (referee) ; Homola, Petr (advisor)
Czech-Chinese and Chinese-Czech dictionary is an electronic dictionary which can be used both by a beginner or a seasoned translator. It allows searching in both directions and a fulltext search for given expression. Data access is hybrid - the program checks if it can access the database - if it fails it reads the data files. Moreover users can change the data source at run-time. The program builds indexes on the data file speeding searches up considerably. Indexes can be hashtables or binary trees. Asynchronous multithreaded IO was implemented to enhance the comfort of the GUI. The .NET framework and MS SQL Server as a platform guarantees rapid development, deployment and scalability - for example adding a web application to the project would be quite easy. At the same time the design of the system allows for future improvements - for instance editing the dictionary from the GUI.

National Repository of Grey Literature : 23 records found   beginprevious14 - 23  jump to record:
Interested in being notified about new results for this query?
Subscribe to the RSS feed.