National Repository of Grey Literature 23 records found  1 - 10nextend  jump to record: Search took 0.00 seconds. 
Implementation of a software keyboard to input text into the machine translation application
Dvořák, Šimon ; Straňák, Pavel (advisor) ; Popel, Martin (referee)
A vast amount of applications need to consume textual input from their users. Trans- lation web applications are not an exception. Contrary to the other applications, the textual input is very diverse. Everything can happen, be it all kinds of characters, key- board layouts, or users with little or no knowledge of the source language. In this thesis, we try to develop means of making the input into the translation web applications more comfortable. We developed a configurable software keyboard supporting multiple features. Such features are: defining multiple keyboard layouts, remapping the physical keys to the active layout's keys, next-word prediction, and phonetic input correction. The software keyboard is easily extensible thanks to the fact that it uses straightforward architecture. 1
Methodology for preparing data from digital libraries for use in digital humanities
Lehečka, B. ; Novák, D. ; Kersch, Filip ; Hladík, Radim ; Bíšková, J. ; Sekyrová, K. ; Válek, F. ; Vozár, Z. ; Bodnár, N. ; Sekan, P. ; Bežová, M. ; Žabička, P. ; Lhoták, Martin ; Straňák, Pavel
This methodology aims to offer libraries and other memory institutions in the Czech Republic a recommended procedure for making large volumes of data available for research purposes. Currently, from this point of view, a more than critical amount of documents from library collections are digitized, while the results of digitization are presented in various digital library systems. When making them available, it is always necessary to proceed from the current version of the copyright law, but it is already possible to prepare for its significant amendment, which implements Directive 2019/790 of the European Parliament and the Council and concerns, among other things, the extraction of texts and data for scientific purposes. The architecture of the newly developed system for digital libraries recommended by the methodology will ensure scalability, easy management and the development of related services. The presented methods of data processing, their enrichment and output formats are based on the requirements of specialists from the entire range of humanities.
Plný tet: Download fulltextPDF
Software for a Czech-Chinese and Chinese-Czech dictionary
Hudeček, Jan ; Homola, Petr (advisor) ; Straňák, Pavel (referee)
Czech-Chinese and Chinese-Czech dictionary is an electronic dictionary which can be used both by a beginner or a seasoned translator. It allows searching in both directions and a fulltext search for given expression. Data access is hybrid - the program checks if it can access the database - if it fails it reads the data files. Moreover users can change the data source at run-time. The program builds indexes on the data file speeding searches up considerably. Indexes can be hashtables or binary trees. Asynchronous multithreaded IO was implemented to enhance the comfort of the GUI. The .NET framework and MS SQL Server as a platform guarantees rapid development, deployment and scalability - for example adding a web application to the project would be quite easy. At the same time the design of the system allows for future improvements - for instance editing the dictionary from the GUI.
N-gram language model for a Czech spellchecker
Richter, Michal ; Straňák, Pavel (advisor) ; Bojar, Ondřej (referee)
The aim of this thesis is to explore the possibilities of using n-gram language models for spellchecking Czech texts and to implement an extension to the spellchecker which would be able to find such misspelled words that are true Czech words. Furthermore, the aim was to implement a simple web application which would present the extended spellchecker. The influence of using lemmatization and morphology analysis of words regarding the hit rate of finding misspelled words was also looked into. The methods of language modelling used in the thesis are described first. What follow, then, is the description of the procedure of the spellchecking program using language models. The next part shows the way of getting the data for language model training. In the following part, the evaluation of the language models created is presented. The final part shows the results achieved for each option of spellchecking.
Automatické čištění HTML dokumentů
Marek, Michal ; Pecina, Pavel (advisor) ; Straňák, Pavel (referee)
This paper describes a system for automatic cleaning of HTML documents, which was used in the participation of the Charles University in CLEANEVAL 2007. CLEANEVAL is a shared task and competitive evaluation of automatic systems for cleaning arbitrary web pages with the goal of preparing web data for use as a corpus in the area of computational linguistics and natural language processing. We try to solve this task as a sequence-labeling problem and our experimental system is based on Conditional Random Fields exploiting a set of features extracted from textual content and HTML structure of analyzed web pages for each block of text.
Pokročilý korektor češtiny
Richter, Michal ; Straňák, Pavel (advisor) ; Žabokrtský, Zdeněk (referee)
The aim of this work is to implement a Czech spell-checker using several language models and a lexical morphological analyser in order to o er proper correction suggestions and also to nd real-word spelling errors (spelling errors that happen to be in the lexicon). The system should also be able to complete diacritics to Czech text. Mac OS X was chosen as the target platform for the application. During the implementation, emphasis was put especially on memory-effient representation of the above-mentioned statistical models. In the beginning, a gentle introduction to Hiden Markov Models, Language Models and Viterbi algorithm is given. The actual system implementation and the statistical models training is discussed further. In the nal part of the work, the achived results are evaluated and discussed in depth.
Today's news
Jankovský, Petr ; Holan, Tomáš (advisor) ; Straňák, Pavel (referee)
The project deals with the design and implementation of the program based on frequency analysis of the text. The results should provide a quick overview about currently published articles in the newspapers. The program downloads the current articles from newspaper Web sites. For each of defined section and each article is able to list the most frequent n-tuple of words. There is option to define dictionary of uninteresting (banned) words and dictionary of phrases. Implementation solves some problems with downloading articles from various structure different servers, such as problems with encoding and problems with recognition articles from advertisement. The work reveals that simple frequency analysis can bring interesting results.
Creating a Bilingual Dictionary using Wikipedia
Ivanova, Angelina ; Zeman, Daniel (advisor) ; Straňák, Pavel (referee)
Title: Creating a Bilingual Dictionary using Wikipedia Author: Angelina Ivanova Department/Institute: Institute of Formal and Applied Linguistics (32-ÚFAL) Supervisor of the master thesis: RNDr. Daniel Zeman Ph.D. Abstract: Machine-readable dictionaries play important role in the research area of com- putational linguistics. They gained popularity in such fields as machine translation and cross-language information extraction. In this thesis we investigate the quality and content of bilingual English-Russian dictionaries generated from Wikipedia link structure. Wiki-dictionaries differ dramatically from the traditional dictionaries: the re- call of the basic terminology on Mueller's dictionary was 7.42%. Machine translation experiments with Wiki-dictionary incorporated into the training set resulted in the rather small, but statistically significant drop of the the quality of the translation compared to the experiment without Wiki-dictionary. We supposed that the main reason was domain difference between the dictio- nary and the corpus and got some evidence that on the test set collected from Wikipedia articles the model with incorporated dictionary performed better. In this work we show how big the difference between the dictionaries de- veloped from the Wikipedia link structure and the traditional...
Orthography Standardization in Arabic Dialects
Cayralat, Christian ; Zeman, Daniel (advisor) ; Straňák, Pavel (referee)
Orthography Standardization in Arabic Dialects Abstract Christian Cayralat1 1 Charles University Spontaneous orthography in Arabic dialects poses one of the biggest ob- stacles in the way of Dialectal Arabic NLP applications. As the Arab world enjoys a wide array of these widely spoken and recently written, non-standard, low-resource varieties, this thesis presents a detailed account of this relatively overlooked phenomenon. It sets out to show that continuously creating addi- tional noise-free, manually standardized corpora of Dialectal Arabic does not free us from the shackles of non-standard (spontaneous) orthography. Because real-world data will most often come in a noisy format, it also investigates ways to ease the amount of noise in textual data. As a proof of concept, we restrict ourselves to one of the dialectal varieties, namely, Lebanese Arabic. It also strives to gain a better understanding of the nature of the noise and its distri- bution. All of this is done by leveraging various spelling correction and morpho- logical tagging neural architectures in a multi-task setting, and by annotating a Lebanese Arabic corpus for spontaneous orthography standardization, and morphological segmentation and tagging, among other features. Additionally, a detailed taxonomy of spelling inconsistencies for...
Adaptive Handwritten Text Recognition
Procházka, Štěpán ; Straka, Milan (advisor) ; Straňák, Pavel (referee)
The need to preserve and exchange written information is central to the human society, with handwriting satisfying such need for several past millenia. Unlike optical character recognition of typeset fonts, which has been throughly studied in the last few decades, the task of handwritten text recognition, being considerably harder, lacks such attention. In this work, we study the capabilities of deep convolutional and recurrent neural networks to solve handwritten text extraction. To mitigate the need for large quantity of real ground truth data, we propose a suitable synthetic data generator for model pre-training, and carry out extensive set of experiments to devise a self-training strategy to adapt the model to unnanotated real handwritten letterings. The proposed approach is compared to supervised approaches and state-of-the-art results on both established and novel datasets, achieving satisfactory performance. 1

National Repository of Grey Literature : 23 records found   1 - 10nextend  jump to record:
See also: similar author names
3 Straňák, Peter
4 Straňák, Petr
Interested in being notified about new results for this query?
Subscribe to the RSS feed.