National Repository of Grey Literature 4 records found  Search took 0.00 seconds. 
Text summarization
Majliš, Martin ; Pecina, Pavel (advisor) ; Schlesinger, Pavel (referee)
The present work explains the basic principles of automatic summarization, evaluation and fundamental concepts, which are used in this eld. It also includes a description of a system for automatic text summarization and evaluation - CSummaK (Czech Summarization Kit). As part of this system are basic algorithms for creating sentence extract summaries (Cenroid, Lead, Position, Random, Relevance Measure, etc.) their evaluation (Precision, Recall, FMeasure, etc.), whose description is also part of this work. This system was used for production of automatic extracts from news articles. Another system was developed for obtaining reference extracts, which allows users to create on-line extracts from news articles. In this work is also evaluated quality of single algorithms, their combination with of di erent parameters, together with discussion of the possibilities of practical application.
Velký mnohojazyčný korpus
Majliš, Martin ; Žabokrtský, Zdeněk (advisor) ; Spousta, Miroslav (referee)
This thesis introduces the W2C Corpus which contains 97 languages with more than 10 million words for each of these languages, with the total size 10.5 billion words. The corpus was built by crawling the Internet. This work describes the methods and tools used for its construction. The complete process consisted of building an initial corpus from Wikipedia, developing a language recognizer for 122 languages, implementing a distributed system for crawling and parsing webpages and finally, the reduction of duplicities. A comparative analysis of the texts of Wikipedia and the Internet is provided at the end of this thesis. The analysis is based on basic statistics such as average word and sentence length, conditional entropy and perplexity. 1
Velký mnohojazyčný korpus
Majliš, Martin ; Žabokrtský, Zdeněk (advisor) ; Spousta, Miroslav (referee)
This thesis introduces the W2C Corpus which contains 97 languages with more than 10 million words for each of these languages, with the total size 10.5 billion words. The corpus was built by crawling the Internet. This work describes the methods and tools used for its construction. The complete process consisted of building an initial corpus from Wikipedia, developing a language recognizer for 122 languages, implementing a distributed system for crawling and parsing webpages and finally, the reduction of duplicities. A comparative analysis of the texts of Wikipedia and the Internet is provided at the end of this thesis. The analysis is based on basic statistics such as average word and sentence length, conditional entropy and perplexity. 1
Text summarization
Majliš, Martin ; Schlesinger, Pavel (referee) ; Pecina, Pavel (advisor)
The present work explains the basic principles of automatic summarization, evaluation and fundamental concepts, which are used in this eld. It also includes a description of a system for automatic text summarization and evaluation - CSummaK (Czech Summarization Kit). As part of this system are basic algorithms for creating sentence extract summaries (Cenroid, Lead, Position, Random, Relevance Measure, etc.) their evaluation (Precision, Recall, FMeasure, etc.), whose description is also part of this work. This system was used for production of automatic extracts from news articles. Another system was developed for obtaining reference extracts, which allows users to create on-line extracts from news articles. In this work is also evaluated quality of single algorithms, their combination with of di erent parameters, together with discussion of the possibilities of practical application.

Interested in being notified about new results for this query?
Subscribe to the RSS feed.