National Repository of Grey Literature 4 records found  Search took 0.00 seconds. 
Collecting XML data and meta-data from the Internet
Sochna, Jan ; Bednárek, David (advisor) ; Žemlička, Michal (referee)
The Diploma Thesis is targeted to design and implement the system for collecting XML-family data from the Internet. The aim of the task is to automate the data collection process and download full structures of XML documents. A comparison of four existing data collection systems took place at the beginning to choose one of the systems as a base of the solution. The open source web crawler Apache Nutch was identified as the most suitable. Then necessary extensions and modifications of the crawler were designed and implemented in order to make the crawler efficient in downloading XML-family documents. Downloaded XML-family data were analyzed and evaluated using the Analyzer application, which was enhanced within this Diploma Thesis in order to process the data. The main outcome of Diploma Thesis is an exploitable system collecting the XML-family documents from the Internet. Implemented modification and extensions of the system lead to elimination of "useless" documents download, improving the ratio targeted XML-family documents.
Application for manual word alignment
Sochna, Jan ; Pecina, Pavel (advisor) ; Raab, Jan (referee)
The aim of this work was to design and implement platform-independent fast, flexible and user friendly interface for manual word alignment of bilingual texts. The new interface does not have the imperfections of existing similar tools and improves the performance of manual alignment process. It provides eg. half automatic alignment of simple texts, group operations with alignments, alignment of phrases, enables to shift one sentences along the line to improve the transparency of the alignment process in case that the length of aligned sentences differs substantially. The preceding and succeeding context of currently aligned sentences is shown in both the languages. Last but not least the tool provides the alignment performance statistics. Along with usual "row view", where the two sentences are shown in parallel in two rows, one above the other, being aligned by connections of corresponding words, there were introduced also a "matrix view", where the words in one language stand in for matrix line descriptors, the words in other language stand in for column descriptors and the alignment of two corresponding words is expressed by highlighting of the point of intersection of row and column with corresponding descriptors. It is possible to switch between the both views anytime during the alignment process.
Application for manual word alignment
Sochna, Jan ; Raab, Jan (referee) ; Pecina, Pavel (advisor)
The aim of this work was to design and implement platform-independent fast, flexible and user friendly interface for manual word alignment of bilingual texts. The new interface does not have the imperfections of existing similar tools and improves the performance of manual alignment process. It provides eg. half automatic alignment of simple texts, group operations with alignments, alignment of phrases, enables to shift one sentences along the line to improve the transparency of the alignment process in case that the length of aligned sentences differs substantially. The preceding and succeeding context of currently aligned sentences is shown in both the languages. Last but not least the tool provides the alignment performance statistics. Along with usual "row view", where the two sentences are shown in parallel in two rows, one above the other, being aligned by connections of corresponding words, there were introduced also a "matrix view", where the words in one language stand in for matrix line descriptors, the words in other language stand in for column descriptors and the alignment of two corresponding words is expressed by highlighting of the point of intersection of row and column with corresponding descriptors. It is possible to switch between the both views anytime during the alignment process.
Collecting XML data and meta-data from the Internet
Sochna, Jan ; Žemlička, Michal (referee) ; Bednárek, David (advisor)
The Diploma Thesis is targeted to design and implement the system for collecting XML-family data from the Internet. The aim of the task is to automate the data collection process and download full structures of XML documents. A comparison of four existing data collection systems took place at the beginning to choose one of the systems as a base of the solution. The open source web crawler Apache Nutch was identified as the most suitable. Then necessary extensions and modifications of the crawler were designed and implemented in order to make the crawler efficient in downloading XML-family documents. Downloaded XML-family data were analyzed and evaluated using the Analyzer application, which was enhanced within this Diploma Thesis in order to process the data. The main outcome of Diploma Thesis is an exploitable system collecting the XML-family documents from the Internet. Implemented modification and extensions of the system lead to elimination of "useless" documents download, improving the ratio targeted XML-family documents.

Interested in being notified about new results for this query?
Subscribe to the RSS feed.