National Repository of Grey Literature 1 records found  Search took 0.01 seconds. 
Extrakce informací z webových stránek pomoci extrakčních ontologií
Labský, Martin ; Berka, Petr (advisor) ; Strossa, Petr (referee) ; Vojtáš, Peter (referee) ; Snášel, Václav (referee)
Automatic information extraction (IE) from various types of text became very popular during the last decade. Owing to information overload, there are many practical applications that can utilize semantically labelled data extracted from textual sources like the Internet, emails, intranet documents and even conventional sources like newspaper and magazines. Applications of IE exist in many areas of computer science: information retrieval systems, question answering or website quality assessment. This work focuses on developing IE methods and tools that are particularly suited to extraction from semi-structured documents such as web pages and to situations where available training data is limited. The main contribution of this thesis is the proposed approach of extended extraction ontologies. It attempts to combine extraction evidence from three distinct sources: (1) manually specified extraction knowledge, (2) existing training data and (3) formatting regularities that are often present in online documents. The underlying hypothesis is that using extraction evidence of all three types by the extraction algorithm can help improve its extraction accuracy and robustness. The motivation for this work has been the lack of described methods and tools that would exploit these extraction evidence types at the same time. This thesis first describes a statistically trained approach to IE based on Hidden Markov Models which integrates with a picture classification algorithm in order to extract product offers from the Internet, including textual items as well as images. This approach is evaluated using a bicycle sale domain. Several methods of image classification using various feature sets are described and evaluated as well. These trained approaches are then integrated in the proposed novel approach of extended extraction ontologies, which builds on top of the work of Embley [21] by exploiting manual, trained and formatting types of extraction evidence at the same time. The intended benefit of using extraction ontologies is a quick development of a functional IE prototype, its smooth transition to deployed IE application and the possibility to leverage the use of each of the three extraction evidence types. Also, since extraction ontologies are typically developed by adapting suitable domain ontologies and the ontology remains in center of the extraction process, the work related to the conversion of extracted results back to a domain ontology or schema is minimized. The described approach is evaluated using several distinct real-world datasets.

Interested in being notified about new results for this query?
Subscribe to the RSS feed.