National Repository of Grey Literature 33 records found  beginprevious14 - 23next  jump to record: Search took 0.01 seconds. 
Automatic Webpage Content Categorisation and Extraction
Rein, Michal ; Koutenský, Michal (referee) ; Dolejška, Daniel (advisor)
Tato práce popisuje vývoj flexibilního systému pro automatickou kategorizaci a extrakci obsahu z webových stránek, se zaměřením na prostředí darknetu. Navrhli jsme vysoce přizpůsobitelný a škálovatelný systém, který dokáže zpracovávat různorodý typ obsahu, přičemž jsme dbali na kvalitu návrhu celkové architektury, struktury databáze a samotného algoritmu pro zpracování dat. Použitím nejmodernějšího jazykového modelu trénovaného na úkolu inference přirozeného jazyka demonstrujeme potenciál modelu efektivně kategorizovat obsah v zcela neznámém prostředí, přičemž jsme provedli analýzu výkonu daného modelu za použití různých hypotetických šablon. Dále jsme do systému integrovali model pro rozpoznávání pojmenovaných entit a metodologii šablonování pro extrakci obsahu, přičemž jsme navrhli automatizovaný přístup k segmentaci obsahu webových stránek za pomocí modelu ChatGPT od společnosti OpenAI. V neposlední řadě jsme vyvinuli uživatelsky přívětivou webovou aplikaci pro zlepšení dostupnosti a snadné použití systému, zhodnotili dosažené výsledky a navrhli možnosti pro další výzkum a vývoj v dané oblasti.
Important Entity Recognition in Web Page Text
Svítková, Veronika ; Hynek, Jiří (referee) ; Burget, Radek (advisor)
The aim of this thesis is training named entity recognition model on a dataset created using structured data. Datasets were created from the names of products and books extracted from structured data in JSON-LD and Microdata format. Structured data were extracted from e-shop and social cataloging websites by web scraping. Names were used as a dataset by themselves as well as webpage text with automatically annotated matches of the names. In total eight models in Czech language were trained for recognizing names of products and books using spaCy library. F-score results are up to 89.94 for products and up to 84.26 for books evaluated on a created testing dataset.
Call Sign Detection and Recognition in VHF Communication
Dedič, Juraj ; Kocour, Martin (referee) ; Szőke, Igor (advisor)
This work explores the processing of data from air traffic communication in order to detect and recognize the~call signs it contains. Particularly it involves recognizing these call signs in human made and automated text transcripts of the communication between pilots and air traffic controllers. The thesis compares various ways of solving this and describes their problems. It implements a system for the identification of these call signs using a suitable technology based on large language models. One of the outputs of this work is a service that is able to distinguish the call signs, which enables indexation and sorting of this data in an efficient way.
Named Entity Recognition Exploiting Sub Word Information
Dobrovodský, Patrik ; Egorova, Ekaterina (referee) ; Kesiraju, Santosh (advisor)
Cieľom tejto bakalárskej práce je zhotovenie systému rozpoznania názvoslovnej entity zhotovenej na základe modelu, ktorý bol nedávno považovaný za jeden z najmodernejších a popri tom skúma aký vplyv majú podslovné informácie na nahradenie slov mimo slovnej zásoby. Vytvorený systém vedľa anglického jazyka podporuje aj dva Indo-Európske jazyky konkrétne nemčinu a maďarčinu. Bakalárska práca predstavuje systém využívajúci hlboké učenie pre rozpoznávanie názvoslovných entít, ktorý používa predtrénované a samotrénované slovné vnorenia, zriedkavé vnorenia a charakterové vnorenia vyzdvihnuté konvolučnou neurónovou sieťou. Tieto vnorenia najprv spracujeme sekvenčnou (dlhodobá-krátkodobá pamäť) a potom charakteristickou (podmienené náhodné pole) metódou. Cieľom je dosiahnuť podobnú F1-mieru akú má inšpiračný model s možnosťou porovnania s ostatnými modernými systémami. Výsledkom našej práce je systém, ktorý na anglickej testovacej sade CoNLL 2003 dosiahol 90.98%-né F1-mieru používajúci predtrénované vnorenia a približuje sa k inšpiračnej práci s hodnotou 91.26%. V prípade ďalších jazykov používajúcich samotrénované slovné vnorenia dosiahol systém na testovacej sade WikiAnn pre nemčinu 89.34%-nú a pre maďarčinu 93.04%-nú F1-mieru.
Named Entity Disambiguation in Slovak
Križan, Samuel ; Otrusina, Lubomír (referee) ; Smrž, Pavel (advisor)
Thesis deals with the topic of named entity recognition and disambiguation. A basic system was created which includes all prequisitions necessary for named entity disambiguation in Slovak language. Part of the system is building of a knowledge base out of an export from Slovak Wikipedia. This was subsequently compared to knowledge base obtained from Wikidata, which revealed that the main contribution of Wikipedia knowledge base for Slovak language is greater coverage of entities with link to Slovak Wikipedia and better determination of entity classes. Apart from that, morfological dictionary of KNOT@FIT research group was updated, which yielded an improvement by 33-39 %. This work presumes possible utilization in relation to system extention by a disambiguation modul and enhancement of alternative names coverage.
Neural Network Based Named Entity Recognition
Straková, Jana ; Hajič, Jan (advisor) ; Černocký, Jan (referee) ; Konopík, Miloslav (referee)
Title: Neural Network Based Named Entity Recognition Author: Jana Straková Institute: Institute of Formal and Applied Linguistics Supervisor of the doctoral thesis: prof. RNDr. Jan Hajič, Dr., Institute of Formal and Applied Linguistics Abstract: Czech named entity recognition (the task of automatic identification and classification of proper names in text, such as names of people, locations and organizations) has become a well-established field since the publication of the Czech Named Entity Corpus (CNEC). This doctoral thesis presents the author's research of named entity recognition, mainly in the Czech language. It presents work and research carried out during CNEC publication and its evaluation. It fur- ther envelops the author's research results, which improved Czech state-of-the-art results in named entity recognition in recent years, with special focus on artificial neural network based solutions. Starting with a simple feed-forward neural net- work with softmax output layer, with a standard set of classification features for the task, the thesis presents methodology and results, which were later used in open-source software solution for named entity recognition, NameTag. The thesis finalizes with a recurrent neural network based recognizer with word embeddings and character-level word embeddings,...
Brno Communication Agent
Jurkovič, Juraj ; Fajčík, Martin (referee) ; Smrž, Pavel (advisor)
The goal of this thesis is explore and subsequently apply techniques and technical solutions in development of information agents. Thesis primarily focuses on solving individual sub tasks using state of the art systems, interconnecting these systems, their adoption for specific domain and implementation of individual modules of communication agent system. User interface is based on multi-platform chat application Telegram. Information extraction from user input is executed by Dialogflow. Several external services are used for user request fulfillment. Elasticsearch is used for searching structured data. For answering open domain questions from free text we use R-net implementation. The resulting can have both ,its knowledge base and range of requests it can fulfill, easily extended and can be deployed to chat platform of choice.
Deep contextualized word embeddings from character language models for neural sequence labeling
Lief, Eric ; Pecina, Pavel (advisor) ; Kocmi, Tom (referee)
A family of Natural Language Processing (NLP) tasks such as part-of- speech (PoS) tagging, Named Entity Recognition (NER), and Multiword Expression (MWE) identification all involve assigning labels to sequences of words in text (sequence labeling). Most modern machine learning approaches to sequence labeling utilize word embeddings, learned representations of text, in which words with similar meanings have similar representations. Quite recently, contextualized word embeddings have garnered much attention because, unlike pretrained context- insensitive embeddings such as word2vec, they are able to capture word meaning in context. In this thesis, I evaluate the performance of different embedding setups (context-sensitive, context-insensitive word, as well as task-specific word, character, lemma, and PoS) on the three abovementioned sequence labeling tasks using a deep learning model (BiLSTM) and Portuguese datasets. v
Semantic Enrichment Component
Doležal, Jan ; Otrusina, Lubomír (referee) ; Dytrych, Jaroslav (advisor)
This master's thesis describes Semantic Enrichment Component (SEC), that searches entities (e.g., persons or places) in the input text document and returns information about them. The goals of this component are to create a single interface for named entity recognition tools, to enable parallel document processing, to save memory while using the knowledge base, and to speed up access to its content. To achieve these goals, the output of the named entity recognition tools in the text was specified, the tool for storing the preprocessed knowledge base into the shared memory was implemented, and the client-server scheme was used to create the component.
Analysis and Data Extraction from a Set of Documents Merged Together
Jarolím, Jordán ; Bartík, Vladimír (referee) ; Kreslíková, Jitka (advisor)
This thesis deals with mining of relevant information from documents and automatic splitting of multiple documents merged together. Moreover, it describes the design and implementation of software for data mining from documents and for automatic splitting of multiple documents. Methods for acquiring textual data from scanned documents, named entity recognition, document clustering, their supportive algorithms and metrics for automatic splitting of documents are described in this thesis. Furthermore, an algorithm of implemented software is explained and tools and techniques used by this software are described. Lastly, the success rate of the implemented software is evaluated. In conclusion, possible extensions and further development of this thesis are discussed at the end.

National Repository of Grey Literature : 33 records found   beginprevious14 - 23next  jump to record:
Interested in being notified about new results for this query?
Subscribe to the RSS feed.