Národní úložiště šedé literatury Nalezeno 13 záznamů.  1 - 10další  přejít na záznam: Hledání trvalo 0.01 vteřin. 
Automatický přepis řeči s podporou code switching
Bílek, Štěpán ; Karafiát, Martin (oponent) ; Szőke, Igor (vedoucí práce)
Tato práce se zabývá problematikou automatického rozpoznávání řeči. Zaměřuje se na rozpoznávání audia obsahující vícejazyčné promluvy, tzv. code-switching. Problém nedostatku vícejazyčných dat pro trénování je řešen kombinováním nahrávek v angličtině a němčině dohromady. Pro co největší přiblížení ke skutečné dvojjazyčné řeči je část datasetů tvořena spojováním nahrávek podobných mluvčích. Na vytvořených datech je trénován a testován model Whisper. Ten v původní neadaptované verzi dosahuje chybovosti až 70 %. Nejlepší modely trénované na kombinovaných datasetech dosahují chybovosti jen lehce přes 7 %. Výsledky této práce ukazují způsoby jak modely trénovat, aby dosahovaly co nejlepších výsledků.
Aligning pre-trained models for spoken language translation
Sedláček, Šimon ; Beneš, Karel (oponent) ; Kesiraju, Santosh (vedoucí práce)
In this work, we investigate a novel approach to end-to-end speech translation (ST) by leveraging pre-trained models for automatic speech recognition (ASR) and machine translation (MT) and connecting them with a small connector module (Q-Former, STE). The connector bridges the gap between the speech and text modalities, transforming the ASR encoder embeddings into the latent representation space of the MT encoder. During training, the foundation ASR and MT models are frozen, and only the connector parameters are tuned, optimizing for the ST objective. We train and evaluate our models on the How2 English to Portuguese ST dataset. In our experiments, aligned systems outperform our cascade ST baseline while utilizing the same foundation models. Additionally, while keeping the size of the connector module constant and small in comparison (10M parameters), increasing the size and capability of the ASR encoder and MT decoder universally improves translation results. We find that the connectors can also serve as domain adapters for the foundation models, significantly improving translation performance in the aligned ST setting, compared even to the base MT scenario. Lastly, we propose a pre-training procedure for the connector, with the potential for reducing the amount of ST data required for training similar aligned systems.
Semi-Supervised Speech-to-Text Recognition with Text-to-Speech Critic
Baskar, Murali Karthick ; Manohar, Vimal (oponent) ; Trmal, Jan (oponent) ; Burget, Lukáš (vedoucí práce)
Sequence-to-sequence automatic speech recognition (ASR) models require large quantities of training data to attain good performance. For this reason, unsupervised and semi-supervised training in seq2seq models have recently witnessed a surge in interest. This work builds upon recent results showing notable improvements in semi-supervised training using cycle-consistency and related techniques. Such techniques derive training procedures and losses able to leverage unpaired speech and/or text data by combining ASR with text-to-speech (TTS) models. This thesis first proposes a new semi-supervised modelling framework combining an end-to-end differentiable ASR->TTS loss with TTS->ASR loss. The method is able to leverage unpaired speech and text data to outperform recently proposed related techniques in terms of word error rate (WER). We provide extensive results analysing the impact of data quantity as well as the contribution of speech and text modalities in recovering errors and show consistent gains across WSJ and LibriSpeech corpora. The thesis also discusses the limitations of the ASR<->TTS model in out-of-domain data conditions. We propose an enhanced ASR<->TTS (EAT) model incorporating two main features: 1) the ASR->TTS pipeline is equipped with a language model reward to penalize the ASR hypotheses before forwarding them to TTS; and 2) speech regularizer trained in unsupervised fashion is introduced in TTS->ASR to correct the synthesized speech before sending it to the ASR model. Training strategies and the effectiveness of the EAT model are explored and compared with augmentation approaches. The results show that EAT reduces the performance gap between supervised and semi-supervised training by absolute WER improvement of 2.6% and 2.7% on LibriSpeech and BABEL respectively.
Integrace hlasových technologií na mobilní platformy
Černičko, Sergij ; Černocký, Jan (oponent) ; Schwarz, Petr (vedoucí práce)
Cílem práce je seznámit se s metodami a technikami využívanými při zpracování řeči. Popsat současný stav výzkumu a vývoje řečových technologií. Navrhnout a implementovat serverový rozpoznávač řeči, který využívá BSAPI. Integrovat klienta, který bude využívat server pro rozpoznání řeči, do mobilních slovníků společnosti Lingea.
Finite State Grammars and Language Models for Automatic Speech Recognition
Beneš, Karel ; Glembek, Ondřej (oponent) ; Hannemann, Mirko (vedoucí práce)
This thesis deals with the transformation of Context Free Grammars (CFG) into Weighted Finite State Transducers (WFST). A subset of CFG is chosen, that can be transformed exactly. Both the test of whether a CFG fulfills such condition and the algorithm for the following transformation are presented. A tool has been implemented, which performs both these tasks, also its input and output processing are reported. Using this tool, a speech recognition system for aircraft cockpit control has been built. Results are presented which show, that the system based on the transformed grammar outperforms the system based on general-purpose language model.
Finite-state based recognition networks for forward-backward speech decoding
Hannemann, Mirko ; AD, Ralf Schlüter, (oponent) ; Novák,, Miroslav (oponent) ; Burget, Lukáš (vedoucí práce)
Many tasks can be formulated in the mathematical framework of weighted finite state transducers (WFST). This is also the case for automatic speech recognition (ASR). Nowadays, ASR makes extensive use of composed probabilistic models -- called decoding graphs or recognition networks. They are constructed from the individual components via WFST operations like composition. Each component is a probabilistic knowledge source that constrains the search for the best path through the composed graph -- called decoding. The usage of a coherent framework guarantees, that the resulting automata will be optimal in a well-defined sense. WFSTs can be optimized with the help of determinization and minimization in a given semi-ring. The application of these algorithms results in the optimal structure for search and the optimal distribution of weights is achieved by applying a weight pushing algorithm. The goal of this thesis is to further develop the recipes and algorithms for the construction of optimal recognition networks. We introduce an alternative weight pushing algorithm, that is suitable for an important class of models -- language model transducers, or more generally cyclic WFSTs and WFSTs with failure (back-off) transitions. We also present a recipe to construct recognition networks, which are suitable for decoding backwards in time, and which, at the same time, are guaranteed to give exactly the same probabilities as the forward recognition network. For that purpose, we develop an algorithm for exact reversal of back-off language models and their corresponding language model transducers. We apply these backward recognition networks in an optimization technique: In a static network decoder, we use it for a two-pass decoding setup (forward search and backward search). This approach is called tracked decoding and allows to incorporate the first pass decoding into the second pass decoding by tracking hypotheses from the first pass lattice. This technique results in significant speed-ups, since it allows to decode with a variable beam width, which is most of the time much smaller than the baseline beam. We also show that it is possible to apply the algorithms in a dynamic network decoder by using the incrementally refining recognition setup. This additionally leads to a partial parallelization of the decoding.
Analýza videozáznamov správ z oblasti finančných trhov
Mikula, Michal
Tato práce se zabývá analýzou videozáznamu zpráv z oblasti finančních trhů. Mnoho médií z finanční sféry stále frekventovaněji zveřejňuje informace prostřednictvím videa, či dokonce tento formát v některých případech upřednostňuje. Manuální analýza těchto videozáznamů je časově velmi náročná. Práce se proto zabývá vytvořením nástroje umožňujícího jejich automatickou analýzu. V práci jsou řešeny dvě hlavní oblasti. První oblastí je automatické rozpoznávání řeči pro získávání přepisů videí a druhou oblastí je zpracování přirozeného jazyka pro provedení textové analýzy daného videa. Textová analýza přepisu videa v sobě zahrnuje analýzu sentimentu, textovou sumarizaci a extrakci klíčových frází.
Out-of-Vocabulary Words Detection and Recovery
Egorova, Ekaterina ; Hannemann, Mirko (oponent) ; Schaaf, Thomas (oponent) ; Černocký, Jan (vedoucí práce)
The thesis explores the field of out-of-vocabulary word (OOV) processing within the task of automatic speech recognition (ASR). It defines the two separate OOV processing tasks - that of detection and recovery - and proposes success metrics for both the tasks. Different approaches to OOV detection and recovery are presented within the frameworks of hybrid and end-to-end (E2E) ASR. These approaches and compared on an open access LibriSpeech database to facilitate replicability. Hybrid approach uses modified decoding graph with phoneme substrings and utilizes full lattice representations for detection and recovery of recurrent OOVs. Recovered OOVs are added to the dictionary and the language model (LM) to improve ASR system performance.  The second approach employs inner representations of a word-predicting Listen Attend and Spell architecture (LAS) E2E system to perform OOV detection task. Detection recall and precision rates improved drastically in comparison with the hybrid approach. Recur-rent OOV recovery is performed on a separate character-predicting system with the use of detected time frames and probabilistic clustering.Finally, we propose a new speller architecture with a capability of learning OOV representations together with the word predicting network (WPN) training. The speller forces word embeddings to be spelling-aware during the training and thus not only provides OOV recovery, but also improves the WPN performance.
Finite-state based recognition networks for forward-backward speech decoding
Hannemann, Mirko ; AD, Ralf Schlüter, (oponent) ; Novák,, Miroslav (oponent) ; Burget, Lukáš (vedoucí práce)
Many tasks can be formulated in the mathematical framework of weighted finite state transducers (WFST). This is also the case for automatic speech recognition (ASR). Nowadays, ASR makes extensive use of composed probabilistic models -- called decoding graphs or recognition networks. They are constructed from the individual components via WFST operations like composition. Each component is a probabilistic knowledge source that constrains the search for the best path through the composed graph -- called decoding. The usage of a coherent framework guarantees, that the resulting automata will be optimal in a well-defined sense. WFSTs can be optimized with the help of determinization and minimization in a given semi-ring. The application of these algorithms results in the optimal structure for search and the optimal distribution of weights is achieved by applying a weight pushing algorithm. The goal of this thesis is to further develop the recipes and algorithms for the construction of optimal recognition networks. We introduce an alternative weight pushing algorithm, that is suitable for an important class of models -- language model transducers, or more generally cyclic WFSTs and WFSTs with failure (back-off) transitions. We also present a recipe to construct recognition networks, which are suitable for decoding backwards in time, and which, at the same time, are guaranteed to give exactly the same probabilities as the forward recognition network. For that purpose, we develop an algorithm for exact reversal of back-off language models and their corresponding language model transducers. We apply these backward recognition networks in an optimization technique: In a static network decoder, we use it for a two-pass decoding setup (forward search and backward search). This approach is called tracked decoding and allows to incorporate the first pass decoding into the second pass decoding by tracking hypotheses from the first pass lattice. This technique results in significant speed-ups, since it allows to decode with a variable beam width, which is most of the time much smaller than the baseline beam. We also show that it is possible to apply the algorithms in a dynamic network decoder by using the incrementally refining recognition setup. This additionally leads to a partial parallelization of the decoding.
Neural networks for automatic speaker, language, and sex identification
Do, Ngoc ; Jurčíček, Filip (vedoucí práce) ; Peterek, Nino (oponent)
Název: Neuronové sítě pro automatické rozpoznávání řečníka, jazyka a pohlaví Autorka: Bich-Ngoc Do Katedra: Ústav formální a aplikované lingvistiky Vedoucí práce: Ing. Mgr. Filip Jurek, Ph.D., Ústav formální a aplikované lingvistiky, a Dr. Marco Wiering, Institut umělé inteligence a kognitivních věd, Fakulta matematiky a přírodních věd, Univerzita v Groningenu Abstrakt: Rozpoznávání řečníka je náročný úkol a má využití v mnoha oblastech, například využítí pro autorizaci nebo forenzní vědě. V posledních letech se rozšířil koncept učení hlubokých, především hluboké neuronové sítě, které se ukázaly jako schopná technika strojového učení a dosáhly výborných úspěchů v mnoha oblastech výzkumu zpracování přirozeného jazyka a zpra- cování mluveného slova. Tato práce si dává za cíl prozkoumat možnosti modelu hlubokých neuronových sítí, rekurentních neuronových sítí v úloze rozpoznávání řečníka. Námi navržené systémy byly vyhodnoceny na kor- pusu TIMIT pro úlohu identifikace řečníka. V porovnání s jinými systémy za stejných testových podmínkách náš systém nedosáhl referenčních výsledků kvůli nedostatku validačních dat. Naše experimenty ukázaly, že nejlepší konfigurace systému je...

Národní úložiště šedé literatury : Nalezeno 13 záznamů.   1 - 10další  přejít na záznam:
Chcete být upozorněni, pokud se objeví nové záznamy odpovídající tomuto dotazu?
Přihlásit se k odběru RSS.