Extrakce informaci ze strukturovanych dokumentu pomoci metod uceni a podobnosti

Holeček, Martin

The automation of document processing is gaining recent attention due to the great potential to reduce manual work through improved methods and hardware. In this area, neural networks have been applied before - even though they have been trained only on relatively small datasets with hundreds of documents so far. To successfully explore deep learning techniques and improve the information ex- traction results, a dataset with more than twenty-five thousand documents (pro forma invoices, invoices and debit note documents) has been compiled, anonymized and is published as a complement of this work. In the first part of the research, we will examine the documents from the point of view of table detection, present a survey on table detection methods and ultimately rephrase the table detection as a text box labelling problem to optimize micro F1 score of per-word classification. We will show that we can extract specific information from structurally different tables or table-like structures with one trainable model that features a comprehen- sive representation of a page using graph over word-boxes, positional embeddings and trainable textual features. The first part is concluded with a novel neural network model that beats multiple baselines and achieves strong, practical results on the presented dataset....

host :: přihlásit Digitální repozitář
		Hledej		Nový záznam		Nápověda		O repozitáři