Miquel Gómez Corral

TFG: Visual Document Understanding

Mar 2025

Finished

Bachelor's thesis addressing VDU (Visual Document Understanding): extracting structured data (JSON) from document images. Implements and compares three approaches: (1) Azure OCR + LLM pipeline for prompt-guided extraction, (2) fine-tuning of custom OCR models adapted to the specific task, and (3) DONUT (Document Understanding Transformer by Clova AI), an end-to-end encoder-decoder transformer that directly processes the document image and generates structured output. DONUT is trained with PyTorch Lightning, with custom tokenizer and image processor. Includes ground-truth validation, edit distance metrics, and sync scripts with private company repository. Full containerization with Docker Compose.

Technologies

Docker

HuggingFace

Jupyter

Matplot

NumPy

Pandas

Plotly

Python

PyTorch

https://raw.githubusercontent.com/MiquelGomezCorral/TFG_Miquel/main/readme-images/image.png

https://raw.githubusercontent.com/MiquelGomezCorral/TFG_Miquel/main/readme-images/image (1).png

https://raw.githubusercontent.com/MiquelGomezCorral/TFG_Miquel/main/readme-images/image (2).png

https://raw.githubusercontent.com/MiquelGomezCorral/TFG_Miquel/main/readme-images/image (3).png

https://raw.githubusercontent.com/MiquelGomezCorral/TFG_Miquel/main/readme-images/image (4).png

https://raw.githubusercontent.com/MiquelGomezCorral/TFG_Miquel/main/readme-images/image (5).png

https://raw.githubusercontent.com/MiquelGomezCorral/TFG_Miquel/main/readme-images/image (6).png

https://raw.githubusercontent.com/MiquelGomezCorral/TFG_Miquel/main/readme-images/image (7).png