TFG: Visual Document Understanding
Mar 2025
Finished
Bachelor's thesis addressing VDU (Visual Document Understanding): extracting structured data (JSON) from document images. Implements and compares three approaches: (1) Azure OCR + LLM pipeline for prompt-guided extraction, (2) fine-tuning of custom OCR models adapted to the specific task, and (3) DONUT (Document Understanding Transformer by Clova AI), an end-to-end encoder-decoder transformer that directly processes the document image and generates structured output. DONUT is trained with PyTorch Lightning, with custom tokenizer and image processor. Includes ground-truth validation, edit distance metrics, and sync scripts with private company repository. Full containerization with Docker Compose.
AI
Docker
HuggingFace
Jupyter
Matplot
NumPy
Pandas
Plotly
Python
PyTorch

.png&w=3840&q=75)
.png&w=3840&q=75)
.png&w=3840&q=75)
.png&w=3840&q=75)
.png&w=3840&q=75)
.png&w=3840&q=75)
.png&w=3840&q=75)

.png&w=3840&q=75)
.png&w=3840&q=75)
.png&w=3840&q=75)
.png&w=3840&q=75)
.png&w=3840&q=75)
.png&w=3840&q=75)
.png&w=3840&q=75)

.png&w=3840&q=75)
.png&w=3840&q=75)
.png&w=3840&q=75)
.png&w=3840&q=75)
.png&w=3840&q=75)
.png&w=3840&q=75)
.png&w=3840&q=75)

.png&w=3840&q=75)
.png&w=3840&q=75)
.png&w=3840&q=75)
.png&w=3840&q=75)
.png&w=3840&q=75)
.png&w=3840&q=75)
.png&w=3840&q=75)