Description

Document Segmentation

Overview

This repository provides tooling for processing VOC inventories to

extract document boundaries and
classify documents

For both cases, two scripts exist respectively:

train a new model
apply a model

Definitions and Workflow

An inventory is a collection of pages.
A document is a subset of such pages; a document can start and end on the same page, or stretch over hundreds of pages.
A document falls in a TANAP category, as defined in the Tanap class in label.py.

There are two separate tasks defined in this repository:

segmenting inventories: identify the boundaries of individual documents inside an inventory
classifying documents: identify the category of a document

Training and Applying Models

For each task, there is a script to train a model:

See below for instructions on installing prerequisites and running the scripts.

Both produce a model file; run either script with the --help argument for the specific arguments.

In order to apply a model as produced by the respective training script, call

extract_docs.py for extracting documents from an inventory
predict_inventories is variation and applies the document segmentation model on multiple inventories, optionally generating a CSV file with thumbnail for human evaluation.
TODO: classify_documents.py for classifying documents

As above, run any of the scripts with the --help argument to get the specific usage.

Prerequisites

Install Poetry

curl -sSL https://install.python-poetry.org | python3 -

Or:

pipx install poetry

Als see Poetry documentation.

Install the dependencies

poetry install

Usage

To train a model run the scripts/train_model.py script. It downloads the necessary data from the HUC server into the local temporary directory.

Set your HUC credentials in the HUC_USER and HUC_PASSWORD environment variables or in settings.py, and run the script.

HUC_USER=... HUC_PASSWORD=... poetry run python scripts/train_model.py

Without the credentials, the script is not able to download the inventories, but can proceed with previously downloaded ones. Add the --help flag to see all available options.

To extract the documents of one or more inventories using a previously trained model, use the scripts/predict_inventories.py script, for instance:

poetry run python scripts/predict_inventories.py --model model.pt --inventory 1547,1548 --output 1547_1548.csv

Missing inventories are downloaded from the HUC server if the HUC_USER and HUC_PASSWORD environment variables are provided.

Add the --help flag to see all available options.

Development Instructions

This project uses

Python version >= 3.9 and <= 3.12
Poetry for package management
PyTest for unit testing
Ruff for linting and formatting
Pre-commit for managing pre-commit hooks

Install Development Dependencies

poetry install --with=dev

Set up pre-commit hooks

poetry run pre-commit install

Run Tests

poetry run pytest

Architecture

Both document segmentation and classification are based on page embeddings -- defined in the PageEmbedding class --, and region embeddings -- defined in the RegionEmbedding class. The models are implemented in the PageSequenceTagger and the DocumentClassifier class respectively, both are sub-classes of the AbstractPageLearner class (see diagram below). These classes are used for document boundary detection and document type classification respectively.

The Inventory class is the main data class. It holds sequences of pages and labels, and is inherited by the Document class, for using different labels.

The Sheet class and its sub-classes are used for reading and processing the annotated data from CSV/Excel sheets as stored in the annotations directory.

(Hyper-)parameters like layer sizes and language model are defined in settings.py.

Classes Diagram

classes

Run this command for updating the classes diagram:

poetry run pyreverse --output svg --colorized document_segmentation

Keywords

Text analysis & natural language processing

Programming languages

Jupyter Notebook 86%
Python 14%

License

Not specified

</>Source code

Participating organisations

Related projects

LAHTeR

Leveraging AI for HTR post-correction

Updated 13 months ago

Finished

Related software

htr-quality-classifier

HT

A package to determine the quality of a a digitized text, from a handwritten script or scanned print (HTR/OCR output).

Updated 25 months ago

1

document-segementation

Description

Document Segmentation

Overview

Definitions and Workflow

Training and Applying Models

Prerequisites

Install Poetry

Install the dependencies

Usage

Development Instructions

Install Development Dependencies

Set up pre-commit hooks

Run Tests

Architecture

Classes Diagram

Participating organisations

Contributors

Contact person

Carsten Schnober

Lead RSE

Netherlands eScience Center

0000-0001-9139-1577

Related projects

LAHTeR

Related software

htr-quality-classifier

document-segementation

Description

Document Segmentation

Overview

Definitions and Workflow

Training and Applying Models

Prerequisites

Install Poetry

Install the dependencies

Usage

Development Instructions

Install Development Dependencies

Set up pre-commit hooks

Run Tests

Architecture

Classes Diagram

Participating organisations

Contributors

Contact person

Carsten Schnober

Lead RSE

Netherlands eScience Center

.logo-orcid_svg__st1{fill:#fff}0000-0001-9139-1577

Related projects

LAHTeR

Related software

htr-quality-classifier

0000-0001-9139-1577