Description

doc2sentences

The script can convert PDF, RTF and TXT files to CSV by splitting the files into sentences.

Install cli

To install and use from cli

git clone git@github.com:backdem/doc2sentences.git
cd doc2sentences
pip install -r requirements.txt

To use the script

The following command will split a pdf file into sentences and output a csv file. By default the csv file has 2 columns; the sentence and the length of the sentence.

python doc2sentences/d2s.py --inputfile /tmp/test.pdf --outputfile /tmp/test.csv

Adding more columns e.g. labels can be done by adding the command line argument columns. In the following command the resulting csv file will have two extra columns with values label1 and label2. The command overwrite will overwrite the output file if it already exists.

python doc2sentences/d2s.py --inputfile /tmp/test.pdf --outputfile /tmp/test.csv --columns label1,label2 --overwrite

Using OCR

Some PDF documents are problematic to extract text. For these documents you can try using Optical Character Reader (OCR). This can be done using the --ocr flag. OCR will only be used if normal text extraction from PDF does not work. N.B. This process if very error prone some characters are not recognized resulting in wrong words recognized.

python doc2sentences/d2s.py --inputfile /tmp/test.pdf --outputfile /tmp/test.csv --ocr

Semantically grouping sentences

In some scenarios sentences do no capture enough semantics. d2s can use GPT LLM models to group consecutive sentences together semantically. This can be done with the --chunk option with --maxtokens indicating the size of the chunk.

python doc2sentences/d2s.py --inputfile /tmp/test.pdf --outputfile /tmp/test.csv --chunk --maxtokens 100

Installation as package

To install doc2sentences from GitHub repository, do:

git clone git@github.com:backdem/doc2sentences.git
cd doc2sentences
python -m pip install .

Then from python

import doc2sentences.lib as d2s 
text = d2s.get_pdf_text(PATH_TO_PDF)
senteces = d2s.get_sentences(text)
print(text)

Documentation

Include a link to your project's full documentation here.

Contributing

If you want to contribute to the development of doc2sentences, have a look at the contribution guidelines.

Credits

This package was created with Cookiecutter and the NLeSC/python-template.

Keywords

Programming languages

Rich Text Format 93%
Python 6%

License

Apache-2.0

</>Source code

Participating organisations

Social Sciences & Humanities

Related projects

BackDem

Assessing democratic backsliding in European and its neighborhood

Updated 7 months ago

Finished

Related software

Netherlands eScience Center Python Template

NE

Generic template for Python packages, so you can spend less time setting up and configuring, and comply with the Netherlands eScience Center Software Development Guide from the start.

Updated 11 months ago

48 23

doc2sentences

Description

doc2sentences

Install cli

To use the script

Using OCR

Semantically grouping sentences

Installation as package

Documentation

Contributing

Credits

Participating organisations

Contributors

Contact person

Reggie Cushing

Netherlands eScience Center

0000-0002-5967-7302

Related projects

BackDem

Related software

Netherlands eScience Center Python Template

doc2sentences

Description

doc2sentences

Install cli

To use the script

Using OCR

Semantically grouping sentences

Installation as package

Documentation

Contributing

Credits

Participating organisations

Contributors

Contact person

Reggie Cushing

Netherlands eScience Center

.logo-orcid_svg__st1{fill:#fff}0000-0002-5967-7302

Related projects

BackDem

Related software

Netherlands eScience Center Python Template

0000-0002-5967-7302