doc2sentences

A python library to split sentence-wise and convert various document formats to CSV format.

2
contributors
Get started
13 commitsLast commit ≈ 14 months ago0 stars0 forks

What doc2sentences can do for you

doc2sentences

The script can convert PDF, RTF and TXT files to CSV by splitting the files into sentences.

Install cli

To install and use from cli

git clone git@github.com:backdem/doc2sentences.git
cd doc2sentences
pip install -r requirements.txt

To use the script

The following command will split a pdf file into sentences and output a csv file. By default the csv file has 2 columns; the sentence and the length of the sentence.

python doc2sentences/d2s.py --inputfile /tmp/test.pdf --outputfile /tmp/test.csv

Adding more columns e.g. labels can be done by adding the command line argument columns. In the following command the resulting csv file will have two extra columns with values label1 and label2. The command overwrite will overwrite the output file if it already exists.

python doc2sentences/d2s.py --inputfile /tmp/test.pdf --outputfile /tmp/test.csv --columns label1,label2 --overwrite

Using OCR

Some PDF documents are problematic to extract text. For these documents you can try using Optical Character Reader (OCR). This can be done using the --ocr flag. OCR will only be used if normal text extraction from PDF does not work. N.B. This process if very error prone some characters are not recognized resulting in wrong words recognized.

python doc2sentences/d2s.py --inputfile /tmp/test.pdf --outputfile /tmp/test.csv --ocr

Semantically grouping sentences

In some scenarios sentences do no capture enough semantics. d2s can use GPT LLM models to group consecutive sentences together semantically. This can be done with the --chunk option with --maxtokens indicating the size of the chunk.

python doc2sentences/d2s.py --inputfile /tmp/test.pdf --outputfile /tmp/test.csv --chunk --maxtokens 100

Installation as package

To install doc2sentences from GitHub repository, do:

git clone git@github.com:backdem/doc2sentences.git
cd doc2sentences
python -m pip install .

Then from python

import doc2sentences.lib as d2s 
text = d2s.get_pdf_text(PATH_TO_PDF)
senteces = d2s.get_sentences(text)
print(text)

Documentation

Include a link to your project's full documentation here.

Contributing

If you want to contribute to the development of doc2sentences, have a look at the contribution guidelines.

Credits

This package was created with Cookiecutter and the NLeSC/python-template.

Keywords
Programming languages
  • Rich Text Format 93%
  • Python 6%
License
</>Source code

Participating organisations

Erasmus University Rotterdam
Netherlands eScience Center
Social Sciences & Humanities
Social Sciences & Humanities

Contributors

Reggie Cushing
Reggie Cushing
AZ
Asya Zhelyazkova

Related projects

BackDem

Assessing democratic backsliding in European and its neighborhood

Updated 3 weeks ago
Finished

Related software

Netherlands eScience Center Python Template

NE

Generic template for Python packages, so you can spend less time setting up and configuring, and comply with the Netherlands eScience Center Software Development Guide from the start.

Updated 4 months ago
48 23