Netherlands eScience Center Python Template
Generic template for Python packages, so you can spend less time setting up and configuring, and comply with the Netherlands eScience Center Software Development Guide from the start.
A python library to split sentence-wise and convert various document formats to CSV format.
The script can convert PDF, RTF and TXT files to CSV by splitting the files into sentences.
To install and use from cli
git clone git@github.com:backdem/doc2sentences.git
cd doc2sentences
pip install -r requirements.txt
The following command will split a pdf file into sentences and output a csv file. By default the csv file has 2 columns; the sentence and the length of the sentence.
python doc2sentences/d2s.py --inputfile /tmp/test.pdf --outputfile /tmp/test.csv
Adding more columns e.g. labels can be done by adding the command line argument columns. In the following command the resulting csv file will have two extra columns with values label1 and label2. The command overwrite will overwrite the output file if it already exists.
python doc2sentences/d2s.py --inputfile /tmp/test.pdf --outputfile /tmp/test.csv --columns label1,label2 --overwrite
Some PDF documents are problematic to extract text. For these documents you can try using Optical Character Reader (OCR). This can be done using the --ocr flag. OCR will only be used if normal text extraction from PDF does not work. N.B. This process if very error prone some characters are not recognized resulting in wrong words recognized.
python doc2sentences/d2s.py --inputfile /tmp/test.pdf --outputfile /tmp/test.csv --ocr
In some scenarios sentences do no capture enough semantics. d2s can use GPT LLM models to group consecutive sentences together semantically. This can be done with the --chunk option with --maxtokens indicating the size of the chunk.
python doc2sentences/d2s.py --inputfile /tmp/test.pdf --outputfile /tmp/test.csv --chunk --maxtokens 100
To install doc2sentences from GitHub repository, do:
git clone git@github.com:backdem/doc2sentences.git
cd doc2sentences
python -m pip install .
Then from python
import doc2sentences.lib as d2s
text = d2s.get_pdf_text(PATH_TO_PDF)
senteces = d2s.get_sentences(text)
print(text)
Include a link to your project's full documentation here.
If you want to contribute to the development of doc2sentences, have a look at the contribution guidelines.
This package was created with Cookiecutter and the NLeSC/python-template.
Assessing democratic backsliding in European and its neighborhood
Generic template for Python packages, so you can spend less time setting up and configuring, and comply with the Netherlands eScience Center Software Development Guide from the start.