nedextract is a python library that helps you to extract information on person names and organisations mentioned in Dutch (annual report) PDF files.
nedextract is developed to extract specific information from annual report PDF files that are written in Dutch. Currently it tries to do the following:
Read the PDF file, and perform Named Entity Recognition (NER) using Stanza to extract all persons and all organisations named in the document, which are then processed by the processes listed below.
Extract persons: using a rule-based method that searches for specific keywords, this module tries to identify:
Ambassadors
People in important positions in the organisation. The code tries to determine a main job description (e.g. director or board) and a sub-job description (e.g. chairman or treasurer). Note that these positions are identified and outputted in Dutch.
The main jobs that are considered are:
The sub positions that are considered are:
For each person that is identified, the code searches for keywords in the sentences in which the name appears to determine the main position, or the sentence directly before or after that. Subjobs are determine based on words appearing directly before or after the name of a person for whom a main job has been determined. For the main jobs and sub positions, various ways of writing are considered in the keywords. Also before the search for the job-identification starts, name-deduplication is performed by creating lists of names that (likely) refer to one and the same person (e.g. Jane Doe and J. Doe).
Extract related organisations:
anbis
argument, to collect their rsin number for further analysis. An empty file ./Data/Anbis_clean.csv
is availble that serves as template for such a file. Matching is attempted both on currentStatutoryName and shortBusinessName. Only full matches (independent of capitals) and full matches with the additional term 'Stichting' at the start of the identified organisation (again independent of capitals) are considered for matching. Fuzzy matching is not used here, because during testing, this was found to lead to a significant amount of false positives.Classify the sector in which the organisation is active. The code uses a pre-trained model to identify one of eight sectors in which the organisation is active. The model is trained on the 2020 annual report pdf files of CBF certified organisations.
Transparency in the Netherlands’ Nonprofit Sector