This package contains a scalable pipeline to perform argument reasoning on a dataset of product reviews, to identify the highest quality ones. By tuning both lossy and lossless parameters, it is possible to tune the pipeline to foster the accuracy of the computation or its speed.
FAReviews uses python3 for argumentation mining and prolog for argument reasoning. The required prolog scripts can be found in the folder "argue".
To install the required Python 3 packages use:
pip3 install -r requirements.txt
pip3 install --upgrade spacy
pip3 install pytextrank
Download the pre-trained vectors trained on part of Google News dataset (about 100 billion words): GoogleNews-vectors-negative300.bin.gz and save it in the FAREVIEWS folder.
Download the Amazon reviews data set:
wget -c "http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/AMAZON_FASHION_5.json.gz"
Install the Spacy en_core_web_md pipeline and nltk stopwords
python -m spacy download en_core_web_md
python -m nltk.downloader stopwords
Perform feature extraction:
python3 ./utils/compute_scores.py
The script will ask you to provide the number of jobs, chunks, batch size, textrank threshold, and folder for output. It creates in the output folder:
[datafile name]_prods.pkl
[datafile name]_reviews.csv
Create the matrix with distance metrics:
python3 ./utils/graph_creation.py
The script will ask you to provide the csv file with review data ([datafile name]_reviews.csv), the pkl file with the product list ([datafile name]_prods.pkl), the number of cores to use, and folder for output. It creates in the output folder:
[datafile name]_prods_mc.pkl
Download the argue
folder, then run the following code to start the server:
cd argue
swipl server.pl
?- server(3333).
While the server is running, solve the Argumentation Graph.
python3 ./utils/graph_solver.py
The script will ask you to provide the csv file with review data ([datafile name]_reviews.csv), the pkl file with the product list and matrices and clusters ([datafile name]_prods_mc.pkl), the number of cores to use, the folder to use for the output, and whether or not to save the figures of the created graphs to png. It creates in the output folder:
[datafile name]_reviews_results.csv
[product asin].png
[product asin_labels].png
The three scripts described above can be ran sequentially using the FAReviews.py. This allows the user to provide the input data and several input parameters. In order for the argument reasoning part to be able to start from FAReviews.py, you have to make sure to have the prolog server running.
The following arguments can be provided to FAReviews.py (only -f
is a required argument, the others are optional):
-f
: Provide the location of the input data file (csv expected). Required argument.-nc
: Number of cores to use for the various processes that are ran in parallel. Default = 8.-cs
: Chunk size used in compute scores. Default = 100.-bs
: Batch size used in compute scores. Default = 20.-trt
: Minimum textrank score threshold for the tokens to be used. Tokens with a textrank score below the threshold are not used. The threshold is used in compute scores, and the resulting output is passed to the scripts that follow. Default = 0.0.-sn
: Name of the output folder (within the current folder) where you want to save the output. If it does not yet exist, it will be created. Default is Output.-si
: True/False If true, also save the output of compute scores and run_graph. If false, only the output of graph_creation_3 is saved. Default is False.-sf
: True/False. Option to save the constructed graphs to png per product. Default is False.Run the script for example as follows, after you have started the prolog server:
python3 FAReviews.py -f Data/mydata.csv -nc 4 -cs 50 -bs 40 -trt 0.10 -sn MyOutputFolder
The script will output the final results of graph_creation_3 (and intermediate results/graphs if respective arguments are set), and will print which part it is currently working on and how long the finished parts have taken to complete. Note that if you use the textrank threshold, tokens with a textrank score below the threshold are not used and therefore not saved in any of the output files.
If you want to contribute to the development of auto_extract, have a look at the contribution guidelines.
Review Argumentation at Scale