Internet Archive scraper

The Internet Archive scraper is a workflow for collecting webpages archived in Internet Archive's Wayback Machine. It is designed to run serverless on Amazon's AWS infrastructure.

3
contributors
Get started
103 commitsLast commit ≈ 1 month ago3 stars3 forks

What Internet Archive scraper can do for you

ia-webscraping (Internet Archive scraper)

ia-webscraping provides code to set up an AWS workflow for collecting and analyzing webpages from the Internet Archive (IA).

Using HashiCorp's Terraform technology, the program provides scripts to define and launch a server on Amazon's AWS infrastructure. Specifically, it uses the AWS modules SQS (message queueing), Lambda (code), S3 (storage), and Kinesis Data Firehose (data transfer) to set up a pipeline that facilitates scraping pages from the IA. An AWS account is required to deploy the pipeline.

Once deployed, the pipeline can be fed a list of domains. The software obtains all available URLs in the IA for each of those domains. As the IA can store multiple snapshots of the same URL, a time period can be configured within which all available snapshots of a URL will be saved. The resulting list of URLs is passed on to a second function, that retrieves the corresponding pages from the IA. For each page, the human readable text and, optionally, all links in the page are saved to a database. The database consists of Apache Parquet files saved in an S3 bucket. Afterwards, the Parquet files can be downloaded manually and be read out for analysis in R, Python or other compatible programs.

Keywords
Programming languages
  • Python 64%
  • HCL 24%
  • Shell 12%
License
</>Source code

Participating organisations

Utrecht University

Contributors

MS
Maarten Schermer
RJB
Robert Jan Bood
SURF