1

I want to archive an external website in Wayback Machine. You can do that by uploading a Google Sheet with links to all pages you want to archive. How can I, either using a Mac or some webservice (I assume), crawl a site and extract the links to a text file?

I have googled, looked at Github for tools etc and what I have found mostly fall into two categories:

  1. Extract all links from one single web page.
  2. Spider and then download a whole site, including all html, images etc etc.

Of course, you could patch something together from these components but I think there should exist ready made solutions that are much more robust than what I can hack together within a reasonable time frame.

d-b
  • 956

1 Answers1

1
#!/bin/bash

mkdir temp_url_scrape && cd temp_url_scrape

wget -O index.html "$1"

cat 'index.html' | grep -o -E 'href="([^"]+)"' | sed 's/^href="//' > url_list.txt
perl -nl -e 'print substr($_, 0, (length($_) - 1))' url_list.txt | awk '!seen[$0]++' > ../scraped_urls.txt

cd ..
rm -rf temp_url_scrape

exit 0

Save as: link_scrape.sh
Change mode executable: chmod +x link_scrape.sh
Usage: ./link_scrape.sh https://www.example.com

The URL's will be in scraped_urls.txt