Extract all LINKS, not webpages, from external website?

Question

I want to archive an external website in Wayback Machine. You can do that by uploading a Google Sheet with links to all pages you want to archive. How can I, either using a Mac or some webservice (I assume), crawl a site and extract the links to a text file?

I have googled, looked at Github for tools etc and what I have found mostly fall into two categories:

Extract all links from one single web page.
Spider and then download a whole site, including all html, images etc etc.

Of course, you could patch something together from these components but I think there should exist ready made solutions that are much more robust than what I can hack together within a reasonable time frame.

score 1 · Answer 1 · answered Sep 04 '24 at 13:39

#!/bin/bash

mkdir temp_url_scrape && cd temp_url_scrape

wget -O index.html "$1"

cat 'index.html' | grep -o -E 'href="([^"]+)"' | sed 's/^href="//' > url_list.txt
perl -nl -e 'print substr($_, 0, (length($_) - 1))' url_list.txt | awk '!seen[$0]++' > ../scraped_urls.txt

cd ..
rm -rf temp_url_scrape

exit 0

Save as: link_scrape.sh
Change mode executable: chmod +x link_scrape.sh
Usage: ./link_scrape.sh https://www.example.com

The URL's will be in scraped_urls.txt

Extract all LINKS, not webpages, from external website?

1 Answers1