Ashley Sheridan​

Generating sitemaps for large sites

Posted on


I recently had a requirement at work to product a sitemap (in the form of a basic list of URLs in a text document). Normally, I'd have a look at one of the various online sitemap generators, but they tend to stop indexing at a few hundred pages and require payment to index the rest; and the site in question contained around 7000 pages, spanning across mobile and desktop.

Faced with this, I set about making a small command line script in Bash to create this list. As well as being able to spider so many URLs, I also had a couple of other requirements:

  • The script would need to be able to generate separate lists for both mobile and desktop versions of the site.
  • The script would need the ability to log in to an HTTP authentication set up on the server.

The following is the result of my efforts:

#!/bin/bash mobile='' user='' password='' while getopts ":u:p:m" opt; do case $opt in m ) mobile='-U "Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7A341 Safari/528.16"';; u ) user="--http-user=$OPTARG";; p ) password="--http-password=$OPTARG";; \? ) echo "usage: sitemap [-u username] [-p password] [-m]" exit 1 esac done shift $(($OPTIND - 1)) if [ $# -ne 1 ]; then echo "you must specify a URL to spider" exit 1 else # spider the site wget --spider --recursive --no-verbose --output-file=wgetlog.txt $1 $mobile $user $password # filter out the actual URLs sed -n "s@.\+ URL:\([^ ]\+\) .\+@\1@p" wgetlog.txt | sed "s@&@\&@" > sedlog.txt # sort the list and get only unique lines sort sedlog.txt | uniq > urls.txt fi

The while loop is a standard structure in Bash scripts to process command line arguments in the form -c option_argument. This script can use the (optional) arguments -u and -p for the HTTP Auth username and password, which would follow each of the options respectively, quoted if necessary, and -m to indicate that the script should spider as if it were a mobile. The shift operation on line 16 just moves the internal option pointer along by the number of arguments that were processed in the while loop.

Line 18 checks to see if there are any command line arguments left (the script needs the URL to index!). From there, wget is used to spider the site, using the variables set up in the while loop if used when calling the script, and pipes the results into the file wgetlog.txt. This file is then processed with sedto retrieve just the URLs from the output, and this is piped into the file sedlog.txt.

Lastly, line 30 sorts sedlog.txt and uses uniq to grab only the unique adjacent lines (hence the need to sort them first).

The script can take quite a while to process a large site, but you can watch progress on the wgetlog.txt (generating this is what takes the time) by using tail -f wgetlog.txt in a new terminal session. This is just tail with an active follow, meaning that as soon as the file changes (i.e. it gets updated by wget) the results are sent straight to the screen.

The final script is then called like this:

# standard ./ # for sites behind HTTP auth ./ -u "username" -p "password" # for mobile sites where page URLs might differ ./ -m