HomeServicesAboutContact"Intelligent Internet Consulting"
 
 
Services Snapshot:
Consulting: Gain from my education and experience in "All things Internet"

Website Maintenance: Keep your website up-to-date and running smooth

Usability Analysis: Is your website easy to use? Find out how to make it easier.

Application Programming: Web & Database applications build or customized to your needs.
Mirroring Websites with wget

Jim Roberts
2/23/2004

Have you ever wanted to copy a website for offline browsing or backup purposes? There's a powerful unix tool called wget that can do this, and much more. I'll review a simple example of using this tool, and discuss some advanced features that are huge timesavers.

GNU wget is a free utility which runs under unix and windows. In a nutshell, this program can go out and effectively mirror a website for local browsing or backup purposes. While it has more powerful features, this article will focus on the basics of the tool.  

Why wget?

There are countless ways to copy your website. Many website editors actually keep a local copy of the site for you. So, why should you bother with wget? Well, there are several advantages to this utility. It's free, easy to learn, and runs from the command line. The last of these is important, because that makes wget easy to automate through "cron" or "at" jobs. Overall, I find it's a useful utility to have in my toolbox - fast, powerful, and easy to use.

Obtaining and installing wg

First, we'll need to install the software. For Linux users, it may already be there - just type "which wget" at your shell prompt to see if you have it.

Look for the latest binary, and be sure to download the ssllibs as well (These enable retrieval of https web pages). I downloaded wget 1.9.1 and ssllibs 0.97. I copied all the files to c:\Program Files\wget. If you choose, add this to your system path for easy use.

Using wget

Wget is a command line utility. So, unix users must run it from a shell, windows users need to open a “command prompt” window. The examples below will assume you are using Windows, but the commands all apply to other platforms.

Before we start, a word of caution. Keep in mind when mirroring a website or ftp site, you are consuming significant resources both in terms of bandwidth and server processing. Please be considerate when using wget, and please get permission before setting up recurring or intensive mirroring of a large site. Wget does have some features to mitigate the impact on remote servers, namely the –w option discussed in the examples below. Please use your best judgement!

Example 1 – Mirror a website for offline browsing.

This example will mirror a website with all images, etc. to your local machine. Keep in mind that any dynamic content will become “static” on the local copy. In this example, I’ll be forcing any non-html extensions (.cgi, .asp, etc.) to be written as html files. This will facilitate local browsing. Also note, this example does not retrieve the source for any scripts or server side code. The second example will illustrate how to do that.

Ok, let’s try it out. Assuming wget is in your path (if not, you’ll have to cd into the c:\Program Files\wget directory), issue the following commands:

    mkdir wget_files

    cd wget_files

    wget --mirror –w 2 –p --html-extension –-convert-links –P c:\wget_files\example1 http://www.yourdomain.com

That’s it! In a few seconds (or minutes, depending on the size of the site and speed of your connection), you’ll have the site downloaded. It will be in a folder called www.yourdomain.com Just open a web browser, choose File -> Open, and browse to the index.html (or appropriate starting page) in the folder just created by wget (c:\wget_file\www.yourdomain.com\example1). This tree is suitable for offline browsing – try it when visiting a client who does not have highspeed network connectivity – or even burning to CD for posterity.

Now, a brief explanation of the options used:

    --mirror: specifies to mirror the site. Wget will recursively follow all links on the site and download all necessary files. It will also only get files that have changed since the last mirror, which is handy in that it saves download time.

    -w: tells wget to “wait” or pause between requests, in this case for 2 seconds. This is not necessary, but is the considerate thing to do. It reduces the frequency of requests to the server, thus keeping the load down. If you are in a hurry to get the mirror done, you may eliminate this option.

    -p: causes wget to get all required elements for the page to load correctly. Apparently, the mirror option does not always guarantee that all images and peripheral files will be downloaded, so I add this for good measure.

    --html-extension: All files with a non-html extension will be converted to have an html extension. This will convert any cgi or asp generated files to html extensions for consistency.

    --convert-links: all links are converted so they will work when you browse locally. Otherwise, relative (or absolute) links would not necessarily load the right pages, and style sheets could break as well.

    -P (prefix folder): the resulting tree will be placed in this folder. This is handy for keeping different copies of the same site, or keeping a “browsable” copy separate from a mirrored copy.

Note: These files should not be uploaded back to your server – they have been modified for local viewing, and will most potentially break your website if you blindly upload them!

Example 2 – copy your site for backup purposes

This example will create a local copy of your site that is suitable for backup purposes. If you already use a website management tool, such as DreamWeaver, this method may get files that it does not have. For example, log files, cgi files, and other data files created on the server. Note: If your site uses a database server, such as mysql or SQL Server, this method will not backup the actual data. Backing up databases is beyond the scope of this article.

Using the same assumptions from the first example, we’ll create a mirror via ftp. You will need to know your ftp username and password, and the ftp host to access your site files. These are likely the same as you use to update your site via DreamWeaver or GoLive, etc. Here’s the full command (all on one line):

    wget –mirror –w 3 –p –P c:\wget_files\example2 ftp://username:password@ftp.yourdomain.com
Once this is done, take a look at the files pulled down. Note, you will likely not be able to browse this as you could with Example 1, since none of the links or files were converted. The options for wget work the same as in Example 1.

To keep this mirror in sync, simply run it by hand every so often, or set up a cron (unix) or at (Windows) job to run it on a regular basis.

Example 3 – download a file

While web browsers and ftp programs can handle download, on occasion, wget comes in handy in this regard. It supports resumption of interrupted downloads, and of course, the command line aspect is useful. Here is a simple example (create the “example3” folder as before):

    Wget –P c:\wget_files\example3 http://ftp.gnu.org/gnu/wget/wget-1.9.tar.gz

To continue an interrupted download, add a “-c” option before the –P. Note, this will only work if the remote server supports it. If so, it can save quite a bit of downloading time, especially over a slow connection.

Other notable features 

Wget has some more advanced features then shown here, which are worth noting in case you ever have a need for them. These include:

  • Support for cookie handling – in the event that a site you are mirroring requires cookies, or uses cookies for certain features to work properly.
  • Support for proxy servers – this can reduce network traffic, and provide greater speed for some downloads.
  • Wgetrc – you can use this file to store often used wget commands and settings. Wget will read this file upon startup.
  • Simple spidering – this feature will check that a page is available – without downloading it. This is useful for monitoring a page, or to check a list of pages to see which still exist (link list or bookmarks, etc.)
  • Quotas – you can specify a maximum to download during a recursive download.
  • http user / password – If you are downloading a password protected site, you can pass along access information to wget.

Wrap Up

These examples cover some of the basic uses for the program. Spend some time reading through all the options available, to get better handle on its full capabilities. However, even if you only use the options I have demonstrated here, I think you will agree that this is quite a handy tool to keep around.