GNU Wget Tutorial

As a student, you may find yourself wanting to download lots of lecture slides and other materials off a module homepage, which can become quite an arduous task. Thankfully, GNU created Wget which is already on most linux machines. It is best demonstrated by example:

wget -r -l5 -np -k -nH --cut-dirs=5 --load-cookies cookies.txt http://www2.warwick.ac.uk/fac/sci/physics/current/teach/module_home/px421/

-r
Means wget acts recursively i.e. it follows links found on the current page (much like a search engine spider.

-l
Specifies the depth, which means how many of these links it can follow. If you imagine all the links on the current page forming branches away from it, then the links on those pages forming branches away from those, then -l5 sets the maximum branch distance away from the current page.

-np
No Parent, means wget will only progress down the directory tree i.e. it will not work its way back into http://www2.warwick.ac.uk/fac/sci/physics/current/teach/module_home/

-k
Convert Hyperlinks. When wget downloads a page, say index.html, there will be links on that page just like viewed in your browser but -k will convert them to local links, so that you can navigate your way through the pages on your local machine.

-nH
No host directories. Basically wget would otherwise create a folder named “http://www2.warwick.ac.uk/” and all the downloaded stuff would get stored in there, which is normally undesirable.

–cut-dirs=5
Otherwise wget would create 5 directories

http://www2.warwick.ac.uk/fac/

sci/
physics/
current/
teach/
module_home/

in a directory tree which you don’t want to have to click through…

–load-cookies
Normally content is restricted and you need to login, so you need to supply wget with some cookies. If you are a firefox user then there is an extension called ‘cookie exporter’, which you can use to output your cookies to a file called cookies.txt.

That’s it!

Leave a Reply