PowerShell: Archiving All The Web-Pages on A Site (Example: bbc.co.uk/food)

With the outcry caused by the BBC removing the BBC food section from their website and the rush of people trying to mirror it or download the data I thought; just how would you do that with PowerShell?

(Of course, if you’re not interested in re-inventing the wheel Wget does this much better)

After a bit of thought I came up with the following requirements;

  • It would need to recursively call itself going through all the links on the page.
  • These should be filtered so I only get the pages matching a particular sub page (in the bbc example we only want the /food ones).
  • It should download the pages and try and keep a representation of the hierarchy (so /food/recipies/cucumber.html is saved to \food\recipies\cucumber.html on the disk).
  • I’m not interested in fixing the links yet; as long as we get a copy it should be fine.
  • We need someway to terminate the recursion so it doesn’t keep processing the same pages.  It also needs to only process each page once.

So a vague loop would be to go to a web page, go to all the links matching the sub page we want, output them to disk and record the page name to a list to make sure we don’t visit it again.  You’d end up with a nice folder full of file versions of the website.

Invoke-WebRequest is good for this; it gets the web page content but also puts all the links from the page in a handy property of the object.  Easy to enumerate through!

The script and detail are after the break.

Edit:  I had another pass at this script and optimised it a bit here. Continue reading “PowerShell: Archiving All The Web-Pages on A Site (Example: bbc.co.uk/food)”