With the outcry caused by the BBC removing the BBC food section from their website and the rush of people trying to mirror it or download the data I thought; just how would you do that with PowerShell?
(Of course, if you’re not interested in re-inventing the wheel Wget does this much better)
After a bit of thought I came up with the following requirements;
- It would need to recursively call itself going through all the links on the page.
- These should be filtered so I only get the pages matching a particular sub page (in the bbc example we only want the /food ones).
- It should download the pages and try and keep a representation of the hierarchy (so /food/recipies/cucumber.html is saved to \food\recipies\cucumber.html on the disk).
- I’m not interested in fixing the links yet; as long as we get a copy it should be fine.
- We need someway to terminate the recursion so it doesn’t keep processing the same pages. It also needs to only process each page once.
So a vague loop would be to go to a web page, go to all the links matching the sub page we want, output them to disk and record the page name to a list to make sure we don’t visit it again. You’d end up with a nice folder full of file versions of the website.
Invoke-WebRequest is good for this; it gets the web page content but also puts all the links from the page in a handy property of the object. Easy to enumerate through!
The script and detail are after the break.
Edit: I had another pass at this script and optimised it a bit here. Continue reading “PowerShell: Archiving All The Web-Pages on A Site (Example: bbc.co.uk/food)”