PowerShell: Archiving All The Web-Pages on A Site (Example: bbc.co.uk/food)

With the outcry caused by the BBC removing the BBC food section from their website and the rush of people trying to mirror it or download the data I thought; just how would you do that with PowerShell?

(Of course, if you’re not interested in re-inventing the wheel Wget does this much better)

After a bit of thought I came up with the following requirements;

  • It would need to recursively call itself going through all the links on the page.
  • These should be filtered so I only get the pages matching a particular sub page (in the bbc example we only want the /food ones).
  • It should download the pages and try and keep a representation of the hierarchy (so /food/recipies/cucumber.html is saved to \food\recipies\cucumber.html on the disk).
  • I’m not interested in fixing the links yet; as long as we get a copy it should be fine.
  • We need someway to terminate the recursion so it doesn’t keep processing the same pages.  It also needs to only process each page once.

So a vague loop would be to go to a web page, go to all the links matching the sub page we want, output them to disk and record the page name to a list to make sure we don’t visit it again.  You’d end up with a nice folder full of file versions of the website.

Invoke-WebRequest is good for this; it gets the web page content but also puts all the links from the page in a handy property of the object.  Easy to enumerate through!

The script and detail are after the break.

Edit:  I had another pass at this script and optimised it a bit here. Continue reading “PowerShell: Archiving All The Web-Pages on A Site (Example: bbc.co.uk/food)”

Powershell Archiving Script, Part 7

We’re on the home straight now.  In the last part we dealt with the last major workhorse of the script (actually moving objects to and from the archive with Move-ArchiveObject) and in this part we deal with some of the formatting / presentation functions.  The full script can be found here.

Continue reading “Powershell Archiving Script, Part 7”

Powershell Archiving Script, Part 6

In the last part we looked at the Get-FolderInformation function, which returns an object describing a folder so the user can tell if they want to process it or not.  This part is going to focus on the Move-ArchiveObject function which will actually perform the archive (or return from archive) process on a chosen folder. The full script can be found here.

Continue reading “Powershell Archiving Script, Part 6”

Powershell Archiving Script, Part 5

Continuing on from the last part where I defined the Get-FolderSize function the next function to be defined is Get-FolderInformation. This gets the relevant information about all the folders that could be processed and outputs that information as objects.  The full script can be found here.

Continue reading “Powershell Archiving Script, Part 5”

Powershell Archiving Script, Part 4

In the last part we wrote out the Powershell to handle the parameters (including some crude validation) and for the skeleton of the rest of the script. In the next few parts we’ll define the functions the script will use.  The full script can be found here.

Continue reading “Powershell Archiving Script, Part 4”