PowerShell: Archiving All The Web-Pages on A Site (Example: bbc.co.uk/food)

With the outcry caused by the BBC removing the BBC food section from their website and the rush of people trying to mirror it or download the data I thought; just how would you do that with PowerShell?

(Of course, if you’re not interested in re-inventing the wheel Wget does this much better)

After a bit of thought I came up with the following requirements;

  • It would need to recursively call itself going through all the links on the page.
  • These should be filtered so I only get the pages matching a particular sub page (in the bbc example we only want the /food ones).
  • It should download the pages and try and keep a representation of the hierarchy (so /food/recipies/cucumber.html is saved to \food\recipies\cucumber.html on the disk).
  • I’m not interested in fixing the links yet; as long as we get a copy it should be fine.
  • We need someway to terminate the recursion so it doesn’t keep processing the same pages.  It also needs to only process each page once.

So a vague loop would be to go to a web page, go to all the links matching the sub page we want, output them to disk and record the page name to a list to make sure we don’t visit it again.  You’d end up with a nice folder full of file versions of the website.

Invoke-WebRequest is good for this; it gets the web page content but also puts all the links from the page in a handy property of the object.  Easy to enumerate through!

The script and detail are after the break.

Edit:  I had another pass at this script and optimised it a bit here. Continue reading “PowerShell: Archiving All The Web-Pages on A Site (Example: bbc.co.uk/food)”

PowerShell: Copy Directory Structure and a Random Sample of Files from Each Directory

I got my wife a new digital picture frame as a present.  It looks cool but the attached storage options (USB or card) aren’t big enough to take all our digital photos.

The decision about which pictures to include relies on either organisational skills OR an artistic eye, neither of which I have.

So what about making it strictly random?  Copying the entire directory structure but only a random sample of the files in each folder?

That I CAN do 🙂 Continue reading “PowerShell: Copy Directory Structure and a Random Sample of Files from Each Directory”

PowerShell: Synchronizing a Folder (and Sub-Folders)

This is the latest version of my PowerShell folder synchronisation script.

I’ve revisited this script a few times with new additions and modifications.  If you want explanations about how the script was developed over time or how parts of the script work, have a look at the following posts;

Part 1 covers the basic script to sync two folders.

Part 2 adds the ability to specify exceptions that the sync process will skip.

In part 3 I added an XML configuration file so more complicated processes can be run.

I customised and validated parameter input (parameter sets) and added some proper error checking in part 4.

The script was updated to deal with non-standard paths and to output objects reporting on the changes it made in part 5.

In part 6 I add more use of literal paths (to prevent path errors) and generate statistics.

I add a filter option, turn strict-mode on and optimise the statistics generation in part 7 (plus fix a bug!).

In part 8 I’ve added some validity checks on the XML configuration file.

WhatIf functionality was added in Part 9 plus some corrections to make sure Filter and Exceptions worked correctly.

In Part 10 I added text file logging.

In Part 11 I fixed an odd bug where $Nulls were being added to arrays and added a -PassThru switch.

Part 12 covers some fixes for behaviour and adding the ability to support  multiple target folders (so the source folders can be synced to multiple locations).

Part 13 added more control over what gets logged and how, added a timestamp to the log filename, creating a new log on each run, code cleanup and some bug- fixes.

A zipped version of the script is here;

Continue reading “PowerShell: Synchronizing a Folder (and Sub-Folders)”