PowerShell: Archiving All The Web-Pages on A Site (Example: bbc.co.uk/food)

With the outcry caused by the BBC removing the BBC food section from their website and the rush of people trying to mirror it or download the data I thought; just how would you do that with PowerShell?

(Of course, if you’re not interested in re-inventing the wheel Wget does this much better)

After a bit of thought I came up with the following requirements;

  • It would need to recursively call itself going through all the links on the page.
  • These should be filtered so I only get the pages matching a particular sub page (in the bbc example we only want the /food ones).
  • It should download the pages and try and keep a representation of the hierarchy (so /food/recipies/cucumber.html is saved to \food\recipies\cucumber.html on the disk).
  • I’m not interested in fixing the links yet; as long as we get a copy it should be fine.
  • We need someway to terminate the recursion so it doesn’t keep processing the same pages.  It also needs to only process each page once.

So a vague loop would be to go to a web page, go to all the links matching the sub page we want, output them to disk and record the page name to a list to make sure we don’t visit it again.  You’d end up with a nice folder full of file versions of the website.

Invoke-WebRequest is good for this; it gets the web page content but also puts all the links from the page in a handy property of the object.  Easy to enumerate through!

The script and detail are after the break.

Edit:  I had another pass at this script and optimised it a bit here.

Some description follows;

[cmdletbinding()]
$OutputFolder="D:\Temp\Web Scrape\"
$MaxDepth=5
function Get-WebSubPage
{
    [cmdletbinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        $BaseURL,
        [Parameter(Mandatory=$true)]
        $SubLink,
        [Parameter(Mandatory=$true)]
        $CurrentResult,
        [Parameter(Mandatory=$true)]
        $CurrentDepth
    )

    $Result=@()
    $Filter="$SubLink*"
    $Page=$BaseURL+$SubLink
    Write-Verbose "Current Page: $Page"
    $Web=Invoke-WebRequest $Page
    $OutFile=$OutputFolder + ($SubLink -replace ("[\.\:]","") -replace "//","" -Replace "/","\") + ".html"
    $OutFile=$OutFile -replace "\\.html",".html" -replace "\\\\","\"
    Write-Verbose "Output to: $OutFile"
    if (Test-Path $OutFile -IsValid)
    {
        New-Item $OutFile -Force -ItemType File > $null
        $Web.Content | Out-File $OutFile -Force
    }
    if ($CurrentDepth -lt $MaxDepth)
    {

        Foreach ($Link in $Web.links.href)
        {
            If ($Link -like $Filter)
            {
                $FullLink=$Base+$Link
                If (!($CurrentResult -contains $FullLink))
                {
                    $Result+=Get-WebSubPage -BaseURL:$BaseURL -SubLink:$Link  -CurrentResult:$Result -CurrentDepth:($CurrentDepth+1)
                }
            }
        }
    }else
    {
        Write-Verbose "Depth Reached Maximum:  Not processing links."
        Return
    }
    Write-Output $CurrentResult+$Result
}

$OutputFolder="c:\temp\Web Scrape\"
$MaxDepth=3</pre>
These define where we’re going to write the web page exports and the maximum depth of recursion.  Some of the links on the pages may link back to pages that link back to the first page (for example).  This limit to the depth of the recursion will hopefully stop this happening.
<pre>function Get-WebSubPage
{
    [cmdletbinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        $BaseURL,
        [Parameter(Mandatory=$true)]
        $SubLink,
        [Parameter(Mandatory=$true)]
        $CurrentResult,
        [Parameter(Mandatory=$true)]
        $CurrentDepth
    )

These are the parameters for the recursive function.  It takes the base URL and current sub link we’re looking at (like http://bbc.co.uk and /food), the $CurrentResults (which is a list of all the links we’ve already visited) and the current recursion depth.

    $Result=@()
    $Filter="$SubLink*"
    $Page=$BaseURL+$SubLink
    Write-Verbose "Current Page: $Page"
    $Web=Invoke-WebRequest $Page
    $OutFile=$OutputFolder + ($SubLink -replace ("[\.\:]","") -replace "//","" -Replace "/","\") + ".html"
    $OutFile=$OutFile -replace "\\.html",".html" -replace "\\\\","\"

$Result holds the current links we’ve found and $Filter is used to remove any links that don’t contain that string (we don’t want ALL the links from all the pages on a website).

The current $Page to download is built and the $Web object is grabbed via Invoke-WebRequest.

Last we build a path to output the web page to.  This is a pair of commands with multiple replace statements, removing “.”, “:” and “//”.  It also replaces \.html with “.html”, “/” with “\” and “\” with “\”.  This should make it a valid full path (using the link as a subdirectory”).

    if (Test-Path $OutFile -IsValid)
    {
        New-Item $OutFile -Force -ItemType File > $null
        $Web.Content | Out-File $OutFile -Force
    }

Before writing out the content we check the path is valid and then force- create a New-Item.  This will create all the folders in the path if they don’t already exist.  The Out-File of the web content then overwrites the file part with –Force.

    if ($CurrentDepth -lt $MaxDepth)
    {

        Foreach ($Link in $Web.links.href)
        {
            If ($Link -like $Filter)
            {
                $FullLink=$Base+$Link
                If (!($CurrentResult -contains $FullLink))
                {
                    $Result+=Get-WebSubPage -BaseURL:$BaseURL -SubLink:$Link  -CurrentResult:$Result -CurrentDepth:($CurrentDepth+1)
                }
            }
        }
    }else
    {
        Write-Verbose "Depth Reached Maximum:  Not processing links."
        Return
    }
    Write-Output $CurrentResult+$Result

This is the meat of the script.  It checks we’ve not reached the maximum recursion depth first and then checks all the links on the current page.  If they match the filter AND we’ve not visited this web page before (discovered by checking the contents of $CurrentResult) then we call the function again, this time pointed at the page referred to by the link.  The result of that function is added to $Result.

The last thing that happens is that the function returns the $CurrentResult + all the results from running the function against the sub links.

You could import the function with . [Path To File].ps1 (. c:\temp\webscript.ps1 for example).

To call the function you’d use something like the following;

Get-WebSubPage -BaseURL:"http://bbc.co.uk" -SubLink:"/food" -CurrentResult:@() -CurrentDepth:1 -Verbose

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s