PowerShell: Optimising a Scripts Performance (Archiving Web-Pages Script)

Hi.  I wrote and ran a quick script to crawl a web-site and archive all it’s web-pages.  It worked but I noticed some serious performance issues;  its memory usage would occasionally balloon up to 19 GB (!);  it would slow down occasionally and it just generally seemed inefficient.

So I had a think and came up with a better script with a few simple changes;  it stays on about 500MB usage now and runs much better 🙂Here’s the updated script;  I’ll got over the optimisations afterwards.

[cmdletbinding()]
$OutputFolder="D:\Temp\Web Scrape\"
$MaxDepth=4
$Global:CheckedPages=@()
function Get-WebSubPage
{
    [cmdletbinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        $BaseURL,
        [Parameter(Mandatory=$true)]
        $SubLink,
        [Parameter(Mandatory=$true)]
        $CurrentDepth
    )
    $Result=@()
    $Filter="$SubLink*"
    $Page=$BaseURL+$SubLink
    Write-Verbose "Current Page: $Page Result Number: $($CheckedPages.Count) Depth: $CurrentDepth"
    $Web=Invoke-WebRequest $Page
    $OutFile=$OutputFolder + ($SubLink -replace ("[\.\:]","") -replace "//","" -Replace "/","\") + ".html"
    $OutFile=$OutFile -replace "\\.html",".html" -replace "\\\\","\"
    Write-Verbose "Output to: $OutFile"
    if (Test-Path $OutFile -IsValid)
    {
        New-Item $OutFile -Force -ItemType File > $null
        $Web.Content | Out-File $OutFile -Force
    }
    $AllLinks=$Web.links.href
    $Web=$OutFile=$Null
    if ($CurrentDepth -lt $MaxDepth)
    {

        Foreach ($Link in $AllLinks)
        {
            If ($Link -like $Filter)
            {
                $FullLink=$Base+$Link
                If (!($CheckedPages -contains $FullLink))
                {
                    Get-WebSubPage -BaseURL:$BaseURL -SubLink:$Link -CurrentDepth:($CurrentDepth+1)           
                }
            }
        }
    }else
    {
        Write-Verbose "Depth Reached Maximum:  Not processing links."
    }
    $Global:CheckedPages+=$Page
}

Get-WebSubPage -BaseURL:"http://bbc.co.uk" -SubLink:"/food" -CurrentDepth:1 -Verbose

Here’s the breakdown;

$Global:CheckedPages=@()
function Get-WebSubPage
{
    [cmdletbinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        $BaseURL,
        [Parameter(Mandatory=$true)]
        $SubLink,
        [Parameter(Mandatory=$true)]
        $CurrentDepth

First I used a global variable to hold the list of all web-links we’ve already checked. A global variable can be updated from anywhere in the script (rather than just in its own scope).

This means all those temporary variables holding the results in each function call are gone;  as the function calls itself recursively that saves a lot of storage!

Because we’re using a global I don’t need to pass $CurrentResult as a parameter any more so it’s gone.

    $AllLinks=$Web.links.href
    $Web=$OutFile=$Null

The variable holding all the detail on the web page is pretty big.  Once we’ve exported the actual content we only need the page links.  So I put those in a new variable and blank the rest.

Get-WebSubPage -BaseURL:$BaseURL -SubLink:$Link -CurrentDepth:($CurrentDepth+1)

Removed the CurrentResult parameter as above.

    $Global:CheckedPages+=$Page
    $FullLink=$Page=$AllLinks=$Null

Updated the global variable with the current page we’re processing and then blank everything.

Watching the script run it was interesting that the memory usage varies so much and also how the memory took a while to be returned to the system (even when I ended the script).  This is due to the PowerShell Garbage Collection;  an automatic process that frees up allocated memory due to certain parameters.

You can force this to occur with

[System.GC]::Collect()

But it seems to be something you don’t want to mess with unless you know what you’re doing so after a bit of tinkering to see how worked I removed it.  🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s