PowerShell : Finding Duplicate Files, Part 3 : A Resumeable Script Using Workflow

In the the previous part we looked at making the original script survive a restart without losing progress. There is actually a built-in PowerShell system which allows this functionality, the workflow. If you run a workflow as a job it allows you to pause, resume and restart the workflow so progress is saved.

The syntax is pretty straight-forward, but there are some strange rules about using workflows which makes it a little more tricky.

Here is the full script (as usual, notes will follow);

workflow Find-Duplicates
{
    [CmdletBinding()]
    param
    (
        [parameter(Mandatory=$True)]
        [string]$Path
    )
    if (!(Test-Path -PathType Container $Path))
    {
        Write-Error "Invalid path specified."
        Exit
    }
    $LogFile="c:\temp\results.csv"
    $CheckpointFile="c:\temp\check.txt"
    Write-Verbose "Scanning Path : $Path"
    $Files=gci -File -Recurse -path $Path | Select-Object -property FullName,Length
    $MatchedSourceFiles=@()
    ForEach ($SourceFile in $Files)
    {
        Write-Verbose "Source : $($SourceFile.FullName)"
        [array]$MatchedSourceFiles=InlineScript
        {
            $MatchedFiles=@()
            $CurrentMatches=$Using:MatchedSourceFiles
            $MatchingFiles=$Using:Files |Where-Object {$_.Length -eq $Using:SourceFile.Length}
            Foreach ($TargetFile in $MatchingFiles)
            {
                Write-Verbose "Checking File $($TargetFile.FullName)"
                ($CurrentMatches | Select-Object -ExpandProperty File) | % {Write-Verbose "List1 : $_"}

                if (($Using:SourceFile.FullName -ne $TargetFile.FullName) -and !(($CurrentMatches |
                     Select-Object -ExpandProperty File) -contains $TargetFile.FullName))
                {
                    Write-Verbose "Matched $($Using:SourceFile.FullName) and $($TargetFile.FullName)"
                    if ((fc.exe /A $Using:SourceFile.FullName $TargetFile.FullName)  -contains "FC: no differences encountered")
                    {
                        Write-Verbose "Match found."
                        $MatchedFiles+=$TargetFile.FullName
                    }
                }
            }
            if ($MatchedFiles.Count -gt 0)
            {
                Write-Verbose "Found Matching Files.  Adding Object."
                $NewObject=[pscustomobject][ordered]@{
                    File=$Using:SourceFile.FullName
                    MatchingFiles=$MatchedFiles
                }

                $CurrentMatches+=$NewObject
            }
            $CurrentMatches
        }
        Checkpoint-Workflow
    }
    $MatchedSourceFiles | Export-CSV $LogFile -NoTypeInformation
}

As you can see, workflows look very much like functions; they’re structured and called in much the same way.  Their ability to automatically save their state is the additional functionality we’re interested in, but workflows also allow you to run multiple tasks in parallel and refer to multiple machines as you go (amongst other things).  This means they have quite a few restrictions in how they’re written.  The ones that affect our script are highlighted below;

  • You can’t use subexpressions (“$test=$($value.name)“).
  • You can’t run methods of objects (“$test.update($true)“).
  • You can’t update properties of objects (“$test.name=’First’“).
  • You can use all the above in an InlineScript;  a block of code that runs as one ‘unit’.
  • Unfortunately by default InlineScripts cannot refer to variables defined outside their scope;  to refer to a variable in their parent (workflow) scope you need to use the Using modifier on a variable (“File=$Using:SourceFile.FullName“).

So while there are lots of new restrictions we can avoid most of them by running our existing script within an InlineScript.  This returns our working variable ($MatchedSourceFiles, which holds all the files with duplicates and a path to each of their clones) that is updated for each file we’re examining. The working variable is written to a CSV at the end (“$MatchedSourceFiles | Export-CSV $LogFile -NoTypeInformation“).

The only other major change was to add the “Using” keyword to variables within the InlineScript that need to refer to variables defined outside of it.

Last I’ve added a Checkpoint-Workflow command to the end of the main loop (examining each file in the folder).  This manually saves the state of the workflow so that if it’s restarted we don’t have to start again!

2 thoughts on “PowerShell : Finding Duplicate Files, Part 3 : A Resumeable Script Using Workflow”

  1. I’ve tried your script. The first one works great where it writes the info to the screen but this one shows the files found in the first column but in the Matching Files column the output only shows “System.Object[]”.

    1. Hi! I’ve just tried re-running the script in the 3rd part and it seems to run ok. Here’s what I did;

      1) Loaded the script into PowerShell.
      2) Ran the workflow; Find-Duplicates ‘D:\temp\New folder’
      3) And this is the output in c:\temp\results.csv (there were some extra columns too);

      D:\temp\New folder\IMG_0161.JPG D:\temp\New folder\IMG_0162.JPG

      Things I can suggest;

      Start with no results file; maybe it’s appending incorrectly?
      I’m using PS 5 though I wrote this on PS 3.

      What your seeing is normally when it tries to write a complex object to a plain text field, rather than each field from the object to its own text field.

      Thanks!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s