PowerShell : Finding Duplicate Files, Part 1 : The Basic Script

We have a lot of photos, music and files.  Normally we copy the files up to some network storage (so it’s backed up) and later on we come back and name and sort everything.  But then, some chaos happens;  maybe we get distracted so end up copying the files twice.  Or maybe we’ve been away and I’ve backed them up to another device and then copied two sets of files later.  Or a set of files were copied to another computer and some of them (but not all) were modified before copying them back.

The upshot is that there’s precious NAS storage being wasted.  But how to find where the duplicate files are?  Sounds like an excuse reason to put on my scripting hat!

Subsequently to writing this I did some more work on the script.  So now there are two more parts;  this post covers the basic script while part 2 details putting code in to allow resuming the script after a restart (or crash) and finally part 3 does the same thing but using PowerShell workflows.

There’s no elegant way to go about it other than comparing each file to each other file and seeing if they’re identical.

But first we need to define ‘identical’.  We could just go for ‘the same size’ as our criteria which might be good enough but as we’re talking about tens of thousands of files there could well be different files that are the same size.

There are also some existing PowerShell libraries that allows you to get file hashes for comparison and PowerShell 5.0 seems to have a Get-FileHash function built in, but I want to stick with what’s already in the OS so I’ll use the built-in fc.exe file-compare application.

Here’s the code.  It takes a valid folder path, grabs all the files then steps through each one comparing it to all the others.  Any matches are output in custom objects.

A break-down and explanation of individual components follows the full script, below;

[CmdletBinding()]
param
(
    [parameter(Mandatory=$True)]
    [string]$Path
)
if (!(Test-Path -PathType Container $Path))
{
    Write-Error "Invalid path specified."
    Exit
}
Write-Verbose "Scanning Path : $Path"
$Files=gci -File -Recurse -path $Path | Select-Object -property FullName,Length
$Count=1
$TotalFiles=$Files.Count
$MatchedSourceFiles=@()
ForEach ($SourceFile in $Files)
{
    Write-Progress -Activity "Processing Files" -status "Processing File $Count / $TotalFiles" -PercentComplete ($Count / $TotalFiles * 100)
    $MatchingFiles=@()
    $MatchingFiles=$Files |Where-Object {$_.Length -eq $SourceFile.Length}
    Foreach ($TargetFile in $MatchingFiles)
    {
        if (($SourceFile.FullName -ne $TargetFile.FullName) -and !(($MatchedSourceFiles |
             Select-Object -ExpandProperty File) -contains $TargetFile.FullName))
        {
            Write-Verbose "Matching $($SourceFile.FullName) and $($TargetFile.FullName)"
            Write-Verbose "File sizes match."
            if ((fc.exe /A $SourceFile.FullName $TargetFile.FullName)  -contains "FC: no differences encountered")
            {
                Write-Verbose "Match found."
                $MatchingFiles+=$TargetFile.FullName
            }
        }
    }
    if ($MatchingFiles.Count -gt 0)
    {
        $NewObject=[pscustomobject][ordered]@{
            File=$SourceFile.FullName
            MatchingFiles=$MatchingFiles
        }
        $MatchedSourceFiles+=$NewObject
    }
    $Count+=1
}
$MatchedSourceFiles

Here are the main components and some explanation of each;

[CmdletBinding()]
param
(
    [parameter(Mandatory=$True)]
    [string]$Path
)
if (!(Test-Path -PathType Container $Path))
{
    Write-Error "Invalid path specified."
    Exit
}
Write-Verbose "Scanning Path : $Path"
$Files=gci -File -Recurse -path $Path | Select-Object -property FullName,Length
$Count=1
$TotalFiles=$Files.Count
$MatchedSourceFiles=@()

We only need a valid folder path to be passed to the script (which is mandatory).  From that we use Get-ChildItem (gci) to recursively get all the files (skipping the folders with the -File option).  These are stored in $Files but as we don`t need all the details about each file only the size and full file path are stored, to save space.

$MatchedSourceFiles will hold a collection of all the files we find which have any duplicates;  the collection will be made up of custom objects which store both the file and a list of its matching partners.

$Count is used to count how many files we’ve checked.  We can use it with $TotalFiles to calculate a percentage progress for a progress bar.


ForEach ($SourceFile in $Files)
{
    Write-Progress -Activity "Processing Files" -status "Processing File $Count / $TotalFiles" -PercentComplete ($Count / $TotalFiles * 100)
    $MatchingFiles=@()
    $MatchingFiles=$Files |Where-Object {$_.Length -eq $SourceFile.Length}

Here we start enumerating through all the files in $Files and comparing each to every other one.  We don’t compare each pair of files with fc.exe straight away though.  Instead a filter is used to get a collection of files that match the size of the current file we’re examining ($SourceFile).

This was because I guessed fc.exe was a lot slower at comparing files than using built-in PowerShell filters.  This way we can use a fast filter to reduce the number of slow fc.exe comparisons we do.

But how much faster is it?  While filtering may be quicker, we do add a few extra commands to the code;   is it worth it?

Enter the handy performance-estimating command;

1..10 | %{measure-command {C:\Scripts\Find-DuplicateFiles.ps1 c:\temp}}| Select-Object -ExpandProperty TotalSeconds | Measure-Object -Average

This runs the script 10 times and averages the time it takes the script to run each time.

And the results?  Just running fc.exe to compare all the files took an average of 1.7 seconds.  If I add the code to filter on files with the same size first (so fc.exe is only run on files that have matching size) it took just under 1 second to run on average.

That’s a big improvement!

Write-Progress is used to put a progress bar up on the console so we can see how the script is running.

    Foreach ($TargetFile in $MatchingFiles)
    {
        if (($SourceFile.FullName -ne $TargetFile.FullName) -and !(($MatchedSourceFiles |
             Select-Object -ExpandProperty File) -contains $TargetFile.FullName))
        {
            Write-Verbose "Matching $($SourceFile.FullName) and $($TargetFile.FullName)"
            Write-Verbose "File sizes match."
            if ((fc.exe /A $SourceFile.FullName $TargetFile.FullName)  -contains "FC: no differences encountered")
            {
                Write-Verbose "Match found."
                $MatchingFiles+=$TargetFile.FullName
            }
        }
    }

Here we loop through all the files that match $SourceFiles size and compare them using fc.exe.

        if (($SourceFile.FullName -ne $TargetFile.FullName) -and !(($MatchedSourceFiles |
             Select-Object -ExpandProperty File) -contains $TargetFile.FullName))

This line skips files where either they match $SourceFile (so we don’t compare files to themselves) or where they’re already in $MatchedSourceFiles (so we don’t check files that have already been matched).

Because fc.exe is not a native PowerShell command it doesn’t return nice, formatted objects but outputs text.  To get a match therefore we need to check for a particular string in the output (“FC: no differences encountered”).

All the matching files have their full path added to $MatchingFiles.

    if ($MatchingFiles.Count -gt 0)
    {
        $NewObject=[pscustomobject][ordered]@{
            File=$SourceFile.FullName
            MatchingFiles=$MatchingFiles
        }
        $MatchedSourceFiles+=$NewObject
    }
    $Count+=1
}
$MatchedSourceFiles

Here is where we format the output.  Each file with a list of matching partners is used to create a custom PSObject.  That object has two fields;   the path to the file we are checking and a list of all the matching files paths.  These are added to our collection ($MatchedSourceFiles) and returned at the end.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s