I wrote some very quick and dirty code to import American Express PDF statements to CSV here. I could export the PDF to TXT and then process the text file with PowerShell.I had to revisit it the other day as I had a raft of PDF statements to convert and import into YNAB (and not just from AMEX).
Of course, all the bank and credit card vendors use a standard PDF format for statements so it was easy.
Right?
No, actually pretty much everyone just does their own thing. And by ‘own thing’ I mean PDFs that are not even consistent within themselves.
So: the joy of regular expressions.Here’s the updated script;
#Export the American Express statement to .TXT then throw the script at it [cmdletbinding()] param ( [Parameter( Mandatory=$true, ValueFromPipeline=$True, ValueFromPipelineByPropertyName=$True ) ] [Alias("FullName")] [ValidateScript({Test-Path $_ -IsValid})] [string]$Path ) BEGIN { $Year="15" $ExportExtension=".csv" $SourceExtension=".txt" } PROCESS { #Source file is the txt export of the pdf if ($Path -notmatch ("\"+$SourceExtension+"$")) { Write-Error "Source file does not have extension $SourceExtension" Break } #Where to put the output csv $ExportFile=$Path -replace ($SourceExtension+"$"), $ExportExtension Write-Verbose "Source File : $Path" Write-Verbose "Export File : $ExportFile" $File=Get-Content $Path #Import the text file. We're only interested in lines that match any of the criteria we check for later [array]$FilteredFile=$File | ?{($_ -match "^([A-Za-z]{3} \d{1,2})\s([A-Za-z]{3} \d{1,2})") -or ($_ -match "(^\d+\,?\d*\.\d+)(.*)([A-Za-z]{3}\s\d{1,2})\s([A-Za-z]{3}\s\d{1,2}.*)") -or ($_ -match "^(\d{2}\s[A-Za-z]+)\s(\d{2}\s[A-Za-z]+)\s(.*)\s(\d+.*)") -or ($_ -match "^\s*CR")} $List=@() for($i=0;$i -lt $FilteredFile.Count;$i++) { $FilteredString="" #If the entry starts with a date Write-Verbose " Processing $($FilteredFile[$i])" if (($FilteredFile[$i] -match "^\w{3}\s\d") -or ($FilteredFile[$i] -match "^\d{2}\s[A-Za-z]+\s\d{2}") -or ($FilteredFile[$i] -match "^\d+\,?\d*\.\d+")) { #Sometimes the entry spills onto the following line; #the following checks for that (a number at the start of the line) or CR for Credit. If found the two lines are merged. if(($FilteredFile[$i+1] -match "^\d+\,?\d?\.\d+\s?$") -or ($FilteredFile[$i+1] -match "^\s*CR")) { $FilteredString= $FilteredFile[$i]+" "+ ($FilteredFile[$i+1]).Trim() }else { $FilteredString=$FilteredFile[$i] } #Otherwise check it's the format Number then date ; }elseif ($FilteredFile[$i] -match "^\d+\,?\d*\.\d+[A-Za-z]+") { $FilteredString=$FilteredFile[$i] } if ($FilteredString -ne "") { Write-Verbose " $FilteredString" $Output=New-Object -typename PSObject $Output | add-member -membertype noteproperty -name "Category" -value "" $Output | add-member -membertype noteproperty -name "Memo" -value "" #Numbers sometimes use the format where 1000's are seperated by commas if($FilteredString -match "(^\d+\,?\d*\.\d+)(.*)([A-Za-z]{3}\s\d{1,2})\s([A-Za-z]{3}\s\d{1,2}.*)") { #Lines start with a numeral ("^\d+\,?\d*\.\d+") followed by 3 digit #Every regular expression within brackets is output as a seperate entry in the matches array ([0] is always the entire string) #So we're interested in (Amount) (Payee) (Date) if ($Matches[4] -like "*CR*") { $Output | add-member -membertype noteproperty -name "Outflow" -value "" $Output | add-member -membertype noteproperty -name "Inflow" -value ([float]$matches[1]) $Output | add-member -membertype noteproperty -name "Payee" -value $matches[2] #Formats the date using $Year as the year $Output | add-member -membertype noteproperty -name "Date" -value ("{0:dd/MM/$Year}" -f [datetime]$Matches[3]) }else { $Output | add-member -membertype noteproperty -name "Outflow" -value ([float]$matches[1]) $Output | add-member -membertype noteproperty -name "Inflow" -value "" $Output | add-member -membertype noteproperty -name "Payee" -value $matches[2] #Formats the date using $Year as the year $Output | add-member -membertype noteproperty -name "Date" -value ("{0:dd/MM/$Year}" -f [datetime]$Matches[3]) } }elseif ($FilteredString -match "^(\d{2}\s[A-Za-z]+)\s(\d{2}\s[A-Za-z]+)\s(.*)\s(\d+.*)") { #Here the format is Day Month Space Day Month Space String Amount [Optional String] #Equating to Date, Processed Date, Payee, Amount (Credit or not) if ($Matches[4] -like "*CR*") { $Output | add-member -membertype noteproperty -name "Inflow" -value ([float]$Matches[4].Split("CR")[0]) $Output | add-member -membertype noteproperty -name "Payee" -value $matches[3] $ProcessedDateString="{0:dd/MM/$Year}" -f [DateTime]::ParseExact($matches[2],'dd MMMM',$null) $Output.Memo="Processed Date : $ProcessedDateString" $Output | add-member -membertype noteproperty -name "Date" -Value "" $Output.Date=("{0:dd/MM/$Year}" -f [DateTime]::ParseExact($matches[1],'dd MMMM',$null)) $Output | add-member -membertype noteproperty -name "Outflow" -value "" }else { $Output | add-member -membertype noteproperty -name "Inflow" -value "" $Output | add-member -membertype noteproperty -name "Payee" -value $matches[3] $ProcessedDateString="{0:dd/MM/$Year}" -f [DateTime]::ParseExact($matches[2],'dd MMMM',$null) $Output.Memo="Processed Date : $ProcessedDateString" $Output | add-member -membertype noteproperty -name "Date" -Value "" $Output.Date=("{0:dd/MM/$Year}" -f [DateTime]::ParseExact($matches[1],'dd MMMM',$null)) $Output | add-member -membertype noteproperty -name "Outflow" -value ([float]$matches[4]) } $Matches.Clear() }else { #For AMEX PDFs : Mostly #Format (3 Char Month) (1-2 Char Day) Space (3 Char Month) (1-2 Char Day) Space String Decimal (Optional String) #This corresponds to Date, Processed Date, Payee, Amount and whether it's a credit or not $FilteredString -match "^([A-Za-z]{3} \d{1,2})\s([A-Za-z]{3} \d{1,2})(.*)(\s\d+\,?\d*\.\d+.*)" 1>$null if ($Matches[4] -like "*CR*") { #If there's a CR at the end of the line then it should be a credit and the value should go into inflow $Output | add-member -membertype noteproperty -name "Inflow" -value ([float]$Matches[4].Split("CR")[0]) $Output | add-member -membertype noteproperty -name "Payee" -value $matches[3] $ProcessedDateString="{0:dd/MM/$Year}" -f [datetime]$Matches[2] $Output.Memo="Processed Date : $ProcessedDateString" #Formats the date using $Year as the year $Output | add-member -membertype noteproperty -name "Date" -value ("{0:dd/MM/$Year}" -f [datetime]$Matches[1]) $Output | add-member -membertype noteproperty -name "Outflow" -value "" }else { #Otherwise put the value in outflow $Output | add-member -membertype noteproperty -name "Inflow" -value "" $Output | add-member -membertype noteproperty -name "Payee" -value $matches[3] $ProcessedDateString="{0:dd/MM/$Year}" -f [datetime]$Matches[2] $Output.Memo="Processed Date : $ProcessedDateString" $Output | add-member -membertype noteproperty -name "Date" -value ("{0:dd/MM/$Year}" -f [datetime]$Matches[1]) $Output | add-member -membertype noteproperty -name "Outflow" -value ([float]$matches[4]) } $Matches.Clear() } #Write to a list of objects $List=$List+$Output } #Output the list of objects to a file $List | Export-CSV -Path $ExportFile -NoTypeInformation -Force } }
The main changes from the original are as follows.
First I modified it so it would take pipeline input, test the paths passed to it and set a few variables I’ll use in the script;
param ( [Parameter( Mandatory=$true, ValueFromPipeline=$True, ValueFromPipelineByPropertyName=$True ) ] [Alias("FullName")] [ValidateScript({Test-Path $_ -IsValid})] [string]$Path ) BEGIN { $Year="15" $ExportExtension=".csv" $SourceExtension=".txt"
The pipeline input is useful as I can process an entire folder of TXT statements in one go via;
GCI *.txt | .\Process-CreditCardStatement.ps1
The script also dynamically generates the output filename from the source (it just changes the extension). Again, that means I can pipe multiple files to it for processing.
Next I broadened the filter at the beginning; it now looks for a range of patterns and filters them into $FilteredFile.
[array]$FilteredFile=$File | ?{($_ -match "^([A-Za-z]{3} \d{1,2})\s([A-Za-z]{3} \d{1,2})") -or ($_ -match "(^\d+\,?\d*\.\d+)(.*)([A-Za-z]{3}\s\d{1,2})\s([A-Za-z]{3}\s\d{1,2}.*)") -or ($_ -match "^(\d{2}\s[A-Za-z]+)\s(\d{2}\s[A-Za-z]+)\s(.*)\s(\d+.*)") -or ($_ -match "^\s*CR")}
This does look pretty horrendous but all it’s doing is going through the TXT file and making sure relevant data is copied out, leaving all the cruft.
Some examples of patterns matched are;
- Aug 24 Aug 24 Steam 35.00
- 28 JUNE 29 JUNE MCDONALDS 17.74
A good reference for regular expressions is here but here’s an example breakdown of one of the expressions;
“^([A-Za-z]{3} \d{1,2})\s([A-Za-z]{3} \d{1,2})”
^ : Start the match from the beginning of the line; this prevents the string being matched anywhere in the text.
[A-Za-z] : Matches any letter of any case
{3} : 3 occurrences of the previous match. So in this case it would match any string of 3 letters.
\d{1,2} : \d is a digit and {1,2} matches either 1 or 2 occurrences.
\s : A whitespace character
The matching pattern is then repeated. This matches successfully to lines where there are two dates such as;
Mar 01 Mar 03
After that the script checks if the data has spilled onto the next line of the TXT file. Normally it’s only the final “CR” (indicating the transaction is a credit) or the amount that’s on the following line so it only checks for either of those patterns;
($FilteredFile[$i+1] -match "^\d+\,?\d?\.\d+\s?$") -or ($FilteredFile[$i+1] -match "^\s*CR")
In the final part of the script it generates an object with Inflow, Payee, Date, Outflow, Memo and Category entries. I noticed that some programs are sensitive to the order so the object attributes are always added in the same order. These match the CSV import requirements for YNAB.
The script looks for three types of transaction line;
[Amount][String1][Date][Date][String2]
Where Amount is a number in currency format, Date is in the format “MMM DD”, String1 is the Payee and String2 is an optional occurrence of “CR” for a credit.
[Date][Date][String1][Amount][String2]
Where Amount is a number in currency format, Date is in the format “DD MMM”, String1 is the Payee and String2 is an optional occurrence of “CR” for a credit.
[Date][Date][String1][Amount][String2]
Where Amount is a number in currency format, Date is in the format “MMM DD”, String1 is the Payee and String2 is an optional occurrence of “CR” for a credit.
Each part of the matched string is encased in brackets which means each matching component is stored in a separate array entry in $Matches. The script then stores each $Matches entry in a different attribute of the output object. If there is a final “CR” in the entry then the amount is stored within “Inflow” (as a payment) or “Outflow” otherwise (as a charge).
Finally all the processed objects are written to a CSV.