Converting American Express PDF Statements to CSV with Powershell

Gah.  Importing accounts into accountancy programs can be tedious.  What makes it more tedious is when banks make life even more difficult.  American Express allows you to download statements in nice CSV formats though.

Yay!

Except if the statement is older than 6 months, then you have to download it as an non-formatted PDF.

Boo.

Last time I did this manually by grabbing elements from the PDF,  pasting into a CSV file one block at a time and then massaging the data to try and get it into my accounts package.  This was tedious, time-consuming and mostly inaccurate.

So this time, I exported the PDFs to .TXT and then wrote some Powershell to produce a formatted CSV from the non-formatted export.  This was interesting, time-consuming and mostly accurate.  So that’s a win in my book.

I used this an opportunity to learn regular expression processing in Powershell so there are two ways of doing it in the following code;  the second is what I tried first and is inelegant and overly complex.  Both work though (for a certain definition of ‘work’, see below).

I still had to check the data and manually fix a few things but as the source was mostly unstructured text I was amazed it worked as well as it did.

The script is below;  I’ve added some comments but this mostly as-is from when I bashed it out.

UPDATE:  I’ve revisited this with an updated and expanded script HERE.

 


#Export the American Express statement to .TXT then throw the script at it
#Source file is the txt export of the pdf
$FilePath="d:\temp\Statement_Dec 2013.txt"
#Where to put the output csv
$ExportFile="D:\temp\output.csv"
$File=Get-Content $FilePath
#Import the text file.  We're only interested in lines that start with a number of 3 letters, a space and then a number (ie Nov 02)
$FilteredFile1=$File | ?{($_ -match "^[A-Za-z]{3} \d") -or ($_ -match "^\d+\,?\d*\.\d+")}
$List=@()
for($i=0;$i -lt $FilteredFile1.Count;$i++)
{
    $FilteredString=""
    #If the entry starts with a date
    if ($FilteredFile1[$i] -match "^\w{3}\s\d")
    {
        #Sometimes the entry spills onto the following line;  the following checks for that (a number at the start of the line).  If found the two lines are merged.
        if($FilteredFile1[$i+1] -match "^\d+\,?\d?\.\d+\s?$")
        {
            $FilteredString= $FilteredFile1[$i]+" "+ ($FilteredFile1[$i+1]).Trim()
        }else
        {
            $FilteredString=$FilteredFile1[$i]
        }
    #Otherwise check it's the format Number then date ;
    }elseif ($FilteredFile1[$i] -match "^\d+\,?\d*\.\d+[A-Za-z]+")
    {
        $FilteredString=$FilteredFile1[$i]
    }
    if ($FilteredString -ne "")
    {
        $Output | add-member -membertype noteproperty -name "Category" -value ""
        $Output | add-member -membertype noteproperty -name "Memo" -value ""
        $Output=new-object psobject
        #Numbers sometimes use the format where 1000's are seperated by commas
        if($FilteredString -match "^\d+\,?\d*\.\d+")
        {
            #The nicer way of doing it;
            #Lines start with a numeral ("^\d+\,?\d*\.\d+") followed by 3 digit
            #Every regular expression within brackets is output as a seperate entry in the matches array ([0] is always the entire string)
            #So we're interested in  (Inflow) (Payee) (Date)
            $FilteredString -match "(^\d+\,?\d*\.\d+)(.*)([A-Za-z]{3} \d{1,2}\s)" 1>$null
            $Output | add-member -membertype noteproperty -name "Outflow" -value ""
            $Output | add-member -membertype noteproperty -name "Inflow" -value ([float]$matches[1])
            $Output | add-member -membertype noteproperty -name "Payee" -value $matches[2]
            #Formats the date using 2013 as the year
            $Output | add-member -membertype noteproperty -name "Date" -value ("{0:dd/MM/13}" -f [datetime]$Matches[3])

        }else
        {
            #It doesn't start with a number;  these seem to be outflows
            #The less nice way that I wrote first, a seperate comparison for each part we're interested in.
            $Output | add-member -membertype noteproperty -name "Inflow" -value ""
            $FilteredString -match "^([A-Za-z]{3} \d{1,2}\s){2}(.*)(\s\d+\,?\d*\.\d+)?" 1>$null
            $Output | add-member -membertype noteproperty -name "Payee" -value $matches[2]
            $Matches.Clear()
            $FilteredString -match "^[A-Za-z]{3} \d{1,2}" 1>$null
            $DateString=$Matches[0]
            #Formats the date using 2013 as the year
            $Output | add-member -membertype noteproperty -name "Date" -value ("{0:dd/MM/13}" -f [datetime]$Matches[0])
            $Matches.Clear()
            $Filteredstring -match "\d+\,?\d*\.\d+\s?$" 1>$null
            $Output | add-member -membertype noteproperty -name "Outflow" -value ([float]$matches[0])
            $Matches.Clear()

        }
        #Write to a list of objects
        $List=$List+$Output
    }
    #Output the list of objects to a file
    $List | Export-CSV -Path $ExportFile -NoTypeInformation
}

3 Replies to “Converting American Express PDF Statements to CSV with Powershell”

  1. Hi h3rring,

    Thank you so much for taking the time to write this script and openly share it. I can’t believe AMEX seems to be some archaic that something like this is still required. I’m actually quite surprised you haven’t had more interest or at least some comments on your post about this given how much of a headache AMEX’s system would be to many many people.

    In any case, I’m not a programmer/script writer by any means and I’m completely new to Powershell but I usually can figure these things out relatively quick so I’ve been trying to get your script to work for my AMEX statements.

    When I first ran the script as is (just with my import and export file name changes) I got the following error:
    Unexpected token ‘1’ in expression or statement.
    At F:\Google Drive\Search Paladin\Tax\AMEX PDF to CSV script\amexpdftocsv_original.ps1:41 char:84
    + $FilteredString -match “(^\d+\,?\d*\.\d+)(.*)([A-Za-z]{3} \d{1,2}\s)” 1 <<<$null
    + CategoryInfo : ParserError: (1:String) [], ParentContainsErrorRecordException
    + FullyQualifiedErrorId : UnexpectedToken

    Do you know what problem might be? I am from Australia so many of my transactions have data showing the US and AUD currency and therefore the format of the exported .txt might be a bit different to yours.

    However, with some modifications to the PDF exported to .txt file and modifications to the script (in particular removing all mentions of 1>$null) I managed to get a .csv file export with all correct dates and transactions (even the transactions in the correct currency). This is brilliant! What didn’t export/populate was any data in the inflow, payee, category or memo columns. I presume it was but am not sure if your script was designed to extract that information? The payee and category data will obviously be of value to extract. See below the exported to .csv data:
    “Inflow”,”Payee”,”Date”,”Outflow”,”Category”,”Memo”.
    “”,,”15/04/16″,”37.66″,””,””
    “”,,”16/04/16″,”10.05″,””,””
    “”,,”18/04/16″,”13.41″,””,””
    “”,,”28/04/16″,”338.66″,””,””
    “”,,”29/04/16″,”17.57″,””,””
    “”,,”30/04/16″,”948.45″,””,””
    “”,,”04/05/16″,”10.11″,””,””
    “”,,”07/05/16″,”40.37″,””,””
    “”,,”15/05/16″,”117.43″,,

    Any help in modifying the script to work would be very much appreciated. I can provide you with the modified scripts I’ve made and raw .txt data file if that will make it easier to understand my particular AMEX export and why the script as is isn’t working like it did for you.

    Please feel free to remove this comment if it is too long and cluttering and move any correspondence to email if you would like. I would be grateful for any advice.

    Kind Regards,
    James

Leave a Reply to h3rring Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: