Payment matching done right

I use ledger to keep an accurate record of my finances. All those financial actions of receiving, spending, saving, investing or moving around money will be reflected by transactions in my ledger file. If you have ever tried to do anything similar, you might think that maintaining the ledger is cumbersome, time consuming and error-prone. In reality, nothing could be further from the truth. You need proper tooling, though.

Any half-decent bank will provide access to your account history as some kind of downloadable CSV (or maybe an Excel file that can be converted to CSV fairly easily). Each entry in the CSV has fields for the transaction identifier, a date (or maybe two, if they separate the value date from the transaction date), the amount and some text which aims to identify what the transaction was or who the counterparty is.

As it turns out, the only real question when converting all this into a ledger entry is finding the correct counter- (or payee-) account of each transaction. I say counter-account, because the account itself is known: it is the representation (typically an Asset in accounting lingo) of the very account that the history pertains to.

Since I started keeping all my financial records in double entry bookkeeping, I tried a few different solutions. First I used GnuCash, but that proved to be a horrible experience when importing my account statements. I basically had to choose the payee account from drop-down menus in the import wizard on a per transaction basis. I also disapproved of the XML file which looked unnecessarily complicated and got quite large.

Then I migrated my books to ledger and never looked back. Finally I felt in control of the whole flow: a simple text file is all I need to maintain. So I adapted my importer scripts (by this time I had a few, initially to convert CSV/XML to QIF that I could import into GnuCash). However, the problem of finding the right account to write the bookkeeping remained unsolved. I tried a few alternatives (ledger --xact, reckon, etc.) but none seemed to work reliably enough.

This got me thinking about the underlying problem. What is the best algorithm for matching a bank account transaction to a bookkeeping account?

The matching algorithm

Assuming you have any sort of financial history in your ledger, you have most likely already invented a number of bookkeeping accounts to label the sources and sinks of your money. For example Income:Salary is where your salary comes from, and Expenses:Rent is where each month’s rent payment is booked. When adding new transactions, they will most likely fall in one of the existing categories, since they will be, for the most part, new instances of recurring transactions you already have in your books (like paying the rent, or shopping once again in your favourite grocery store). When considering a new transaction to be added to your ledger, the fundamental question is this: which of these bookkeeping accounts should be chosen as the counterparty?

Preferably, this should be decided with no up-front configuration such as regex rules or anything like that. I also don’t want the algorithm to ask me any questions. I will adjust the generated entries anyway, fixing anything I might not like, so guessing is fine as long as the guesses are reasonably good. What’s more, I want the tool to automatically adapt to any changes I might enact on the ledger. So if I decide that from now on, the rent is bookkept under Expenses:Housing (and run an auto-search-replace for that in my whole ledger file), this change should be picked up by the algorithm.

So how to do this? I use the following naive Bayesian inference algorithm.

Step 1

Parse the ledger file. For each account, maintain a table of {token→N} entries, increasing the tally for all the tokens from the descriptor text in the table for that account.

Also, add each token to the set of all tokens T and each account to the set of accounts A, so we have these sets ready for the subsequent steps where we will iterate over them.

Let N(t,a) be the number of occurrences of token t in account a.

Step 2

When parsing a new transaction, scan the descriptor into tokens. Based on the presence of each token, a vector of probabilities of the transaction belonging to each possible account is calculated. Let P_belong(t,a) be the probability that a transaction with the token t belongs to account a.

We define P_belong(t,a) as:

Then, after we have P_belong(t,a) for all tokens in the descriptor for all accounts, we use the Bayes theorem to calculate the combined probability of the transaction to belong to the accounts.

Let P(a) be the probability of the descriptor to correspond to account a. For each account, P(a) is defined as:

Then, the account with the greatest P(a) should be chosen.

Note that step 1 does not depend on the transaction to be decided on, just the current ledger file. The results of reading the ledger file may thus be precomputed and then only the remaining part of the algorithm needs to be carried out on a per transaction basis.

The implementation

My implementation is done in Clojure, which I consider one of the most promising new languages. Clojure is my personal favourite among these. (My other favourite languages, Erlang, Common Lisp and C, are not exactly new.)

The repository is among my public repos, called banks2ledger. Read the repository README for general instructions and usage.

One interesting point I want to make here is the way tokenizing is done. It is super simple but improves the quality of the matches quite a bit. When breaking the transaction’s description string into tokens, by default every substring separated by spaces (i.e. conventional words) will be a token. (The tokens are not case sensitive, so everything is converted to uppercase.) The comma , and the slash / are also taken as token separators.

What’s more, different forms of dates are replaced with degraded forms. This is because the same recurring transaction will show up with different dates in the ledger file; we want all these, and even future dates, to weigh towards the same payee account (in terms of Bayesian inference). So a substring like “20160317” will be replaced by the generic (degraded) token “YYYYMMDD”. The same goes for dates specified as “YY-MM-DD”. In practice, it is these two formats pertaining to card transactions that I have come across.

One final quirk: when editing the newly added transactions in my ledger file, I will sometimes add some notes after the end of the original descriptor string. I separate these notes with a pipe | character from the original content that came from the bank account CSV. The notes contain additional information concerning the transaction that I want to be able to look up later. But since this part does not come from the original CSV, it is not useful to parse it when tokenizing. So when descriptor lines are read from the ledger file, any pipe character terminates the line.

Driving the flow

Each month I download the account history CSV (or XLS) from all the banks I have accounts with. I save all these files according to simple naming conventions (i.e. bankname_YYYY_MM.csv) and archive them indefinitely for further reference.

After downloading the account files for the recently closed month, I run make which will drive the conversion of all the account files into ledger format. These files will be saved with the same names as the originals, changing the extension to .dat.

Below is the Makefile I currently use. (Note that in two cases I download the file as XLS or even XLSX and convert that to CSV via ssconvert, a useful command line tool that comes with Gnumeric). The JAR file run by make is the result of running lein uberjar in the banks2ledger repo. It is a standalone executable version of my Clojure application.

dats := $(patsubst %.xls,%.dat,$(wildcard *.xls))
dats += $(patsubst %.csv,%.dat,$(wildcard *.csv))
dats += $(patsubst %.xlsx,%.dat,$(wildcard *.xlsx))

all : $(dats)

LEDGER ?= "/home/tom/doc/ledger/ledger.dat"
B2LJAR ?= "/home/tom/bin/banks2ledger-1.0.0-standalone.jar"

bp_%.dat: bp_%.csv
	java -jar $(B2LJAR) -l $(LEDGER) -f $< -sa 3 -sz 2 -D 'yyyy/MM/dd' \
	-r 3 -m 4 -t '%9!%1 %6 %7 %8' -a 'Assets:Budapest Bank' -c HUF > $@

seb_%.dat: seb_%.csv
	java -jar $(B2LJAR) -l $(LEDGER) -f $< -sa 5 -r 2 -m 4 -t '%3' \
	-a 'Assets:SEB Privatkonto' > $@

ica_%.dat: ica_%.csv
	java -jar $(B2LJAR) -l $(LEDGER) -f $< -F ';' -sa 1 -m 4 -t '%1' \
	-a 'Assets:ICA Bank' > $@

bp_%.csv: bp_%.xls
	ssconvert $< $@

seb_%.csv: seb_%.xlsx
	ssconvert $< $@

After the conversion is done, I go through the resulting .dat files and manually edit them as I see fit. When that is done, I append them to my main ledger file and eventually, if everything looks fine, commit the changes to that file. The .dat files are also archived along with the sources.

Executing this flow takes about half an hour each month – including logging in to all webbanks and downloading the account history files, running make which in turn runs banks2ledger on all of them, and finally reviewing and editing the results, committing the updated master ledger to git. In return, I get detailed records of all my money movements, which I find to be an indispensable resource.