Finding a good historical data source

trading systems data source

In the last article, we introduced the idea of building robust trading systems for foreign exchange, and compared some of the characteristics of the FX and Futures markets, with FX having its own unique, but also non-random, behaviour.  This article now explores the first major challenge of actually building a system, namely, building a reliable historical database:

If we were discussing futures markets, this would be relatively straightforward, as there is only one price traded at any given time with a specific volume, which is readily available, direct from almost all of the relevant futures exchanges as well as third parties.  The FX market is rather unique though:

While being by far the most liquid market in the world, it’s also the most fragmented.  With no central exchange, each bank makes its own price, for each currency pair.  Therefore, at any given moment, EURUSD may theoretically be quoted as 1.3340/42 at one bank, 1.3339/41 at another and 1.3341/43 at a third, each with their own white labelled, or proprietary, electronic trading platform, otherwise known as an ECN (Electronic Communication Network).  There are also a growing number of ECNs competing for liquidity, where ‘buy side’ counterparties can submit their own prices into the systems.  This makes it impossible to get a truly complete, clean and accurate picture of intraday FX prices.  However, even the current, fragmented, electronic market is a quantum leap forwards, from only relatively recent years:

In the beginning – Voice Brokers

Before ECNs existed, most FX trading was done over the phone, with a trader sitting on a ‘spot’ desk, as the author once was, with half a dozen ‘broker boxes’, all shouting out prices.  For example a ‘Dollar Mark’ (US Dollar v German Deutsche Mark) spot dealer  (the author pre-dates the Euro) might have one broker box calling out, “thirty, thirty-five, in five” and another, “thirty, thirty-four, three by five” etc., with the ‘three by five’ denoting the size, in millions, that the price was good in and the ‘big figure’ not quoted as that was known by all involved.

Each trader, for each currency pair, would have a number of boxes shouting out similar prices and hence the classic image of a bank’s trading floor, being a cacophony of sound.  The reality is much different these days, with the voice brokers having been almost entirely replaced by ECNs, particularly in the major currency pairs.

In the days of voice brokers, part of the spot traders’ art was to recognise the brokers voice with the best price, good in the size he wanted to execute, which as a junior dealer, was probably the hardest skill to master; particularly when the broker at each institution wouldn’t always be the same person, as they would need to go to lunch, be away on holiday, or just step off the desk for a few moments.  A junior on a desk would usually cover several dealers, when they similarly stepped off the desk, so may have had over twenty voices to recognise and remember which ‘box’ they were on.  All the deals were also entered manually, unlike today’s ECNs, where the deals automatically go in the trading ‘blotter’.  On this occasion it is probably very fair to say that junior traders today really do have it easy by comparison.

If an order was too large to execute with just one counterparty, a ‘call out’ would be made, where the dealer would stand up and shout, “Get me calls!”  Every other dealer on the desk, would then either call up several banks on, ‘The Reuters’ (an inter-bank chat system) and/or the telephone.  Each dealer would then shout out the prices he was being made and the dealer initiating the activity would make hand signals and shout “yours” or “mine”, to indicate if he wanted to buy or sell.  There was a great deal of ‘spoofing’ that went on, which was part of the art of good execution and mastering the art of spot trading:

For example, if a dealer at one bank took a ‘call’ from another, and found they were a seller, he might also sell, believing a large order was going through and expecting the price to fall, as the other bank continued to execute their order.  This meant that one would often buy from the first few ‘calls’, hoping this would prompt the other banks to believe you were a buyer, drive the price up, quoting higher prices, into which you could then sell.  Hence it was always a game of bluff, counter-bluff and spoof.

One anecdote worth recounting, in which the author was involved, is a spot desk of a first tier bank, making a huge return in the space of a few minutes, solely by a simple, but beautifully executed spoof:

The bank was known to be one that had a good relationship with the Bank of Japan (BoJ) and through which they had intervened in the market before, to strengthen their currency, occasionally coming into the market and selling a collosal, market-moving amount of USDJPY and DEMJPY.  This always kept dealers wary of being the other way around, lest they got caught the wrong way on an intervention, and hence kept the Yen supported.

Therefore the chief dealer and his number two, the Yen trader, knew that if the bank was to be seen selling a huge amount of DEMJPY and USDJPY, the market may well think that the BoJ was intervening and would then also start selling, to capture the pending move down.   One day they stood up and shouted “Get me calls!”, which in itself wasn’t unusual, as this happened on most large orders:

As each of the other traders, and assistants, all started getting prices from banks and shouting them out, they shouted, “yours!” together with the hand gesture of pushing an open hand down and away from the body (for the avoidance of any doubt as to the instruction) until they’d sold literally several hundred million US Dollars and German Deutsche Marks, against the Yen.

Nobody knew what was going on, but everyone did his or her job and got the order executed.  The sales desk was asking what was happening, as customers called up to ask what the reason was for the big move, as everybody saw and heard the huge commotion coming from the spot desk and the inevitable rumour spread that it was ‘BoJ’ intervention.  Nobody on the desk said a word to confirm or deny the rumour, as nobody else on the desk, knew what was really going on.  Just tallying the total amount sold and reconciling the now huge position the desk had, was not an easy task.

As the rumour spread and speculation mounted, USDJPY and DEMJPY continued to fall rapidly.  Then came the second wave, or so everybody thought.  Again the Chief Dealer shouted, “Get me calls!” and started to sell USDJPY and DEMJPY again.  The market thought it was the start of a second wave of selling by the BoJ, as this was their typical style and accordingly marked their prices much lower and again sold themselves.  Then came the stroke of genius – they started to buy, and buy everything, shouting, ”Mine, mine, mine…” with the accompanying hand gesture of bring the palm of the hand up towards the shoulder, to the still falling prices, as other banks initially thought it was just part of a ‘spoof’ to sell into.

Before the market realized what was going on, they’d covered the entire position and locked in a massive profit, literally in the space of a few minutes.  Everybody on the desk was given a slice of the pie, for a job very well done and it’s the type of trading that we will unlikely see again – such were the days before the dominance of ECNs.

There is of course a point to this anecdote of course, other than to record it for posterity:

Although a huge amount of transactions went through in those few minutes, none were recorded by exact time.  The author himself probably executed trades, with more than half a dozen banks, but the most that would have been recorded was either a conversation on ‘Reuters’ or a hurried scribble on a deal ticket after a phone transaction, later reconciled with the counterparty.

Therefore, although an extreme example, it illustrates the point very well; there simply isn’t a completely reliable source of accurate, historic FX data available before the dominance of ECNs and the situation hasn’t improved significantly since:

The Advent of Electronic Trading Platforms

As electronic platforms began to dominate more and more of the volume, so accurate data has become more readily available, as computers are easily able to capture the exact time, price and volume of every trade. However, there is still no central ECN and rather than one becoming the dominant player, as some expected, the market has continued to fragment.  This means that at every minute of the day, each currency pair is trading at different prices, bid/ask spreads and volume. 

Only if one could aggregate all of the prices made on every ECN and by every bank and broker, could a truly accurate record be built.  Even then though, a bank may provide a rate on several ECNs, good in $10mio, but as soon as one of its prices is hit, it will immediately ‘pull’ that rate from the other ECNs.  Therefore, even though a 40 bid may appear to be good in $50mio, if one could aggregate all of the prices at a given moment, the reality is, that it may well not be case if you tried to execute a trade of that size.

Trading the Crosses

If someone wanted to sell the Swiss Franc against the Japanese Yen, as it’s not a commonly quoted pair, it has relatively little liquidity on the electronic platforms and as a consequence has a wider price.  However, USDCHF and USDJPY are more actively traded, so a professional trader would go ‘through the legs’ or ‘components’, buying USDCHF and selling USDJPY, with the USD amounts netting out to zero, leaving a CHFJPY position.  This means the trader actually traded CHFJPY, but no price may actually have traded on any ECN or with any broker ‘direct’ in CHFJPY.

Gaps and Spikes

Although FX is by far the most liquid market, there are still times when no prices are recorded for periods of time, particularly during the less liquid Asian session and, as we have seen above, particularly in the less liquid crosses.  This means that not only do genuine gaps occur in historic data, but there are also often times when a certain pair traded on one electronic platform, but not on others.  These gaps in the price data need to be ‘filled’, which can be done using a simple algorithm, otherwise any indicator, even a simple moving average, would have an input of zero for the price at that time, which would of course create a hugely incorrect reading, which may well trigger an erroneous trading signal in an historic simulation.

Conversely, not only are there times when there is no price, there are times when a spike in the data appears:

This can be due to a number of factors, but often where somebody has left an offer to take profit at, for example, 1.2580 overnight.  If somebody else has a stop order to buy if 1.2520 is traded and there are no other prices in the system until the 1.2580 offer, then that would be the next price dealt.  It’s market practice to cancel these deals the following morning, when an obviously ‘off market’ rate was traded, but nonetheless, it will still often appear in the historical data made available and there is a ‘grey’ area where it is questionable whether the rate dealt was ‘off market’, or fair given the time of day and liquidity.

One of the challenges of using a simple algorithm to clean the data is that some genuine market moves can look a lot like a ‘spike’ in a fast market, when some news, or economic data, has just been released.  A way around this is to confirm the rate via the other components.  Looking at AUDUSD and USDJPY components at the time could check for example, a ‘spike’ in AUDJPY.

Highs and Lows

One of the most commonly asked questions in FX trading is where the highs and lows were, as this is where queries occur and money is lost and made on orders.  If an order to buy was placed at 0.9840 and the low was 0.9839 offered, then the order would be filled.  If the low price quoted was 0.9840/43 but was never traded, or ‘given’ at 0.9840, then the order would not have been filled. As it’s often hard enough to determine in a real trading situation whether an order should have been filled, it’s impossible to be certain with a historic simulation.  In fact, if a large buy order had been placed at 0.9840, this could affect the price action itself, with market makers buying ahead of the 0.9840 bid, knowing the market will be supported there.

With the market so fragmented, and with no central exchange to determine the definitive highs, lows and the volume they traded in, order fills remain a cause of much debate, on a daily basis, in the FX market.

Predictive Pricing

As there is no central price for a currency pair, a bank or broker is free to make whatever price it wants to their customers and the customer is equally free to trade on that price, or trade elsewhere.

Some traders are very predictable in their trading behaviour and only trade with one counterparty.  This leaves them open to ‘predictive pricing’ algorithms.  For example, if some traders sold USDJPY earlier in the day, then it’s likely that their next trade in USDJPY will be to cover that position and buy.  Some ECNs therefore have the ability to show each customer a different price.

Therefore while a neutral price in USDJPY may be 98.94/96, one customer’s ECN might show a price always marked a point higher at 98.95/97, until they have closed their short position, when it will then go back to a neutral price, earning the bank an extra pip on that trade and the customer believing he’s being shown a relatively tight two point price all day.

The author has first hand experience of such pricing engines, with one of his former colleagues having built just such an engine, for a first tier investment bank. It’s a perfectly legitimate practice, as the customer has the freedom to trade on the price, or not, but the phrase, ‘Caveat Emptor’, is just as true in today’s FX market, as it was in Roman times, when the phrase was first coined:

Consider a system, which generated a trading signal, in USDJPY, just once a day, for 252 trading days a year.  Giving one point away on each trade may well result in an otherwise profitable system, recording a net loss.  Without knowing why the losses were occurring, the trader may believe a perfectly robust system was no longer performing and even worse, if he were to run a simulation on that years’ data, he might see that he should have made a profit, still not knowing where the 252pt ‘loss’ was made.  This highlights how critical efficient execution is, no matter how robust the back testing and how clean and reliable the historical data; something that we’ll look into in much more depth, in a future article.

Interest Rates

One extraneous factor we have to take into account, when dealing with FX, which Futures traders do not have to account for, is the interest rate differential.  As each currency yields a certain rate of interest, then one earns interest in the purchased (long) currency and pays interest in the sold (short) currency.  This means that if a position was held long “Kiwi Yen” (NZDJPY) then the interest rate, or ‘carry’ would be approximately 3pct per annum, at current rates.  That is to say, if the exchange rate and interest rates remained the same in one year’s time, then the trade would yield a 3pct return, being the interest rate differential earned by holding the New Zealand Dollar vs. that paid borrowing Japanese Yen.

That difference is accounted for by ‘rolling’ the position every night, or ‘tom/next’ as it’s called.  When a trade is rolled, it’s closed out at an agreed rate at the end of the day (called a ‘reval.’ being short for ‘revaluation’) and re-instated with a small adjustment made in the price, to account for the roll (the difference in the interest rates).

For an intra-day trading system, this isn’t a factor; if the positions are flat overnight, then there is no ‘roll’.  For a longer term trading system, which holds trades overnight, then the interest rate differential has to be taken into consideration, to correctly calculate the results.  With some historic interest rate differentials being very large, this can make a dramatic difference, and again be the difference between a system being profitable or otherwise, hence the ‘carry trades’ which seek to exploit exactly those differentials.

However, although the central bank rates may be fixed and known, the counterparty will usually charge a small mark-up on the ‘tom/next’.  Sometimes, this can be as much as several percent.  Therefore it’s important to know both the interest rates and the mark-up from the broker, to negotiate them as low as possible and factor them into any simulations.

Time Zones

Probably the most overlooked factor when dealing with FX data is that Europe, the US and Asia, all operate on different time zones.  If we wanted to code an ‘opening range break out’ system for the London open, which is one of the most liquid times of day, then we should use local time in London and not GMT.

Although most data is provided in GMT, traders and therefore market behaviour, operate on local time, so daylight savings need to be taken into account.  Unfortunately the US has slightly different dates when they observe DST and most of Asia doesn’t observe daylight savings at all, so it’s impossible to make one universal adjustment for local time across all the sessions and days of the year.

Therefore one either has to adjust the data, to local time, for the session one is interested in trading, or write an adjustment into the code, dependent on both the time of day, and date that the order is being executed.  For example, if the data is in GMT, then closing a position at the close of the day in London, at 5pm, would still be 5pm in November as local time is GMT.  However, if the same trade were done in June, closing at 5pm local time in London, would be 4pm according to the time stamp of the data, as Daylight Savings would have been in effect.

Synthetic Prices

We know that the major currency pairs are the most liquid, with the better pricing; being those traded against the Dollar and the Euro.   Therefore, if we had the data for those, then we could derive the ‘cross rates’, such as CHFJPY, GBPCAD etc.

The one challenge here of course is that if we had 60 minute OHLC (Open, High, Low, Close) data, for each hour of the day, and calculated the implied ‘crosses’, then we would only know the open and close accurately for those hours, as we have no way of knowing that the high or low of each component occurred (and almost certainly didn’t) at the same time, within the hour.

However, it’s certainly one viable method to create a reliable database.  If we had hourly data for the seven major currency pairs i.e. 8 currencies, then we could calculate a synthetic price from those ‘components’ for the other 21 ‘crosses’. For example, GBPJPY is the GBPUSD rate multiplied by the USDJPY rate etc.

This creates a relatively clean set of data for the crosses, but only a line chart, as the crosses would not contain accurate highs and lows. i.e. one could not plot a bar chart, which would require the Highs and Lows of each hour.


Historic FX data is an absolute prerequisite, before even attempting to build a robust FX trading system, but it can only ever be an approximation, unlike the futures markets, which have a central exchange and no interest rates or ‘rolls’ to take into account and where the data is almost always supplied in local exchange time.

The FX market is simply too fragmented to have one universally agreed set of historic data and the trend is for the market to become more fragmented and not less so, with new electronic platforms being released each year, some carving a niche in certain currency pairs, or time zones. 

Historic price data before the dominance of ECNs is much less accurate than more recent data and is, at best, an average rate traded for a certain time period.  Accurate Open, High, Low and Close (OHLC) data simply cannot, and does not, exist before the days of ECNs and since then (approximately late 1990’s onwards) it is far more accurate and more readily available, but can still only be an approximation. (Daily data is much more accurate as the ‘OHLC’ rates for a given currency pair on a certain date are generally agreed, particularly for the ‘majors’).

As an algorithmic FX trader, the best solution is to find a good source of data and then ‘clean’ it as much as possible, cross referencing the crosses and majors, filling in any gaps and cleaning out any spikes.  Then the data must either be offset to account for ‘daylight savings’ in the time zone one is interested in trading, if the system has any time input, or it can be written into the code of a system itself.

Finally, if it’s a system that holds many positions for a number of days, or frequently overnight, then the rolls must be factored into the simulations.  There are many pieces of software available for analysing futures markets that can be adapted for FX data but none ‘off the shelf’ to date provide, as far as the author is aware, the unique functionality required to account for such unique nuances of FX.

All of these challenges probably contribute to the relative lack of successful systematic traders in FX, given its huge liquidity and clear capacity for systematic trading.

Better software and data will certainly be more readily available in the future, as FX continues to grow as an investment class.   The author himself is currently involved in beta-testing a number of software packages and working with one software company to provide the unique functionality needed to test FX systems, ‘off the shelf’, so it’s certainly something that will be available, in the near future.

Caspar Marney