How we migrated 1.8 billion rows from MongoDB to Timescale with zero downtime.

Luke describes how we managed our migration from MongoDB to Timescale with no downtime and the issues we've faced while doing so.

How we migrated 1.8 billion rows from MongoDB to Timescale with zero downtime.

Overview

So in this blog we're going to go through the following:

  • Deciding on a database
    • Which one meets our needs and why
    • How I picked the wrong one, started migrating and then broke everything
    • Proceeding to go with a database that is far superior, and finishing migration.
  • The downfalls I noticed as we started migration
    • How storage almost became an issue
    • How I configured the database to remedy a lot of issues.
    • How the migration went
  • Benchmarks between Mongo and Timescale
    • Getting bot data
    • Getting historical bot data
    • How we store it
  • The new fetcher and how much more reliable it is

Anyway, let's get into the nitty gritty.

Deciding on a database

When I went to look for different databases I looked for specifically time-series databases. Due to the nature of our data I wanted to make sure it was stored in the most efficient way possible.

This lead me to initially look at InlfluxDB, Timescale and VictoriaDB. I first looked into InfluxDB, did a migration run on it only for it to break.

This was due to the fact that InfluxDB after about 120,000,000 (out of a total of 2.2 billion we needed to insert) rows would basically give up because it would run out of memory. After trying to configure it in a way that wouldn't fail I decided it was time to look at Timescale instead as the docs were significantly better than VictoriaDB and it used PostgreSQL which is where our primary entity storage for bots is.


Starting the migration

So, the migration started with basically no hitches once I'd proceeded with the updates to the migration script to insert into Timescale rather than Influx.

It gave me an estimate of around 13 days. This wasn't idea but we're inserting 1000 rows per query, and each bot is requiring around 70~ queries average to insert all their data at once Eventually it finished and had only taken around 8 days. This was nice. Though we then ran into our next issue... storage use.

As of writing this, it took 8 days to migrate:

  • 903,860,360 counts
  • 839,522,573 rankings

As we only migrated non-deleted bots. This is the final count so far and is growing every hour.


Timescale and compression

First I need to explain something, we essentially doubled our row count. As we're now storing ranking value and count values in different tables. This is because rankings only go back to when we first wrote DBL Statistics as you know it now. Due to Marco not having stored ranking changes.

Once migration had finished I noticed that the storage space from timescale was rather bad... This was a major issue if we wanted to continue to add data to it.

luke@topstats ~/topstats/timescale $ du -sh ./
525G    .

So after some discussion I decided to look into timescale a bit more. I found that there was a way to not only compress data, but to compress it on an interval. I proceeded to configure it so:

  • Data for 3 months is left uncompressed
  • Any data after 3 months is compressed

This left us with an, I must say, insane improvement.

luke@topstats ~/topstats/timescale $ du -sh ./
88G   	.

That's more like it. Time to look into benchmarks.


Before the benchmarks, some setup

I created 2 materialized views.

  • One calculating the monthly values for each month for each bot.
  • One calculating the rank for each month.

These are set as a continuous aggregate that will update each hour after a fetch to keep them updated.

Right. Time for the benchmarks.

Fetching 3 months of data
This is a good test, since this is where the compression interval ends.

  • Mongo: 8.49s
  • Timescale: 193ms

Fetching 6 months of data
See how timescale competes even with decompression as a handicap.

  • Mongo: 10.6s
  • Timescale: 470ms

Fetching 3 months of data for all bots
Note: Worth mentioning the only reason it's close is Timescale has to decompressed all compressed data to read it.

  • Mongo: 68.3s
  • Timescale: 42.2s

Fetch all of a bots historical data Note: This is purely down to how the aggregates work in timescale with our views. Mongo has indexes but they're not nearly as efficient.

  • Mongo: 12.2s
  • Timescale: 150ms

Honestly, I think with the timeframes we get here we're going to be good with timescale for a very very long time.


The new fetcher, what I learnt from my past fetcher and where it's better.

The way we fetch data for the bots is by making an API request to Top.gg, pulling every single bot from their site and storing the useful analytical data we need for displaying graphs and such on our site.

The old fetcher was written in such a way that it was multi-threaded and efficient enough but wouldn't handle the API going down very well. Knocking out the entire system at times causing missed data and more.

One of the things I wanted to implement in this fetcher was reliable retries and ratelimit handling. Both go hand in hand, but make it significantly more reliable.

I wrote the fetcher with a central websocket as a single point of contact to tell the processes when to proceed with fetching. This allows the fetcher to make the request, wait the needed time before making the next request to abide by the ratelimits set in place by top.gg

Top.gg has a nasty habbit of giving up 502 errors. If we recieve one of these we wait 5s and try again. This has increase stability 10 fold and has also increased the accuracy of our data.

The scaler also now auto scales completely making it completely mainenance free unless top.gg updates their API in some way.


That's it

That's how the migration of DBL Statistics MongoDB with over 181GB of disk space used was migrated to timescale where it uses 88GB and is significantly faster.

Anyway,

Thanks for reading the last atleast 2 weeks of my life.

Fun fact: I did all this and rode my motorcycle 680 miles to Germany from the UK whilst the migration happened.

Have a good day!