Build your own bot to harvest market data

Image for post
Image for post

Trading algorithms using machine learning require a lot of market data. In the cryptocurrency space, you can download a few thousand lines of market data from the exchanges via their APIs, or you can find some free datasets on the web. But these do not give enough data to give the machine learning models a good chance to perform.

How can I get A LOT MORE cryptocurrency historical data for my machine learning modelling?

One answer, which avoids spending a lot of money buying data, is to create your own datastore and harvest the data you need from the exchanges over time.

This post outlines how we built a simple data harvesting bot using Python and MongoDB. It assumes you have some knowledge of coding and are comfortable on the command line.

This post outlines how to do this at a high level. The code is available on .

Cryptodata — a data harvesting bot

Image for post
Image for post
“two green and red cultivators on brown field” by on

This is a very simple bot that does just one job — no more, no less.

What it does

  1. The cryptodata bot wakes up every period (currently configured to be every 1 minute).
  2. It then collects the current market data from a number of exchanges at the same time using their REST APIs.
  3. The data collected is then saved to a MongoDB database in a fairly consistent structure.
  4. The bot then goes back to sleep until it wakes up again at the start of the next minute.

What data do we collect?

  • latest market prices
  • current open order books
  • recent trade history

From this we can then later construct candles, technical analysis indicators or do any other feature engineering we like. We do this by querying the MongoDB database.

Structuring the MongoDB database

MongoDB and other document databases are a great choice for data collection tasks because they enable you to easily store data from different sources that may not have a consistent structure.

Being less strict about the structure of your database of course means more work when it comes to querying the database and extracting the data later, but it saves a vast amount of time in getting started.

We the mongoengine package in Python as it helps provide more structure and validation around MongoDB. For this task we set up the database to store the ticker information, orderbook and recent trades in three different tables/collections. Here is the code that defines the schema:

from mongoengine import Document, connect, FloatField, StringField, DictField, ListFieldfrom config import DATABASEconnect(DATABASE, host='localhost', port=27017)
class Ticker(Document):
Document to store ticker information
unixtime = FloatField(required=True, )
exchange = StringField(required=True, max_length=20)
pair = StringField(required=True, max_length=10)
content = DictField(required=True)
class Orderbook(Document):
Document to store orderbook information
unixtime = FloatField(required=True, )
exchange = StringField(required=True, max_length=20)
pair = StringField(required=True, max_length=10)
content = DictField(required=True)
class Trades(Document):
Document to store recent trades information
unixtime = FloatField(required=True, )
exchange = StringField(required=True, max_length=20)
pair = StringField(required=True, max_length=10)
content = ListField(required=True)

Some more technical details


You can run the cryptodata bot on a small machine or on a virtual machine in the cloud.

We have the bot running on a low-spec Ubuntu machine that is constantly connected to the internet. But anything that is very reliable and always connected to the internet can work.

Cloud hosting is also an option. Installing and running the crytodata bot on a small virtual machine will do the job ($5/$10 a month VM from DigitalOcean for example), but as the datastore grows in size, be aware that you may need to spend more on your cloud disks.

Installation and packages

We require the following software and packages to be installed:

  1. MongoDB installed and with the Mongo server running
  2. A Python 3.6 virtual environment for the bot with the packages you need (details are in the requirements.txt file)

Collecting the data

Each cryptocurrency market API works slightly differently and provides a slightly different response.

The python API wrappers that you can use to connect to each of these exchanges also have different formats and responses.

This could add a lot of complexity to a bot that is collecting data from many exchanges. However the cryptotik python package tries to unify the request and responses from cryptocurrency exchanges. It does a good job so far, although the coverage of exchanges is still a bit patchy.

For each exchange and pair, in one line we can collect the current data about the ticker api.get_market_ticker(pair), or the orderbook api.get_market_orders(pair), or the recent trades api.get_market_trade_history(pair).

You can find out more about the and support its development.


As the bot is calling data from multiple exchanges, we must use parallel processing in python. If we do not then there is a lag of a few seconds between the different pieces of data that we collect at any time point. In fast moving markets this is probably not acceptable.

Using threading in python, it is pretty simple to parallelise the process so that the bot sends the request to all the exchanges at the same time. The bot then waits for the responses from the exchanges, and has plenty of time afterwards to process these responses and save them to the datastore before going back to sleep.

How much data might we get?

Running just for 2–3 weeks, the bot has been harvesting data from five exchanges every minute. This means that we now have well over 100,000 data rows for analysis, and well over 3GB of data to start mining.

GitHub repository

The full code is available on .

If you want to make improvements to this package or to the cryptotik package, please do. Pull requests are welcomed.

Written by

Insurance meets tech meets music. #insurtech

Get the Medium app