Intro and Motivations

Why Feat?

Research and prototyping are often performed in scripting languages Python and R, but these languages are designed for fast iteration, not performance ingesting and processing data. Processing large quantities of data, such as those that relate to market microstructure, is a bottleneck that slows down research and production predictions, as generating the final samples from scratch can take hours to complete. While many vendors support accessing aggregated price and volume data in the form of traditional OHLC candlestick bars, these so-called “time bars” are known to have less desirable statistical properties than some of the alternatives.

Additionally, many vendors who provide programmatic access to the source data have limited history available. Feat is designed to always capture the latest data so that aged out underlying data is still available even when the window of available history shifts forward. It has support for pulling such data to a local machine using data providers’ APIs as well as further refining it into samples useful for analysis.

Feat can also be configured to work on data from providers it has not been precoded to support, such as a historical tick data CSV with arbitrary field names. In the Future, we want to expand Feat to rapidly generate data that is useful for forecasting models but can be slow to generate, such as autocorrelation, bubble tests, triple barrier labels, and more.

Benchmarks

Fast is easy to claim. It’s harder to back up with data.

Here is one example of Feat processing dollar bars from a BTCUSDT Binance trade data CSV from CryptoArchive. This test was executed on a t3a.xlarge instance (with 4 vCPUs and 16GB memory) on AWS, with the source file on the standard root block device (EBS). The bars are sampled at a rate of $7,000,000 per bar.

$ feat bars dollar \
    BTCUSDT \
    --timestamp_index 3 \
    --last_index 1 \
    --volume_index 2 \
    --delimiter "|"
INFO feat::bars: Processing ticks into bars out_dir_path="bars/BTCUSDT" in_dir_path="ticks/BTCUSDT" symbol="BTCUSDT"
INFO feat::bars: Sampling dollar bars out_file="bars/BTCUSDT/dollar-2021-09-12-17-11-39.csv"
INFO feat: Finished all seconds=399

The source file is about 43GB large and it takes ~6.5 minutes to process the dollar bars. For comparison, wc -l on the tick file takes ~4.5 minutes.