A single-store feature store from Splice Machine
Dual-store architecture is not the only viable option for feature store. You may rely on an HTAP to rull them all.
A feature store consists of at least two parts, an online store that serves the low-latency online serving purpose, and an offline store that serves the large-scale offline training or inference goal.
Usually, the online/offline stores are in different data systems. But Splice Machine presents a new option - one HTAP database to rule them all.
Most feature stores today share a similar dual-store architecture: a KV store as the online store, and a data warehouse or data lake as the offline store, as shown below. For example, Uber picks Cassandra and Hive; Tecton picks DynamoDB and S3; Feast defines the online/offline store protocol and allows users to choose/implement whatever store they want.
Dual-store architecture is good because both stores are good in their way. However, keeping data in sync between the two very different stores is challenging. In Uber's case, batch features added in Hive are copied to Cassandra; real-time features added to Cassandra are ETL-ed to Hive. They all happen automatically thanks to the great though tough software engineering.
Even if your team manages to get it right, you now have a 3x more complex infrastructure to maintain, as you need a compute engine (such as Spark) to move data and a workflow orchestrator (such as Airflow) to schedule the data moving jobs.
Even worse, data governance in the system becomes very hard given so many moving parts.
Given the downside of the dual-store architecture, Splice Machine goes a different path - have a single store rather than two, as shown below.
While it eliminates the need to sync data, the team runs into a new challenge: how to build a store that serves both low-latency lookups and OLAP queries?
The Splice Machine team just knows, as they have provided HTAP (Hybrid Transactional/Analytical Processing) capabilities in their DB product (also named Splice Machine) to customers since 2017. Now in 2021, they build a feature store highly based on the HTAP DB. Below are some design highlights.
There exists a cost-based optimizer sitting between SQL queries and execution backends. The optimizer chooses the execution backend based on the query workload. If it is a key-based lookup, HBase is used; otherwise, if it is a OLAP query, Spark is used.
There are two types of table supporting the feature store - feature set and feature set history.
A feature set table stores the latest values of a feature set, and it serves online features retrieval. Schema:
primary_key, last_update_ts, feature_1, feature_2, ...
Instead, a feature set history table stores historical values of a feature set, and it serves offline training and inference. Schema:
primary_key, asof_ts, until_ts, feature_1, feature_2, ...
A feature set history table is a CDC (Change Data Capture) table of the corresponding feature set table. Every time there is an INSERT or UPDATE to the feature set table, a DB trigger is triggered to INSERT a record to the feature set history table, where asof_ts = feature_set.last_update_ts and until_ts = NOW().
The most innovative part of the feature store is how it runs prediction. Here is a typical prediction workflow:
- A program (could be a Spark job, or just a piece of Python code) gets features from a feature store.
- Feed the feature vector into a model served somewhere, which could be the program memory or a dedicated k8s pod in the case of Seldon Core.
- Wait for the prediction results.
- Save the (primary_key, feature_vector, prediction) tuple in another store.
Splice Machine does it diffferently leveraging prediction tables. Schema:
model_id, primary_key, feature_1, feature_2, ..., prediction NULLABLE
Whenever there are predictions to be made, whether a point or batch prediction, it simply inserts the (model_id, primary key, feature_vector) into the table, triggers another DB trigger, and populates the prediction result.
According to the schema, each model in the table should expect the same set of features. Is it correct?
With both feature and prediction data in one store, feature and model governance have never been easier. Basically, you can answer below questions with simple SQL queries:
- Is a certain feature drifting? Just compare the statistics of the trained features vs. actual features.
- Is the model making reasonable prediction? Ditto.
- Given a feature, what models are using it? Check some metadata tables.
- Want to re-train the model? Training dataset has been collected as prediction tables if we backfill labels properly.
The list can go on. See the post for more details.
Splice Machine Feature Store also implements time-travel query (aka point-in-time join) and feature backfill.
I am very impressed by the simple architecture of the Splice Machine Feature Store.
The only concern I have is performance, especially when compared with its KV store competitors such as Redis, as machine learning can be very latency-sensitive in some scenarios. For example, in a recommendation system, it is common having to retrieve hundreds of candidates (= looking up the feature store with hundreds of primary keys) within milliseconds.
Besides the architecture itself, I am also curious about Splice Machine's HTAP implementation. Sadly it is closed-source and I have no further information about this.
I want to thank Monte Zweben and Jack Ploshnick for making the feature store presentation on the Data+AI Summit 2021, and Jim Dowling for pointing me to Splice Machine when I ask about single-store architecture in the great MLOps.community Slack workspace.
If you find it interesting, discuss at Twitter!
© Yik San Chan. Built with Vercel and Nextra.