Dolt + Flock Safety: Versioned Feature Store

USE CASE
5 min read

Dolt is a version controlled SQL database. How would you use such a thing?

Are you building novel machine learning models using structured data? Are you worried about model reproducibility? Are you worried about model explainability? Are data scientists and ML engineers stepping on each other's toes cleaning or selecting the right data to train models?

Flock Safety chose Dolt to solve model reproducibility, explainability, and collaboration for their machine learning models. Flock is one of Dolt's first customers. In this blog we'll tell their story.

Dolt + Flock

Flock Safety

Based in Atlanta, Georgia, Flock Safety (from here on out I'll call them "Flock" for brevity) was founded in 2017. Flock builds a full-service, end-to-end safety solution for neighborhoods, properties, schools, and businesses. Flock works with its clients to install network connected infrastructure-free cameras and audio detection devices on premises.

Flock Safety Camera

Flock uses Machine Learning to decode data captured from these devices into actionable evidence. Using these machine learning models, Flock can alert when a known suspect vehicle drives past a camera, or enable investigators to search for suspect vehicle history. This information is used by law enforcement to identify, apprehend, and prosecute the suspect.

Flock's approach works. It was nice to see Paul Graham retweet repost this infographic, highlighting a recent study concluding that Flock technology helps solve 10% of reported crime in United States.

Flock Safety Tweet

10% of reported US crime is a lot. Here at DoltHub, we're proud to help Flock progress on this mission.

Machine Learning at Flock

Machine learning is a key part of Flock's product. Machine Learning models power the automated decoding of evidence captured from Flock’s devices. Richard Taylor, VP of Machine Learning at Flock, says "Machine learning is a core part of Flock Safety's products. Building best-in-class models is necessary for our customers’ success and safety."

Flock maintains vision and audio machine learning models trained on a vast amount of data. As Flock continues to extend and improve its ML, and as changes occur with vehicle and license plate styles, more data needs to be collected, labeled, and incorporated into model training and testing.

Handling these data updates in a version controlled database provides Flock many benefits. Commits of labeled data can target a specific change that can go through QA and testing before being incorporated the full model dataset; schema- and model-breaking data changes can be tested in a Dolt branch without affecting the production usage of the ML pipelines; release tags provide an easy way to specify data used in a specific model, supporting reproducibility of R&D and providing a lineage of changes.

At Flock, Machine Learning infrastructure is a priority. Flock chooses best-of-breed tools to manage and optimize their training and deployment infrastructure. Dolt is one of these tools.

How Dolt Helps Flock

Flock has been a Dolt customer since April 2022. Flock uses Dolt as a versioned feature store. Data used to test and train models is stored in Dolt.

Flock became a Dolt customer to achieve model reproducibility via tags. Since then, Flock has taken advantage of more Dolt features like data diffs for model explainability and data branches for collaboration.

Standard SQL

Flock's data is structured into tables. Columns representing different features and labels populate the tables. Structured data allows Flock to use standard SQL to inspect and update data. Models are trained using data retrieved using SQL. Dolt is a standard, MySQL-compatible SQL database with unique versioning features. This was a good fit for Flock's data.

Moreover, SQL is a common skill among data and software engineers. The learning curve to adopt Dolt was low. New engineers feel comfortable in a standard SQL environment. Very little training is required to get an engineer ramped up to use Dolt at Flock.

Model Reproducibility

Model reproducibility is the ability to regenerate a model using the same inputs. Model reproducibility is analogous to software reproducibility. Software reproducibility relies on codified build processes and source code version control. Model reproducibility relies on codified build processes and data version control. Dolt is the world's only version controlled SQL database.

Flock was originally interested in Dolt only for model reproducibility. Flock had structured data. They wanted to permanently store a version of their training data every time they trained a model. They used Dolt tags to label the data version they used to train. Tags are named, immutable versions of the database at that point in time.

When Flock wants to reproduce a model, they look up the data by tag and are sure they are using the same database that was used during the original training run. Dolt uses Git's model of version control. Git tags are also used in source code to accomplish the same goal. This process is very familiar to engineers at Flock.

Model Explainability

Model explainability is the ability to understand why different models output different results. Most technology in this space helps understand how model training parameters effect model performance. However, the data used to train and test models is equally, if not more important, than the model training parameters.

Dolt allows for fast, queryable computation of data differences or "diffs". Dolt's unique storage engine powers this capability. Fast data diff is the backbone of data version control. Data diffs are a powerful tool for model explainability. What changed in the data between model version X and Y to generate this result? Dolt answers this question for Flock.

Efficient Collaboration

Flock uses data branches for more efficient collaboration between teams and projects. At Flock, one team may be working on a long running project to add a new feature to the model. Another may be testing a new set of labels. These projects are happening as the core deployed model continues to be tuned and optimized. New data is being ingested and labeled every day.

Dolt allows each of these distinct teams to operate on their own isolated branch, preventing stomping on each other's work. As new data comes into the main branch, it can be synced as needed to the development branches using Dolt's merge functionality. This same merge functionality allows successful projects' data to be merged into the main branch when it's ready.

Data branches prevent Flock engineers from stomping on each other's work and have allowed the team to scale more efficiently.

Flock Chose Hosted Dolt

Once Flock knew Dolt was the right fit for their feature store they had a few Dolt flavors to choose from. Flock wanted a deployed instance of Dolt that behaved like a standard MySQL database. Flock chose the Hosted Dolt flavor of Dolt so they don't have to run Dolt themselves on their own infrastructure.

Flock appreciates the stability provided by Hosted Dolt's 24-hour on-call support, automated backups, and continuous monitoring. Flock is also a heavy user of Hosted Dolt's built-in workbench. Flock uses the modern web-based interface to execute SQL queries. Hosted Dolt is the easiest, safest way to run Dolt in production.

Conclusion

Dolt makes a great versioned feature store for Flock Safety's Machine Learning pipeline. Dolt ensures model reproducibility and explainability and allows Flock's data engineers to efficiently collaborate, all using standard SQL.

Sound like something that could help you at your company? Join our Discord and let's discuss your use case.

Also, if working with exciting tools like Dolt and the mission of eliminating crime and shaping safer communities are interesting to you, check out https://www.flocksafety.com/careers.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.