Where Is The Data Catalog?

4 min read

Why is there no place on the internet to get useful, maintained data? This question has puzzled me since 2013. We can rent a server. We can rent a database. Why can't we rent the data in the database? Something like that would be extremely useful. It's weird for it not to exist. Another way to state this question: how do we share data today? What are some of the limitations of those methods? Do any of those methods produce efficiency through centralization?

Surveying the data landscape I came up with three ways data is shared. I will order these from least sophisticated to most sophisticated.

Least Sophisticated: Email a CSV file

If the data is small enough and the schema isn't complicated, just export to CSV and email it to me. I'll load it into Excel or Google sheets and do my thing. When I want a new copy with the updates, I'll let you know.

Moderately Sophisticated: A cloud storage bucket with JSON in it

The data is too big for email or the schema is relatively complicated, i.e. it has one to many mappings. We need another way. Well let's just make a whole lot of JSON blobs and put those in a cloud storage bucket. Here are the permissions to the bucket, go download it. If the data changes, I'll let you know and you can download it again.

Most Sophisticated: A private or public API

This data is pretty valuable. I don't want to give you all of it. Just send up a key and I'll give you back the information you need. Don't store a copy because the data will be changing and you always want the latest. Plus, you could be stealing from me if you are storing it. If you have an application that needs access to the whole thing, too bad.

These methods all share one thing in common. There is a single view of the data. The data you are getting is "the truth", at least at that point in time. If the data changes either in the source or by your hand, that is the new truth. You can either accept a single truth or fork. A fork means you maintain your own copy of the data for ever more.

At this point, I think it makes sense to introduce two general buckets of data: immutable log data and dictionary data.

Immutable log data is colloquially called "big data." It's all the information that we are collecting from our ever increasing electronic footprint: our phones, our web browsers, our cameras. Humans generally don't modify this data. The sensor is the single source of truth. We spend our time aggregating and labeling this data to find problems in the sensor readings and predict future sensor readings. There's been massive advancement in the tools and capabilities we enjoy in this space in the last ten years.

Dictionary data is human curated. It comes in the following form. There is one or many keys and a bunch of columns with more information about said keys. It's used to connect multiple streams of immutable log data or add context to immutable log data. Some examples are IP to geographical location mapping, or product SKU to corporate security mapping. This data is usually about the "three Ps": people, places, or products. There has not been much innovation here since the invention and adoption of the API back in the mid-2000s.

Back to this place on the internet with the data I yearn for. A JSON filled cloud storage bucket may be the best solution to sharing immutable log data. Cataloguing these is what the AWS data registry or Google Dataset Search is trying to do. There are some API catalogs like RapidAPI. APIs are the best way to get dictionary style today. As far as I can tell, none of these have achieved the status of place on the internet to get useful, maintained data.

I think the reason none of these solutions have caught on is that the method for sharing does not encourage collaboration. The current methods encourage complete trust or forking. For an internet data catalog to emerge, a format to encourage internet-style collaboration must exist. This is especially true of dictionary style data where the truth evolves and different people can have different views of the truth.

Pre-2000, we didn't collaborate on source code much either. The pre-2000 source code world looked a lot like the data world today, large institutions asserting power over the source code they produced with smaller players getting crushed or acquired if they got too much traction. Google is the new Microsoft. Facebook is the new Oracle.

The rest of us started to share source code because we didn't like Microsoft or Oracle. We also agreed on a format to distribute open source: first patch files, then CVS, then Git. We think by porting the semantics of version control to databases, specifically merging and branching semantics, we have a chance to create the same collaboration dynamic in data that we see in source code. We all need to band together to topple the giants and usher in an age of data collaboration so any small player with a great idea can flourish. We can only do this if we have a format built for data collaboration.

We built this with Dolt. Dolt is a database built from the engine up to encourage collaboration and sharing. Dolt is git semantics on top of a SQL database. You can see who changed what data and why. You can branch a copy, make some writes, and still get updates from the master branch. If you make a change to the same value, a conflict will be thrown. We also built DoltHub, a place on the internet to share these databases. With a little help from you, we think DoltHub can evolve into the place on the internet to get useful, maintained data.



Get started with Dolt