Announcing the $10,000 Menus Data Bounty

4 min read

If you aren’t yet familiar with Dolt and DoltHub, Dolt is a MySQL database that can branch, diff and merge. DoltHub hosts those databases on the web. We are currently running data bounties, a string of collaborative database projects in which participants get paid to contribute data.

Today we are launching a $10,000 data bounty to pay contributors to build the only free, open source menus database on the internet.

Background

In 2010, Congress enacted the federal menu labeling law *. As of May 7, 2018, the FDA requires restaurants with over twenty locations to provide menu labeling for consumers. Enforcement of this law began May 7, 2019. Legal reasons aside, restaurants and businesses also use menu data to set competitive offerings and prices. Whether it concerns analysts of law, nutrition or restaurant strategy, this is data we think should be open source and free for all to inspect, contribute to and use.

Schema

We toyed with this schema for a few days. We wondered if adding pricing data was increasing the scope of the bounty too much. We ultimately couldn't pick between prices and nutrition, and seeing as nutritional data and prices are often listed in tandem on the internet, it made sense to allow contestants to scrape all nutritional information and prices at once. That is, however, not a requirement. We will accept submissions that contain menu items with nutritional information or prices.

The menu_items table has a 3-part composite primary key: (name, restaurant_name and identifier). The identifier is where we will store regional data, where applicable. The database README.md has more explicit information on how to build the primary key.

In short, we're either using NATIONAL as the identifier, or comma delineating regional information, for example Santa Monica, CA or NULL, CA. For food service delivery partners, we will insert menu items uniquely by also attaching the service provider name, such as Postmates, NATIONAL, Postmates, NULL, CA or Postmates, Santa Monica, CA. Using the identifier in this format means we can parse out useful regional information later. At the same time, we are allowing inserts of unique menu items including those that vary in price across location or direct and third party menus.

menus> describe menu_items;
+-----------------+---------------+------+-----+---------+-------+
| Field           | Type          | Null | Key | Default | Extra |
+-----------------+---------------+------+-----+---------+-------+
| name            | varchar(255)  | NO   | PRI |         |       |
| restaurant_name | varchar(100)  | NO   | PRI |         |       |
| identifier      | varchar(255)  | NO   | PRI |         |       |
| calories        | int           | YES  |     | NULL    |       |
| fat_g           | decimal(6,2)  | YES  |     | NULL    |       |
| carbohydrates_g | int           | YES  |     | NULL    |       |
| protein_g       | int           | YES  |     | NULL    |       |
| sodium_mg       | int           | YES  |     | NULL    |       |
| price_usd       | decimal(10,2) | YES  |     |         |       |
| cholesterol_mg  | int           | YES  |     | NULL    |       |
| fiber_g         | int           | YES  |     | NULL    |       |
| sugars_g        | int           | YES  |     | NULL    |       |
+-----------------+---------------+------+-----+---------+-------+

Here is an example of McDonald’s data, which we used to seed the database:

menus> select * from menu_items limit 5;
+-----------------------------------------+-----------------+------------+----------+-------+-----------------+-----------+-----------+-----------+----------------+---------+----------+
| name                                    | restaurant_name | identifier | calories | fat_g | carbohydrates_g | protein_g | sodium_mg | price_usd | cholesterol_mg | fiber_g | sugars_g |
+-----------------------------------------+-----------------+------------+----------+-------+-----------------+-----------+-----------+-----------+----------------+---------+----------+
| APPLE SLICES                            | MCDONALD'S      | NATIONAL   | 15       | 0.00  | 4               | 0         | NULL      | NULL      | 0              | 0       | 3        |
| BACON BUFFALO RANCH MCCHICKEN           | MCDONALD'S      | NATIONAL   | 430      | 21.00 | 41              | 20        | 850       | 1.00      | 50             | 2       | 6        |
| BACON CHEDDAR MCCHICKEN                 | MCDONALD'S      | NATIONAL   | 480      | 24.00 | 43              | 22        | 650       | 1.00      | 65             | 2       | 6        |
| BACON CLUBHOUSE BURGER                  | MCDONALD'S      | NATIONAL   | 720      | 40.00 | 51              | 39        | 1470      | 4.49      | 115            | 4       | 14       |
| BACON CLUBHOUSE CRISPY CHICKEN SANDWICH | MCDONALD'S      | NATIONAL   | 750      | 38.00 | 65              | 36        | NULL      | NULL      | 90             | 4       | 16       |
+-----------------------------------------+-----------------+------------+----------+-------+-----------------+-----------+-----------+-----------+----------------+---------+----------+

Data Validation

We are excited to release a public GitHub repository intended to mirror and validate the DoltHub menus database. This GitHub repository contains a data validation script that users will run locally to identify issues and make any necessary corrections to submissions. This will also be used by the reviewer to ensure submissions pass minimum formatting requirements. Although GitHub contributions are not a paid part of the bounty, contestants are welcome to contribute to the data validation script by submitting a PR on GitHub.

The validation script and edge cases will continue to evolve over the course of the bounty as we discover more edge cases, creating a cycle of improvement between our scripts and submissions. We imagine scripts like this forming the basis for continuous integration on DoltHub, and have started to map out what we need to build an MVP for DoltHub CI. So far we think that includes a combination of webhooks for DoltHub pull requests, and GitHub actions.

To run the script, first clone the GitHub repository of the same name: dolthub/menus. Install python3 if you don't already have it, as well as Dolt. Then locate menus/validation/validate_menu_items.py and fill in the relative path to your Dolt directory:

relative_path_to_dolt_directory = "FILL ME IN"
db = doltcli.Dolt(relative_path_to_dolt_directory)

Finally, cd into menus/validation and run the script:

python3 validate_menu_items.py

Users will be prompted to run SQL queries to make corrections if any issues are found.

Next steps

The menus bounty will be running for six weeks, until August 18, 2021. Get started scraping menu data, fork the database and submit a pull request to start earning money! If you need guidance or inspiration to start web scraping, pick a restaurant from this list of restaurants that we scraped in the research phases of this bounty. We've also open sourced the web scrapers we used to acquire McDonald's data.

Conclusion

Stay tuned for more product development regarding continuous integration testing. Come chat with us on Discord if you find data validation interesting. Our #data-bounties channel would love to meet you.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt