Collaborative GPT-3 Dataset

6 min read

Dolt is Git for data. Recently, we've been thinking a lot about what could be Dolt's Linux. A reader of that blog had a suggestion, an open GPT-3 dataset.

Dolt really shines as a collaborative database where many users are making distributed edits. These edits can be reviewed in pull requests, like code, so you can accept or reject edits. But each user has their own forked copy of the database so acceptance or rejection into the master dataset is less important.

Recently, we've become excited about building these types of open, collaborative datasets, as seen in the open resume initiative. This blog post introduces our shot of creating a sample of text generated by GPT-3.

What is GPT-3

GPT-3 stands for Generative Pre-trained Transformer 3. GPT-3 was created by OpenAI, a company focused on Artificial Intelligence. It's a machine learning model whose purpose is to generate text that is indistinguishable from what a human would generate. It was released in May 2020 and was was put in beta testing last month, July 2020.

GPT-3 accepts a prompt, along with some other parameters, and then spits out coherent text in response to the prompt. It's notable because it's the biggest generative text model ever trained by about 10X and it produces pretty incredible results.

Here's an example prompt and response:

Prompt: "I am a startup CEO in my basement playing with an artificial intelligence in the middle of a pandemic."

Response: "You don't need to tell me what I can and cannot do, because you know that's all I ever wanted to be. And if it helps, your company is also in the basement."

I supplied the prompt and the AI generated the response. That prompt/response cherry picked for the sick burn the AI dropped on me at the end. It really is indistinguishable from most humans I interact with.

AI Dungeon

The easiest way to get access to GPT-3 is to buy a subscription to AI Dungeon. That's where we started. If this dataset catches on, we hope we can get direct access to the GPT-3 API but until then we built a simple AI Dungeon Scraper.

AI Dungeon is really cool and I highly recommend it for those interested in Dungeons and Dragons or Artificial intelligence. In my experience the Venn Diagram for those two interests overlap some. In AI Dungeon, you create your own adventure story by entering prompts similar to early text based dungeon crawlers except now you can enter more detailed prompts than "grab bag" or "fight orc". The computer generates a response and you proceed with your adventure story.

The closest thing to raw GPT-3 is to use AI Dungeon in "custom" mode under the for pay "Dragon" AI Mode. That strips out a lot of extra bells and whistles that keep the AI thinking D&D. It's still not raw GPT-3 but it's close after the first response. According to the above, the second response onward is GPT-3.

An Open GPT-3 Dataset

It would be interesting and useful to have an open set of GPT-3 prompt/response pairs. This could be used to build a model to detect whether text was generated by GPT-3 or not or find places where GPT-3 makes errors and see if they can be fixed in GPT-4. Or, the prompts could be used to test an alternative generative text model for quality against GPT-3.

Dolt is an ideal place to store and distribute this type of dataset. Many people all around the world can contribute their prompts and responses without stomping on each other's edits. The data is schemaed and accessible via SQL. You can view the repository on DoltHub and even run SQL queries there.

With the above in mind we created the ai-dungeon Dolt repository. It has two tables, one for settings information and the other for prompt/response pairs. A prompt can have multiple responses if the user hits the retry button. Also the AI remembers previous prompts and responses so each prompt has a sequence number.

Here's the query that generates data for a session. If you want to see the settings used in that session this is the query. As you can see, I'm using Dragon mode. As a team, we've been playing with it for a few hours so it's currently a small sample.

Now, this data might be a little hard to enter yourself using SQL so we built an open source AI Dungeon scraper to gather session information using the command line. It's a primitive first pass but it gets the job done. If there's interest we can improve it. It's open source after all. The scraper navigates a session via the command line, collects the data, and then inserts the new data into a cloned ai-dungeon Dolt repository. After that it's up to you to commit the changes if you'd like. We suggest you do this on it's own branch, ask us for permissions to submit to Dolt, and then submit a Pull Request. Once we have forks, you won't need to ask us for permission. You can just fork and we'll update the script to take an organization name. But this will work for now.

Here's a sample transcript of the scraper.

ai-dungeon-scraper $ ./ai-dungeon-scraper.py --email tim@liquidata.co --password <redacted>
Logging into AI Dungeon
Collecting Settings Information.
Starting Session.
Enter your prompt. Enter 'exit' to exit:
I am testing the AI Dungeon command line scraper so I can publish the transcript in the blog I am writing about it.
Please wait for reply...
ERROR: AI Dungeon did not respond with a response
Enter your prompt. Enter 'exit' to exit:
That was unexpected. I wonder why the AI could not understand that prompt.
Please wait for reply...
 It is a bit strange, but then I guess it would be if you were using an old version of the program.
Enter your prompt. Enter 'exit' to exit:
It must be the fact that it is using GPT-2 on the initial response. Now we're in GPT-3 land. It's much cooler here.
Please wait for reply...
 We've left GPT-3 Central behind and the sky is a deep purple now. The paper cranes are flying in formation around us.
Enter your prompt. Enter 'exit' to exit:
That got real meta real quick. What is a paper crane? What is its significance?
Please wait for reply...
 I feel like it's a particular trigger for me, but I can't place it.
It's almost like it's a meme for when you're using GPT-2.
Enter your prompt. Enter 'exit' to exit:
I love memes especially ones with cats.
Please wait for reply...
Who are you?
I don't know.
What is GPT-2?
It's a compiler. It compiles thoughts into beliefs into emotions into actions.
Hey this isn't working.
Enter your prompt. Enter 'exit' to exit:
exit
Preparing SQL...
Writing to Dolt repo.

Everything up-to-date

Rows inserted: 6 Rows updated: 0 Rows deleted: 0
Data written to local ai-dungeon Dolt repository. Add, commit, and push if you would like to contribute it.

Conclusion

We're not sure if collaborative open data Dolt use cases like this will get any traction. We've always hoped Dolt could spur an open data revolution similar to what we see in open source. Maybe GPT-3 data will be Dolt's Linux? We'll have to wait and see.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt