WordNet in Dolt

DATASET
4 min read

The Princeton WordNet database is on DoltHub. This blog entry will be about how it got there and how to use it.

WordNet is distributed natively from Princeton as a compilable custom database. You can also download the database files only but they are in a funky format. There is a project on SourceForge where someone has imported the database into SQLite. I tried unsuccessfully to load the distributed file in the SQLite shell:

sqlite> .open ~/Downloads/sqlite-31.db
Error: unable to open database "~/Downloads/sqlite-31.db": unable to open database file

This illustrates one of the problems Dolt is trying to solve. Distributing data in SQL format on the internet is hard. There are multiple conflicting formats and database versions. Instead, the data community distributes unstructured data as CSVs or JSON and lets the consumer do the import work.

Contrast this with Dolt. You can procure a SQL database in one command!

$ dolt clone dolthub/word-net
cloning https://doltremoteapi.dolthub.com/dolthub/word-net
6,141 of 6,141 chunks complete. 0 chunks being downloaded currently.

$ cd word-net/

$ dolt sql -q "show tables"
+-----------------+
| tables          |
+-----------------+
| lexs            |
| pointers        |
| synset_pointers |
| synset_types    |
| synsets         |
| words           |
| words_synsets   |
+-----------------+

$ dolt sql -q "describe words"
+--------------------+---------------+------+-----+---------+-------+
| Field              | Type          | Null | Key | Default | Extra |
+--------------------+---------------+------+-----+---------+-------+
| word               | varchar(1024) | NO   | PRI | NULL    |       |
| lex_id             | varchar(1024) | NO   | PRI | NULL    |       |
| syntactic_category | varchar(1024) | YES  |     | NULL    |       |
+--------------------+---------------+------+-----+---------+-------+

$ dolt sql -q "select * from words where word='dog'"
+------+--------+--------------------+
| word | lex_id | syntactic_category |
+------+--------+--------------------+
| dog  | 0      | n                  |
| dog  | 1      | n                  |
| dog  | 2      | n                  |
+------+--------+--------------------+

Moreover, you can get an idea of what you're getting from DoltHub before you download it. You can explore the schema, see some sample data for each table, see when the data was last updated, and examine the diff of the last update. Dolt and DoltHub are designed for data distribution.

Getting data into Dolt from the data files wasn't easy. But our intent is for Dolt import job code to be public so users can debug. Writing this code took me about two work days, mostly spent trying to understand the format and output schema. Taking data from the web and putting it into structured form is hard. With Dolt, someone only has to write the import job once.

Now for a little tour of WordNet. WordNet is a collection of synonym sets (or "synsets") and their relationships to each other. These are modeled in the synsets table. There are multiple words in a synset. The words in a synset are described by the words_synsets table. Words are stored in the words table. Words are identified by the word and the lexical id (lex_id) of the word for disambiguation.

$ dolt sql -q "select * from words_synsets where word='dog'"
+------+--------+-----------+-------------+----------+
| word | lex_id | synset_id | synset_type | word_num |
+------+--------+-----------+-------------+----------+
| dog  | 0      | 02005890  | v           | 7        |
| dog  | 0      | 02086723  | n           | 1        |
| dog  | 0      | 03907626  | n           | 4        |
| dog  | 0      | 10042764  | n           | 1        |
| dog  | 1      | 02712903  | n           | 3        |
| dog  | 1      | 07692347  | n           | 5        |
| dog  | 1      | 10133978  | n           | 2        |
| dog  | 2      | 09905672  | n           | 4        |
+------+--------+-----------+-------------+----------+

$ dolt sql -q "select * from synsets where synset_id='02086723' and synset_type='n'"
+-----------+-------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| synset_id | synset_type | lex_num | gloss                                                                                                                                                                              |
+-----------+-------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 02086723  | n           | 05      | a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds; "the dog barked all night" |
+-----------+-------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

A synset is described by synset_id and synset_type. The synset_id is derived from the byte offset in the synset type file described by synset_type. So, in order to retrieve a synset, you need both the id and type.

If you want to see all the synonyms in that synset, you query the words_synsets table.

$ dolt sql -q "select * from words_synsets where synset_id='02086723' and synset_type='n'"
+------------------+--------+-----------+-------------+----------+
| word             | lex_id | synset_id | synset_type | word_num |
+------------------+--------+-----------+-------------+----------+
| Canis familiaris | 0      | 02086723  | n           | 3        |
| dog              | 0      | 02086723  | n           | 1        |
| domestic dog     | 0      | 02086723  | n           | 2        |
+------------------+--------+-----------+-------------+----------+

Synsets are related to each other through pointers. Each pointer has a symbol as explained by the pointers table. The pointers between synsets are modeled in the synset_pointers table. The pointer I'll show off is the hypernym pointer (ie. less specific synset). A noun hypernym is represented by an "@" pointer symbol.

$ dolt sql -q "select * from synset_pointers where from_synset_id='02086723' and from_synset_type='n' and pointer_type_symbol='@'"
+----------------+------------------+--------------+----------------+---------------------+------------------+-----------------+---------------+-------------+
| from_synset_id | from_synset_type | to_synset_id | to_synset_type | pointer_type_symbol | semantic_pointer | lexical_pointer | from_word_num | to_word_num |
+----------------+------------------+--------------+----------------+---------------------+------------------+-----------------+---------------+-------------+
| 02086723       | n                | 01320032     | n              | @                   | true             | false           | 0             | 0           |
| 02086723       | n                | 02085998     | n              | @                   | true             | false           | 0             | 0           |
+----------------+------------------+--------------+----------------+---------------------+------------------+-----------------+---------------+-------------+

$ dolt sql -q "select * from synsets where synset_id='01320032' and synset_type='n'"
+-----------+-------------+---------+----------------------------------------------------------------------------------+
| synset_id | synset_type | lex_num | gloss                                                                            |
+-----------+-------------+---------+----------------------------------------------------------------------------------+
| 01320032  | n           | 05      | any of various animals that have been tamed and made fit for a human environment |
+-----------+-------------+---------+----------------------------------------------------------------------------------+

$ dolt sql -q "select * from synsets where synset_id='00015568' and synset_type='n'"
+-----------+-------------+---------+-------------------------------------------------------+
| synset_id | synset_type | lex_num | gloss                                                 |
+-----------+-------------+---------+-------------------------------------------------------+
| 00015568  | n           | 03      | a living organism characterized by voluntary movement |
+-----------+-------------+---------+-------------------------------------------------------+

You can navigate synsets through these relationships and find more and less specific words, antonyms, and other syntactic relationships. WordNet is a powerful tool in any machine interpreted text problem. Dolt makes WordNet more accessible.

Moreover, if you find WordNet lacking in some way, you can make edits to it and have those edits tracked with Dolt. If the WordNet source is ever updated, you can merge your changes with the new release and have merge conflicts produced. Dolt allows you to rely on WordNet, make changes to it, and still get updates to it when it changes.

WordNet is a great example of the power of Dolt and DoltHub. Clone it and tell us what you think.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.