Maintained Wikipedia ngrams dataset in Dolt

6 min read

Wikipedia is the largest and most popular general reference work on the internet, making it a powerful tool for predictive language modeling. Wikipedia releases a dump of all its articles and pages twice a month, and we created a dataset of ngrams using these dumps.

The construction process

We’re using Airflow to create an ETL workflow that runs every time a new dump is released. Not only does this automate the whole process, but it allows users to see the logic behind how the data is extracted and transformed.

Transforming the raw XML from the Wikipedia dump into a structure that can be easily consumed by a Dolt database required some design work. With the help of wikiextractor, each article was sentence-tokenized using nltk (to include sentence padding for bigrams and trigrams - represented by _START_ and _END_), processed to remove unwanted punctuation and whitespace, and split into unigrams, bigrams, and trigrams, adding each to its relative word and article count. The bigram and trigram counts were too large to fit in memory, so I had to shard the work and aggregate the results before importing to Dolt. You can see the Python code for this process here.

Accessing this dataset

In one step you can clone this dataset and start querying using SQL:

$ dolt clone dolthub/wikipedia-ngrams

Getting the top 20 unigrams by total count:

$ dolt sql -q 'select * from unigram_counts order by total_count desc limit 20'
+---------+-------------+---------------+
| unigram | total_count | article_count |
+---------+-------------+---------------+
| the     | 155259841   | 8909083       |
| of      | 72985199    | 5038180       |
| in      | 63129117    | 7436319       |
| and     | 62643059    | 4631776       |
| a       | 45591516    | 6295708       |
| to      | 44389821    | 4337393       |
| was     | 27174519    | 3881191       |
| is      | 21247935    | 4639424       |
| for     | 18356114    | 3694266       |
| on      | 18041482    | 4016161       |
| as      | 17806695    | 3953147       |
| with    | 15700702    | 3360073       |
| by      | 15454532    | 3570537       |
| he      | 13841101    | 3017920       |
| that    | 12476262    | 2382706       |
| at      | 12077889    | 3519426       |
| from    | 11541898    | 3298925       |
| his     | 10868621    | 2334352       |
| it      | 10243530    | 3958561       |
| an      | 8502023     | 3041339       |
+---------+-------------+---------------+

Calculating the frequency of unigrams holiday and winter based on their total count and article count from the July 20, 2019 dump:

$ dolt sql -q 'select u.unigram, cast(u.total_count as decimal) / cast(t.total_word_count as decimal) as word_frequency, cast(u.article_count as decimal) / cast(t.total_article_count as decimal) as article_frequency from unigram_counts u join total_counts t on t.dump_date="7-20-19" where u.unigram="winter" or u.unigram="holiday"'
+---------+------------------------+----------------------+
| unigram | word_frequency         | article_frequency    |
+---------+------------------------+----------------------+
| holiday | 2.6414831850765796e-05 | 0.006453791793413558 |
| winter  | 9.898699994484225e-05  | 0.022862929573204745 |
+---------+------------------------+----------------------+

Google Ngrams

Creating a language model for search prediction is a common use case for ngrams. Google created one of the most popular datasets of ngrams, which is based on about 8 million books over 200 years. You’re able to graph the frequencies of certain words or phrases over time.

While this impressive dataset has been widely used in a variety of scientific and cultural studies, it does have its drawbacks. Will it ever be updated again? If so, when? What if you wanted to directly compare different versions of the dataset? What if you wanted to use their code to improve what they currently have or create a Google Books-based dataset of your own?

Wikipedia Ngrams using Dolt

This is one of the problems we’re trying to solve with Dolt. I came across other ngram datasets as I was researching ngrams to create this dataset, but none were frequently updated, easily manipulated, and quickly downloadable.

With a Dolt database you can create and compare branches, so I made a branch for every dump date with both case-sensitive and case-insensitive ngrams. This allows you to easily diff between two branches and see how all or certain ngram counts have changed over time (this will be more impressive as we add more and more dumps).

Here you can see that there are 25 more articles including the unigram snow (case-insensitive) from 7/20/19 to 8/1/19:

$ dolt diff --where from_unigram='snow' 20190720/case-insensitive 20190801/case-insensitive unigram_counts

diff --dolt a/unigram_counts b/unigram_counts
--- a/unigram_counts @ nqah97epk1ornuf5f6a6d9fkchrungsg
+++ b/unigram_counts @ vdea4mdqj0qcsjpsmvk1jfc7oqcli5ka
+-----+---------+-------------+---------------+
|     | unigram | total_count | article_count |
+-----+---------+-------------+---------------+
|  <  | snow    | 66374       | 35245         |
|  >  | snow    | 66332       | 35220         |
+-----+---------+-------------+---------------+

This ngrams dataset can be used for search prediction using a SQL join. For every unigram or bigram, you can predict the next word by joining with the bigrams or trigrams table. Here are some examples.

Predicting the probability of the next word using unigrams and bigrams (in descending order by probability):

$ dolt sql -q 'select u.unigram, b.bigram, cast(b.total_count as decimal) / cast(u.total_count as decimal) as probability from unigram_counts u inner join bigram_counts b on b.bigram like concat(u.unigram, " %") where u.unigram="winter" order by probability desc limit 15'
+----------+--------------------+----------------+
| unigram  | bigram             | probability    |
+----------+--------------------+----------------+
| winter   | winter olympics    | 0.17874229559  |
| winter   | winter _END_       | 0.08015938010  |
| winter   | winter of          | 0.07336369601  |
| winter   | winter and         | 0.04579304532  |
| winter   | winter months      | 0.03154557514  |
| winter   | winter games       | 0.02047325546  |
| winter   | winter in          | 0.01653664354  |
| winter   | winter paralympics | 0.01612478389  |
| winter   | winter the         | 0.01427141550  |
| winter   | winter olympic     | 0.01283948489  |
| winter   | winter season      | 0.01271018012  |
| winter   | winter sports      | 0.00978406103  |
| winter   | winter break       | 0.00973138131  |
| winter   | winter war         | 0.00881188071  |
| winter   | winter is          | 0.00750925487  |
+----------+--------------------+----------------+

Predicting the probability of the next word using bigrams and trigrams (in descending order by probability):

$ dolt sql -q 'select b.bigram, t.trigram, cast(t.total_count AS decimal) / cast(b.total_count AS decimal) AS probability from bigram_counts b inner join trigram_counts t ON t.trigram like concat(b.bigram, " %") where b.bigram="winter and" order by probability desc limit 15'
+-------------+---------------------+---------------+
| bigram      | trigram             | probability   |
+-------------+---------------------+---------------+
| winter and  | winter and spring   | 0.18228404100 |
| winter and  | winter and summer   | 0.08146831207 |
| winter and  | winter and the      | 0.06714076553 |
| winter and  | winter and early    | 0.04308722025 |
| winter and  | winter and in       | 0.02311231960 |
| winter and  | winter and a        | 0.01955657812 |
| winter and  | winter and is       | 0.01223593391 |
| winter and  | winter and was      | 0.00930767622 |
| winter and  | winter and it       | 0.00669316043 |
| winter and  | winter and then     | 0.00658857980 |
| winter and  | winter and are      | 0.00554277348 |
| winter and  | winter and autumn   | 0.00512445095 |
| winter and  | winter and his      | 0.00512445095 |
| winter and  | winter and hot      | 0.00501987032 |
| winter and  | winter and during   | 0.00481070906 |
+-------------+---------------------+---------------+

Areas for improvement and conclusion

Natural language processing is an imperfect art, and therefore there will always be areas to improve and expand upon with these kinds of datasets. Unlike Google Ngrams, Wikipedia Ngrams does not include part of speech, nor does it consider the weight of page popularity. Furthermore, Wikipedia only has three months worth of dumps available for download, preventing us from backfilling dumps from the beginning of Wikipedia time. Many of the points made in this study about improving Google Ngrams reliability can also be applied to Wikipedia Ngrams.

However, Dolt provides the tools and flexibility to implement these improvements. Anyone can clone this dataset and make a branch with enhanced text processing, or add 4- and 5-gram tables, or make another dataset that merges Wikipedia Ngrams with a synonyms dataset to produce more accurate language prediction. This creates more dynamic data that will only continue to grow and improve over time.

Want to contribute to the code that constructed this dataset? Clone our DoltHub ETL jobs repo and submit a pull request with your changes and someone on our team will review it.

As excited as we are about Dolt, it is still a work in progress. SQL joins are currently experimental and we're working hard to get the performance where we want it. If you're using Dolt and want a specific feature or improved performance for a certain area, file an issue. We welcome feedback and would love to hear from you.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt