The National Oceanic and Atmospheric Administration, NOAA, publishes weather measurements taken from stations around the
world. It started in 1901 with a handful of stations, and there are more than 35,000 stations today. Most of these
stations provide observations of the air temperature, wind speed, visibility, and more at least 24 times per day. This
fundamental dataset is the basis for a large amount of the data you see online, on tv, in weather apps, or anywhere else
you might see the weather including numerous commercial products. If you want to be able to forecast the weather, you
will need this data to train your models. It is also vital when looking at climate change. Jon Oliver recently discussed
this dataset and how important it is on, "Last Week Tonight". Check it out
The Source Data
NOAA's Global Integrated Surface Dataset's homepage is here
where you can find the details of the dataset and the link to download the raw data.
The data is provided in a few different formats, but if you want the data going back all the way to 1901, then you'll
need to parse the csvs. The csvs are archived in annual zip files which contain a csv for every station taking measurements
in that year. The NCEI provides documentation on the format of the csvs
here. This documentation is critical as
the observations are provided as whole numbers with a scaling factor, and the scaling factor is different for different
From a high level, each row of the csv has the date and time that the measurements were made and several
metrics that were recorded at that time. Each csv file contains a full year of a station's data. Only a few of the
metrics that a station might record are mandatory, as not all of the stations have the same capabilities and over the
years newer stations were brought online tracking additional metrics. Each row also contains all the information on the
station that made the recording including the name of the station, it's gps coordinates, the elevation of the station, etc.
Another challenge is that some of the columns of the csv have values which are a quoted list of comma separated values
themselves. So each csv row, has multiple csvs you need to parse in order to extract the data into a meaningful format.
Another problem with this format, is that in order to get the high and low temperature for a day you would
need all the rows for the day and compare the values for each. Additionally there is a lot of work required to
understand the data before you can use it well.
Modeling the Data in Dolt
When modeling this data, I wanted to make it easy to answer questions about what happened on a specific day, as opposed
to what happened on a specific hour. I wanted to make it easy to select date ranges in order to look at how the climate
is changing. I also wanted to make it easier to look at a specific metric, as opposed to all the metrics for a day.
To accomplish this each of the mandatory observations is in it's own table and a row contains up to 24 hourly
observations for that metric, and the high, low, average, and median observation for that day. There is also a separate
table for the station data rather than duplicate that data over and over again. All tables are keyed off the station's
ID. The tables are
The unique thing that we can do in Dolt is model each day as a single commit.
This makes it fast, and easy to query a specific day by checking out the commit corresponding to the day. Now if you
want to query to see how things have changed over time you can query the Dolt history table for the metric you are
interested in. If you want to see how air temperature has changed in Melbourne International Airport in Australia
between January 1st 2000 and today you could run:
WHERE station = '94866099999' AND commit_date >= '2000-01-01'
The downside to this approach is that it makes it very difficult to make corrections in past data. An assumption being
made when you query this data is that any date will appear only once in the commit history. If a mistake was made in a
previous import, the HEAD would need to be rewound to the problematic commit, and all the dates from that point forward
would need to be re-imported. This is unlike Dolt datasets I have modeled in
the past where the date of the commit is not a core piece of the data. Another way to model this would be to include
the date of the observations on each row. When modeled in this way a correction would be a simple operation of
overwriting the data with the appropriate corrections, and there would be no need to import everything after that, and
a diff would be able to show times when corrections were made.
The Import Programs
There are two binaries used to import the data. The first processes the archive going all the way back to 1901. It was
run one time to seed the data repository with all the historic data. The second is an incremental update job that runs
daily. It attempts to download and process only new data that has become available since it was last run.
The archive import program loops over every year from 1901 to 2020 and downloads an archive containing all the data for
the year. It extracts the archive, and then reads through each of the per station csv files. It processes the data
and generates a new commit for every day of the year. This commit contains the full days data for all the mandatory
The incremental update program looks at the latest commit to the NOAA repository to find what data needs to be
downloaded. It then scrapes the NOAA website to find the list of all stations. Each station is associated with a
single csv file which holds all of it's observations for the year. This file is sorted in the order of the date and
time the observation was taken at. We read from the end of the csv files until we have reached a date that we have
already processed. After downloading all the data for all the stations, we process the data in the same manner done by
archive import generating a commit for each day.
The code is available here. There are
three go packages for these 2 programs.
- all_from_archive - Is the code for importing the entire archive of data into Dolt.
- update - Is the code that run daily in order to update the NOAA data repository with data that has been added since
it was last run.
Get the Dataset
If you have Dolt installed you can now clone this data from
dolt clone dolthub/noaa
Once cloned you can immediately start querying it using SQL. Here is a simple query to find the stations in New York
dolt sql -q "
WHERE name LIKE '%NY US'"
More to Come
Part two uses this data set to
show how air temperatures have changed between 1950 and today. I'll cover how I queried the data, and generated
visualizations from it using python and PIL. On Friday I'll talk about how dolt history queries have been
optimized in order to support modeling data
that is frequently updated using the Dolt commit graph as was done with this