In the first part of this two part blog I
covered NOAA's "Global Hourly Surface Data" dataset and how it is
modeled in Dolt. Dolt is git for data,
and for this dataset we model a day of observations as a single commit in the commit graph.
In this second part I use this dataset to see how the daily high air temperature readings change over time at 236 fixed
stations between the years 1950 and 2019. I put the stations and the visualization of how their temperature readings
have changed in a Google Map which you can explore. This blog covers the methodology used in accessing the data, and
creating the visualizations, and covers what the visualizations are actually showing. This blog does not attempt to
make any conclusions, and leaves you to interpret the data for yourself.
The Google Map has 236 stations each with a visualization of how the the readings of the daily temperature highs have
changed over time. A preview of the map and some of the visualizations can be seen below. Each visualization shows how
the percentage of days whose highest reading was over the mean highest rating has changed over time.
See the full map (Note: when you click
on a station, Google Maps will give you information on the station that you've clicked on, and a preview
of the visualization for this station. The preview is cropped, and you'll need to click on it to see the full
Selecting the Stations
Before we can show how stations readings have changed over time we need to decide what stations we are going
to look at, and over what period of time. Prior to 1940 worldwide station coverage was relatively poor with
760 stations globally, but by 1950 there were over around 2550 stations with broad geographic coverage. Using
that date, I first look in the
dolt_log system table to find the commits corresponding to the start and end of our
SELECT commit_hash, date
WHERE date = '1950-01-01 00:00:00' or date = '2019-12-31 00:00:00'
ORDER by date DESC;
| commit_hash | date |
| 6euavau0131fa7g4gr62oq7radt2n0f5 | 2019-12-31 00:00:00 +0000 UTC |
| t98o66v0u8b2j0tg817dshlgl8mtb8he | 1950-01-01 00:00:00 +0000 UTC |
I'll create branches corresponding to our commits for convenience, and check out the newest commit in our timeline.
dolt branch oldest_in_range t98o66v0u8b2j0tg817dshlgl8mtb8he
dolt branch newest_in_range 6euavau0131fa7g4gr62oq7radt2n0f5
dolt checkout newest_in_range
Now I want to find stations that were around at the start of our time interval, and still exist at the end of it. To get
this I select all the stations in the current table that weren't added between our oldest and newest commits. The
ability to query how a table has changed, is a feature of Dolt which uses the
dolt diff system table. Check out the system table docs.
WHERE station NOT IN (
SELECT to_station AS station
WHERE diff_type = 'added' AND from_commit = 'oldest_in_range' AND to_commit = 'newest_in_range'
Pulling the Data for our Stations
As covered in the first part of this blog,
every new day of observation data is stored in a single commit. If you were to query the
air_temp table you would be
able to see what the air temperatures were globally for one particular day. We want to look at data for the entire 70
year time interval, so we query the history of the
air_temp table using the system table
store the results into a csv. We'll be looking at the max observed temperatures for our time range in Fairbanks Alaska,
and Perth Western Australia as we create our visualization so we can go ahead and pull data for those now.
dolt sql -r csv -q "
SELECT station, commit_date, max
WHERE station = '70261026411' AND commit_date >= '1950-01-01 00:00:00' AND commit_date <= '2019-12-31 23:59:59'
" > fairbanks_alaska-us.csv
dolt sql -r csv -q "
SELECT station, commit_date, max
WHERE station = '94608099999' AND commit_date >= '1950-01-01 00:00:00' AND commit_date <= '2019-12-31 23:59:59'
" > perth_metro-au.csv
Visualizing the Data
We'll be using a Python program to take the data and produce images using PIL (Python Image Library). The code is
available here with the
main flow of the program living in
The process of coming up with a visualization of the data that was meaningful was iterative. I began
by taking the data and producing an image showing what the daily highs looked like for a given station over time. I
created an image where each row was a year, and each pixel was a day. A column shows the temperatures for a single day
in a calendar year across multiple years. Leap days were discarded.
This visualization certainly conveys some information, but it has a lot of problems. Looking at Fairbanks you can clearly
see the seasons. You can see some hot years and some cold years. You can also see that this particular station is very
complete, in that there are almost no gaps in it's readings. Now looking at Perth, you can still see the seasons, but
there is much less differentiation between the highs in the summer, and the highs in the winter. There isn't a single
date where the daily high is less than 0°C. These differences are not surprising giving the differences in climate
between our two sample locations. Additionally you can see gaps in the reading of this station. There are instances
where the station didn't have readings for a single day, and others where multiple years were missing within this one
If climate was simple and temperatures were always rising, or cooling you would see an image that looked like a funnel,
where the winter months would look like they were shrinking in duration. But climate is very complex, and seeing trends
in this visualization of the data is impossible.
Next, I decided to take look at the average daily high temperature for each calendar day, and to see how the daily highs
have changed over time relative it. To calculate the averages of the daily highs, I added up the daily high for
1950-01-01, 1951-01-01, 1952-01-01, ... 2019-01-01 and divided by our 70 year time interval. I did that for each day in
the calendar year.
I took this data and generated images showing the distance from the mean high temp for a calendar year, and the observed
daily high for each date in our time range.
Now both of these graphs look completely random. I can't look at it and come to any conclusion about what has happened
over time, and I certainly can't find a trend in that data... but what if I sorted each year? Our X-axis would no longer
correspond with a calendar day, but we would be able to see the number of days where the daily high was above the mean
high temperature, and that is certainly something we can draw a trend line for.
The result is something that can be easily interpreted, and shows changes over time very clearly. The biggest problem
with this new graph is that it doesn't handle missing data very well. The trend line is based on the number of days
over the average high temperature for a day. If 5 days were missing, even if only half of them were above the mean,
skew the rate of warming will be skewed downward. Basing the trend line on the number of days below the mean has similar
problems skewing the rate of cooling downward. Throwing out years where even a single data point is missing would really
decrease the amount of data we are working with. As stations have gotten more reliable, this would really impact the
older data in our range more than the newer data.
The final visualization renders the percentage of days above the mean, and below the mean, and discards any years where
there is less than 355 valid data points. Trend lines are only drawn on images with at least 30 years with over 355
days of observations.
Putting it on the Map
Before generating these images for the over 1300+ stations that have been around during our entire time window, I
threw out half of them, as the csv file generated from the query had so little data, that they wouldn't be useful.
Once I had generated all the individual images, I went through and removed ones that did not have enough data to
generate a trend line. I ended up with 236 stations which were added to the Google Map that I generated based on their
latitude, longitude, and name from the "stations" table in dolt. Finally, I painstakingly added an image for each
station on the map manually.
While I was working on the data I played with the idea of clustering stations in order to fill in missing data. I found
several places where there were 2 stations within 10 kms of each other. When clustering stations I ordered them based
on which stations had the most data, and when data was missing attempted to backfill data for those dates with observations
from the other station(s). 8 of the stations on the map are clusters (They are the blue ones in the clusters layer).
Try it for Yourself
Install Dolt and you clone the
NOAA dataset from DoltHub completely free.