NOAA Global Hourly Surface Data

6 min read

The National Oceanic and Atmospheric Administration, NOAA, publishes weather measurements taken from stations around the world. It started in 1901 with a handful of stations, and there are more than 35,000 stations today. Most of these stations provide observations of the air temperature, wind speed, visibility, and more at least 24 times per day. This fundamental dataset is the basis for a large amount of the data you see online, on tv, in weather apps, or anywhere else you might see the weather including numerous commercial products. If you want to be able to forecast the weather, you will need this data to train your models. It is also vital when looking at climate change. Jon Oliver recently discussed this dataset and how important it is on, "Last Week Tonight". Check it out here.

The Source Data

NOAA's Global Integrated Surface Dataset's homepage is here where you can find the details of the dataset and the link to download the raw data. The data is provided in a few different formats, but if you want the data going back all the way to 1901, then you'll need to parse the csvs. The csvs are archived in annual zip files which contain a csv for every station taking measurements in that year. The NCEI provides documentation on the format of the csvs here. This documentation is critical as the observations are provided as whole numbers with a scaling factor, and the scaling factor is different for different metrics.

From a high level, each row of the csv has the date and time that the measurements were made and several metrics that were recorded at that time. Each csv file contains a full year of a station's data. Only a few of the metrics that a station might record are mandatory, as not all of the stations have the same capabilities and over the years newer stations were brought online tracking additional metrics. Each row also contains all the information on the station that made the recording including the name of the station, it's gps coordinates, the elevation of the station, etc.

Another challenge is that some of the columns of the csv have values which are a quoted list of comma separated values themselves. So each csv row, has multiple csvs you need to parse in order to extract the data into a meaningful format.

Another problem with this format, is that in order to get the high and low temperature for a day you would need all the rows for the day and compare the values for each. Additionally there is a lot of work required to understand the data before you can use it well.

Modeling the Data in Dolt

When modeling this data, I wanted to make it easy to answer questions about what happened on a specific day, as opposed to what happened on a specific hour. I wanted to make it easy to select date ranges in order to look at how the climate is changing. I also wanted to make it easier to look at a specific metric, as opposed to all the metrics for a day.

To accomplish this each of the mandatory observations is in it's own table and a row contains up to 24 hourly observations for that metric, and the high, low, average, and median observation for that day. There is also a separate table for the station data rather than duplicate that data over and over again. All tables are keyed off the station's ID. The tables are

  • stations
  • air_temp
  • visibility
  • wind_speed
  • dew_point_temp
  • sea_level_pressure
  • sky_ceiling_height

The unique thing that we can do in Dolt is model each day as a single commit. This makes it fast, and easy to query a specific day by checking out the commit corresponding to the day. Now if you want to query to see how things have changed over time you can query the Dolt history table for the metric you are interested in. If you want to see how air temperature has changed in Melbourne International Airport in Australia between January 1st 2000 and today you could run:

FROM dolt_history_air_temp
WHERE station = '94866099999' AND commit_date >= '2000-01-01'

The downside to this approach is that it makes it very difficult to make corrections in past data. An assumption being made when you query this data is that any date will appear only once in the commit history. If a mistake was made in a previous import, the HEAD would need to be rewound to the problematic commit, and all the dates from that point forward would need to be re-imported. This is unlike Dolt datasets I have modeled in the past where the date of the commit is not a core piece of the data. Another way to model this would be to include the date of the observations on each row. When modeled in this way a correction would be a simple operation of overwriting the data with the appropriate corrections, and there would be no need to import everything after that, and a diff would be able to show times when corrections were made.

The Import Programs

There are two binaries used to import the data. The first processes the archive going all the way back to 1901. It was run one time to seed the data repository with all the historic data. The second is an incremental update job that runs daily. It attempts to download and process only new data that has become available since it was last run.

Archive Import

The archive import program loops over every year from 1901 to 2020 and downloads an archive containing all the data for the year. It extracts the archive, and then reads through each of the per station csv files. It processes the data and generates a new commit for every day of the year. This commit contains the full days data for all the mandatory metrics.

Incremental Update

The incremental update program looks at the latest commit to the NOAA repository to find what data needs to be downloaded. It then scrapes the NOAA website to find the list of all stations. Each station is associated with a single csv file which holds all of it's observations for the year. This file is sorted in the order of the date and time the observation was taken at. We read from the end of the csv files until we have reached a date that we have already processed. After downloading all the data for all the stations, we process the data in the same manner done by archive import generating a commit for each day.

The code

The code is available here. There are three go packages for these 2 programs.

  • all_from_archive - Is the code for importing the entire archive of data into Dolt.
  • update - Is the code that run daily in order to update the NOAA data repository with data that has been added since

it was last run.

  • noaaimp - Contains library code that is shared by both programs.

Get the Dataset

If you have Dolt installed you can now clone this data from Dolthub.

dolt clone dolthub/noaa

Once cloned you can immediately start querying it using SQL. Here is a simple query to find the stations in New York state.

dolt sql -q "
    SELECT * 
    FROM stations 
    WHERE name LIKE '%NY US'"

More to Come

Part two uses this data set to show how air temperatures have changed between 1950 and today. I'll cover how I queried the data, and generated visualizations from it using python and PIL. On Friday I'll talk about how dolt history queries have been optimized in order to support modeling data that is frequently updated using the Dolt commit graph as was done with this dataset.



Get started with Dolt