The US National Oceanic and Atmospheric Administration (NOAA) collects climate data across the globe. NOAA provides location and date-based climate records for thousands of weather stations around the world in the form of Global Historical Climatology Network (GHCN) data. This data includes:
-
Minimum temperature,
-
Maximum temperature,
-
Precipitation,
-
Snowfall,
-
Wind,
-
and much more.
Consider the following scenario: A friend in Berlin, Germany, asks you if they should buy an air conditioner. You can use climate data from NOAA to build a simple prediction tool that uses machine learning (ML) regression.
For this tutorial, you will use the Global Historical Climatology Network – Daily (GHCND) dataset. It comes in two flavors:
-
Data arranged per location (weather station):
The station data file contains a pre-set collection of data points (high/low temperature, wind speed, precipitation, etc.) for one day. This makes it easy to review and understand the data.
An initial approach would be to use historical data from a single station. Weather from every day of the last century or so was recorded in the source data. To simplify things, you can then use data from the last couple of years and focus on Berlin. However, you will find that this specific weather station shut down in the last two years, together with the airport (Tegel) that housed it. There are also gaps in the data (probably due to the world wars that took place in the area) that might need special consideration.
-
Super GHCND data, which is the entire historical, global dataset in one file:
In this large dataset (12GB tar gzipped, ~103GB), each row contains one data point about one weather station per day. It includes
station_id
,date
,data point
, andvalue
fields (e.g. the station in Potsdam Germany, January 9, 1960, maximum temperature, 15c, followed by metadata about the observation). While this format is more confusing, it offers a foundation to collect data in increments from this point on, as NOAA provides a daily diff file. Therefore, to maintain the data on your own server, you can download the daily updates and changes (which will also be used to demonstrate Domino’s model monitoring capability later on).The Super GHCND dataset contains a daily-updated,
all-station-history-data
file calledsuperghcnd_full_<creation date>.csv.gz
(e.g. https://www.ncei.noaa.gov/pub/data/ghcn/daily/superghcnd/superghcnd_full_20230122.csv.gz) from the GHCND index page, along with several metadata files:-
ghcnd-countries
– the list of country codes. -
ghcnd-states
– the list of states and provinces. -
ghcnd-stations
– the weather station names, codes, and location information. -
ghcnd-inventory
– inventory listing the availability of data points for each weather station. For example, a station may offer daily high and low temperatures from 1929 to the current day, but wind speed is only available from 1944.
-