Step 5: Develop Your Model

As you develop a model, you’ll want to be able to quickly execute code, see outputs, and make iterative improvements. If you have not done so, please complete Step 3 to start a MATLAB workspace.

In this section, we will use MATLAB to load, explore, and transform some data. After the data has been prepared, we will train a model.

Attention

Be sure to occasionally save your work by clicking the Save & Push All button in Domino.




Step 5.1: Load and Explore the Dataset

  1. Be sure you have downloaded and saved the tegel.csv dataset from Step 4 in this walkthrough. You should see the file in your Current Browser file navigator.

Current Folder MATLAB

  1. Use the New menu in the MATLAB desktop to create a new MATLAB Live Script. Save it as ber_hot_weather.mlx.

Tegel file

  1. Examine the file by typing the following command into your Live Script:

    opts = detectImportOptions('tegel.csv');
    
  2. Next, load tegel.csv into MATLAB. The file has four columns we want to keep (the rest we can ignore):

    • Date – date the temperature was read
    • PRCP – total precipitation for the day
    • TMIN – lowest temperature measured that day
    • TMAX – highest temperature measured that day

Run the following command to specify these columns when loading the file and to assign data types to them. This command will also load the data using the readtable() function, giving it the import options object as a second argument:

opts.SelectedVariableNames = {'DATE', 'PRCP', 'TMIN', 'TMAX'};
opts = setvartype(opts, {'DATE','PRCP','TMIN','TMAX'},{'datetime','double', 'double', 'double'});
berWeatherTbl = readtable("tegel.csv", opts);
head(berWeatherTbl)

The result of running this command should look similar to the following:

Result of loading

  1. Create a new section in your script by clicking the Section Break button.

Section Break Button

In this section, we’ll format the dates into a Year/Month/Day format and store each field as a table variable. This will help with examining the data. Run the following command:

[berWeatherTbl.year, berWeatherTbl.month, berWeatherTbl.day] = ymd(berWeatherTbl.DATE);

To speed up processing of the data, our data set will be limited to temperatures between January 2000 and December 2019 (currently, the last full year of data), inclusive. Let’s remove the rows with data outside that range.

berWeatherTbl = berWeatherTbl(berWeatherTbl.year > 1999 & berWeatherTbl.year < max(berWeatherTbl.year) , :);

Notice that temperatures in the TMAX or TMIN columns look a bit odd (outside of data denoted as NaN, or “not a number”). NOAA uses a temperature format consisting of a tenth-of-a-degree in the Celsius scale. We therefore need to divide all temperature data by 10 to get temperatures in full Celsius degrees. Run the following command:

berWeatherTbl.TMAX = berWeatherTbl.TMAX/10;
berWeatherTbl.TMIN = berWeatherTbl.TMIN/10;

Like most data sources, some data may be missing, so we’ll fill in missing data with interpolated information by running the following command:

berWeatherTbl = fillmissing(berWeatherTbl, 'linear');

Preview the start of the table by running the head() function as follows:

head(berWeatherTbl)

The result should look like the following:

Table after formatting

  1. Create another section in your live script. In this section, we’ll calculate how many hot days have occurred in Berlin since the year 2000. To calculate this, we’ll first need to define “hot day”. We’ll use 29 degrees Celsius for our definition, and use this as our baseline threshold, as such:

    hotDayThreshold = 29;
    
  2. Now let’s figure out how many hot days have occurred since (and including) the year 2000. To do this, we’ll create a table column indexing the days with maximum temperatures (TMAX) that meet or exceed the hot day threshold.

    berWeatherTbl.HotDayFlag = berWeatherTbl.TMAX >= hotDayThreshold;
    

Next, we’ll use groupsummary() to count how many hot days were flagged:

numHotDaysPerYear = groupsummary(berWeatherTbl, 'year', 'sum', 'HotDayFlag');

We will repeat the same approach to find the highest temperature of each year:

maxTempOfYear = groupsummary(berWeatherTbl, 'year', 'max', 'TMAX');

We will finally combine the two variables to create a table called annualMaxTbl:

annualMaxTbl = join(numHotDaysPerYear, maxTempOfYear);
annualMaxTbl.Properties.VariableNames = {'Year', 'daysInYear', 'hotDayCount', 'maxTemp'};
annualMaxTbl

The table should look like the following:

Annual Max Table

  1. Create another section in your live script. In this section, we’ll visualize the weather data using a chart with that combines a bar graph and line graph. The chart will use two y-axes. The bar graph will represent the hot day count (for a given year), and the line graph will represent the highest annual temperature (in Celsius, for a given year). The y-axis on the left side of the chart will correspond to the hot day count, and the y-axis on the right side of the chart will correspond to the highest annual temperature.

Start with the hot day count bar graph:

figure
hold on
yyaxis left
bar(annualMaxTbl.Year,  annualMaxTbl.hotDayCount, 'FaceColor', 'b');

Add a titles and labels to the x-axis and left side y-axis:

titleText = sprintf("%s%d%s%d%s%d", "Number of hot days (over ", ...
    hotDayThreshold,"\circC) - ", min(annualMaxTbl.Year), "-", max(annualMaxTbl.Year));
title(titleText)
ylabel("Hot days per year")
xlabel("Year")

Now draw the line plot for the highest temperature each year:

yyaxis right
ylabel("Highest Annual Temperature in \circC")

plot(annualMaxTbl.Year, annualMaxTbl.maxTemp, 'Color', 'r', "Marker","*")
hold off

Your chart should look something like this:

Number of hot days chart




Step 5.2: Generating Predictions from the Data

In this step, we will use an interactive machine-learning MATLAB application called Regression Learner to develop a model that can predict the weather for the next 20 days.

First, let’s prepare the data that we will use with Regression Learner. We’ll partition the data into two sets: one set to train the model, and a second set to test the model.

  1. Create a new section in the live script and remove the HotDayFlag column.

    berWeatherTbl.HotDayFlag = [];
    

Partition the data:

cv = cvpartition(berWeatherTbl.year, 'Holdout', 0.3);
dataTrain = berWeatherTbl(cv.training, :);
dataTest = berWeatherTbl(cv.test, :);
  1. Next, find the Regression Learner app by clicking on the Apps tab in the MATLAB toolbar. If you do not see the Regression Learner app, you may need to expand the full app list by clicking on the arrow to the far right of the list.

image1

image2

Attention

If Regression Learner still does not appear in the apps list after expanding the list, please reach out to your IT team or your MathWorks account manager for assistance.

  1. Click on the Regression Learner icon. The application will open in a new, blank window.

Tip

You can use the Domino desktop window controls to navigate between windows (in case the Regression Learner window disappears behind the MATLAB window).

image3

image4

  1. Click the New Session button and select “From Workspace” in the dropdown.

image5

  1. A “New Session” window will open. Here we can specify the input variables that should be used for prediction in our model, as well as the outputs (or “response”) you would like to predict – for us, that means the maximum temperature.

For the input variable, select dataTrain, under the “Workspace Variable” section of the window.

image6

For the output, select TMAX (maximum temperature), under the “Response” section of the window. Click the Start Session button.

image7

The Regression Learner window will refresh and display the original data set and values of TMAX.

image8

  1. Next, select the type of model that should be used for model training. For this walkthrough, we’ll use the “Fine Gaussian SVM” and “Coarse Tree” models and compare the results. Feel free to select additional models if you’d like. Alternatively, you can select all models and compare the results for the best fit for your data.

image9

Attention

Regression Learner runs best on a container with multiple cores. Multiple cores allow it to run in parallel and produce models rapidly. If you are using a single-core container, please turn off parallel processing by clicking the “Use Parallel” button in Regression Learner.

  1. Click the Train button in the Regression Learner toolbar to start the model training process. The Domino container will spin up a “parallel pool” – a method to optimize the model training. Once the models finish training, the model list will automatically pick the model that best fits the data. Several visualizations are offered to demonstrate this fit (the visualization shown below is the default visualization).

image10

Clicking on the Predicted vs. Actual Plot button in the toolbar will display a chart that shows how many of the predictions that were made by the model fit correct values in the data. The closer the predictions are to the diagonal, the better the predictions will be.

image11

8. In a later step, we’ll deploy a model using Domino. To do that, we first need to create a function that can be used for model deployment. We can do that with Regression Learner. Click the Generate Function button.

image12

MATLAB will generate the function in an M-file (as shown below). Save the file as trainRegressionModel.m.

image13

  1. Let’s export the model to your Domino workspace so that we can use it for predictions. Navigate back to the Regression Learner window and export the model as shown below.

image14

Give the model a name, such as weatherModel, and click “OK”.

image15

Close the Regression Learner app (confirm your decision in the pop-up) and you’ll see the trained model available to you in your workspace.

image16

You will also notice that the Command Window shows information on how to use the model to make predictions, specifically with the following line of code:

yFit  = weatherModel.predictFcn(T);

This line of code will output a prediction (in the form of a table) as a result of inputting a table of data. The input table must include data organized similarly to the data we had in berWeatherTbl – date, precipitation, minimum temperature, month, day and year (it should not include TMAX, as that value will be predicted). The model will predict the TMAX value and include it in yFit.

  1. Let’s test the model with the data we partitioned earlier. Create a new section in your live script. Use the model with the test data, using the function call we saw in the Command Window:

    yFit  = weatherModel.predictFcn(dataTest);
    

Now compare the results column to the actual values in the test data set:

err = yFit - dataTest.TMAX;

Finally, draw a histogram to visualize the results:

figure;
histogram(err)
xlim([-15 15])
ylabel('Number of predictions');
xlabel('Gap with actual test data')

The result should look like the following:

image17

Now that we’ve seen that the model works, let’s save it for later use. In the Command Line window, enter:

save weathermodel weathermodel

image18

  1. We are finally ready to use our model to predict the weather for the next year. We will generate a table with next year’s dates and add randomly selected, historical precipitation and minimum temperature data to the table (for the same dates), which we’ll need for the model to properly make predictions.

Create a new section in your live script, then create a new table with date and temperature input data:

todayDate = datetime('today');
daysIntoFuture = 365;
endDate = todayDate + days(daysIntoFuture);
predictedMaxTemps = table('Size', [daysIntoFuture+1 7], 'VariableTypes', ... {'datetime', 'double', 'double', 'double', 'double', 'double', 'double'}, 'VariableNames', berWeatherTbl.Properties.VariableNames);
x=1;

Next, loop through the next 20 days and populate the table.

for i=todayDate:endDate
        [y, m, d] = ymd(i);
        prcps = berWeatherTbl.PRCP(berWeatherTbl.month == m & berWeatherTbl.day == d);
    curMinTemp = NaN;
    [historicalRowCount z] = size(minTemps);
    randomRow = randi([1 historicalRowCount]);
    curMinTemp = minTemps(randomRow);
    predictedMaxTemps.TMIN(x) = curMinTemp;
    randomRow = randi([1 historicalRowCount]);
    predictedMaxTemps.PRCP(x) = prcps(randomRow);
    predictedMaxTemps.DATE(x) = i;
    predictedMaxTemps.year(x) = y;
    predictedMaxTemps.month(x) = m;
    predictedMaxTemps.day(x) = d;
    predictedMaxTemps.TMAX(x) = 0;
    x = x+1;
end

head(predictedMaxTemps)

The result will be a preview of a table containing historical weather data that we can use for our weather predictions. The predictions will appear in the TMAX column of the table after the table is run through the model.

image19

  1. Let’s run the model!

    yFit = weatherModel.predictFcn(predictedMaxTemps);
    result = table(predictedMaxTemps.DATE, yFit, 'VariableNames', {'Date', 'Predicted TMAX'})
    

image20

Your very own, AI-driven, weather prediction!

  1. Let’s draw this out in another plot and count how many hot days will be forecasted:

    figure
    plot(result.Date, result.("Predicted TMAX"))
    titleText = sprintf("%s%d%s", "Weather forecast for the next ", daysIntoFuture, " days in Berlin, Germany (\circC)");
    title(titleText)
    ylabel('Forecasted Daily High Temperature')
    

image21

We can also predict how many hot days will happen during the next year:

hotWeatherDaysIdx = result(result.("Predicted TMAX") > hotDayThreshold, :);
height(hotWeatherDaysIdx)

The result on September 10, 2020 was a prediction of 17 hot days between September 2020 and October 2021. The results may vary for you based on the dates, data, and model used.

To export your model, all you need to do is save it into a MAT file:

save weatherModel weatherModel

Anyone in your Domino project will be able to load it later using:

load weathermodel.mat