Model registration

To monitor any model, it first needs to be registered with the model monitor. You can register a new model either through the UI or using the public APIs. In both cases the information about the model is captured through a Model Config JSON file.

Few things to keep in mind while registering a model:

A data column can be only one of these column types: feature, prediction, timestamp, row identifier.
A data column can be only one of these value type: numerical, categorical, string, datetime (only valid for timestamp variable).
Not all columns in the training data file need to be declared. In this case, the undeclared columns will be ignored from any analysis within the model monitor. When you define the Model Config, you can ignore some columns by not mentioning them in the variables attribute.
Feature and prediction columns can only be of value type numerical or categorical.
Prediction column is optional but when declared there can only be one (in the current release).
1. When a prediction column (or row_identifer column) is not declared, Model Quality metrics can not be monitored.
2. Data drift can still be monitored for all feature columns.
Timestamp and row_identifier column types are optional
1. But when present there can be only one timestamp and/or row_identifer columns.
2. Timestamp can only be of datetime value type.
3. Row_identifier can only be of string value type.
4. These two columns can be declared at the time of adding prediction data for the first time, but it is recommended that they be declared during model registration.
Timestamp column should contain the date/time of when the prediction was made. It should be in a valid UTC format (ISO 8601).
When the timestamp column is not declared, the ingestion time of the prediction dataset into the Model Monitor is substituted as the timestamp of prediction.
Row_identifier is used to uniquely identify each prediction row. It is typically referred to as prediction ID, transaction ID, etc.
1. The row_identifer values are used to match ground truth data to the predictions to calculate model quality metrics.
When a row_identifer column (or Prediction column) is not declared, Model Quality metrics can not be monitored.

Model Config

Model Config JSON should capture all the information needed for registering a model. The easiest way to generate a Model Config file for a model is to use the Guided Flow in the Register Model flow. In the Step 4, you can download the Model Config for future reference, sharing with colleagues or for offline edits.

This section provides details about the structure of the Model Config. Lets review the sample Model Config JSON to get a good idea about it.

{
    "variables": [
        {
            "name": "age",
            "valueType": "numerical",
            "variableType": "feature"
        },
        {
            "name": "y",
            "valueType": "categorical",
            "variableType": "prediction"
        },
        {
            "name": "date",
            "valueType": "datetime",
            "variableType": "timestamp"
        },
        {
            "name": "RowId",
            "valueType": "string",
            "variableType": "row_identifier"
        }
    ],
    "datasetDetails": {
        "name": "BMAF-TrainingData-Webinar.csv",
        "datasetType": "file",
        "datasetConfig": {
            "path": "BMAF-TrainingData-Webinar.csv",
            "fileFormat": "csv"
        },
        "datasourceName": "dmm-shared-bucket",
        "datasourceType": "s3"
    },
    "modelMetadata": {
        "name": "test_psg",
        "modelType": "classification",
        "version": "2",
        "description": "",
        "author": "testadmin"
    }
}

The key attributes of a Model Config JSON are listed below.

variables - An array of variables that declare all prediction columns that you want to analyze. For each member in the array, specify the ‘name', ‘variableType' and 'valueType'.

name - Name of the column
variableType - Used to provide the attribute that identifies the column. Supported types are:
1. "feature": Input feature of the model. Data drift will be calculated for this data column. Needs to be declared while registering the model (along with its training data). The column needs to be present in all training & prediction datasets registered with the model.
2. "prediction": Output prediction of the model. Data drift and model quality metrics are calculated for this data column. While it is optional (model quality metrics will not be calculated for the model in this case), if declared, it has to be done while registering the model (along with its training data).
3. "timestamp": Used to identify the column which contains the timestamp for the prediction made. If not declared, the ingestion time of the data in the Model Monitor is used as the timestamp of the prediction. Column values need to follow the ISO 8601 time format.
4. "row_identifier": Used to uniquely identify each prediction. Is used for matching ground truth labels to their corresponding prediction values. Model quality metrics will not be calculated if this column is not present. If used, needs to be present in both prediction and ground truth datasets.
5. "ground_truth": Used to identify the column which contains the ground truth labels in the ground truth datasets.
6. "sample_weight": Column which contains the weight to be associated with each prediction to calculate the Gini metric.
7. "prediction_probability": Column which contains the probability value for the model’s prediction. Can be a single value (maps to the probability value of the positive class) or a list of values (in this case the length of the list has to match the number of unique prediction labels/classes present in the training dataset).
valueType - Used to provide the attribute that identifies the value of the column. Supported types are "categorical", "numerical", "datetime", or "string".
forPredictionOutput - Used within the Ground Truth Config to specify which prediction column the ground truth variable represents.

datasetDetails - Used to specify the data source details.

"name": The name you want to associate with this dataset instance. You can use the same name as the file you are selecting.
"datasetType": For this version, the supported type is "file"
"datasetConfig": This field is used to define the actual location of the file
1. "path": The name of the file
2. "fileFormat": For the current version (4.6.0), the only supported type is "csv"
"datasourceName": Name you provided when you created the data source
"datasourceType": One of the supported data source types in the current version. For this release, it is one of "s3", "gcs", "azure_blob", "azure_data_lake_gen1", "azure_data_lake_gen2", and "hdfs".

modelMetadata - Used to capture metadata related to the model. Specify the ‘name', ‘modelVersion', 'modelType', ‘dataset', ‘dateCreated', ‘description' and ‘author' attributes. ‘dateCreated' needs to be in a valid UTC format (ISO 8601). Valid values for ‘modelType' are ‘classification' and ‘regression'.

Supported Binning Methods

Bins are important in the Model Monitor to calculate probability distributions and divergence values for data drift. It affects not only the quality in drift value, but also the performance of the tool. Having high number of bins (greater than 20) usually causes noise/false alarms and considerably slows down the computation and UI performance.

When a user has not specified any binning strategy, the Model Monitor uses the Freedman Diaconis Estimator method to calculate the number of bins for Numerical variables. This count is capped to 20, if the count returned by the default method is higher than 20. For numerical variables, the Model Monitor automatically adds one guard bin for values which fall outside the min and max range of the values present in the training data. For training data this guard bin will have zero count (unless the user uses 'binsEdges' over-ride strategy mentioned below), however for Prediction data they may see values falling in it indicating a case that prediction data has values outside the min-max seen on the training data.

For categorical variables, the class values are used as bins. The Model Monitor automatically adds one guard bin 'Untrained Classes'. For training data this guard bin will have zero counts (unless the ‘binsCategories' override strategy is used), however for Prediction data counts of all classes which were not present in the training data will fall in this bin. Users can potentially use this bin to detect new classes previously unseen during training.

Users can override these defaults and fine tune the bin creation using following attributes in the Model Config JSON.

Note	Changing bins after a model has been successfully registered is not supported.

For numerical data columns, user can use two approaches:

binsNum - This takes a positive integer greater than equal to 2, and less than 20 as input. The Model Monitor will create that many equal sized bins for the numerical variable. The Model Monitor uses the max and min value in the training dataset to determine the bin widths. The Model Monitor will add two guard bands in addition to the user defined bins.

Example of valid 'binsNum'

"binsNum": 10_

binsEdges - This takes an array of real numbers as input. These correspond to actual bin edges. For creating N user-defined bins, users need to provide N+1 bin edges. This is similar to histogram_bin_edges method used in Numpy. The Model Monitor will add two guard bands in addition to the user defined bins.

Edges can be both positive and negative decimal numbers (except Infinity)
Minimum of 3 and Maximum of 20 numbers/edges can be provided in the array.
They should be monotonically increasing (lowest to highest) from start of the array to end of the array.
All provided values should be unique. No duplicates.

Example of valid 'binsEdges'

"binsEdges": [-10, -4.5, -0.25, 0, 3.2, 5.11111]

Examples of invalid 'binsEdges'

"binsEdges": [-10, 4, -0.25, 0, 3.2, 5.11111] -→ not monotonically increasing

"binsEdges": [-10, XYZ, -0.25, 0, 3.2, 5.11111] -→ string value present

"binsEdges": [1,2] -→ less than 3 edges provided

"binsEdges": [1,2,2,4,6] -→ Duplicates present

For categorical data columns, user can use the following approach:

binsCategories - This takes an array of strings as input (length should be less than 100) and creates a bin for each of them. The values should ideally correspond to class values present in the data column in the training data or class values user expects to find in the prediction data. Counts of all other class values of the training and prediction data columns will fall in the 'Untrained Classes' guard bin. If the user has specified an 'Untrained Classes' bin as part of the 'binsCategories', then it will correspond to the internal 'Untrained Classes' bin.

Example of valid 'binsCategories'

"binsCategories": ["red", "blue", "green", "white", "yellow"]