Reference Projects

Domino Reference Projects is a collection of open-source solutions provided by Domino Data Lab. These projects are freely available to the data science/machine learning community and were built with the following goals in mind:

  • to educate practitioners on a specific data science topic

  • to accomplish a certain analytical method or task in the Domino MLOps Platform, including any relevant best practices

  • provide an easy way to share pre-built assets where possible such as a Launcher, Scheduled Job, App, Endpoint, and so on

  • to facilitate onboarding of new team members, by providing end-to-end implementations that they can use to get experience with the platform

The projects follow a common baseline, where a certain use case is developed using Python or R. The data sets used by the projects are based on freely available collections of data, and are either encapsulated with the reference project or are externally available for free downloading.

Typically, the projects contain a Jupyter notebook, which provides background and context for the specific use case. In addition, the majority of the projects also provide the relevant scripts for the purposes of operationalisation (such as model retraining job scripts, Model API scripts, web applications, and so on). The projects and all accompanying assets are available on GitHub.

The table below lists the reference projects that are currently available. The Domino team is actively working on expanding the number of projects available.

Project Name

Brief description

GitHub Link

Credit Card Fraud Detection

Detection of credit card transaction fraud using XGBoost

https://github.com/dominodatalab/domino-reference-project-fraud-detection

Named Entity Recognition

Locating and classifying named entities with a BiLSTM-CRF model

https://github.com/dominodatalab/domino-reference-project-ner

Note that the GitHub repositories contain detailed instructions for using the project assets and additional information for creating a dedicated compute environment if needed.

Getting the project assets into your Domino installation can be accomplished by either importing the relevant GitHub repository into your project or by directly leveraging Git-based projects.

Credit Fraud Detection Reference Project

Credit card fraud represents a significant problem for financial institutions, and reliable fraud detection is generally challenging. This project can be used as a template, facilitating the training of a machine learning model on a real-world credit card fraud dataset. It also employs techniques like oversampling and threshold moving to address class imbalance.

The dataset used in this project has been collected as part of a research collaboration between Worldline and the Machine Learning Group of Université Libre de Bruxelles, and the raw data can be freely downloaded from Kaggle.

The assets included in the project are:

  • FraudDetection.ipynb - a notebook that performs exploratory data analysis, data wrangling, hyperparameter optimisation, model training, and evaluation. The notebook introduces the use-cases and discusses the key techniques needed for implementing a classification model (such as oversampling, threshold moving, and so on).

  • model_train.py - a training script that can be operationalised and retrain the model on-demand or on schedule. The script can be used as a template. The key elements that need to be customized for other datasets are:

    • load_data - data ingestion function

    • feature_eng - data wrangling

    • xgboost_search - more specifically, the values in params, which define the grid search scope

  • model_api.py - a scoring function that exposes the persisted model as Model API. The score function accepts as arguments all independent parameters of the dataset and uses the model to compute the fraud probability for the individual transaction.

Note

You need to unzip the dataset/creditcard.csv.zip file before running any of the above.

This project uses two additional Python packages that are not included in the Domino standard environments - imblearn and xgboost. You can either customise a duplicate copy of the Domino Standard Environment or create a new environment with the Dockerfile instructions given in the README.md file of the project.

Named Entity Recognition

Named Entity Recognition (NER) is an NLP problem, which involves locating and classifying named entities (people, places, organizations, etc.) mentioned in unstructured text. This problem is used in many NLP applications that deal with use-cases like machine translation, information retrieval, chatbots, and others. In this project, we fit a BiLSTM-CRF model can using a freely available annotated corpus and Keras.

The dataset used in this project is the Annotated Corpus for Named Entity Recognition. This dataset is based on the GMB (Groningen Meaning Bank) corpus and has been tagged, annotated and built specifically to train a classifier to predict named entities such as name, location, etc.

The assets included in the project are:

  • ner.ipynb - a notebook that performs exploratory data analysis, data wrangling, hyperparameter optimisation, model training, and evaluation. The notebook introduces the use-cases and discusses the key techniques needed for implementing an NER classification model.

  • model_train.py - a training script that can be operationalised and retrain the model on-demand or on schedule. The script can be used as a template. The key elements that need to be customized for other datasets are:

    • load_data - data ingestion function

    • pre_process - data wrangling

    The majority of the remaining important parameters are controlled via command-line arguments to the script.

  • model_api.py - a scoring function that exposes the persisted model as Model API. The score function accepts a string of plain text and outputs the tokenized version of the text with the corresponding IOB tags.

This project uses two additional Python packages that are not included in the Domino standard environments - plot-keras-history and keras-contrib. You can either customise a duplicate copy of the Domino Standard Environment or create a new environment with the Dockerfile instructions given in the README.md file of the project.