15  AI Workflows and MLOps: From Development to Deployment

15.1 Instructors

Ben Galewsky, Sr. Research Software Engineer National Center for Supercomputing Applications (NCSA) University of Illinois Urbana-Champaign

15.2 Overview

Machine learning models have become a vital tool for most branches of science. The process and tools for training these models on the lab’s desktop is often fragile, slow, and not reproducible. In this workshop, we will introduce the concept of MLOps, which is a set of practices that aims to streamline the process of developing, training, and deploying machine learning models. We will use the popular open source MLOps tool, MLflow, to demonstrate how to track experiments, package code, and deploy models. We will also introduce Garden, a tool that allows researchers to publish ML Models as citable objects.

15.3 Outline

  • Introduction to MLOps
  • Introduction to MLflow
  • Tracking experiments with MLflow
  • Packaging code with MLflow
  • Deploying models with MLflow
  • Publishing models with Garden

15.4 Three Challenges for ML in Research

  • Training productivity
  • Training reproducibility
  • Model citability

15.5 Introcution to MLOps

Machine Learning Operations (MLOps) is a set of practices that combines Machine Learning, DevOps, and Data Engineering to reliably and efficiently deploy and maintain ML models in production. Just as DevOps revolutionized software development by streamlining the bridge between development and operations, MLOps brings similar principles to machine learning systems.

Think of MLOps as the infrastructure and practices that transform ML projects from experimental notebooks into robust, production-ready systems that deliver real reproducible scientific value.

15.6 Why Researchers Need MLOps

As a researcher, you’ve likely experienced the following challenges:

  • Inability to harness computing resources to robustly search hyperparameter space
  • Difficulty reproducing results from six months ago
  • Retraining is too painful to consider creating better models after new data becomes available
  • Tracking experiments becomes unwieldy as projects grow
  • Collaboration among researchers in your group is difficult
  • Advisors have little visibility into students’ work

MLOps addresses these pain points by providing:

15.6.1 1. Reproducibility

  • Version control for data, code, and models
  • Automated documentation of experiments
  • Containerization for consistent environments

15.6.2 2. Automation

  • Automated training pipelines
  • Continuous integration and deployment (CI/CD)
  • Automated testing and validation

15.6.3 3. Production Monitoring

  • Real-time performance monitoring
  • Data drift detection
  • Automated model retraining triggers

15.6.4 4. Governance

  • Model lineage tracking

MLOps isn’t just another buzzword—it’s a crucial evolution in how we develop and deploy ML systems. As models become more complex and requirements for reliability increase, MLOps practices become essential for successful ML implementations.

16 MLflow: A Comprehensive Platform for the ML Lifecycle

16.1 What is MLflow?

MLflow is an open-source platform designed to manage the end-to-end machine learning lifecycle. Created by Databricks, it provides a unified set of tools that address the core challenges in developing, training, and deploying machine learning models. MLflow is language-agnostic and can work with any ML library, algorithm, or deployment tool.

16.2 The PDG MLFlow Tracking Server

The Permafrost Discovery Gateway project has a shared MLFlow instance that you can use for your arctic research:

https://pdg.mflow.software.ncsa.illinois.edu

16.3 Key Components of MLflow

MLFlow consists of three main components:

The tracking server of MLFlow allows researchers to log and compare model parameters, metrics, and artifacts across multiple runs. With MLFlow’s tracking features, users can record hyperparameters, evaluation metrics, model versions, and even source code, making it easier to reproduce results and collaborate with team members. The platform provides a user-friendly interface to visualize and compare different experiments, helping practitioners identify the most promising models and configurations. Additionally, MLFlow’s tracking capabilities integrate seamlessly with popular ML frameworks, enabling users to incorporate experiment logging into their existing workflows with minimal code changes. This comprehensive approach to tracking enhances model development efficiency and facilitates better decision-making throughout the machine learning process.

16.3.1 Key Concepts in Tracking:

  • Parameters: key-value inputs to your code
  • Metrics: numeric values (can update over time)
  • Artifacts: arbitrary files, including data, models and plots
  • Source: training code that ran
  • Version: version of the training code
  • Tags and Notes: any additional information
import mlflow

with mlflow.start_run():
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_metric("accuracy", 0.85)
    mlflow.log_artifact("model.pkl")

MLFlow Projects provide a standardized format for packaging and organizing machine learning code to make it reproducible and reusable across different environments. A Project in MLFlow is essentially a directory containing code, dependencies, and an MLProject file that specifies the project’s entry points and environment requirements. This structure enables data scientists to share their work with teammates who can reliably execute the same code, regardless of their local setup. The MLProject file can define multiple entry points, each specifying its parameters and command to run, making it flexible for different use cases within the same project. MLFlow supports various environments for project execution, including conda environments, Docker containers, and system environments, ensuring consistency across different platforms. This standardization not only improves collaboration but also simplifies the deployment process, as projects can be easily versioned and moved between development and production environments.

16.3.2 Key Concepts in Projects

  • Packaging format for reproducible ML runs
    • Any code folder or GitHub repository
    • Optional MLproject file with project configuration
  • Defines dependencies for reproducibility
    • Conda (+ R, Docker, …) dependencies can be specified in MLproject
    • Reproducible in (almost) any environment
  • Execution API for running projects
    • CLI / Python / R / Java
    • Supports local and remote execution
name: myproject

python_env: python_env.yaml

entry_points:
  main:
    parameters:
      learning_rate: {type: float, default: 0.01}
    command: "python train.py --lr {learning_rate}"

The Models component of MLFlow provides a standardized way to package and deploy machine learning models across different platforms and serving environments. MLFlow’s model format includes all the code and dependencies required to run the model, making it highly portable and easy to share. The platform supports a variety of popular ML frameworks like scikit-learn, TensorFlow, and PyTorch, allowing models to be saved in a framework-agnostic format using the MLFlow Model Registry. This registry acts as a centralized repository where teams can version their models, transition them through different stages (like staging and production), and maintain a clear lineage of model iterations. MLFlow also provides built-in deployment capabilities to various serving platforms such as Kubernetes, Amazon SageMaker, and Azure ML, streamlining the process of moving models from experimentation to production. Additionally, MLFlow’s model serving feature allows for quick local deployment of models as REST endpoints, enabling easy testing and integration with other applications.

16.3.3 Key Concepts in Models

  • Packaging format for ML Models
    • Any directory with MLmodel file
  • Defines dependencies for reproducibility
    • Conda environment can be specified in MLmodel configuration
  • Model creation utilities
    • Save models from any framework in MLflow format
  • Deployment APIs
    • CLI / Python / R / Java
  • Model Versioning and Lifecycle -Model repository

16.4 Scaling Up Training With MLFlow-Slurm

With your project defined in MLProject file, it’s easy to scale up your workflows by launching them onto a cluster.

There is a plugin for MLFlow developed by NCSA called mlflow-slurm

To use, you have to create a json file that tells the plugin how to configure slurm jobs.

{
  "partition": "cpu",
  "account": "bbmi-delta-cpu",
  "mem": "16g",
  "modules": ["anaconda3_cpu"]
}

With this in place you can launch a training run on your cluster with the command

mlflow run --backend slurm \
          --backend-config slurm_config.json \
          examples/sklearn_elasticnet_wine

16.5 How MLflow Solves Common MLOps Challenges

16.5.1 Training productivity

  • Track impact of hyperparameter and code changes on model quality
  • Run hyperparameter sweeps and find best run
  • Switch between desktop to supercomputer

16.5.2 Training reproducibility

  • Enforced use of reproducible runtime environments
  • Trace models back to specific runs

16.5.3 Model citeability

  • Publish models to repository
  • Versioning and lifecycle events

16.6 Hands-On Tutorial

We will be using a GPU powered JuypterHub provided by NCSA. Connection instructions will be provided in the classroom.

The example code is in the Cyber2A-RTS-ToyModel repo. It has a notebook along with some libraries to keep the notebook focused on the MLOps aspects of the code.

16.7 Sharing Models

Now that we have trained and validated a model we will want to first of all, share it with other members of our research group.

16.7.1 MLFlow Model Repository

Expert users within our research group will have access to the MLFlow tracking server and model repository. You can test the model as an artifact from an existing run. Once you are satisfied with its performance, you can publish it to the MLFlow model repository with the Register Model button on the tracking server.

Published models are given sequential version numbers so colleauges can rely on a stable model for their research. Models in the repository can also follow a lifecycle with MLFlow model aliases. Members of the research group who are not activly involved in model development may just want to use the current best model. The researcher who is training the model can decide which version others should use. MLFlow allows you to pull down the model symbolically.

mlflow.pyfunc.load_model("models:/rts@prod")

16.7.2 Models as Citable Objects

Publishing your MLProject and training code to a Git repo and making the data publicly readable through a data repository is a way for others to reproduce your models. However, to make your work truly reusable it is better to publish the weights of your trained model in a way that is findable, citable, and usable.

At a minimum, you should publish your model on Hugging Face. You can include a README and an notebook demonstrating how to use the model. HuggingFace allows you to mint a DOI that you can cite in your publications.

Here’s an example with the RTS model:

https://huggingface.co/bengal1/RTS/tree/main

A new facility called Garden takes this a step further. Your model is hosted as an endpoint which is a hosted function-as-a-service which allows anyone to perform inference with your model without needing to install anything.

Our example model is hosted at 10.26311/5fb6-f950

You can run a remote inference:

from garden_ai import GardenClient
garden_client = GardenClient()

rts_garden = garden_client.get_garden('10.26311/5fb6-f950')
image_url="https://github.com/cyber2a/Cyber2A-RTS-ToyModel/blob/main/data/images/valtest_yg_055.jpg?raw=true"
pred = rts_garden.identify_rts(image_url)

16.8 Reference