14 AI Workflows and MLOps: From Development to Deployment

Instructors

Ben Galewsky, Sr. Research Software Engineer National Center for Supercomputing Applications (NCSA) University of Illinois Urbana-Champaign

Overview

Machine learning models have become a vital tool for most branches of science. The process and tools for training these models on the lab’s desktop is often fragile, slow, and not reproducible. In this workshop, we will introduce the concept of MLOps, which is a set of practices that aims to streamline the process of developing, training, and deploying machine learning models. We will use the popular open source MLOps tool, MLflow, to demonstrate how to track experiments, package code, and deploy models. We will also introduce Garden, a tool that allows researchers to publish ML Models as citable objects.

Outline

Introduction to MLOps
Introduction to MLflow
Tracking experiments with MLflow
Packaging code with MLflow
Deploying models with MLflow
Publishing models with Garden

14.1 Three Challenges for ML in Research

Training productivity
Training reproducibility
Model citability

14.2 Introcution to MLOps

Machine Learning Operations (MLOps) is a set of practices that combines Machine Learning, DevOps, and Data Engineering to reliably and efficiently deploy and maintain ML models in production. Just as DevOps revolutionized software development by streamlining the bridge between development and operations, MLOps brings similar principles to machine learning systems.

Think of MLOps as the infrastructure and practices that transform ML projects from experimental notebooks into robust, production-ready systems that deliver real reproducible scientific value.

Why Researchers Need MLOps

As a researcher, you’ve likely experienced the following challenges:

Inability to harness computing resources to robustly search hyperparameter space
Difficulty reproducing results from six months ago
Retraining is too painful to consider creating better models after new data becomes available
Tracking experiments becomes unwieldy as projects grow
Collaboration among researchers in your group is difficult
Advisors have little visibility into students’ work

MLOps addresses these pain points by providing:

1. Reproducibility

Version control for data, code, and models
Automated documentation of experiments
Containerization for consistent environments

2. Automation

Automated training pipelines
Continuous integration and deployment (CI/CD)
Automated testing and validation

3. Production Monitoring

Real-time performance monitoring
Data drift detection
Automated model retraining triggers

4. Governance

Model lineage tracking

MLOps isn’t just another buzzword—it’s a crucial evolution in how we develop and deploy ML systems. As models become more complex and requirements for reliability increase, MLOps practices become essential for successful ML implementations.

14.3 MLflow: A Comprehensive Platform for the ML Lifecycle

What is MLflow?

MLflow is an open-source platform designed to manage the end-to-end machine learning lifecycle. Created by Databricks, it provides a unified set of tools that address the core challenges in developing, training, and deploying machine learning models. MLflow is language-agnostic and can work with any ML library, algorithm, or deployment tool.

The MLFlow Tracking Server

There are two components to MLFlow, the Python, R, and Java libraries that allow you to instrument your training code and a tracking server that records training runs and hosts the model repository. For serious production use, you will need a hosted tracking server. Fortunately, for tutorials and personal use, you can run a local tracking server.

Key Components of MLflow

MLFlow consists of three main components:

Tracking

The tracking server of MLFlow allows researchers to log and compare model parameters, metrics, and artifacts across multiple runs. With MLFlow’s tracking features, users can record hyperparameters, evaluation metrics, model versions, and even source code, making it easier to reproduce results and collaborate with team members. The platform provides a user-friendly interface to visualize and compare different experiments, helping practitioners identify the most promising models and configurations. Additionally, MLFlow’s tracking capabilities integrate seamlessly with popular ML frameworks, enabling users to incorporate experiment logging into their existing workflows with minimal code changes. This comprehensive approach to tracking enhances model development efficiency and facilitates better decision-making throughout the machine learning process.

Key Concepts in Tracking:

Parameters: key-value inputs to your code
Metrics: numeric values (can update over time)
Artifacts: arbitrary files, including data, models and plots
Source: training code that ran
Version: version of the training code
Tags and Notes: any additional information

import mlflow

with mlflow.start_run():
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_metric("accuracy", 0.85)
    mlflow.log_artifact("model.pkl")

Projects

MLFlow Projects provide a standardized format for packaging and organizing machine learning code to make it reproducible and reusable across different environments. A Project in MLFlow is essentially a directory containing code, dependencies, and an MLProject file that specifies the project’s entry points and environment requirements. This structure enables data scientists to share their work with teammates who can reliably execute the same code, regardless of their local setup. The MLProject file can define multiple entry points, each specifying its parameters and command to run, making it flexible for different use cases within the same project. MLFlow supports various environments for project execution, including conda environments, Docker containers, and system environments, ensuring consistency across different platforms. This standardization not only improves collaboration but also simplifies the deployment process, as projects can be easily versioned and moved between development and production environments.

Key Concepts in Projects

Packaging format for reproducible ML runs
- Any code folder or GitHub repository
- Optional MLproject file with project configuration
Defines dependencies for reproducibility
- Conda (+ R, Docker, …) dependencies can be specified in MLproject
- Reproducible in (almost) any environment
Execution API for running projects
- CLI / Python / R / Java
- Supports local and remote execution

name: myproject

python_env: python_env.yaml

entry_points:
  main:
    parameters:
      learning_rate: {type: float, default: 0.01}
    command: "python train.py --lr {learning_rate}"

Models

The Models component of MLFlow provides a standardized way to package and deploy machine learning models across different platforms and serving environments. MLFlow’s model format includes all the code and dependencies required to run the model, making it highly portable and easy to share. The platform supports a variety of popular ML frameworks like scikit-learn, TensorFlow, and PyTorch, allowing models to be saved in a framework-agnostic format using the MLFlow Model Registry. This registry acts as a centralized repository where teams can version their models, transition them through different stages (like staging and production), and maintain a clear lineage of model iterations. MLFlow also provides built-in deployment capabilities to various serving platforms such as Kubernetes, Amazon SageMaker, and Azure ML, streamlining the process of moving models from experimentation to production. Additionally, MLFlow’s model serving feature allows for quick local deployment of models as REST endpoints, enabling easy testing and integration with other applications.

Key Concepts in Models

Packaging format for ML Models
- Any directory with MLmodel file
Defines dependencies for reproducibility
- Conda environment can be specified in MLmodel configuration
Model creation utilities
- Save models from any framework in MLflow format
Deployment APIs
- CLI / Python / R / Java
Model Versioning and Lifecycle -Model repository

14.4 Scaling Up Training With MLFlow-Slurm

With your project defined in MLProject file, it’s easy to scale up your workflows by launching them onto a cluster.

There is a plugin for MLFlow developed by NCSA called mlflow-slurm

To use, you have to create a json file that tells the plugin how to configure slurm jobs.

{
  "partition": "cpu",
  "account": "bbmi-delta-cpu",
  "mem": "16g",
  "modules": ["anaconda3_cpu"]
}

With this in place you can launch a training run on your cluster with the command

mlflow run --backend slurm \
          --backend-config slurm_config.json \
          examples/sklearn_elasticnet_wine

14.5 How MLflow Solves Common MLOps Challenges

Training productivity

Track impact of hyperparameter and code changes on model quality
Run hyperparameter sweeps and find best run
Switch between desktop to supercomputer

Training reproducibility

Enforced use of reproducible runtime environments
Trace models back to specific runs

Model citeability

Publish models to repository
Versioning and lifecycle events

14.6 Hands-On Tutorial

For this tutorial you will need a local JupyterLab running from a conda environment as outlined in Section 0.

You will also need a local MLFlow tracking server.

To install and run an MLFlow tracking server locally, you first need to install MLFlow via pip with:

pip install mlflow

Once installed, you can start a basic tracking server by running the command

mlflow server --backend-store-uri sqlite:///mlflow.db

This sets up a server using SQLite for storage, creates a local directory for artifacts, binds to all network interfaces, and runs on port 5000. You can then access the MLFlow UI by navigating to http://localhost:5000 in your web browser, where you can view, compare, and manage your machine learning experiments.

We will be using a GPU powered JuypterHub provided by NCSA. Connection instructions will be provided in the classroom.

The example code is in the Cyber2A-RTS-ToyModel repo. It has a notebook along with some libraries to keep the notebook focused on the MLOps aspects of the code.

14.7 Sharing Models

Now that we have trained and validated a model we will want to first of all, share it with other members of our research group.

MLFlow Model Repository

Expert users within our research group will have access to the MLFlow tracking server and model repository. You can test the model as an artifact from an existing run. Once you are satisfied with its performance, you can publish it to the MLFlow model repository with the Register Model button on the tracking server.

Published models are given sequential version numbers so colleauges can rely on a stable model for their research. Models in the repository can also follow a lifecycle with MLFlow model aliases. Members of the research group who are not activly involved in model development may just want to use the current best model. The researcher who is training the model can decide which version others should use. MLFlow allows you to pull down the model symbolically.

mlflow.pyfunc.load_model("models:/rts@prod")

Models as Citable Objects

Publishing your MLProject and training code to a Git repo and making the data publicly readable through a data repository is a way for others to reproduce your models. However, to make your work truly reusable it is better to publish the weights of your trained model in a way that is findable, citable, and usable.

At a minimum, you should publish your model on Hugging Face. You can include a README and an notebook demonstrating how to use the model. HuggingFace allows you to mint a DOI that you can cite in your publications.

Here’s an example with the RTS model:

https://huggingface.co/bengal1/RTS/tree/main

A new facility called Garden takes this a step further. Your model is hosted as an endpoint which is a hosted function-as-a-service which allows anyone to perform inference with your model without needing to install anything.

Our example model is hosted at 10.26311/x49j-2v19

You can run a remote inference:

from garden_ai import GardenClient
garden_client = GardenClient()

rts_garden = garden_client.get_garden('10.26311/5fb6-f950')
image_url="https://github.com/cyber2a/Cyber2A-RTS-ToyModel/blob/main/data/images/valtest_yg_055.jpg?raw=true"
pred = rts_garden.identify_rts(image_url)