15 AI Workflows and MLOps: From Development to Deployment
15.1 Instructors
Ben Galewsky, Sr. Research Software Engineer National Center for Supercomputing Applications (NCSA) University of Illinois Urbana-Champaign
15.2 Overview
Machine learning models have become a vital tool for most branches of science. The process and tools for training these models on the lab’s desktop is often fragile, slow, and not reproducible. In this workshop, we will introduce the concept of MLOps, which is a set of practices that aims to streamline the process of developing, training, and deploying machine learning models. We will use the popular open source MLOps tool, MLflow, to demonstrate how to track experiments, package code, and deploy models. We will also introduce Garden, a tool that allows researchers to publish ML Models as citable objects.
15.3 Outline
- Introduction to MLOps
- Introduction to MLflow
- Tracking experiments with MLflow
- Packaging code with MLflow
- Deploying models with MLflow
- Publishing models with Garden
15.4 Three Challenges for ML in Research
- Training productivity
- Training reproducibility
- Model citability
15.5 Introcution to MLOps
Machine Learning Operations (MLOps) is a set of practices that combines Machine Learning, DevOps, and Data Engineering to reliably and efficiently deploy and maintain ML models in production. Just as DevOps revolutionized software development by streamlining the bridge between development and operations, MLOps brings similar principles to machine learning systems.
Think of MLOps as the infrastructure and practices that transform ML projects from experimental notebooks into robust, production-ready systems that deliver real reproducible scientific value.
15.6 Why Researchers Need MLOps
As a researcher, you’ve likely experienced the following challenges:
- Inability to harness computing resources to robustly search hyperparameter space
- Difficulty reproducing results from six months ago
- Retraining is too painful to consider creating better models after new data becomes available
- Tracking experiments becomes unwieldy as projects grow
- Collaboration among researchers in your group is difficult
- Advisors have little visibility into students’ work
MLOps addresses these pain points by providing:
15.6.1 1. Reproducibility
- Version control for data, code, and models
- Automated documentation of experiments
- Containerization for consistent environments
15.6.2 2. Automation
- Automated training pipelines
- Continuous integration and deployment (CI/CD)
- Automated testing and validation
15.6.3 3. Production Monitoring
- Real-time performance monitoring
- Data drift detection
- Automated model retraining triggers
15.6.4 4. Governance
- Model lineage tracking
MLOps isn’t just another buzzword—it’s a crucial evolution in how we develop and deploy ML systems. As models become more complex and requirements for reliability increase, MLOps practices become essential for successful ML implementations.
16 MLflow: A Comprehensive Platform for the ML Lifecycle
16.1 What is MLflow?
MLflow is an open-source platform designed to manage the end-to-end machine learning lifecycle. Created by Databricks, it provides a unified set of tools that address the core challenges in developing, training, and deploying machine learning models. MLflow is language-agnostic and can work with any ML library, algorithm, or deployment tool.
16.2 The PDG MLFlow Tracking Server
The Permafrost Discovery Gateway project has a shared MLFlow instance that you can use for your arctic research:
16.3 Key Components of MLflow
MLFlow consists of three main components:
16.4 Scaling Up Training With MLFlow-Slurm
With your project defined in MLProject file, it’s easy to scale up your workflows by launching them onto a cluster.
There is a plugin for MLFlow developed by NCSA called mlflow-slurm
To use, you have to create a json file that tells the plugin how to configure slurm jobs.
{
"partition": "cpu",
"account": "bbmi-delta-cpu",
"mem": "16g",
"modules": ["anaconda3_cpu"]
}
With this in place you can launch a training run on your cluster with the command
mlflow run --backend slurm \
--backend-config slurm_config.json \
examples/sklearn_elasticnet_wine
16.5 How MLflow Solves Common MLOps Challenges
16.5.1 Training productivity
- Track impact of hyperparameter and code changes on model quality
- Run hyperparameter sweeps and find best run
- Switch between desktop to supercomputer
16.5.2 Training reproducibility
- Enforced use of reproducible runtime environments
- Trace models back to specific runs
16.5.3 Model citeability
- Publish models to repository
- Versioning and lifecycle events
16.6 Hands-On Tutorial
We will be using a GPU powered JuypterHub provided by NCSA. Connection instructions will be provided in the classroom.
The example code is in the Cyber2A-RTS-ToyModel repo. It has a notebook along with some libraries to keep the notebook focused on the MLOps aspects of the code.