AI/ML on Kubernetes: Deploying Models with Pulumi on Google Cloud

Posted on Monday, Mar 24, 2025

Kubernetes has transformed cloud infrastructure by enabling scalable, containerized applications. While it initially gained traction for managing web applications and microservices, its capabilities now extend to AI/ML workloads, making it the go-to platform for data scientists and machine learning engineers.

Running AI/ML workloads on Kubernetes presents unique challenges, including:

Specialized hardware requirements (e.g., GPUs, TPUs)
Scalability for model training and inference
Complex data pipelines that integrate various cloud services
Infrastructure automation for seamless deployment

Google Cloud Kubernetes (GKE) provides a robust foundation for AI/ML workloads, but managing infrastructure manually can be cumbersome. This is where Pulumi comes in—enabling Infrastructure as Code (IaC) to automate and simplify AI/ML infrastructure on Kubernetes.

Pulumi: Automating AI/ML Infrastructure on Google Cloud

Pulumi is a modern Infrastructure as Code (IaC) tool that allows teams to define and manage cloud infrastructure using general-purpose programming languages like Python, TypeScript, and Go. This approach is particularly beneficial for AI/ML teams, as Python is already the dominant language in data science and machine learning.

With Pulumi, you can:

Provision and scale Kubernetes clusters on Google Cloud automatically.
Define AI/ML environments as code, making deployments repeatable and version-controlled.
Integrate infrastructure with machine learning pipelines, reducing operational overhead.

Deploying AI/ML Workloads on Kubernetes

Below, we explore two use cases for running AI/ML workloads on Google Kubernetes Engine (GKE) using Pulumi.

Use Case 1: Deploying a Large Language Model (LLM) with Retrieval Augmented Generation (RAG)

Large Language Models (LLMs) like GPT-3, Whisper, and DALL-E require significant infrastructure for training and inference. Retrieval Augmented Generation (RAG) enhances LLMs by integrating external knowledge sources, improving accuracy and relevance.

Using Pulumi, you can automate the deployment of an open-source LLM with RAG on Kubernetes.

Step 1: Set Up a Kubernetes Cluster on Google Cloud

import pulumi
import pulumi_gcp as gcp

# Create a GKE cluster for AI/ML workloads
cluster = gcp.container.Cluster("ml-cluster",
    location="us-central1",
    initial_node_count=3,
    node_version="1.23",
    min_master_version="1.23")

# Create a node pool optimized for ML workloads
node_pool = gcp.container.NodePool("ml-node-pool",
    cluster=cluster.name,
    node_config=gcp.container.NodePoolNodeConfigArgs(
        machine_type="n1-standard-4",
        oauth_scopes=[
            "https://www.googleapis.com/auth/devstorage.read_only",
            "https://www.googleapis.com/auth/logging.write",
        ],
        labels={"team": "ml"},
        shielded_instance_config=gcp.container.NodePoolNodeConfigShieldedInstanceConfigArgs(
            enable_secure_boot=True,
            enable_integrity_monitoring=True,
        ),
    ),
    initial_node_count=2,
    location="us-central1",
    version="1.23")
...

This Pulumi script:

Provisions a Kubernetes cluster on Google Cloud.
Creates a dedicated node pool optimized for AI/ML workloads.
Ensures security best practices for machine learning environments.

Step 2: Deploy the LLM with RAG Model

Once the cluster is set up, we can deploy the LLM with RAG model using Pulumi’s Kubernetes provider:

Define Kubernetes Deployments for the LLM and RAG model.
Package models as Docker containers and deploy them to the cluster.
Configure Kubernetes Services to expose APIs for model inference.
Use ConfigMaps and Secrets to manage parameters and credentials.

By defining these resources in Pulumi, deployments become fully automated, repeatable, and scalable.

Use Case 2: Training and Serving Custom Machine Learning Models

Beyond pre-trained LLMs, Kubernetes is ideal for training and serving custom AI/ML models. Pulumi can help automate every stage of the ML lifecycle.

Step 1: Set Up the Model Training Environment

Using Pulumi, we define a training environment with:

A Kubernetes Deployment for training jobs (GPU-enabled).
Persistent Volume Claims (PVCs) for storing training data and model artifacts.
Monitoring tools for tracking performance.

Step 2: Deploy and Serve the Trained Model

Once the model is trained, Pulumi can be used to:

Deploy the trained model as a Kubernetes Deployment.
Expose the model via a Kubernetes Service (REST API or gRPC).
Add autoscaling rules for dynamic inference scaling.

Pulumi allows teams to manage the entire AI/ML pipeline in a structured and automated way.

Try Jay’s demo code on Creating an AI Training Platform on GKE with Pulumi.

Why Use Pulumi for AI/ML on Kubernetes?

Pulumi provides several advantages for AI/ML teams running workloads on Kubernetes:

1. Use General-Purpose Languages for Infrastructure as Code

Most AI/ML engineers already work with Python or Go, and Pulumi lets them manage infrastructure using the same language.
No need to learn YAML or Kubernetes manifests—define everything programmatically.

2. Automate AI/ML Workflows

Define infrastructure, training jobs, and model serving in one unified IaC framework.
Ensure consistency across development, staging, and production environments.

3. Improve Scalability and Cost Efficiency

Pulumi integrates with Google Cloud AI services for optimized compute resources.
Automate autoscaling and resource allocation for AI/ML workloads.

4. Increase Security and Compliance

Manage credentials and secrets securely with Pulumi ESC (Secrets Management).
Apply policy-as-code to enforce security best practices.

Get Started with AI/ML on Kubernetes with Pulumi

Pulumi makes it easy to deploy, scale, and manage AI/ML workloads on Kubernetes, leveraging Google Cloud’s AI infrastructure. Whether you’re serving LLMs, training custom models, or automating ML pipelines, Pulumi provides a developer-friendly, scalable, and secure solution.

By combining Kubernetes, Google Cloud, and Pulumi, you can accelerate AI/ML innovation while reducing infrastructure complexity.