Auto-Scaling AI Workloads with Kubernetes Horizontal Pod Autoscaler

Question

Pulumi · Accepted Answer

The Kubernetes Horizontal Pod Autoscaler (HPA) allows you to automatically scale the number of pods in a replication controller, deployment, replica set, or stateful set based on observed CPU utilization or other select metrics. This can be particularly useful for workloads such as AI tasks which are computationally expensive and might have varying resource requirements over time.

In Pulumi, you can declare resources as classes in a Python program. For managing Kubernetes resources, Pulumi provides an SDK that maps directly to Kubernetes API objects. To leverage an HPA, you would typically define a Kubernetes `Deployment` for your AI workload and then an `HorizontalPodAutoscaler` resource to manage scaling that deployment.

Below is a Pulumi program written in Python that demonstrates how to deploy an AI workload and configure auto-scaling using the HPA. This particular program uses the `pulumi_kubernetes` Python package.

```python
import pulumi
import pulumi_kubernetes as k8s

# Define the Kubernetes Deployment for the AI workload
ai_workload = k8s.apps.v1.Deployment(
    "ai-workload",
    spec=k8s.apps.v1.DeploymentSpecArgs(
        selector=k8s.meta.v1.LabelSelectorArgs(match_labels={"app": "ai-workload"}),
        replicas=1,
        template=k8s.core.v1.PodTemplateSpecArgs(
            metadata=k8s.meta.v1.ObjectMetaArgs(labels={"app": "ai-workload"}),
            spec=k8s.core.v1.PodSpecArgs(
                containers=[
                    k8s.core.v1.ContainerArgs(
                        name="ai-container",
                        image="gcr.io/my-ai-project/ai-service:latest",
                        resources=k8s.core.v1.ResourceRequirementsArgs(
                            requests={"cpu": "500m", "memory": "500Mi"},
                            limits={"cpu": "1000m", "memory": "1000Mi"}
                        )
                    )
                ]
            )
        )
    )
)

# Define the Horizontal Pod Autoscaler that targets the AI workload Deployment
ai_workload_hpa = k8s.autoscaling.v1.HorizontalPodAutoscaler(
    "ai-workload-hpa",
    spec=k8s.autoscaling.v1.HorizontalPodAutoscalerSpecArgs(
        scale_target_ref=k8s.autoscaling.v1.CrossVersionObjectReferenceArgs(
            api_version="apps/v1",
            kind="Deployment",
            name=ai_workload.metadata["name"]
        ),
        min_replicas=1,
        max_replicas=10,
        target_cpu_utilization_percentage=80
    )
)

# Export the name of the workload and the HPA
pulumi.export("workload_name", ai_workload.metadata["name"])
pulumi.export("hpa_name", ai_workload_hpa.metadata["name"])
```

What this program does:

1. It defines a Kubernetes `Deployment` called `ai_workload`. This deployment specifies the desired state of your AI service, like the number of replicas, the container image to use, and resource requirements.

2. Then it defines a `HorizontalPodAutoscaler` called `ai_workload_hpa`. The HPA automatically adjusts the number of pods as needed to maintain an average CPU utilization across all pods of 80%. It does that by scaling the number of replicas in the deployment between a minimum of 1 and a maximum of 10.

The `min_replicas` and `max_replicas` specify the scale range, and the `target_cpu_utilization_percentage` sets the target CPU utilization percentage that triggers the scaling.

The exported values at the end are useful for confirming the names of the deployed resources and might be used as identifiers in other Pulumi stacks or for querying the Kubernetes cluster about these resources.

Remember to adjust the container `image`, `requests`, and `limits` to fit your actual AI workload requirement as well as to modify `min_replicas`, `max_replicas`, and `target_cpu_utilization_percentage` to match your scaling preferences and the expected load pattern.

When you run the program with Pulumi, these resources would get deployed to your Kubernetes cluster. Pulumi would also provide you with outputs after the run, showing the `workload_name` and `hpa_name`, which can be useful for tracking and managing your resources.

For further reading on HorizontalPodAutoscaler and how to configure it, please refer to the [HorizontalPodAutoscaler v1 documentation](https://www.pulumi.com/registry/packages/kubernetes/api-docs/autoscaling/v1/horizontalpodautoscaler/).