Kubernetes Batch for Large Scale Simulation Workloads

Question

Pulumi · Accepted Answer

When dealing with large-scale simulation workloads in Kubernetes, you'll typically use the `Job` resource to manage parallel processing of a set of independent but similar work items. Kubernetes Jobs are well-suited for batch processing workloads, where you can create multiple Pods that run your simulation until a specified number of them successfully terminate.

Here's how one might set this up in Pulumi using Python:

1. Define a `Job` resource, which runs a given number of Pods (defined by `parallelism`) to completion (`completions`).
2. Each Pod executes the simulation task (this could be, for instance, a simulation software or a custom application).
3. Once the specified number of Pods have successfully completed, the job is considered complete.

Here is a Python program using Pulumi that creates a Kubernetes `Job` resource. This program assumes that you have a Docker image `simulation-worker:latest` which contains your simulation application.

```python
import pulumi
import pulumi_kubernetes as kubernetes

# Define a Kubernetes Job that will spin up "parallelism" number of Pods to process a simulation workload.
# Each Pod processes a workload until "completions" number of Pods have run successfully to completion.
simulation_job = kubernetes.batch.v1.Job("simulation-job",
    spec=kubernetes.batch.v1.JobSpecArgs(
        # Parallelism determines how many Pods the job should run in parallel.
        parallelism=5,
        # Completions specifies the number of pods that should complete successfully.
        completions=5,
        template=kubernetes.core.v1.PodTemplateSpecArgs(
            spec=kubernetes.core.v1.PodSpecArgs(
                containers=[kubernetes.core.v1.ContainerArgs(
                    # The name of the container within the Pod.
                    name="simulation-container",
                    # The Docker image to run. Replace with your simulation application image.
                    image="simulation-worker:latest",
                    # Command to run within the container, modify accordingly.
                    command=["/app/simulate"],
                    # Arguments for the command, modify accordingly.
                    args=["--mode=batch"]
                )],
                # The restart policy for the Pods. "Never" ensures the Pod does not restart once it completes or fails.
                restart_policy="Never",
            ),
        ),
        # Optionally define a backoff limit for how many times to retry a job before considering it failed.
        backoff_limit=2,
    ),
    # Metadata for the job, such as labels or annotations.
    metadata=kubernetes.meta.v1.ObjectMetaArgs(
        name="simulation-job",
        labels={"app": "simulation"},
    )
)

# Export the Job name
pulumi.export("job_name", simulation_job.metadata["name"])
```

Explanation:

- We've defined a `Job` called `simulation-job` that will manage the execution of our simulation Pods.
- We've defined the `parallelism` as 5, which means up to 5 Pods will be running the simulations in parallel at any given time.
- We have also specified that we want exactly `5` completions. This means that Kubernetes will ensure that exactly 5 Pods have processed their simulations successfully before the Job is marked as complete.
- The `template` defines the Pod that will be created as part of this Job, which uses the `simulation-worker:latest` Docker image stored in a container registry and executes the command `/app/simulate --mode=batch`.
- The `restart_policy` is set to `Never` ensuring that complete simulations don't restart.
- We set a `backoff_limit` to specify the number of retries for each Pod if it fails.
- Labels have been added to the metadata to help identify the resources associated with this Job.

To deploy this Pulumi program:

1. Save this Python code in a file named `__main__.py`.
2. Ensure you have `pulumi_kubernetes` installed in your Python environment (usually installed via `pip install pulumi_kubernetes`).
3. Run `pulumi up` to preview and deploy the changes.
4. Once complete, you should see the resources created, including the name of the Job which is exported at the end.