Kubernetes Batch for Large Scale Simulation Workloads
PythonWhen dealing with large-scale simulation workloads in Kubernetes, you'll typically use the
Job
resource to manage parallel processing of a set of independent but similar work items. Kubernetes Jobs are well-suited for batch processing workloads, where you can create multiple Pods that run your simulation until a specified number of them successfully terminate.Here's how one might set this up in Pulumi using Python:
- Define a
Job
resource, which runs a given number of Pods (defined byparallelism
) to completion (completions
). - Each Pod executes the simulation task (this could be, for instance, a simulation software or a custom application).
- Once the specified number of Pods have successfully completed, the job is considered complete.
Here is a Python program using Pulumi that creates a Kubernetes
Job
resource. This program assumes that you have a Docker imagesimulation-worker:latest
which contains your simulation application.import pulumi import pulumi_kubernetes as kubernetes # Define a Kubernetes Job that will spin up "parallelism" number of Pods to process a simulation workload. # Each Pod processes a workload until "completions" number of Pods have run successfully to completion. simulation_job = kubernetes.batch.v1.Job("simulation-job", spec=kubernetes.batch.v1.JobSpecArgs( # Parallelism determines how many Pods the job should run in parallel. parallelism=5, # Completions specifies the number of pods that should complete successfully. completions=5, template=kubernetes.core.v1.PodTemplateSpecArgs( spec=kubernetes.core.v1.PodSpecArgs( containers=[kubernetes.core.v1.ContainerArgs( # The name of the container within the Pod. name="simulation-container", # The Docker image to run. Replace with your simulation application image. image="simulation-worker:latest", # Command to run within the container, modify accordingly. command=["/app/simulate"], # Arguments for the command, modify accordingly. args=["--mode=batch"] )], # The restart policy for the Pods. "Never" ensures the Pod does not restart once it completes or fails. restart_policy="Never", ), ), # Optionally define a backoff limit for how many times to retry a job before considering it failed. backoff_limit=2, ), # Metadata for the job, such as labels or annotations. metadata=kubernetes.meta.v1.ObjectMetaArgs( name="simulation-job", labels={"app": "simulation"}, ) ) # Export the Job name pulumi.export("job_name", simulation_job.metadata["name"])
Explanation:
- We've defined a
Job
calledsimulation-job
that will manage the execution of our simulation Pods. - We've defined the
parallelism
as 5, which means up to 5 Pods will be running the simulations in parallel at any given time. - We have also specified that we want exactly
5
completions. This means that Kubernetes will ensure that exactly 5 Pods have processed their simulations successfully before the Job is marked as complete. - The
template
defines the Pod that will be created as part of this Job, which uses thesimulation-worker:latest
Docker image stored in a container registry and executes the command/app/simulate --mode=batch
. - The
restart_policy
is set toNever
ensuring that complete simulations don't restart. - We set a
backoff_limit
to specify the number of retries for each Pod if it fails. - Labels have been added to the metadata to help identify the resources associated with this Job.
To deploy this Pulumi program:
- Save this Python code in a file named
__main__.py
. - Ensure you have
pulumi_kubernetes
installed in your Python environment (usually installed viapip install pulumi_kubernetes
). - Run
pulumi up
to preview and deploy the changes. - Once complete, you should see the resources created, including the name of the Job which is exported at the end.
- Define a