Orchestrating BigQuery ML Workflows with GCP Dataform
PythonOrchestrating BigQuery ML workflows can be achieved by using GCP's Dataform, which allows you to write SQL workflows for BigQuery and manage them as if they were code. It's a common practice for machine learning teams that work with large datasets to use Dataform to create and manage their data transformations in BigQuery.
With Pulumi's infrastructure as code, you can define and deploy your BigQuery ML workflows along with other cloud infrastructure. Below is a Pulumi program in Python that demonstrates how to create a Dataform repository for orchestrating BigQuery ML workflows.
In the following program, we use various resources:
pulumi_gcp
package, which contains the GCP resources we want to use.gcp.dataform.Repository
is a resource for creating a Dataform repository which will host our SQL workflows for BigQuery.gcp.dataform.WorkflowConfig
is a resource for configuring the workflow within our Dataform repository, allowing us to specify schedules and settings for our ML workflows.
Let's look at the code:
import pulumi import pulumi_gcp as gcp # Initialize your GCP project and region, replace these with your own specific identifiers. gcp_project = 'my-gcp-project' gcp_region = 'us-central1' # Choose a region that makes sense for your scenario # Define a Dataform repository dataform_repository = gcp.dataform.Repository("my-dataform-repository", project=gcp_project, region=gcp_region, git_remote_settings=gcp.dataform.RepositoryGitRemoteSettingsArgs( url="https://github.com/my-org/my-dataform-repo.git", # Your git repository URL authentication_token_secret_version="projects/my-gcp-project/secrets/my-secret/versions/latest", default_branch="master", # Default branch of your git repository )) # Define a Dataform workflow configuration to orchestrate BigQuery ML workflows. # This includes settings like the schedule, the BigQuery dataset to act upon, etc. dataform_workflow_config = gcp.dataform.WorkflowConfig("my-dataform-workflow-config", project=gcp_project, region=gcp_region, repository_id=dataform_repository.name, workflow_config_id="my-workflow-config", invocation_config=gcp.dataform.WorkflowConfigInvocationConfigArgs( included_tags=["my_ml_workflow"], # Tags identifying the jobs to run within this workflow ), release_config=gcp.dataform.WorkflowConfigReleaseConfigArgs( target_name="my-target", # Target environment name defined in Dataform. ), cron_schedule="0 9 * * *", # Schedule using cron syntax, here every day at 9 AM UTC time_zone="UTC") # Time zone for the schedule # Export the Dataform repository URL to access it later pulumi.export('dataform_repository_url', dataform_repository.git_remote_settings.url)
Here's what's going on in the code:
- We create a Dataform repository by defining
gcp.dataform.Repository
. This repository is linked to a Git repository which contains your Dataform SQL scripts. - In the
git_remote_settings
, you must set yoururl
to the Git repository you wish to connect with Dataform. Theauthentication_token_secret_version
is a reference to a secret in GCP's Secret Manager that holds the Git authentication token. Thedefault_branch
is the name of the branch to use by default when running Dataform operations. - Next, we create a
gcp.dataform.WorkflowConfig
resource that defines how the Dataform project should behave. This includes the schedule incron_schedule
, the time zone for the schedule, and the tags for the Dataform jobs you want to run as part of the workflow. - The
workflow_config_id
assigns a unique identifier to the configuration, whilerepository_id
connects it to the previously defined Dataform repository. - The
release_config
block is where you specify your Dataform environment target name. This name corresponds to one of the environments you’ve configured in Dataform. - Finally, we export the Dataform repository URL so that you can access the Git repository directly from your Pulumi stack's outputs.
This Pulumi program will set up a Dataform repository and workflow configuration which can be used to orchestrate your BigQuery ML workflows. All the SQL scripts for defining and transforming your data for machine learning can be version controlled in your Git repository, and scheduled through Dataform to run in BigQuery.