Data Lakes on Databricks with External Storage Mounting
PythonTo build a data lake on Databricks with external storage mounting, you'll need to create a Databricks workspace, and then configure mounts to link Databricks with external storage systems such as an S3 bucket or Azure Data Lake Storage (ADLS). Mounts allow the Databricks file system (
dbfs
) to access data stored in external storage seamlessly.First, let's set up a Databricks workspace where data processing and analytics will take place. Then, we'll set up a mount point to an S3 bucket, which we'll use as an example of external storage. This will enable our data lake to read and write data to S3 as if it were a local filesystem.
Here is how you can do this using the Pulumi Python SDK:
- Create a Databricks workspace: A workspace is your environment for accessing all of Databricks' features.
- Mount S3 bucket: This will provide the workspace access to your S3 data lake.
- Cluster & Notebook (optional): Create a Databricks cluster to process data and a notebook to write your analytics code.
Let's write a Pulumi program to achieve this:
import pulumi import pulumi_aws as aws import pulumi_databricks as databricks # Note: Ensure your AWS and Databricks providers are configured properly # Provision a new Databricks workspace databricks_workspace = databricks.Workspace("my-databricks-workspace", tags={ "Environment": "Production" }, sku="premium" # SKU can be "standard", "premium", or "trial" depending on your requirements ) # Use AWS IAM role for the workspace to access the S3 bucket s3_access_role = aws.iam.Role("s3-access-role", assume_role_policy=databricks_workspace.workspace_url.apply( lambda url: json.dumps({ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Principal": { "Service": "databricks.amazonaws.com", }, "Action": "sts:AssumeRole", "Condition": { "StringEquals": { "sts:ExternalId": url, }, }, }] }) ) ) # Define the policy to read S3 data s3_access_policy = aws.iam.RolePolicy("s3-access-policy", role=s3_access_role.name, policy=pulumi.Output.all(databricks_workspace.workspace_url, databricks_workspace.id).apply( lambda args: json.dumps({ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": [ "s3:Get*", "s3:List*", ], "Resource": "*", }] }) ) ) # Create an S3 bucket to be used as the data lake storage data_lake_bucket = aws.s3.Bucket("data-lake-bucket", acl="private", tags={ "Purpose": "Databricks Data Lake Storage" } ) # Mount the S3 bucket to Databricks s3_mount = databricks.Mount("s3-mount", cluster_id=databricks_cluster.id, uri=f"s3a://{data_lake_bucket.bucket}", s3=databricks.MountS3Args( bucket_name=data_lake_bucket.bucket, # The instance profile ARN we just created instance_profile=s3_access_role.arn ) ) # Outputs pulumi.export('databricksWorkspaceUrl', databricks_workspace.workspace_url) pulumi.export('dataLakeBucket', data_lake_bucket.bucket)
In the above program, we create a Databricks workspace and an AWS S3 bucket. We then set up an IAM Role with the appropriate trust policy for Databricks to assume the role, along with a Role Policy for necessary S3 access permissions. Then we mount the S3 bucket to the Databricks workspace using the
databricks.Mount
resource. This Pulumi program is structured so that the outputs will provide URLs to access the Databricks workspace and the name of the S3 bucket, which is now part of your data lake and mounted to Databricks.Please remember that each
pulumi.export
will output the value after the Pulumi program is run, allowing you to access the workspace URL or the S3 bucket name easily. Replace the placeholder values with appropriate values corresponding to your AWS account and Databricks setup. Also, ensure you configure your cloud provider before running this program.For more detailed documentations on resources: