Low-Latency LLM Inference with Edge Locations in US-East-1
PythonTo achieve low-latency large language model (LLM) inference with edge locations in the AWS US-East-1 region, you would typically set up an inference endpoint that can process the machine learning model's inferences efficiently. AWS provides services like AWS Lambda and Amazon SageMaker to host the inference logic, which can be placed closer to end-users with the help of Amazon CloudFront's edge locations. However, this usually involves setting up an API Gateway endpoint to trigger the Lambda functions, which in turn interact with the SageMaker endpoint or the deployed model.
Let's implement an infrastructure that supports this, using Pulumi with AWS services.
Below is a Pulumi program written in Python that demonstrates how to deploy a SageMaker endpoint for model inference, a Lambda function for preprocessing and postprocessing, and an API Gateway to expose the Lambda function. The request flow would be: Client → API Gateway → Lambda (preprocessing) → SageMaker (inference) → Lambda (postprocessing) → Client.
This program does the following:
- Creates an Amazon SageMaker model, which points to the pre-trained model data.
- Deploys the model to a SageMaker endpoint configuration with an instance type chosen for inference.
- Creates the SageMaker endpoint where inference requests can be sent.
- Sets up an AWS Lambda function, which will handle the incoming requests and postprocess the responses. This Lambda function can be deployed to multiple locations using Lambda@Edge to decrease latency.
- Initializes an API Gateway to trigger the Lambda function.
Pulumi Program for Low-Latency Inference
import pulumi import pulumi_aws as aws # Create a SageMaker model by pointing to the pre-trained model data. sagemaker_model = aws.sagemaker.Model("llmModel", execution_role_arn="arn:aws:iam::123456789012:role/service-role/AmazonSageMaker-ExecutionRole-20200101T000001", # Replace with your SageMaker execution role ARN primary_container={ "image": "174872318107.dkr.ecr.us-east-1.amazonaws.com/kmeans:1", # Replace with the image of the LLM model "model_data_url": "s3://my-bucket/pretrained-llm-model-data", # Path to your pretrained model data }) # Deploy the model to a SageMaker endpoint configuration endpoint_config = aws.sagemaker.EndpointConfiguration("llmEndpointConfig", production_variants=[{ "instanceType": "ml.m5.large", "modelName": sagemaker_model.name, "variantName": "VariantOne", "initialInstanceCount": 1, }]) # Create the SageMaker endpoint where inference requests can be sent sagemaker_endpoint = aws.sagemaker.Endpoint("llmEndpoint", endpoint_config_name=endpoint_config.name) # Define the AWS Lambda function that will preprocess requests and invoke the SageMaker endpoint with open('inference_lambda_handler.py', 'r') as lambda_handler_file: lambda_handler_code = lambda_handler_file.read() # Create a Lambda function to process and forward requests to the SageMaker endpoint lambda_function = aws.lambda_.Function("inferenceLambdaFunction", code=pulumi.StringAsset(lambda_handler_code), role="arn:aws:iam::123456789012:role/lambda_execution_role", # Replace with your Lambda execution role ARN handler="inference_lambda_handler.handler", runtime="python3.8", environment={ "variables": { "SAGEMAKER_ENDPOINT_NAME": sagemaker_endpoint.name, }, }, timeout=10 # Provide adequate timeout for processing and SageMaker communication ) # Use API Gateway to expose the Lambda function as a REST API endpoint api_gateway = aws.apigatewayv2.Api("inferenceApi", protocol_type="HTTP", route_key="POST /invoke-llm", target=lambda_function.invoke_arn) # Export the API Gateway endpoint URL pulumi.export('api_url', api_gateway.api_endpoint)
In this program, replace the placeholders such as execution role ARNs and model data URLs with valid AWS resources that correspond to your deployment.
Before running this Pulumi program, you would first need to create an AWS Lambda execution role that grants your Lambda function the necessary permissions to invoke SageMaker endpoints and log to CloudWatch Logs. Likewise, you should set up the SageMaker execution role to have access to the specified S3 bucket and necessary policies to run SageMaker jobs.
Place the logic that handles the incoming request and invokes the SageMaker endpoint within an
inference_lambda_handler.py
file located alongside your Pulumi program.To execute this Pulumi program:
- Install Pulumi and configure your AWS credentials.
- Create a new directory, save the above program as
__main__.py
, and place your Lambda handler code ininference_lambda_handler.py
. - Run
pulumi up
in the directory to create the resources. - Once the Pulumi process is complete, test the API Gateway endpoint by sending inference requests.