SageMaker Serverless Inference using BYOC

TABLE OF CONTENTS

1.Introduction#introduction
2.Docker Image#docker-image
3.SageMaker#sagemaker
3.1.Boto3#boto3
3.2.CDK#cdk
4.Final Words#final-words
5.Resources#resources

Introduction

As we already know, SageMaker can do basically everything from creating, training, deploying, and optimizing ML models. You can use built-in algorithms and models, browse AWS Marketplace to find specific model packages, or simply create your own - train it using SageMaker and deploy it. Everything is streamlined and organized from start to finish.

However, in some circumstances we want a completely custom solution. The idea is to bring our own packages and models i.e. BYOC (Bring Your Own Container). To achieve this we could:

Extend a prebuilt SageMaker container image - SageMaker provides containers for some of the most common machine learning frameworks, such as Apache MXNet, Tensorflow, PyTorch etc.
Adapt an existing container image - Modify existing Docker image to enable training and inference using SageMaker

In this article we will focus on deploying our own inference code by adapting a Docker image that contains our production-ready model. Additionally, we will deploy it as a serverless inference endpoint, which means that we don't have to configure or manage the underlying infrastructure and we only pay for the compute capacity used to process inference requests.¹

To do this we will:

Create a Docker image and configure it for SageMaker inference
Push the image to ECR
Create a SageMaker model based on the Docker image
Configure a SageMaker endpoint
Deploy the SageMaker endpoint

There are two ways to do this through code: boto3 and CDK - we will cover both.

We will go through setting up the server and invocation endpoints for one model, if you are interested in a premade solution for hosting multi model servers please see Amazon SageMaker Multi-Model Endpoints using your own algorithm container

Docker Image

Behind the scenes SageMaker makes extensive use of Docker containers. All the built-in algorithms and the supported deep learning frameworks used for training and inference are essentially stored in containers. The benefits of this approach is that it allows us to scale quickly and reliably. Consequently, there are certain rules that we have to respect when we implement our own containers:

For model inference, SageMaker runs the container as

docker run <image> serve

This overrides default CMD statements in a container.

Containers need to implement a web server that responds to /invocations and /ping on port 8080
To get the result from the model, client sends a POST request to the SageMaker endpoint, this is forwarded to the container and invoked at /invocations, then the result is returned to the client
A customer's model containers must respond to requests within 60 seconds
SageMaker sends periodic GET requests to the /ping endpoint. The response can be just HTTP 200 status with an empty body

See the details at Use Your Own Inference Code with Hosting Services

To implement our container and satisfy these requirements, we will use Nginx and gunicorn. The idea is to create a simple Flask application, set up a WSGI server using gunicorn and then use the Nginx as a reverse-proxy.

The structure looks like this:

root/
├─ model
│   ├─ nginx.conf       Contains the configuration for reverse-proxy
│   ├─ predictor.py     Contains the Flask application
│   ├─ serve            Starts the Nginx and WSGI
│   └─ wsgi.py          Defines the WSGI application
├─ Dockerfile           Defines the Docker image configuration

This can also be found in the amazon sagemaker examples GitHub repository provided by AWS.

To define a reverse-proxy to gunicorn, use the following configuration in nginx.conf. The serve will start the gunicorn and a reverse-proxy server.

The predictor.py contains the endpoint logic. The GET should check if the model is loaded and configured properly:

@app.route('/ping', methods=['GET'])
def ping():
    # Check if the model was loaded correctly
    health = is_model_ready()
    status = 200 if health else 404
    return flask.Response(response= '\n', status=status, mimetype='application/json')

Next we define the POST request for /invocations, this part of the code should implement your custom model predictions:

@app.route('/invocations', methods=['POST'])
def transformation():
    
    # Process input
    input_json = flask.request.get_json()
    data = input_json['input']
    
    # Custom model
    result = custom_model.predict(data)

    # Return value
    resultjson = json.dumps(result)
    return flask.Response(response=resultjson, status=200, mimetype='application/json')

In order to build a docker image we define the Dockerfile:

FROM python:3.8

RUN apt-get -y update && apt-get install -y --no-install-recommends \
         wget \
         python3 \
         nginx \
         ca-certificates \
    && rm -rf /var/lib/apt/lists/*

RUN wget https://bootstrap.pypa.io/get-pip.py && python3 get-pip.py && \
    pip install flask gevent gunicorn && \
        rm -rf /root/.cache


# Install all dependencies for your custom model

# Set some environment variables. PYTHONUNBUFFERED keeps Python from buffering our standard
# output stream, which means that logs can be delivered to the user quickly. PYTHONDONTWRITEBYTECODE
# keeps Python from writing the .pyc files which are unnecessary in this case. We also update
# PATH so that the train and serve programs are found when the container is invoked.
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"

COPY model /opt/program
WORKDIR /opt/program

Note that the Dockerfile should contain the commands that install all the dependencies needed for the custom model.

Finally, we have to build and push the image to the ECR. To do this, we can use a simple bash script:

model_name=<model-name>

account=$(aws sts get-caller-identity --query Account --output text)
region=<region>
fullname="${account}.dkr.ecr.${region}.amazonaws.com/${model_name}:latest"
chmod +x model/serve

aws ecr get-login-password --region ${region} | docker login --username AWS --password-stdin ${fullname}

docker build -t ${model_name} .
docker tag ${model_name} ${fullname}
docker push ${fullname}

This script is a simple version of build_and_push.sh that is provided in the official AWS GitHub repository.

SageMaker

Boto3

Once we have the image in the ECR, we can create a SageMaker model. We will do this using the SageMaker Boto3 Client. One of the parameters of create_model() method is ExectionRoleArn, which means that we will have to create an IAM role beforehand or use the get_execution_role(), please see SageMaker Roles.

import boto3

sm_client = boto3.client(service_name='sagemaker')

def create_model():
    role_arn = "<role-arn>"
    image = "{}.dkr.ecr.{}.amazon.com/{}:latest".format(
        "<profile>", "<region>", "<image-name>"
    )
    create_model_response = sm_client.create_model(
        ModelName="<model-name>",
        ExecutionRoleArn=role_arn,
        Containers=[{"Image": image}],
    )
    print(create_model_response)

If everything went well, you should see the model in the SageMaker/Models console.

The next step is to define an endpoint configuration. This step is crucial since we are defining a model that we want to host and the resources chosen to deploy for hosting it. In other words, we are configuring a ProductionVariant which can take many arguments for defining instance types, how to distribute traffic among multiple modes etc. However, we are only interested in ServerlessConfig.

def create_endpoint_configuration():
    create_endpoint_config_response = sm_client.create_endpoint_config(
        EndpointConfigName="<endpoint-config-name>",
        ProductionVariants=[
            {
                "ModelName": "<model-name>",
                "VariantName": "<variant-name>",
                "ServerlessConfig": {
                    "MemorySizeInMB": 2048,
                    "MaxConcurrency": 1,
                },
            }
        ],
    )
    print(create_endpoint_config_response)

SageMaker console has the Endpoint Configurations section where we can confirm the configuration.

After configuring the endpoint, we can deploy it. This can take a few minutes. SageMaker Client offers the get_waiter() method that returns an object that can wait for some condition, in this case for an endpoint to be in service.

def create_endpoint():
    create_endpoint_response = sm_client.create_endpoint(
        EndpointName="<endpoint-name>",
        EndpointConfigName="<endpoint-config-name>",
    )
    print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])
    resp = sm_client.describe_endpoint(EndpointName="<endpoint-name>")
    print("Endpoint Status: " + resp["EndpointStatus"])
    print("Waiting for {} endpoint to be in service".format("<endpoint-name>"))
    waiter = sm_client.get_waiter("endpoint_in_service")
    waiter.wait(EndpointName="<endpoint-name>")

Finally, we can use the SageMaker Runtime Client for inference and invoking the endpoint.

runtime_sm_client = boto3.client(service_name="sagemaker-runtime")

def invoke_endpoint():
    content_type = "application/json"
    request_body = {}
    payload = json.dumps(request_body)
    response = runtime_sm_client.invoke_endpoint(
        EndpointName="<endpoint-name>",
        ContentType=content_type,
        Body=payload,
    )
    result = json.loads(response["Body"].read().decode())
    print(result)

CDK

The same process of creating a model, an endpoint configuration, and deployment of an endpoint can be achieved through a CDK application. This is usually a better option since we can manage the infrastructure from the code and deploy services as stacks.

In order to use SageMaker constructs we'll need to install @aws-cdk/aws-sagemaker module. There are L1 Cfn constructs for each service that we need to configure: CfnModel, CfnEndpointConfig , and CfnEndpoint.

We will approach this by setting up a CDK application and a separate stack construct for SageMaker services.

Note

If you are not sure how to start with a CDK application, please see Your first AWS CDK app

The stack will have three methods:

create_model() - Creates an IAM role and a SageMaker model based on the Docker image name
create_endpoint_configuration() - Creates an endpoint configuration for a specific model
create_endpoint() - Deploys the endpoint based on the provided endpoint configuration

The code below implements a simple example of this stack:

from aws_cdk import core
from aws_cdk.aws_iam import Role, ManagedPolicy, ServicePrincipal
from aws_cdk.aws_sagemaker import CfnModel, CfnEndpointConfig, CfnEndpoint


class SageMakerStack(core.Stack):
    def __init__(
        self,
        scope: core.Construct,
        id_: str,
        env: core.Environment,
    ) -> None:
        super().__init__(scope=scope, id=id_, env=env)
        self.env = env

    def create_model(
        self,
        id_: str,
        model_name: str,
        image_name: str,
    ) -> CfnModel:
        role = Role(
            self,
            id=f"{id_}-SageMakerRole",
            role_name=f"{id_}-SageMakerRole",
            assumed_by=ServicePrincipal("sagemaker.amazonaws.com"),
            managed_policies=[
                ManagedPolicy.from_aws_managed_policy_name("AmazonSageMakerFullAccess")
            ],
        )
        container = CfnModel.ContainerDefinitionProperty(
            container_hostname="<container-hostname>",
            image="{}.dkr.ecr.eu-west-1.amazonaws.com/{}:latest".format(
                self.env.account, image_name
            ),
        )
        return CfnModel(
            self,
            id=f"{id_}-SageMakerModel",
            model_name=model_name,
            execution_role_arn=role.role_arn,
            containers=[container],
        )

    def create_endpoint_configuration(
        self,
        id_: str,
        model_name: str,
        endpoint_configuration_name: str,
    ) -> CfnEndpointConfig:
        return CfnEndpointConfig(
            self,
            id=f"{id_}-SageMakerEndpointConfiguration",
            endpoint_config_name=endpoint_configuration_name,
            production_variants=[
                CfnEndpointConfig.ProductionVariantProperty(
                    model_name=model_name,
                    initial_variant_weight=1.0,
                    variant_name="AllTraffic",
                    serverless_config=CfnEndpointConfig.ServerlessConfigProperty(
                        max_concurrency=1,
                        memory_size_in_mb=2048,
                    ),
                )
            ],
        )

    def create_endpoint(
        self,
        id_: str,
        endpoint_configuration_name: str,
        endpoint_name: str,
    ) -> CfnEndpoint:
        return CfnEndpoint(
            self,
            id=f"{id_}-SageMakerEndpoint",
            endpoint_config_name=endpoint_configuration_name,
            endpoint_name=endpoint_name,
        )

Now we can use this stack class to deploy multiple models in one or more stacks.

from aws_cdk import core
from stacks.sagemaker import SageMakerStack


class SimpleExampleApp(core.App):

    def __init__(self) -> None:
        super().__init__()
        env = core.Environment(
            account="<account>",
            region="<region>",
        )

        sagemaker = SageMakerStack(
            scope=self,
            id_="app-sagemaker-stack",
            env=env,
        )
        model = sagemaker.create_model(
            id_="AppModel",
            model_name="<model-name>",
            image_name="<image-name>",
        )
        endpoint_config = sagemaker.create_endpoint_configuration(
            id_="AppEndpointConfiguration",
            model_name="<model-name>",
            endpoint_configuration_name="app-endpoint-configuration",
        )
        endpoint_config.add_depends_on(model)
        endpoint = sagemaker.create_endpoint(
            id_="AppEndpoint",
            endpoint_configuration_name="app-endpoint-configuration",
            endpoint_name="app-endpoint",
        )
        endpoint.add_depends_on(endpoint_config)

simple_app = SimpleExampleApp()
simple_app.synth()

Sometimes CDK cannot infer the right order to provision our resources in. For example, the creation of endpoint configuration may start before the model is defined, which doesn't make sense in this example. That's why we add A.add_depends_on(B) to each CfnResource and it will inform the CDK that the creation of resource A should follow the creation of resource B.

Now we can generate CloudFormation templates and deploy custom models for serverless inference as stacks that can be easily managed.

Note

If you also want to manage Docker images through AWS CDK, please take a look at AWS CDK Docker Image Assets. However, this approach will publish image assets to the CDK-controlled ECR repository. To publish Docker images to an ECR repository in your control, please see cdk-ecr-deployment.

Final Words

I hope that this article gave you a better understanding of how to implement a custom model using the SageMaker and deploy it for the serverless inference. The main key concepts here are the configuration of a custom Docker image and connection between a model, an endpoint configuration, and an endpoint.

The code examples are deliberately simplified and serve only to introduce the key concepts and ideas. For more information and examples please check out the official AWS repository for Advanced SageMaker Functionality Examples.

If you have any questions or suggestions, please reach out, I'm always available.

Resources

Footnotes

[SageMaker pricing page] ↩