How we deploy AI privately on AWS

Privacy without compromise is what privatesynapse.ai stands for. This blog post explores how we approach deploying private AI solutions on AWS seamlessly.

Developing an AI model

Before deployment, an AI model needs to be created and trained. We argue that efficient model deployment starts with model development. The reasoning behind this is that model serving needs to be taken into account when designing the model architecture and choosing the model size.

When processing sequential data, for example, one can select neural network architectures using the costly attention mechanism only on local region features or on the whole sequence. This represents a tradeoff where the decision carries assumptions about both the data and the serving performance. Other topics that influence general model use at scale are the model's batching capacity, decoding strategy, separation of preprocessing/postprocessing etc.

One can quickly see how these decisions influence the model's performance and cost in production. Because of this we emphasize having the use-case in mind at this stage.

Optimizing and exporting

After an AI model is trained and achieves satisfactory performance on evaluation benchmarks it needs to be optimized for production. This topic usually concerns itself with creating a deployable artifact with increased serving performance in terms of inference speed, model size, platform compatibility, and GPU memory requirement. Below we list a couple of common optimizations.

Serving format export. Exporting an AI model in a common format such as ONNX or torchscript serializes and improves its interoperability and portability. This is much needed as neural networks often come developed using tools from various frameworks (e.g. Tensorflow or Torch). Serving solutions such as TorchServe and Triton Inference Server often rely exactly on these neural network formats for serving.
Quantization. Quantization is one of the most used optimizations that enables the reduction of model artifact size and GPU memory requirements with minimal impact on model accuracy. This is achieved by projecting model parameters from 32-bit precision down to 16-bit precision—or even 8-bit or 4-bit.

+ Serving Server

An optimized AI model is ready to be served, now a serving solution needs to be selected. Choosing the best serving solution depends highly on the requirements. A subset of factors that may influence the decision are listed below:

Serving pattern (API, Worker ...)
Model export format (Torchscript, ONNX, TensorRT...)
Model complexity (RNN, encoder-decoder, beam search requirement...)
Model memory requirements (single GPU,multi-GPU GPU cluster)
Inference complexity (One model pass or multiple)
Deployment platform (On-prem, Kubernetes, Sagemaker...)
Advanced features (GPU sharing, dynamic batching...)
Support for ensembles
Hot reload functionality
Monitoring requirements
Degree of customizability

For custom usage patterns it is not unusual to create custom containerized serving solutions where a model may be packaged for use over an API or as a queue worker. Advanced model serving solutions such as the Triton Inference Server offer additional functionality such as hot reload, dynamic batching, GPU sharing, ensemble serving, etc.

For packaging this together in a versionable and highly reproducible environment, Docker containers are widely used. Finalization of this step is, what we observed over the years, where most AI products stop.

When companies are dealing with a complex technology such as artificial intelligence, we noticed the growing need to serve a solution that solves a business problem rather than a model that solves an AI problem. MLOps and DevOps skills are needed more than ever to achieve this.

Complete AI Solution

At privatesynapse.ai we don't just develop AI models, we create and deploy complete AI solutions.

This statement is a big contrast to traditional 3rd party API-based AI model offerings or deployable AI models. At privatesynapse.ai we optimize and package the model along with the complete adjacent cloud infrastructure. Examples of this adjacent infrastructure may include:

Encryption (at rest and in transit)
Autoscaling Infrastructure
Adjacent NoSQL or SQL storage
Jobs on a trigger or on schedule
Networking
Logging
Monitoring

Significant performance benefits and savings can be achieved when the model is developed with the end environment in mind. And best of all, our AI models are completely private to the client's environment.

Packing the solution as IaC

For deploying the adjacent AWS infrastructure we use TerraForm. Terraform is a cloud-agnostic infrastructure-as-code with dynamic resource state tracking. This enables us to centrally describe and essentially package all the infrastructure and AI models that are to be deployed. Before deploying with TerraForm we establish access to the target AWS environment.

Establishing access to the target environment

We handle the deployment of our AI solution. To deploy the solution privately we request access to the target environment with appropriate access rights for the required deployment operations. For AWS this involves gaining IAM user/role access to single or multiple AWS accounts over which the solution is deployed. Once access is established the deployment can begin.

Deploying infrastructure

Our AWS TerraForm resources are packaged into customizable and interoperable modules. These modules accept input parameters that specify target environment preferences and dependencies such as e.g. networking (in which subnet will instances be deployed). Before deployment a TerraForm plan is generated which validates the deployment configuration and outlines which resources will be deployed. Executing TerraForm apply then deploys the resources in the right order and the deployment is complete.

This completes a high-level overview of our AI solution creation and deployment. Stay tuned for more detailed insights for the upcoming release of our new AI product.