Creating cost-effective ML training infrastructure

Examine and understand why your training costs are so high and what it takes to build a cost-effective ML training infrastructure. …

Hear from CIOs, CTOs, and other C-level and senior execs on data and AI strategies at the Future of Work Summit this January 12, 2022. Learn more

This article was contributed by Bar Fingerman, head of development at Bria.

Our ML training costs were unsustainable

This all started because of a challenge I faced at my company. Our product is driven by AI technology in the field of GANs, and our team is composed of ML researchers and engineers. During our journey to establish the core technology, we started to run many ML training experiments by multiple researchers in parallel. Soon, we started to see a huge spike in our cloud costs. It wasn’t sustainable — we had to do something, and fast. But before I get to the cost-effective ML training solution, let’s understand why the training costs were so high.

We started using a very popular GAN network called stylegan.

The above table shows the amount of time it takes to train this network depending on the number of GPUs and the desired output resolution. Let’s assume we have eight GPUs and want a 512×512 output resolution; we need an EC2 of type “p3.16xlarge” that costs $24.48 per hour, so we pay $6,120 for this experiment. But there’s more before we can do this experiment. Researchers have to repeat several cycles of running shorter experiments, evaluating the results, changing the parameters, and starting all over again from the beginning.

So now the training cost for only one researcher can be anywhere from $8–12k per month. Multiply this by N researchers, and our monthly burn rate is off the charts.

We had to do something

Burning these sums every month is not sustainable for a small startup, so we had to find a solution that would dramatically reduce costs — but also improve developer velocity and scale fast.

Here is an outline of our solution:

Researcher: will trigger training job via a Python script (the script will be declarative instructions for building an ML experiment).

Training Job: will be scheduled on AWS on top of Spot instance and will be fully managed by our infrastructure.

Traceability: during training, metrics like GPU stats/progress will be sent to the researcher via Slack, and model checkpoints will automatically upload to be viewed via Tensorboard.

Developing the infrastructure

First, let’s review the smallest unit of the infrastructure, the docker image.

The image is constructed from three steps that repeat every training session and have a Python interface for abstraction. For an integration Algo researcher will add a call to some training code inside the “Train function”; then, when this docker image compiles, it will fetch training data from an S3 bucket and save it on the local machine → Call a training function → Save the results back to S3.

This logic above is actually a class that is called when the docker starts. All the user needs to do is override the train function. For that, we provided a simple abstraction:

from resarch.some_algo import train_my_alg
from import TrainingSession class Session(TrainingSession): def __init__(self): super().__init__(path_to_training_res_folder="/...") def train(self): super().train() train_my_alg(restore=self.training_env_var.resume_needed)
  • Inheriting from TrainingSession means all the heavy lifting is done for the user.
  • Importing the call to training function (line 1).
  • Add the path where the checkpoints are saved (line 7). This path will be backed up by the infrastructure to s3 during training.
  • Override “train” function and call some algo training code (lines 9–11).

Starting a more cost-effective ML training job

To start a training job, we provided a simple declarative script via Python SDK:

from import run
from import TrainingEnvVar, DataSourceEnvVar env_vars = TrainingEnvVar(...) run(env_vars=env_vars)
  • TrainingEnvVar – Declarative instructions for the experiment.
  • run – Will fire SNS topic that will start a flow to run a training job on AWS.

Triggering an experiment job

  • SNS message with all the training metadata sent (3). This is the same message used by the infra in case we need to resume the job on another spot.
  • The message is consumed by SQS to persist the state and lambda that fires a spot request.
  • Spot requests are asynchronous, meaning that fulfillment can take time. When a spot instance is up and running, a CloudWatch event is sent.
  • Spot fulfillments’ event triggers a Lambda (4) that is responsible for pulling a message from SQS(5) with all the training job instructions.

Responding to interruptions in cost-effective ML training jobs

Before the AWS spot instance is going to be taken from us, we get a CloudWatch notification. For this case, we added a Lambda trigger that connects to the instance and runs a recovery function inside the docker image (1) that starts the above flow again from the top.

Starting cost-effective ML training

Lambda (6) is triggered by a CloudWatch event:

{ "source": ["aws.ec2"], "detail-type": ["EC2 Spot Instance Request Fulfillment"]

It then connects to the new spot instance to start a training job from the last point where it stopped or start a new job if the SNS (3) message was sent by the researcher.

After six months in production, the results were dramatic

The above metrics show the development phase when we spent two weeks building the above cost-effective ML training infrastructure, followed by the development usage by our team.

Let’s zoom in on one researcher using our platform. In July and August, they didn’t use the infra and were running K small experiments that cost ~$650. In September, they ran the same K experiments++ but we cut the cost in half. In October, they more than doubled their experiments and the cost was only around $600.

Today, all Bria researchers are using our internal infra while benefiting from dramatically reduced costs and a vastly improved research velocity.

Bar Fingerman is head of development at Bria.

This story originally appeared on Copyright 2021


Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Read More From DataDecisionMakers

Live Updates for COVID-19 CASES