Multi-GPU Jobs with AMD and Intel GPU Support

(4 min)

Dec 15, 2023

We are excited to announce that Bacalhau 1.2 now supports running jobs with multiple GPUs, allowing for parallel execution of data-intensive algorithms for machine learning, natural language processing and model inference. Bacalhau users can now leverage the parallel processing capability of multiple GPUs per job to divide the workload and process it simultaneously.

But that's not all! We can also announce that Bacalhau now supports AMD and Intel GPUs for the first time, in addition to Nvidia GPUs, and allows power users to use GPUs from different vendors as part of the same job. With this update, users can now harness the power of AMD and Intel GPUs for their machine learning and artificial intelligence tasks.

Faster computation with multi-GPU jobs

Using multiple GPUs in parallel takes advantage of more of the hardware available on Bacalhau compute nodes to process compute-intensive jobs more quickly.

Bacalhau is now able to use GPU parallelism to achieve multiple orders of speed-up on tasks like:

Training deep neural networks: each GPU can be assigned a portion of the training data or a subset of the neural network layers. Each GPU independently performs computations on its assigned data, and the results are combined to update the model.
Computationally intensive scientific simulations: when doing weather modeling or molecular dynamics simulations, different GPUs can simulate parts of the system or perform different iterations of the simulation simultaneously, reducing the overall simulation time.
Real-time analytics on big data sets: multiple GPUs can be used to simultaneously process different portions of a large volume of data in real-time, speeding up the analytics process.
Complex image and video processing algorithms: parallelized by assigning different GPUs to process different frames or sections of the image or video. The results are then combined to produce the final processed output.
Generating high-quality and realistic graphics: performance increased by assigning different GPUs to render different parts of the scene simultaneously. The outputs from each GPU are combined to create the final graphics output.

In all these cases, the parallel processing capability of multiple GPUs enables faster computation and more efficient utilization of computational resources, leading to improved performance and productivity in machine learning and artificial intelligence tasks.

Running multi-GPU jobs is as simple as requesting the number of required GPUs when a job is submitted:

bacalhau docker run --gpu=4 ...

Bacalhau will find an appropriate compute node with the available number of GPUs and assign exclusive access to those GPUs to the submitted job. GPU jobs running in a Docker container will only see the GPUs that have been assigned, allowing a server with many GPUs to safely run many small jobs in parallel or process bigger jobs that use more hardware.

Node operators who want to limit the number of GPUs available per job (to allow more jobs to run in parallel, for example) can do so by setting a job resource limit config option, using our new config API:

bacalhau config set Node.Compute.Capacity.JobResourceLimits.GPU 1

More hardware choices with AMD support

With the support for using GPUs from AMD and Intel as well as Nvidia, Bacalhau empowers users to leverage the benefits of diverse GPU architectures. This allows for customized hardware configurations, improved performance, and greater flexibility in tackling complex machine learning and artificial intelligence workloads.

When it comes to accelerating computation in machine learning and artificial intelligence tasks, having the flexibility to use GPUs from different hardware vendors in Bacalhau jobs offers a range of benefits.

Increased scalability: With the ability to use GPUs from different vendors, users can scale their computational power more effectively. They can select the GPUs that best suit their specific needs, whether it's based on performance, cost, or other factors, allowing for greater flexibility in scaling up or down depending on the workload.
Enhanced performance optimization: Different GPU architectures have unique strengths and weaknesses. Users can optimize their workloads by assigning specific tasks to GPUs that excel in those areas. This results in improved performance and efficiency in executing compute-intensive algorithms.
Reduced vendor lock-in: Supporting GPUs from different hardware vendors reduces the risk of vendor lock-in. Users are not limited to a single vendor's ecosystem or hardware offerings, giving them the freedom to explore and utilize GPUs from different manufacturers.
Future-proofing: Technology evolves rapidly, and hardware advancements are constantly being made. By embracing GPUs from different vendors, users can future-proof their infrastructure. They can adapt to new GPU technologies and take advantage of the latest advancements in hardware without having to rewrite their whole computational stack.

Bacalhau’s support for AMD and Intel GPUs is as close to plug-and-play as you can get. Simply start a Bacalhau compute node on a host with AMD and Intel GPUs attached to have them automatically detected and available to the rest of the cluster.

The only requirements are that the host has the normal rocm-smi tool installed which Bacalhau will use to query for GPU characteristics, and that the Bacalhau user has read and write access to the /dev/dri and /dev/kfd device trees.

What’s more, Bacalhau now offers the rich GPU information collected via it’s Nodes API, allowing cluster users to understand the capabilities of the GPUs on their network:

$ curl -sL http://bootstrap.staging.bacalhau.org:1234/api/v1/orchestrator/nodes | jq '.Nodes[2].ComputeNodeInfo.MaxCapacity.GPUs'
[
  {
    "Index": 0,
    "Name": "Tesla T4",
    "Vendor": "NVIDIA",
    "Memory": 15360
  }
]

Install Bacalhau 1.2 today

With the support for multiple GPUs and the inclusion of AMD and Intel GPUs, Bacalhau continues to push the boundaries on distributed computation and provide the most efficient solution for running machine learning algorithms. Users can now benefit from faster computation, improved performance, and enhanced productivity in their AI workloads.

Try out the new multi-GPU with AMD and Intel GPU support in Bacalhau Docker jobs today and experience the power of accelerated computation in your performance-intensive projects!

5 Days of Bacalhau 1.2 Blog Series

If you’re interested in exploring our other 1.2 features more in more detail, check back tomorrow for our next 5 Days of Bacalhau blog post.

Day 1 - Job Templates
Day 2 - Streamlined Node Bootstrap
Day 3 - Multi-GPU Jobs with AMD and Intel GPU Support
Day 4 - Seamless S3 Downloads with Bacalhau
Day 5 - Instrumenting WebAssembly: Enhanced Telemetry with Dylibso Observe SDK

How to Get Involved

We're looking for help in several areas. If you're interested in helping, there are several ways to contribute. Please reach out to us at any of the following locations.

Commercial Support

While Bacalhau is open-source software, the Bacalhau binaries go through the security, verification, and signing build process lovingly crafted by Expanso. You can read more about the difference between open source Bacalhau and commercially supported Bacalhau in our FAQ. If you would like to use our pre-built binaries and receive commercial support, please contact us!