Getting Started with Machine Learning on Bacalhau

Distributed Machine Learning needn't be compex with the help of Bacalhau

May 08, 2025

Machine Learning requires vast amounts of resources, and distributing these resources across multiple devices and regions helps with cost, speed, and data sovereignty. Bacalhau is an open-source distributed orchestration framework designed to bring compute resources to the data where and when you want, drastically reducing latency and resource overhead.

Instead of moving large datasets around networks, Bacalhau makes it easy to execute jobs close to the data’s location, reducing latency and resource overhead.

How does Bacalhau work?

Bacalhau is a single self-contained binary that you can run on bare metal, in containers, or as WebAssembly. Bacalhau can function as a client, orchestrator, and compute node or all at once. Bacalhau integrates with S3, the local file systems, and other sources via HTTP endpoints, letting you pull data from various sources.

You can install Bacalhau on any UNIX-like operating system with one a one-line command:

curl -sL https://get.bacalhau.org/install.sh | bash

Or, for more flexibility, you can also install with Docker, depending on whether you want to run jobs in containers or not.

To start an orchestrator node that schedules and manages jobs, run:

bacalhau serve --orchestrator

To start a compute node that executes workloads, run:

bacalhau serve --compute

A node can have both types if you specify both flags.

Bacalhau’s architecture enables you to create compute networks that bridge traditional infrastructure boundaries. When you submit a job, Bacalhau determines which compute nodes are best positioned to process the data based on locality, availability, and defined constraints, without requiring manual data movement or constant connectivity.

Distributed machine learning with Bacalhau

This design allows for simple, flexible, and extensible execution, which is well-suited to distributing machine learning workloads.

For example, say you have a product recommendation model to train, and for regional and regulatory reasons, you want to train versions of it in the USA, Europe, and China.

To do this, use labels, which are key-value pairs that describe a node’s characteristics, capabilities, and properties. You can define these labels in a YAML file or as you start a node.

For example, to start a new orchestrator node that runs in the US. First, create a config file:

# config.yaml
labels:
  region: us

Then, pass it to the node:

bacalhau serve --orchestrator --config config.yaml

Or to pass the config as you start the node:

bacalhau serve --orchestrator -c Labels="region=us"

Bacalhau nodes run jobs either in Docker containers or as WASM payloads. The rest of this post uses Docker.

To submit a job to a node that matches that label, use the --constraints argument:

bacalhau docker run --constraints "region=us" data-processor

Or, more conveniently, you can declare jobs in a job definition file ml-job-us.yaml:

Type: batch
Count: 1
Constraints:
  - Key: region
    Operator: =
    Values:
    - us
Tasks:
  - Name: "data-processor"
    // rest of job definition

You can use Bacalhau to submit jobs to multiple machines in each region and distribute them amongst multiple servers based on the anticipated load for each region. This could be around Singles Day in China, Christmas in Europe, or Black Friday in the USA.

Retrieving processed data

Despite this global and regional distribution, you can aggregate it and use federated learning or analysis across subsets of the data. For example, to get an impression of trends at regional or global levels, but opting to run the learning in a specific region or not.

For basic job retrieval based on a region, you first find the details of jobs based on constraints. For example:

bacalhau job list --labels "region=us"

Then, download the results of a particular job:

bacalhau job get <jobID> --output-dir /destination/path

However, doing this manually for every job isn’t particularly productive, so instead, you could expand on the job definition file mentioned earlier to define what to do with the results upon completion.

Type: batch
Count: 1
Constraints:
  - Key: region
    Operator: =
    Values:
    - us
Tasks:
  - Name: "data-processor"
    // rest of job definition
			…
    Publisher:
      Type: s3
      Params:
        Bucket: us-results
        Key: us- results-folder
    ResultPaths:
      - Name: us-results
        Path: /outputs

This YAML configuration introduced a couple of other possibilities for job results. Instead of downloading the job results to a user’s computer, it saves them to an outputs directory where the job is running and then publishes those results to an S3 bucket. You can also fetch the results manually from the outputs directory with the job get command.

Supplying data to process

Machine learning needs data to process, which, again, you often want to keep separated for practical or regulatory reasons.

With Bacalhau, you can mount input data from the local file system, an S3 bucket, IPFS, or an HTTP endpoint.

Either as an --input command line argument:

bacalhau docker run --input <URI><SOURCE>:<TARGET> data-processor

Which consists of the URL to the storage location and where to mount it in the destination container.

Or, add the input details to the job definition file:

Type: batch
Count: 1
Constraints:
  - Key: region
    Operator: =
    Values:
    - us
Tasks:
  - Name: "data-processor"
    // rest of job definition
			…
			InputSources:
      - Alias: input
        Target: outputs
        Source:
          Type: /us-sales-data
          Params:
            key: value
    Publisher:
      Type: s3
      Params:
        Bucket: us-results
        Key: us-results-folder
    ResultPaths:
      - Name: us-results
        Path: /outputs

Summary

This post covered getting started with machine learning using Bacalhau, including some of the basic concepts for processing data in and out of Bacalhau securely and privately. To find out more, we recommend the more detailed installation guide, onboarding nodes to your network, and using jobs.

What's Next?

To start using Bacalhau, install Bacalhau and give it a shot.

If you don’t have a node network available and would still like to try Bacalhau, you can use Expanso Cloud. You can also set up a cluster on your own (with setup guides for AWS, GCP, Azure, and more 🙂).

Get Involved!

We welcome your involvement in Bacalhau. There are many ways to contribute, and we’d love to hear from you. Reach out at any of the following locations:

Commercial Support

While Bacalhau is open-source software, the Bacalhau binaries go through the security, verification, and signing build process lovingly crafted by Expanso. Read more about the difference between open-source Bacalhau and commercially supported Bacalhau in the FAQ. If you want to use the pre-built binaries and receive commercial support, contact us or get your license on Expanso Cloud!

Bacalhau