Getting Started with Machine Learning on Bacalhau
Distributed Machine Learning needn't be compex with the help of Bacalhau
Machine Learning requires vast amounts of resources, and distributing these resources across multiple devices and regions helps with cost, speed, and data sovereignty. Bacalhau is an open-source distributed orchestration framework designed to bring compute resources to the data where and when you want, drastically reducing latency and resource overhead.
Instead of moving large datasets around networks, Bacalhau makes it easy to execute jobs close to the data’s location, reducing latency and resource overhead.
How does Bacalhau work?
Bacalhau is a single self-contained binary that you can run on bare metal, in containers, or as WebAssembly. Bacalhau can function as a client, orchestrator, and compute node or all at once. Bacalhau integrates with S3, the local file systems, and other sources via HTTP endpoints, letting you pull data from various sources.
You can install Bacalhau on any UNIX-like operating system with one a one-line command:
curl -sL https://get.bacalhau.org/install.sh | bash
Or, for more flexibility, you can also install with Docker, depending on whether you want to run jobs in containers or not.
To start an orchestrator node that schedules and manages jobs, run:
bacalhau serve --orchestrator
To start a compute node that executes workloads, run:
bacalhau serve --compute
A node can have both types if you specify both flags.
Bacalhau’s architecture enables you to create compute networks that bridge traditional infrastructure boundaries. When you submit a job, Bacalhau determines which compute nodes are best positioned to process the data based on locality, availability, and defined constraints, without requiring manual data movement or constant connectivity.
Distributed machine learning with Bacalhau
This design allows for simple, flexible, and extensible execution, which is well-suited to distributing machine learning workloads.
For example, say you have a product recommendation model to train, and for regional and regulatory reasons, you want to train versions of it in the USA, Europe, and China.
To do this, use labels, which are key-value pairs that describe a node’s characteristics, capabilities, and properties. You can define these labels in a YAML file or as you start a node.
For example, to start a new orchestrator node that runs in the US. First, create a config file:
# config.yaml
labels:
region: us
Then, pass it to the node:
bacalhau serve --orchestrator --config config.yaml
Or to pass the config as you start the node:
bacalhau serve --orchestrator -c Labels="region=us"
Bacalhau nodes run jobs either in Docker containers or as WASM payloads. The rest of this post uses Docker.
To submit a job to a node that matches that label, use the --constraints
argument:
bacalhau docker run --constraints "region=us" data-processor
Or, more conveniently, you can declare jobs in a job definition file ml-job-us.yaml:
Type: batch
Count: 1
Constraints:
- Key: region
Operator: =
Values:
- us
Tasks:
- Name: "data-processor"
// rest of job definition
You can use Bacalhau to submit jobs to multiple machines in each region and distribute them amongst multiple servers based on the anticipated load for each region. This could be around Singles Day in China, Christmas in Europe, or Black Friday in the USA.
Retrieving processed data
Despite this global and regional distribution, you can aggregate it and use federated learning or analysis across subsets of the data. For example, to get an impression of trends at regional or global levels, but opting to run the learning in a specific region or not.
For basic job retrieval based on a region, you first find the details of jobs based on constraints. For example:
bacalhau job list --labels "region=us"
Then, download the results of a particular job:
bacalhau job get <jobID> --output-dir /destination/path
However, doing this manually for every job isn’t particularly productive, so instead, you could expand on the job definition file mentioned earlier to define what to do with the results upon completion.
Type: batch
Count: 1
Constraints:
- Key: region
Operator: =
Values:
- us
Tasks:
- Name: "data-processor"
// rest of job definition
…
Publisher:
Type: s3
Params:
Bucket: us-results
Key: us- results-folder
ResultPaths:
- Name: us-results
Path: /outputs
This YAML configuration introduced a couple of other possibilities for job results. Instead of downloading the job results to a user’s computer, it saves them to an outputs directory where the job is running and then publishes those results to an S3 bucket. You can also fetch the results manually from the outputs directory with the job get command.
Supplying data to process
Machine learning needs data to process, which, again, you often want to keep separated for practical or regulatory reasons.
With Bacalhau, you can mount input data from the local file system, an S3 bucket, IPFS, or an HTTP endpoint.
Either as an --input
command line argument:
bacalhau docker run --input <URI><SOURCE>:<TARGET> data-processor
Which consists of the URL to the storage location and where to mount it in the destination container.
Or, add the input details to the job definition file:
Type: batch
Count: 1
Constraints:
- Key: region
Operator: =
Values:
- us
Tasks:
- Name: "data-processor"
// rest of job definition
…
InputSources:
- Alias: input
Target: outputs
Source:
Type: /us-sales-data
Params:
key: value
Publisher:
Type: s3
Params:
Bucket: us-results
Key: us-results-folder
ResultPaths:
- Name: us-results
Path: /outputs
Summary
This post covered getting started with machine learning using Bacalhau, including some of the basic concepts for processing data in and out of Bacalhau securely and privately. To find out more, we recommend the more detailed installation guide, onboarding nodes to your network, and using jobs.
What's Next?
To start using Bacalhau, install Bacalhau and give it a shot.
If you don’t have a node network available and would still like to try Bacalhau, you can use Expanso Cloud. You can also set up a cluster on your own (with setup guides for AWS, GCP, Azure, and more 🙂).
Get Involved!
We welcome your involvement in Bacalhau. There are many ways to contribute, and we’d love to hear from you. Reach out at any of the following locations:
Commercial Support
While Bacalhau is open-source software, the Bacalhau binaries go through the security, verification, and signing build process lovingly crafted by Expanso. Read more about the difference between open-source Bacalhau and commercially supported Bacalhau in the FAQ. If you want to use the pre-built binaries and receive commercial support, contact us or get your license on Expanso Cloud!