Distributed Compute Platform Bacalhau Launches Next Release 1.1

5 Days of Bacalhau - Day 1

David Aronchick

Laura Hohmann

, and

Simon Worthington

Sep 25, 2023

Since we announced the release of Bacalhau 1.0 in May 2023 at the Compute over Data Summit, the community has stepped forward to release a number of novel ways to use Bacalhau. For example:

We’re excited to announce that MotherDuck/DuckDB and Bacalhau/Expanso have partnered to deliver an enterprise-grade logging solution - Unified Data Log Insights: Leveraging Bacalhau and Motherduck for Advanced File Querying Across Distributed Networks

Newly released Bacalhau 1.1, adds several new features designed to solve our customer’s needs. It’s never been easier to deploy Bacalhau in high performance scenarios like the ones listed above.

What's New in 1.1?

Full Fleet Targeting

We learned that our users want to execute single jobs and conduct simultaneous operations across their entire node fleet. With the new --target=all option, you can execute jobs and queries in parallel on all matching nodes in your network with a single command. This makes it easy to get a comprehensive view of your entire infrastructure, allows for immediate update rollouts, and simplifies the management of edge device fleets.

New Node CLI and APIs

Fleet management has gotten better for enterprises to gain a fuller view of their deployment footprint and provides more detailed node info. . In Bacalhau 1.1, we are adding two new functions to make this much easier:

bacalhau node list - which will output a table of all the nodes in a network
bacalhau node describe - which will output the entire config for a node.

Configurable Timeouts

Bacalhau now supports running jobs for extended periods without timing out, enabling long-running intensive computations! Users and node operators can configure custom timeouts if needed. However, by default, there is no execution timeout limit. The two flags to set this are:

--timeout - to set the requested time out for a job.
--max-timeout - to set the maximum time out allowed on a node.

Richer Node Configuration

Bacalhau 1.1.0 offers a wider range of customizable options for your setup, including persistent config files, command flags, and environment variables.

This improved flexibility allows you to tailor Bacalhau to your preferences. For insights into new options like config.yaml and updates from v1.0.3, refer to the latest configuration guide.

Here is a sample configuration file:

Node:
  ClientAPI:
    Host: bootstrap.production.bacalhau.org
    Port: 1234
User:
  KeyPath: /home/user/.bacalhau/user_id.pem

This method replaces the old way of using many different command line flags, making it easier to deploy Bacalhau to nodes.

Text within this block will maintain its original spacing when published

⚠️ NOTE: Existing Bacalhau users may need to follow migration steps to retain their previous configurations.

Support for TLS on Public APIs

Bacalhau now supports secure client-server communication using TLS certificates. These certificates help prevent eavesdropping and ensure data remains secure while moving between the client and the Bacalhau network. Setting up TLS is simple, requiring only a few extra lines in your setup, and it offers essential encryption for your jobs and data.

You can make use of free certificates from Let's Encrypt or supply your own certificate and private key.

To enable TLS, specify the certificate and key paths either in the Bacalhau config file or via CLI flags. In a sample configuration file, it looks like this:

Node:
 ServerAPI:
   TLS:
     ServerCertificate: /root/hostname.crt
     ServerKey: /root/hostname.prv

We ALSO support auto-provisioning of TLS certificates from Let’s Encrypt using the following setting in your configuration file:

Node:
  ServerAPI:
    TLS:
      AutoCert: example.com

Optional External Storage of Jobs and Executions

To date, all job information has been stored in the memory of the running server. This works well for many, but some users wanted this information stored externally to preserve job information across server restarts.

Bacalhau 1.1 adds support for external job storage. Job histories can now outlive the nodes that ran them. The benefits include improved recordkeeping for auditing, the ability to restart interrupted jobs, and better insights from long-term job analytics. Node operators can also configure Bacalhau to save this information on storage solutions such as IPFS, S3, etc. to securely archive job data. Once enabled, all job information will be versioned and stored in the external system and protected from loss even if nodes go offline. This persistence unlocks new use cases and visibility for Bacalhau users.

Learn how to configure persistence.

Improved Error Messages

We’ve heard your feedback suggesting the need for clearer error reporting. We now highlight why jobs fail to make it clear to end-users. For example, previously many errors would report "not enough nodes to run job" that will now provide details like:

Could not inspect image - could be due to repo/image not existing, or registry needing authorization
Job timeout 1800s exceeds maximum possible value 300s

Let us know if you have other issues where the errors aren’t clear!

Fine-Grained Control Over Image Entrypoint and Parameters

Users now have finer control over the entrypoint and parameters passed to a Docker image. Previously, Bacalhau would ignore the default entrypoint to the image and replace it with the first argument after bacalhau docker run <image>. Now, the default entrypoint in the image is used and all of the positional arguments are passed as the command to that entrypoint.

The entrypoint can still be explicitly overridden by using the --entrypoint flag or by setting the Entrypoint field in a Docker job spec.

GPU Support Inside Docker Containers

When running ML models, nothing beats custom hardware like GPUs. Bacalhau 1.1 now has the capability to automatically utilize GPUs when the Bacalhau node is running inside a Docker container. Ensure that the Bacalhau node is started with a GPU capability by passing --gpus=all to bacalhau docker run.

Support for Private IPFS Clusters

Most enterprise workloads need the privacy enabled by running clusters disconnected from the external world. While IPFS is a terrific protocol for moving data around, the default mechanism for doing so requires moving through public gateways.

Bacalhau 1.1 now enables connecting to existing, private, bacalhau clusters. To connect to a private swarm, pass the path to a swarm key to --ipfs-swarm-key, set the BACALHAU_IPFS_SWARM_KEY environment variable or configure the Node.IPFS.SwarmKeyPath configuration property.

When connecting to a private swarm, Bacalhau will no longer bootstrap using or connect to public peers and will rely on the swarm for all data retrieval.

⚠️ NOTE: It will also be necessary to set this environment variable when using a client that uses bacalhau get to download from a private IPFS swarm.

Setting the environment variable is NOT necessary if using the --ipfs-connect flag, which already can connect to IPFS nodes running a private swarm.

Experimental Features

We are hard at work to develop long running jobs, and pluggable executors and future releases.

Long-Running Jobs

To date, the idea of a Bacalhau job was finite - it started, it did some work, and then it finished.

However, in many cases, the cost of starting the job was significant (such as a database, loading a model into memory, etc.), or the response time needed for the job was very short (faster than the time it would take even a fast container to start). That’s why we developed long running jobs.

In Bacalhau 1.1, Bacalhau jobs can now run indefinitely and will automatically restart when nodes come back online, allowing for continuous and uninterrupted processing. Long-running jobs allow compute workloads to process data that arrives continuously and is perfect for tasks such as pre-filtering logs, processing real-time analytics, or working with edge sensors.

With the introduction of long-running jobs, ML inference tasks can now operate in a "warm-boot" environment. This means that the necessary resources and dependencies are already loaded, significantly reducing the time taken to run an inference job.

With this experimental feature, you can now unleash the power of Bacalhau to handle dynamic and ever-changing data streams, ensuring continuous and uninterrupted processing of your computational workloads.

Pluggable Executors

While Bacalhau supports Docker and WASM natively today, in many cases this is an unnecessary abstraction. If people just wanted to execute a curl command, or a simple Python script, it would be very convenient to just specify that, instead of first having to go through container packaging to do so. Or perhaps people want to use Bacalhau to configure and operate host nodes directly and don’t want their job to be entirely contained within Docker. That’s where pluggable executors come in.

In Bacalhau 1.1, you can start to use a simpler interface to specify what to execute by relying on executor plugins. We’re still putting the finishing touches on this feature and it’ll be ready to try in an upcoming release, but once it’s ready you’ll simply detail it all in the job spec:

Engine:
    Type: python
    Params:
        Repository: github.com/me/my-project
        Script: main.py

Or, you can use the familiar command line interface and just run bacalhau run python main.py directly.

This will still run the job in appropriate security and isolation context but now the details of that are left to the Bacalhau runtime, which could choose to execute the script inside Docker, in a Python virtual environment on the host machine or even as a WebAssembly binary if appropriate! It is like a remote executor, but better!

Planned Upcoming Release Items

We have LOTS more features in our roadmap:

Moving long running jobs, and pluggable executors to general availability
Hosted clusters and “burstable” clusters
Our new WebUI dashboard
Open telemetry tracing

And lots more! If you’d like something in particular, come tell us!

5 Days of Bacalhau Blog Series

If you’re interested in exploring these features more in depth, check back tomorrow for our 5 Days of Bacalhau.

Day 1 - Bacalhau 1.1 Release
Day 2 - Improved Queuing For Jobs
Day 3 - New Job Types
Day 4 - Power Your Flyte Machine Learning Workflows
Day 5 - GPU Support for Docker Nodes

How to Get Involved

We're looking for help in several areas. If you're interested in helping out, there are several ways to contribute. Please reach out to us at any of the following locations.

As always, thank you for reading, and onward!

Your humble Bacalhau team.

⭐ Star on Github

Join Our Slack

Bacalhau