Enhancing Bacalhau’s Resiliency
We've rebuilt key Bacalhau functionality to provide and even more reliable platform for your workloads
We're excited to announce that with Bacalhau v1.6.0, we’re releasing a complete redesign of our communication architecture that brings significant improvements to network reliability and resilience. At the heart of this release is the new Bacalhau Messaging Protocol (BMP), which elevates distributed communication with built-in resilience, automatic recovery, and guaranteed message delivery.
The Power of Reliable Distributed Computing
Distributed computing networks inherently face challenges in maintaining consistent communication across nodes. With Bacalhau v1.6, we've evolved our architecture to provide enhanced reliability and resilience for your computational workloads. By shifting from traditional request/response patterns to an event-driven model, we've created a more robust foundation for distributed computing.
Key Improvements
The new communication architecture brings fundamental advancements in how Bacalhau handles distributed operations:
Self-Healing Networks
Compute nodes and orchestrators automatically reconnect and sync after network interruptions
Nodes can operate offline and reconcile state when connectivity returns
All execution data and results are preserved during network disruptions
Reliable Message Delivery
Ordered, at-least-once message delivery between nodes
Built-in failure detection and recovery mechanisms
Event-based architecture decouples processing from delivery
Efficient handling of network partitions and reconnections
How does it Work?
The new architecture introduces several key components that work together to ensure reliability:
Event Tracking: Each compute node and orchestrator tracks message sequence numbers, ensuring no updates are lost even during network partitions.
Asynchronous Communication: All interactions between orchestrators and compute nodes use a publish/subscribe model, reducing coupling and improving resilience.
Efficient Synchronization: Nodes periodically exchange their last processed sequence numbers, enabling efficient recovery from any missed messages during network disruptions.
Health Monitoring: Proactive health checks and connection management ensure rapid detection and recovery from failures.
Real-World Scenarios and Use Cases
These improvements make a significant difference in common distributed computing scenarios:
Edge Computing and IoT
Compute nodes at edge locations can continue processing data even during intermittent connectivity
Results are automatically synchronized when connections are restored
No data loss during network fluctuations
Large-Scale Data Processing
Better handling of long-running jobs across distributed nodes
Automatic recovery from partial network failures
Reliable job state tracking across large clusters
Research and Scientific Computing
More reliable execution of complex computational workflows
Better handling of resource-intensive jobs
Automatic recovery from infrastructure interruptions
Cloud-to-Edge Deployments
Seamless operation across different network conditions
Reliable job execution spanning multiple regions
Graceful handling of network latency and partitions
Seamless Upgrade Path
We've ensured a smooth transition path for existing deployments. Bacalhau v1.6 maintains full backward compatibility by supporting both the existing request/response protocol and the new messaging protocol:
v1.5 compute nodes can join networks managed by v1.6 orchestrators
v1.6 compute nodes can work with v1.5 orchestrators
Mixed version deployments are fully supported during gradual upgrades
v1.6 nodes automatically use the appropriate protocol based on the version of the node they're communicating with
This means you can upgrade your infrastructure at your own pace without disrupting ongoing operations.
Looking Forward
This new communication architecture lays the groundwork for future improvements, including:
Region-aware scheduling and execution
Enhanced job monitoring capabilities
More sophisticated failure handling mechanisms
Improved network resilience features
Getting Started
Ready to try the new features? Upgrade your Bacalhau installation to version 1.6.0!
You can:
Update your entire cluster at once
Perform a rolling upgrade starting with either orchestrators or compute nodes
Maintain mixed-version deployments during the transition
For more information:
Read our technical documentation
Check out the Bacalhau Messaging Protocol readme for implementation details
Review the v1.6.0 release notes
Join our community discussions
We're excited to see how these improvements help you build more reliable distributed computing applications. Your feedback and contributions help us continue improving and evolving the platform.
Get Involved!
We welcome your involvement in Bacalhau. There are many ways to contribute, and we’d love to hear from you. Please reach out to us at any of the following locations.
Commercial Support
While Bacalhau is open-source software, the Bacalhau binaries go through the security, verification, and signing build process lovingly crafted by Expanso. You can read more about the difference between open-source Bacalhau and commercially supported Bacalhau in our FAQ. If you would like to use our pre-built binaries and receive commercial support, please contact us!