Improved Queuing For Jobs
5 Days of Bacalhau - Day 2
In distributed systems a significant amount of data is used to monitor the status of different components of the system, often referred to as 'state'. Bacalhau has traditionally stored this state in memory for performance gains, but this approach also has drawbacks.
Durability: In-memory data is inherently volatile. In the event of a system crash or failure, memory-stored data can be lost, potentially causing data inconsistency. For systems like Bacalhau, determining the state from the operational components left can be challenging.
Internal Stores: Even assuming the nodes lasted forever, there are instances where job queues need permanent storage or archiving. When jobs only exist in memory, accessible solely via Bacalhau APIs, it makes it challenging to access and store.
Diminished Scalability and Performance: Memory, being a costly resource both in terms of hardware and operations, limits the resources available for our compute nodes. It's more efficient to allocate memory to compute rather than system state storage.
To date, many systems rely on a centralized service, such as a database, to maintain their state. But in highly distributed systems, this can be a challenge. Nodes that are scattered globally may not be able to check in regularly to understand what to do next. In a distributed world, we need a new solution!
Improved Queuing in Bacalhau 1.1
With this release, Bacalhau components now optionally store their state in a persistent boltdb data store. This means that every node will maintain its own list of jobs in a more permanent, and distributed way. Additionally, as each node communicates with the network, it will automatically sync (folks in the distributed world call this “achieving eventual consistency”) without any user intervention. In many ways, this is the best of both worlds!
Additionally, there are more advantages in-memory only queuing:
Reduced Memory Requirements: By using a persistent data store, Bacalhau can efficiently manage and store larger volumes of data without the constraints of limited node memory. For embedded and IoT solutions, this is particularly valuable.
Resilient, Distributed State: Persistent data stores are designed to ensure data durability and resilience. Even in the face of system failures or crashes, data remains intact, reducing the risk of data loss and enabling recovery processes on long-running jobs.
In addition to requester nodes storing the job state, each compute node also maintains a persistent record of historical executions. This means that state is far more resilient to failure; if Something Bad™️ happens to the network, collaborating nodes can reconstruct the distributed state, even if node connections are severed for a while.
Cost Savings: By storing data on a durable, non-volatile medium, organizations can better utilize their hardware resources, potentially reducing the demand for costly high-capacity memory solutions. For nodes that process a large number of low-memory jobs, it is no longer required to “overstuff” the amount of memory simply to hold the job states.
Support for Much Longer Queues and Historical Analysis: Previously, queues were best kept short to prevent jobs from getting stuck or lost during server restarts. Now, state preservation extends through longer periods and system reboots. Additionally, with external data storage, performance of job queues can be analyzed, reported on, and audited, providing valuable insights into overall system performance.
Dynamically Taking Advantage of Changing Resources: In the past, job placement was determined solely at the time of submission. Now, if jobs are incomplete, the orchestrator can re-evaluate job placement at the point of scheduling.
Altogether, this change to Bacalhau improves scalability, durability, and cost-efficiency. It demonstrates how this approach can deliver tangible benefits to users and organizations eager to tap into the power of distributed computations.
While in-memory storage has its place in certain use cases demanding ultra-low latency, which Bacalhau continues to support, its shift to a persistent data store opens the door to new possibilities and paves the way for a more stronger and more adaptable system.
Contributing and More Information
We're looking for help in several areas. If you're interested in helping out, there are several ways to contribute and you can always reach out to us via Slack or Email.
For more information, see: