In today’s fast-paced, interconnected world, managing computational resources across distributed networks presents unique challenges. As organizations expand globally and rely more on decentralized systems, they face increasing complexity in maintaining efficiency and performance.
If you’re dealing with:
Geographically Dispersed Resources: From edge devices in remote locations to data centers across continents.
Network Challenges: Varying levels of connectivity, from high-speed links to intermittent satellite connections.
Heterogeneous Hardware: A mix of cloud instances, on-premise servers, and specialized hardware.
Dynamic Workloads: Periods of low activity followed by intense bursts of computation.
You’re not alone. These factors create a perfect storm of complexity for job management and resource utilization. Traditional queuing systems often fall short, leading to inefficiencies and missed deadlines.
The Need for Advanced Queueing
That’s why we developed Bacalhau’s advanced job queueing system. It’s designed to tackle these challenges head-on, giving you unprecedented control and efficiency in managing large-scale, distributed workloads.
But That’s Not Me!
You might be thinking, “This is for advanced workloads; I’m fine as is.” But are you experiencing any of the following challenges?
Running your network in multiple zones and regions according to best practices?
Dealing with unreliable or intermittent network connections?
(hint: all networks are unreliable)Handling highly bursty batch jobs that overwhelm your system?
Struggling to efficiently utilize resources across different time zones?
Balancing workloads on a mix of cloud and on-premise hardware?
Needing to prioritize critical jobs amidst a sea of routine tasks?
If you nodded to any of these, Bacalhau’s queueing system could be the solution you’ve been looking for. Today, we’re excited to introduce a feature that addresses these common yet challenging scenarios in distributed computing environments.
Key Features of Bacalhau's Queueing System
Dynamic Job Prioritization
Intelligent Load Balancing
Fault Tolerance and Job Recovery
Scalable Queue Management
Real-time Monitoring and Reporting
Let's walk through a demonstration of these features in action!
Configuring Job Queuing in your Network
The job queuing feature is automatically enabled. To configure it, simply set a QueueTimeout
parameter. NOTE: If this parameter isn’t set, the system assumes the job needs to be executed immediately and will not queue it.
Node availability in your network is determined by capacity as well as job constraints such as label selectors, engines, or publishers. For example, jobs will be queued if all nodes are currently busy or if idle nodes do not meet the parameters specified in your job.
Note: Your Bacalhau Compute Nodes update their node, resource, and health information every 30 seconds to the Requester Nodes in the network. During this update period, multiple jobs may be allocated to a node. Each node will take as many jobs as allowed by its scheduling limits and store them in a local job queue created on your Compute Node.
How does it work?
At the requester node level, you can set default queuing behavior for all jobs by defining the QueueTimeout
parameter in the node's configuration file. Alternatively, within the job specification, you can include the QueueTimeout
parameter directly in the configuration YAML. This flexibility allows you to tailor the queuing behavior to meet the specific needs of your distributed computing environment, ensuring that jobs are efficiently managed and resources are optimally utilized.
Requester Node
Here’s a set of commands that configure your requester node, setting the default job queuing time for thirty minutes, the default total timeout covering execution time as well to one hour, and the interval between execution retry attempts to one minute
bacalhau config set node.requester.jobdefaults.queuetimeout 30m
bacalhau config set node.requester.jobdefaults.totaltimeout 1h
bacalhau config set node.requester.scheduler.queuebackoff 1m
To start this requester node, run the following command:
bacalhau serve --node-type requester
Submitting Your First Queued Job
Here’s a sample job specification setting the QueueTimeout
for this specific job.
Type: batch
Count: 1
Tasks:
- Name: main
Engine:
Type: docker
Params:
Image: ubuntu:latest
Entrypoint:
- /bin/bash
Parameters:
- -c
- sleep 90
Timeouts:
TotalTimeout: 3600
QueueTimeout: 1800
This will be submitted to the network and stay in “Pending” until the job is scheduled, or it times out. To get the status of any job, type:
bacalhau job describe <Job ID>
This will give you a full overview of the job and its executions on your network.
ID = j-489422d0-08bb-473f-a320-4c7126e5f111
Name = j-489422d0-08bb-473f-a320-4c7126e5f111
Namespace = default
Type = batch
State = Completed
Count = 1
Created Time = 2024-07-03 11:24:01
Modified Time = 2024-07-03 11:25:35
Version = 0
Summary
Completed = 1
Job History
TIME REV. STATE TOPIC EVENT
2024-07-03 11:24:01 1 Pending Submission Job submitted
2024-07-03 11:24:02 2 Running
2024-07-03 11:25:35 3 Completed
Executions
ID NODE ID STATE DESIRED REV. CREATED MODIFIED COMMENT
e-eb35b837 QmbxGSsM Completed Stopped 6 2m23s ago 49s ago Accepted job
Execution e-eb35b837 History
TIME REV. STATE TOPIC EVENT
2024-07-03 11:24:02 1 New
2024-07-03 11:24:02 2 AskForBid
2024-07-03 11:24:02 3 AskForBidAccepted Requesting Node Accepted job
2024-07-03 11:24:02 4 AskForBidAccepted
2024-07-03 11:24:02 5 BidAccepted
2024-07-03 11:25:35 6 Completed
CLI Command
You can also define timeouts for your jobs directly through the CLI using the --queue-timeout
flag. This method provides a convenient way to specify queuing behavior on a per-job basis, allowing you to manage job execution dynamically without modifying configuration files.
For example, here is how you can submit a job with a specified queue timeout using the CLI:
bacalhau docker run ubuntu sleep 90 –timeout 3600 --queue-timeout 1800
Note: Timeouts in Bacalhau are generally governed by the TotalTimeout value for your YAML specifications and the --timeout
flag for your CLI commands. The default total timeout value is 30 minutes. Declaring any queue timeout that is larger than that without changing the total timeout value will result in a validation error.
Executing Job Queueing in Bacalhau
Jobs will be queued when all available nodes are busy and when there is no node that matches your job specifications. Let’s have a look at how queuing will be executed within your network.
Queued Jobs will initially display the Queued
status. Using the bacalhau job describe
command will show both the state of the job and the reason behind queuing.
For busy nodes:
ID = j-d740ba46-b135-4161-bd79-795c94d215b0
Name = j-d740ba46-b135-4161-bd79-795c94d215b0
Namespace = default
Type = batch
State = Queued
Message = Job queued. not enough nodes to run job. requested: 1, available: 3, suitable: 0.
• Node n-b75224b7: node busy with available capacity {CPU: 0.20000000000000018, Memory: 12 GB, Disk: 79 GB, GPU: 0} and queue capacity {CPU: 2, Memory: 4.0 GB, Disk: 0 B, GPU: 0}
• Node n-d42422fd: node busy with available capacity {CPU: 0.20000000000000018, Memory: 12 GB, Disk: 83 GB, GPU: 0} and queue capacity {CPU: 3, Memory: 1.0 GB, Disk: 0 B, GPU: 0}
• Node n-f50db1f9: node busy with available capacity {CPU: 0.20000000000000018, Memory: 12 GB, Disk: 83 GB, GPU: 0}
For no matching nodes in the network:
ID = j-0dda82b7-ad5a-4b96-b675-728c5f54f4c9
Name = j-0dda82b7-ad5a-4b96-b675-728c5f54f4c9
Namespace = default
Type = batch
State = Queued
Message = Job queued. not enough nodes to run job. requested: 1, available: 4, suitable: 0.
• 3 of 4 nodes: labels map[Architecture:amd64 Operating-System:linux owner:bacalhau] don't match required selectors [name = walid]
• Node Qma5yQAk: labels map[Architecture:amd64 GPU-0:Tesla-T4 GPU-0-Memory:15360-MiB Operating-System:linux owner:bacalhau] don't match required selectors [name = walid]
Once appropriate node resources become available, these jobs will transition to either a `Running
` or `Completed
` status allowing more jobs to be assigned to matching nodes.
ID = j-0dda82b7-ad5a-4b96-b675-728c5f54f4c9
Name = j-0dda82b7-ad5a-4b96-b675-728c5f54f4c9
Namespace = default
Type = batch
State = Completed
Count = 1
Created Time = 2024-06-24 13:36:40
Modified Time = 2024-06-24 13:41:40
Version = 0
Summary
Completed = 1
Job History
TIME REV. STATE TOPIC EVENT
2024-06-24 13:36:40 1 Pending Submission Job submitted
2024-06-24 13:36:40 2 Queued Queueing Job queued. not
enough nodes to run
job.
2024-06-24 13:39:40 3 Running
2024-06-24 13:41:40 4 Completed
Executions
ID NODE ID STATE DESIRED REV. CREATED MODIFIED COMMENT
e-88cb1c72 n-73426e31 Completed Stopped 6 6m5s ago 4m4s ago
Accepted job
Execution e-88cb1c72 History
TIME REV. STATE TOPIC EVENT
2024-06-24 13:39:40 1 New
2024-06-24 13:39:40 2 AskForBid
2024-06-24 15:39:40 3 AskForBidAccepted Requesting Node Accepted
job
2024-06-24 13:39:40 4 AskForBidAccepted
2024-06-24 13:39:40 5 BidAccepted
2024-06-24 13:41:40 6 Completed
But Wait There’s More!
Job queuing is not just about more reliable and resilient scheduling. You can:
Vary the Job Priorities
Just set the Priority field in your Job file and you can schedule jobs in priority order:
Type: batch
Count: 1
Priority: 1 // Default is 0, which is the lowest priority
Tasks:
- Name: main
Engine:
Type: docker
Params:
Image: ubuntu:latest
Entrypoint:
- /bin/bash
Parameters:
- -c
- sleep 90
Timeouts:
TotalTimeout: 3600
QueueTimeout: 1800
Queued jobs will be distributed in priority order by the requester nodes, and if they are subsequent queue for execution on the compute node, they will again be executed in priority order. The value of the Priority
field is one of a descending order - Priority 5 will be executed before Priority 4, 4 before 3 ad infinitum.
Dynamic Job Distribution
Watch as Bacalhau intelligently and automatically distributes jobs to nodes as they become available:
Handling Increased Load
If you notice a lot of jobs in your queue, you can add more nodes—manually or automatically, including spot nodes — to handle the extra workload.
Deal with Fault Tolerance
If the node goes down, the system will automatically allow the job to finish. If the job finishes while disconnected, and you need to get an answer, the system will automatically reschedule!
Conclusion
Bacalhau’s new queueing system is a major advancement in distributed job management. With intelligent job distribution, fault tolerance, and easy scalability, it empowers you to handle complex computational challenges more efficiently and reliably.
We invite you to explore this new feature and see how it can transform your workflow. Whether you’re managing batch processing, real-time analytics, or any compute-intensive tasks, Bacalhau’s queueing system is designed to meet your needs.
How to Get Involved
We welcome your involvement in Bacalhau. There are many ways to contribute, and we’d love to hear from you. Please reach out to us at any of the following locations.
Commercial Support
While Bacalhau is open-source software, the Bacalhau binaries go through the security, verification, and signing build process lovingly crafted by Expanso. You can read more about the difference between open-source Bacalhau and commercially supported Bacalhau in our FAQ. If you would like to use our pre-built binaries and receive commercial support, please contact us!