Que? Queue! Advanced Queueing in Bacalhau

5 Days of Bacalhau (8 min)

David Aronchick

Sean M. Tracey

Laura Hohmann

, and 4 others

Jul 03, 2024

In today’s fast-paced, interconnected world, managing computational resources across distributed networks presents unique challenges. As organizations expand globally and rely more on decentralized systems, they face increasing complexity in maintaining efficiency and performance.

If you’re dealing with:

Geographically Dispersed Resources: From edge devices in remote locations to data centers across continents.
Network Challenges: Varying levels of connectivity, from high-speed links to intermittent satellite connections.
Heterogeneous Hardware: A mix of cloud instances, on-premise servers, and specialized hardware.
Dynamic Workloads: Periods of low activity followed by intense bursts of computation.

You’re not alone. These factors create a perfect storm of complexity for job management and resource utilization. Traditional queuing systems often fall short, leading to inefficiencies and missed deadlines.

The Need for Advanced Queueing

That’s why we developed Bacalhau’s advanced job queueing system. It’s designed to tackle these challenges head-on, giving you unprecedented control and efficiency in managing large-scale, distributed workloads.

But That’s Not Me!

You might be thinking, “This is for advanced workloads; I’m fine as is.” But are you experiencing any of the following challenges?

Running your network in multiple zones and regions according to best practices?
Dealing with unreliable or intermittent network connections?
(hint: all networks are unreliable)
Handling highly bursty batch jobs that overwhelm your system?
Struggling to efficiently utilize resources across different time zones?
Balancing workloads on a mix of cloud and on-premise hardware?
Needing to prioritize critical jobs amidst a sea of routine tasks?

If you nodded to any of these, Bacalhau’s queueing system could be the solution you’ve been looking for. Today, we’re excited to introduce a feature that addresses these common yet challenging scenarios in distributed computing environments.

Key Features of Bacalhau's Queueing System

Dynamic Job Prioritization
Intelligent Load Balancing
Fault Tolerance and Job Recovery
Scalable Queue Management
Real-time Monitoring and Reporting

Let's walk through a demonstration of these features in action!

Configuring Job Queuing in your Network

The job queuing feature is automatically enabled. To configure it, simply set a QueueTimeout parameter. NOTE: If this parameter isn’t set, the system assumes the job needs to be executed immediately and will not queue it.

Node availability in your network is determined by capacity as well as job constraints such as label selectors, engines, or publishers. For example, jobs will be queued if all nodes are currently busy or if idle nodes do not meet the parameters specified in your job.

Note: Your Bacalhau Compute Nodes update their node, resource, and health information every 30 seconds to the Requester Nodes in the network. During this update period, multiple jobs may be allocated to a node. Each node will take as many jobs as allowed by its scheduling limits and store them in a local job queue created on your Compute Node.

How does it work?

At the requester node level, you can set default queuing behavior for all jobs by defining the QueueTimeout parameter in the node's configuration file. Alternatively, within the job specification, you can include the QueueTimeout parameter directly in the configuration YAML. This flexibility allows you to tailor the queuing behavior to meet the specific needs of your distributed computing environment, ensuring that jobs are efficiently managed and resources are optimally utilized.

Requester Node

Here’s a set of commands that configure your requester node, setting the default job queuing time for thirty minutes, the default total timeout covering execution time as well to one hour, and the interval between execution retry attempts to one minute

bacalhau config set node.requester.jobdefaults.queuetimeout 30m
bacalhau config set node.requester.jobdefaults.totaltimeout 1h
bacalhau config set node.requester.scheduler.queuebackoff 1m

To start this requester node, run the following command:

bacalhau serve --node-type requester

Submitting Your First Queued Job

Here’s a sample job specification setting the QueueTimeout for this specific job.

Type: batch
Count: 1
Tasks:
  - Name: main
    Engine:
      Type: docker
      Params:
        Image: ubuntu:latest
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - sleep 90
    Timeouts:
      TotalTimeout: 3600
      QueueTimeout: 1800

This will be submitted to the network and stay in “Pending” until the job is scheduled, or it times out. To get the status of any job, type:

bacalhau job describe <Job ID>

This will give you a full overview of the job and its executions on your network.

ID            = j-489422d0-08bb-473f-a320-4c7126e5f111
Name          = j-489422d0-08bb-473f-a320-4c7126e5f111
Namespace     = default
Type          = batch
State         = Completed
Count         = 1
Created Time  = 2024-07-03 11:24:01
Modified Time = 2024-07-03 11:25:35
Version       = 0

Summary
Completed = 1

Job History

 TIME                 REV.  STATE      TOPIC       EVENT
 2024-07-03 11:24:01  1     Pending    Submission  Job submitted
 2024-07-03 11:24:02  2     Running
 2024-07-03 11:25:35  3     Completed

Executions

 ID          NODE ID   STATE      DESIRED  REV.  CREATED    MODIFIED  COMMENT

 e-eb35b837  QmbxGSsM  Completed  Stopped  6     2m23s ago  49s ago   Accepted job

Execution e-eb35b837 History

 TIME                 REV.  STATE              TOPIC            EVENT

 2024-07-03 11:24:02  1     New
 2024-07-03 11:24:02  2     AskForBid
 2024-07-03 11:24:02  3     AskForBidAccepted  Requesting Node  Accepted job
 2024-07-03 11:24:02  4     AskForBidAccepted
 2024-07-03 11:24:02  5     BidAccepted
 2024-07-03 11:25:35  6     Completed

CLI Command

You can also define timeouts for your jobs directly through the CLI using the --queue-timeout flag. This method provides a convenient way to specify queuing behavior on a per-job basis, allowing you to manage job execution dynamically without modifying configuration files.

For example, here is how you can submit a job with a specified queue timeout using the CLI:

bacalhau docker run ubuntu sleep 90 –timeout 3600 --queue-timeout 1800

Note: Timeouts in Bacalhau are generally governed by the TotalTimeout value for your YAML specifications and the --timeout flag for your CLI commands. The default total timeout value is 30 minutes. Declaring any queue timeout that is larger than that without changing the total timeout value will result in a validation error.

Executing Job Queueing in Bacalhau

Jobs will be queued when all available nodes are busy and when there is no node that matches your job specifications. Let’s have a look at how queuing will be executed within your network.

Queued Jobs will initially display the Queued status. Using the bacalhau job describe command will show both the state of the job and the reason behind queuing.

For busy nodes:

ID            = j-d740ba46-b135-4161-bd79-795c94d215b0
Name          = j-d740ba46-b135-4161-bd79-795c94d215b0
Namespace     = default
Type          = batch
State         = Queued

Message       = Job queued. not enough nodes to run job. requested: 1, available: 3, suitable: 0.

• Node n-b75224b7: node busy with available capacity {CPU: 0.20000000000000018, Memory: 12 GB, Disk: 79 GB, GPU: 0} and queue capacity {CPU: 2, Memory: 4.0 GB, Disk: 0 B, GPU: 0}

• Node n-d42422fd: node busy with available capacity {CPU: 0.20000000000000018, Memory: 12 GB, Disk: 83 GB, GPU: 0} and queue capacity {CPU: 3, Memory: 1.0 GB, Disk: 0 B, GPU: 0}

• Node n-f50db1f9: node busy with available capacity {CPU: 0.20000000000000018, Memory: 12 GB, Disk: 83 GB, GPU: 0}

For no matching nodes in the network:

ID            = j-0dda82b7-ad5a-4b96-b675-728c5f54f4c9
Name          = j-0dda82b7-ad5a-4b96-b675-728c5f54f4c9
Namespace     = default
Type          = batch
State         = Queued

Message       = Job queued. not enough nodes to run job. requested: 1, available: 4, suitable: 0.

• 3 of 4 nodes: labels map[Architecture:amd64 Operating-System:linux owner:bacalhau] don't match required selectors [name = walid]

• Node Qma5yQAk: labels map[Architecture:amd64 GPU-0:Tesla-T4 GPU-0-Memory:15360-MiB Operating-System:linux owner:bacalhau] don't match required selectors [name = walid]

Once appropriate node resources become available, these jobs will transition to either a `Running` or `Completed` status allowing more jobs to be assigned to matching nodes.

ID            = j-0dda82b7-ad5a-4b96-b675-728c5f54f4c9
Name          = j-0dda82b7-ad5a-4b96-b675-728c5f54f4c9
Namespace     = default
Type          = batch
State         = Completed
Count         = 1
Created Time  = 2024-06-24 13:36:40
Modified Time = 2024-06-24 13:41:40
Version       = 0

Summary

Completed = 1

Job History

 TIME                 REV.  STATE      TOPIC       EVENT
 2024-06-24 13:36:40  1     Pending    Submission  Job submitted
 2024-06-24 13:36:40  2     Queued     Queueing    Job queued. not
                                                   enough nodes to run
                                                   job. 
 2024-06-24 13:39:40  3     Running
 2024-06-24 13:41:40  4     Completed

Executions

 ID          NODE ID     STATE      DESIRED  REV.  CREATED   MODIFIED  COMMENT
 e-88cb1c72  n-73426e31  Completed  Stopped  6     6m5s ago  4m4s ago  
Accepted job

Execution e-88cb1c72 History

 TIME                 REV.  STATE              TOPIC            EVENT

 2024-06-24 13:39:40  1     New
 2024-06-24 13:39:40  2     AskForBid
 2024-06-24 15:39:40  3     AskForBidAccepted  Requesting Node  Accepted
                                                                job
 2024-06-24 13:39:40  4     AskForBidAccepted
 2024-06-24 13:39:40  5     BidAccepted
 2024-06-24 13:41:40  6     Completed

But Wait There’s More!

Job queuing is not just about more reliable and resilient scheduling. You can:

Vary the Job Priorities

Just set the Priority field in your Job file and you can schedule jobs in priority order:

Type: batch
Count: 1
Priority: 1 // Default is 0, which is the lowest priority
Tasks:
  - Name: main
    Engine:
      Type: docker
      Params:
        Image: ubuntu:latest
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - sleep 90
    Timeouts:
      TotalTimeout: 3600
      QueueTimeout: 1800

Queued jobs will be distributed in priority order by the requester nodes, and if they are subsequent queue for execution on the compute node, they will again be executed in priority order. The value of the Priority field is one of a descending order - Priority 5 will be executed before Priority 4, 4 before 3 ad infinitum.

Dynamic Job Distribution

Watch as Bacalhau intelligently and automatically distributes jobs to nodes as they become available:

Handling Increased Load

If you notice a lot of jobs in your queue, you can add more nodes—manually or automatically, including spot nodes — to handle the extra workload.

Deal with Fault Tolerance

If the node goes down, the system will automatically allow the job to finish. If the job finishes while disconnected, and you need to get an answer, the system will automatically reschedule!

Conclusion

Bacalhau’s new queueing system is a major advancement in distributed job management. With intelligent job distribution, fault tolerance, and easy scalability, it empowers you to handle complex computational challenges more efficiently and reliably.

We invite you to explore this new feature and see how it can transform your workflow. Whether you’re managing batch processing, real-time analytics, or any compute-intensive tasks, Bacalhau’s queueing system is designed to meet your needs.

How to Get Involved

We welcome your involvement in Bacalhau. There are many ways to contribute, and we’d love to hear from you. Please reach out to us at any of the following locations.

Commercial Support

While Bacalhau is open-source software, the Bacalhau binaries go through the security, verification, and signing build process lovingly crafted by Expanso. You can read more about the difference between open-source Bacalhau and commercially supported Bacalhau in our FAQ. If you would like to use our pre-built binaries and receive commercial support, please contact us!