Truly Global Compute: Using Bacalhau to Execute Distributed DuckDB Queries Across Many Zones

Jun 27, 2023

Let’s explore how to deploy an application across multiple zones using Bacalhau and DuckDB. This will enable us to harness the capabilities of Bacalhau's network, leverage DuckDB's analytical power, and run cross-zone operations with ease. This tutorial will help you process and filter logs with Python scripts, utilize DuckDB for data collection, run operations on multiple nodes using concurrency flags, and store results in cloud storage buckets. All code below is available in our Github Repo.

Follow through step-by-step on how to do this via video here:

The Problem:

We have one node per Google Cloud zone — a total of 112 nodes — storing logs containing all types of information. In this example, the security warnings in the logs are useful to us, so we want to be able to capture those warnings across all 112 nodes and put them into our own files to filter them or move them to a central cluster to run some analysis.

We can log in to a single virtual machine (VM) and list the logs on that VM to get an idea of the data stored in these logs. Below uses an example username and example path_to_logs:

ssh username@34.18.30.29 'ls /var/log/logs_to_process'

Hello World!

First, let’s run “Hello World!” on one machine:

bacalhau docker run -s zone=me-central1-a ubuntu echo "hello me-central1a"

This command will go out and find the node to run on, schedule the job, and run it. In the results, you will find a command to get more details about the run. If you run that command you will see how it ran the job on your selected node, in this case me-central1-a.

Now to run the same command on all 112 nodes, one for each node in the network, all you need to do is add a target flag:

bacalhau docker run --target all ubuntu echo "hello world on 112 nodes"

Processing and Filtering Logs using DuckDB:

Since each VM generates various log files, it's crucial to be able to filter and process the relevant data. We can use DuckDB as a layer on top of our log files to run multiple nodes and collect useful information from the log files.

We can get the security information we want from a single log file by running a Yaml file containing the zone selector and a python command to run a script we have written over the log file with a simple SQL select statement using DuckDB:

Spec:
  NodeSelectors:
  - Key: zone
    Operator: =
    Values:
    - europe-west9-b
  EngineSpec:
    Params:
      EnvironmentVariables:
        - INPUTFILE=/var/log/logs_to_process/aperitivo_logs.log.1
        - QUERY=SELECT * FROM log_data WHERE message LIKE '%[SECURITY]%' ORDER BY '@timestamp'
      Image: docker.io/bacalhauproject/duckdb-log-processor:1.1
      WorkingDirectory: ""
      Type: docker

We can run this Yaml file using the following:

cat job.yaml| bacalhau create

To download the job outputs all you need to do is run:

bacalhau get [JOB_ID]

To do the same across all 112 nodes in the network all you need to do is replace NodeSelectors with the Concurrency flag:

Spec:
  Deal:
    Concurrency: 1
    TargetingMode: true
  EngineSpec:
    Params:
      EnvironmentVariables:
        - INPUTFILE=/var/log/logs_to_process/aperitivo_logs.log.1
        - QUERY=SELECT * FROM log_data WHERE message LIKE '%[SECURITY]%' ORDER BY '@timestamp'
      Image: docker.io/bacalhauproject/duckdb-log-processor:1.1
      WorkingDirectory: ""
      Type: docker

Then when you run the same cat command with the Yaml file containing the concurrency flag, this will go out and run the same command across all 112 nodes in the network.

Storing Results:

Once the data processing is complete, it's essential to store the results securely. We can do this by utilizing Bacalhau and Google Cloud Storage buckets to store the processed log data. We can also configure appropriate access permissions for the storage buckets.

Running Cross-Zone Operations with Bacalhau:

One of the advantages of using Bacalhau is the ability to execute commands across multiple zones with a single command. You can do this by using Bacalhau's concurrency flag to run the command across every node in the network. This enables parallel execution and saves time when running operations across multiple zones. Next, jump into “Processing Logs With Distributed Queries” using Bacalhau and DuckDB.

Bacalhau