Bacalhau v1.7.0 - Day 5: Distributed Data Warehouse with Bacalhau and DuckDB

(3:11) Build a distributed data warehouse with Bacalhau and DuckDB to run SQL queries on regional data without moving it.

Mar 28, 2025

This is part of the 5-days of Bacalhau 1.7 series! Make sure to go back to the start to catch all of them!

Distributed Data Warehouse with Bacalhau and DuckDB

With many applications that rely on data warehouses, you need to keep data sources in different locations. This could be due to privacy or regulatory reasons or because you want to keep processing close to the source. However, there are still times when you want to perform analysis on and across these data sources from one location but not move the data.

This post uses Bacalhau to orchestrate the distributed processing and DuckDB to provide the SQL storage and querying capacity for some mock sales data based in the EU and the US.

Prerequisites

To reproduce this tutorial, you need the following prerequisites:

Bacalhau CLI
Docker and Docker Compose
The example multi-region setup
- git clone https://github.com/bacalhau-project/examples.git

Architecture

The example Docker Compose file and Bacalhau job definitions in the example repository setup simulate the following architecture:

Bacalhau Orchestrator: Central control plane for job distribution
Compute Nodes: Distributed across regions, running close to data
Regional Storage: Regional data stores, using MinIO in this setup
DuckDB: SQL query engine running on each compute node. Bacalhau has a custom image that adds several user-defined functions to help with partitioning large data sets across nodes based on the following methods:
- partition_by_hash: Even distribution of files across partitions
- partition_by_regex: Pattern-based partitioning
- partition_by_date: Time-based partitioning

You can find more details on how these functions work in the custom image documentation.

You can see each component’s setup in the Docker Compose file. Create the architecture by running the following command:

docker compose up -d

The Docker Compose file uses several Bacalhau configuration files, which you can see in the configuration folder that label the compute nodes as US and EU nodes, respectively.

They also configure the orchestrator nodes to write data to the regional MinIO buckets.

Generate sample data

With the simulated infrastructure in place, you can now create sample data using the data generator job to write 3000 records to each region in JSON format to the relevant MinIO bucket.

# Move to the jobs directory cd ../jobs

Generate the US data:

bacalhau job run -V Region=us -V Events=3000 \ -V StartDate=2024-01-01 -V EndDate=2024-12-31 \ -V RotateInterval=month data-generator.yaml

Generate the EU data:

bacalhau job run -V Region=eu -V Events=3000 \ -V StartDate=2024-01-01 -V EndDate=2024-12-31 \ -V RotateInterval=month data-generator.yaml

Accessing data for analysis

Bacalhau supports two ways to access the regional data:

Bacalhau input sources

InputSources: - Type: s3 Source: Bucket: local-bucket Key: "data/*"

This method provides more control and preprocessing options and supports other source types in addition to S3.

Direct DuckDB access

SET VARIABLE files = ( SELECT LIST(file) FROM partition_by_hash('s3://local-bucket/**/*.jsonl') ); SELECT * FROM read_json_auto(getvariable('files'));

This method is a simpler and more familiar option for SQL-only jobs. The job definitions also use SQL queries to process data from an input source.

Run analysis

With data in place, you can send analysis tasks as Bacalhau jobs. In each case, after running the job, use bacalhau job describe <job_id> to see the job results, passing the job ID from the output of the bacalhau job run command. All the examples show using US data. You can also change Region to eu to see the results from the EU region.

Monthly trend analysis

bacalhau job run -V Region=us monthly-trends.yaml

Job definition.

Sample output:

month | total_txns | revenue | unique_customers | avg_txn_value ------------|------------|----------|------------------|--------------- 2024-03-01 | 3,421 | 178,932 | 1,245 | 52.30 2024-02-01 | 3,156 | 165,789 | 1,189 | 52.53 2024-01-01 | 2,987 | 152,456 | 1,023 | 51.04

Operational monitoring

Hourly Operations

Tracks operational health metrics
Monitors transaction success rates
Shows hourly patterns

bacalhau job run -V Region=us hourly-operations.yaml

Anomaly Detection

Identifies unusual patterns
Uses statistical analysis
Flags significant deviations

bacalhau job run -V Region=us anomaly-detection.yaml

Business analytics

Product Performance

Analyzes category performance
Tracks market share
Shows sales patterns

bacalhau job run -V Region=us product-performance.yaml

Monthly Trends

Long-term trend analysis
Monthly aggregations
Key business metrics

bacalhau job run -V Region=us monthly-trends.yaml

Customer analysis

Customer Segmentation (Two-Phase)

Phase 1: Compute local metrics
Phase 2: Combine and segment

# Run Phase 1 bacalhau job run -V Region=us customer-segments-phase1.yaml # Note the job ID, then run Phase 2 bacalhau job run -V Region=us -V JobID=<phase1-job-id> customer-segments-phase2.yaml

Summary

This post combined the distributed computing power of Bacalhau with the flexible SQL capabilities of DuckDB to create a distributed data warehouse spread across regions. The example Bacalhau jobs provide a range of analysis tasks, from operational monitoring to customer segmentation, all while keeping the data in its original location and using SQL to query data stored in S3-compatible buckets.

Get Involved!

We welcome your involvement in Bacalhau. There are many ways to contribute, and we’d love to hear from you. Please reach out to us at any of the following locations.

Commercial Support

While Bacalhau is open-source software, the Bacalhau binaries go through the security, verification, and signing build process lovingly crafted by Expanso. You can read more about the difference between open-source Bacalhau and commercially supported Bacalhau in our FAQ. If you would like to use our pre-built binaries and receive commercial support, please contact us or get your license on Expanso Cloud!