Bacalhau v1.7.0 - Day 5: Distributed Data Warehouse with Bacalhau and DuckDB
(3:11) Build a distributed data warehouse with Bacalhau and DuckDB to run SQL queries on regional data without moving it.
This is part of the 5-days of Bacalhau 1.7 series! Make sure to go back to the start to catch all of them!
Day 2: Scaling Your Compute Jobs with Bacalhau Partitioned Jobs
Day 3: Streamlining Security: Simplifying Bacalhau's Authentication Model
Distributed Data Warehouse with Bacalhau and DuckDB
With many applications that rely on data warehouses, you need to keep data sources in different locations. This could be due to privacy or regulatory reasons or because you want to keep processing close to the source. However, there are still times when you want to perform analysis on and across these data sources from one location but not move the data.
This post uses Bacalhau to orchestrate the distributed processing and DuckDB to provide the SQL storage and querying capacity for some mock sales data based in the EU and the US.
Prerequisites
To reproduce this tutorial, you need the following prerequisites:
The example multi-region setup
git clone https://github.com/bacalhau-project/examples.git
Architecture
The example Docker Compose file and Bacalhau job definitions in the example repository setup simulate the following architecture:
Bacalhau Orchestrator: Central control plane for job distribution
Compute Nodes: Distributed across regions, running close to data
Regional Storage: Regional data stores, using MinIO in this setup
DuckDB: SQL query engine running on each compute node. Bacalhau has a custom image that adds several user-defined functions to help with partitioning large data sets across nodes based on the following methods:
partition_by_hash
: Even distribution of files across partitionspartition_by_regex
: Pattern-based partitioningpartition_by_date
: Time-based partitioning
You can find more details on how these functions work in the custom image documentation.
You can see each component’s setup in the Docker Compose file. Create the architecture by running the following command:
docker compose up -d
The Docker Compose file uses several Bacalhau configuration files, which you can see in the configuration folder that label the compute nodes as US and EU nodes, respectively.
They also configure the orchestrator nodes to write data to the regional MinIO buckets.
Generate sample data
With the simulated infrastructure in place, you can now create sample data using the data generator job to write 3000 records to each region in JSON format to the relevant MinIO bucket.
# Move to the jobs directory
cd ../jobs
Generate the US data:
bacalhau job run -V Region=us -V Events=3000 \
-V StartDate=2024-01-01 -V EndDate=2024-12-31 \
-V RotateInterval=month data-generator.yaml
Generate the EU data:
bacalhau job run -V Region=eu -V Events=3000 \
-V StartDate=2024-01-01 -V EndDate=2024-12-31 \
-V RotateInterval=month data-generator.yaml
Accessing data for analysis
Bacalhau supports two ways to access the regional data:
Bacalhau input sources
InputSources:
- Type: s3
Source:
Bucket: local-bucket
Key: "data/*"
This method provides more control and preprocessing options and supports other source types in addition to S3.
Direct DuckDB access
SET VARIABLE files = (
SELECT LIST(file)
FROM partition_by_hash('s3://local-bucket/**/*.jsonl')
);
SELECT * FROM read_json_auto(getvariable('files'));
This method is a simpler and more familiar option for SQL-only jobs. The job definitions also use SQL queries to process data from an input source.
Run analysis
With data in place, you can send analysis tasks as Bacalhau jobs. In each case, after running the job, use bacalhau job describe <job_id>
to see the job results, passing the job ID from the output of the bacalhau job run
command. All the examples show using US data. You can also change Region
to eu
to see the results from the EU region.
Monthly trend analysis
bacalhau job run -V Region=us monthly-trends.yaml
Sample output:
month | total_txns | revenue | unique_customers | avg_txn_value
------------|------------|----------|------------------|---------------
2024-03-01 | 3,421 | 178,932 | 1,245 | 52.30
2024-02-01 | 3,156 | 165,789 | 1,189 | 52.53
2024-01-01 | 2,987 | 152,456 | 1,023 | 51.04
Operational monitoring
Tracks operational health metrics
Monitors transaction success rates
Shows hourly patterns
bacalhau job run -V Region=us hourly-operations.yaml
Identifies unusual patterns
Uses statistical analysis
Flags significant deviations
bacalhau job run -V Region=us anomaly-detection.yaml
Business analytics
Analyzes category performance
Tracks market share
Shows sales patterns
bacalhau job run -V Region=us product-performance.yaml
Long-term trend analysis
Monthly aggregations
Key business metrics
bacalhau job run -V Region=us monthly-trends.yaml
Customer analysis
Customer Segmentation (Two-Phase)
Phase 1: Compute local metrics
Phase 2: Combine and segment
# Run Phase 1
bacalhau job run -V Region=us customer-segments-phase1.yaml
# Note the job ID, then run Phase 2
bacalhau job run -V Region=us -V JobID=<phase1-job-id> customer-segments-phase2.yaml
Summary
This post combined the distributed computing power of Bacalhau with the flexible SQL capabilities of DuckDB to create a distributed data warehouse spread across regions. The example Bacalhau jobs provide a range of analysis tasks, from operational monitoring to customer segmentation, all while keeping the data in its original location and using SQL to query data stored in S3-compatible buckets.
Get Involved!
We welcome your involvement in Bacalhau. There are many ways to contribute, and we’d love to hear from you. Please reach out to us at any of the following locations.
Commercial Support
While Bacalhau is open-source software, the Bacalhau binaries go through the security, verification, and signing build process lovingly crafted by Expanso. You can read more about the difference between open-source Bacalhau and commercially supported Bacalhau in our FAQ. If you would like to use our pre-built binaries and receive commercial support, please contact us or get your license on Expanso Cloud!
Awesome 🙌 thanks 🙏 Bacalhau also a good technology to use unused compute power in big enterprise where shadow IT comes to distribute IT