Bacalhau

Bacalhau

Share this post

Bacalhau
Bacalhau
How We Processed 46 Million Rows Across 20 Nodes Without Breaking the Bank
User's avatar
Discover more from Bacalhau
Compute Over Data
Already have an account? Sign in

How We Processed 46 Million Rows Across 20 Nodes Without Breaking the Bank

(03:30) With Bacalhau, you can process huge volumes of the data in a fraction of the time it would take with other processes. Find out one approach that we've taken!

Mandy Moore's avatar
David Aronchick's avatar
Sean M. Tracey's avatar
Mandy Moore
,
David Aronchick
, and
Sean M. Tracey
Feb 06, 2025
9

Share this post

Bacalhau
Bacalhau
How We Processed 46 Million Rows Across 20 Nodes Without Breaking the Bank
1
4
Share

In the fast-paced world of data engineering, efficiency, scalability, and cost-effectiveness are everything. What if you could process millions of rows of data across a distributed system without relying on a central data warehouse? Expanso’s Bacalhau and DuckDB make this not only possible but also surprisingly simple.

In this blog, we'll showcase how Expanso and Bacalhau help improve your data processing pipeline, from handling raw data to making your uploads compliance-ready - all in a fraction of the time and cost of traditional pipelines.

The Problem? Centralized Systems Are Holding Us Back

Centralized data pipelines are the go-to approach for many organizations, but they come with significant challenges that can hinder scalability, efficiency, and cost-effectiveness. Let’s break down the issues:

  1. High Data Transfer Costs: Every time raw data is moved to a central warehouse for processing, it incurs bandwidth and storage expenses. For example, companies using platforms like Snowflake or Databricks often face steep charges for transferring and querying large datasets, especially when processing logs or sensor data from distributed sources like IoT devices.

    Imagine a fleet of IoT devices generating terabytes of log data daily—shipping all of that to a central location for processing can quickly eat into your budget.

  2. Bottlenecks During Heavy Workloads: A centralized system can easily become overwhelmed during periods of peak activity, slowing down queries and delaying insights. This is particularly true for real-time analytics, where speed is critical. Competitors like Amazon Redshift or Google BigQuery, while powerful, often struggle with latency when processing massive concurrent queries.

    For example, a marketing platform analyzing millions of user interactions in real-time might see significant delays during high-traffic events like Black Friday.

  3. Time-to-Insight Increases with Massive Datasets: Centralized architectures require raw data to be collected, transferred, and processed before insights can be derived. This delay grows exponentially with larger datasets, creating lag times that impact decision-making. Data warehouses like Snowflake may excel at long-term storage but can slow down when processing raw, unoptimized logs from distributed systems.

    Consider an e-commerce company analyzing server logs for security breaches. Waiting hours—or even minutes—for centralized processing could mean the difference between proactive prevention and costly downtime.

These limitations aren’t just inconvenient; they can directly impact business outcomes, from increased costs to missed opportunities. Enter Expanso’s Bacalhau and DuckDB: a solution designed to tackle these challenges head-on by enabling distributed log processing directly at the nodes, significantly reducing overhead and boosting efficiency.

The Solution: Distributed Orchestration with Bacalhau

Bacalhau is a distributed orchestration system that simplifies managing nodes across on-prem and multiple cloud providers like AWS, GCP, and Azure through a unified control plane. It enables jobs to run directly on individual nodes, reducing the reliance on centralized data processing and ensuring efficiency where it’s needed most.

Example Workflow:

  1. A cluster of 20 nodes is set up across AWS, GCP, and Azure.

  2. A simple task, like running a containerized command, is executed on all nodes simultaneously.

  3. The results are aggregated and logged with minimal latency.

Efficient Local Processing with DuckDB

DuckDB, a lightweight and fast SQL engine, is the perfect companion to Bacalhau. It enables local-first data processing, reducing the volume of raw data sent to centralized systems like BigQuery.

Key Steps in the Workflow:

  1. Raw Log Uploads: Initially, all nodes upload unprocessed log files to BigQuery. While effective, this generates millions of rows of raw data, making queries complex.

  2. Schema-Based Uploads: By preprocessing logs locally with DuckDB, nodes structure data into user IDs, timestamps, and error categories before uploading.

  3. Aggregation and Filtering: Instead of uploading every row, nodes aggregate data, retaining only critical information like emergencies and key metrics.

This approach drastically reduces the volume of data processed centrally, improving both speed and scalability.

Compliance Made Easy: Sanitizing Sensitive Data

A major challenge in distributed systems is handling Personally Identifiable Information (PII), such as IP addresses. The workflow included a simple Python function to sanitize IPs by zeroing out the last octet, ensuring compliance with regulations like GDPR.

Real-World Results

In just a few minutes, Expanso’s system processed:

  • 46 million rows of data across 20 nodes.

  • Structured and sanitized data ready for analysis in BigQuery.

  • Aggregated metrics and actionable insights for real-time monitoring.

All this was achieved without the need for a costly central data warehouse, highlighting the power of distributed systems.

Why This Matters

Distributed orchestration with Bacalhau and DuckDB offers a smarter, faster, and more cost-effective alternative to traditional pipelines. Here’s why it’s a game-changer:

  • Cost Savings: Use existing compute resources instead of scaling centralized infrastructure.

  • Speed: Process data locally for real-time insights.

  • Flexibility: Adapt workflows for different data types, from raw logs to structured metrics.

Get Involved!

Expanso’s tools and templates make it easy to get started. Check out our public GitHub repository for examples and guides, and join the conversation on social media with #ExpansoInAction.

Have a unique use case? We’d love to hear about it! Share your projects and ideas, and let’s build the future of distributed systems together.

There are many ways to contribute and get in touch, and we’d love to hear from you! Please reach out to us at any of the following locations.

  • Website Expanso

  • Website Bacalhau

  • Bluesky Bacalhau

  • Twitter Bacalhau

  • Twitter Expanso

  • Slack

  • LinkedIn

  • Careers Page

Commercial Support

While Bacalhau is open-source software, the Bacalhau binaries go through the security, verification, and signing build process lovingly crafted by Expanso. You can read more about the difference between open-source Bacalhau and commercially supported Bacalhau in our FAQ. If you would like to use our pre-built binaries and receive commercial support, please contact us!

⭐️ GitHub

Slack

Website

Website

Tony Evans's avatar
Yuriy Gavrilov's avatar
Bubba Guy's avatar
Rafael Escrich's avatar
Laura Hohmann's avatar
9 Likes∙
4 Restacks
9

Share this post

Bacalhau
Bacalhau
How We Processed 46 Million Rows Across 20 Nodes Without Breaking the Bank
1
4
Share
A guest post by
Mandy Moore
Breaking down the complexities of tech, marketing, and community building—one post at a time.

Discussion about this post

User's avatar
Chase Christensen's avatar
Chase Christensen
Feb 10

How many columns?

Expand full comment
Like
Reply
Share
Tutorial: Building a Distributed Data Warehousing Without a Data Lake
A step-by-step guide (9 min)
Nov 2, 2023 • 
Ross Jones
 and 
Michael Hoepler
6

Share this post

Bacalhau
Bacalhau
Tutorial: Building a Distributed Data Warehousing Without a Data Lake
U.S. Navy Chooses Bacalhau to Manage Predictive Maintenance Workloads
Bacalhau deployment example (4 min)
Nov 29, 2023 • 
Michael Hoepler
 and 
David Aronchick
6

Share this post

Bacalhau
Bacalhau
U.S. Navy Chooses Bacalhau to Manage Predictive Maintenance Workloads
Bacalhau 1.0: Unlocking The Potential of Private Data
New simple job and data moderation features in Bacalhau 1.0 unlock new data sharing, federated learning and compute islands using private data.
Jun 20, 2023 • 
Simon Worthington
6

Share this post

Bacalhau
Bacalhau
Bacalhau 1.0: Unlocking The Potential of Private Data

Ready for more?

© 2025 Expanso
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Create your profile

User's avatar

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.