How We Processed 46 Million Rows Across 20 Nodes Without Breaking the Bank
(03:30) With Bacalhau, you can process huge volumes of the data in a fraction of the time it would take with other processes. Find out one approach that we've taken!
In the fast-paced world of data engineering, efficiency, scalability, and cost-effectiveness are everything. What if you could process millions of rows of data across a distributed system without relying on a central data warehouse? Expanso’s Bacalhau and DuckDB make this not only possible but also surprisingly simple.
In this blog, we'll showcase how Expanso and Bacalhau help improve your data processing pipeline, from handling raw data to making your uploads compliance-ready - all in a fraction of the time and cost of traditional pipelines.
The Problem? Centralized Systems Are Holding Us Back
Centralized data pipelines are the go-to approach for many organizations, but they come with significant challenges that can hinder scalability, efficiency, and cost-effectiveness. Let’s break down the issues:
High Data Transfer Costs: Every time raw data is moved to a central warehouse for processing, it incurs bandwidth and storage expenses. For example, companies using platforms like Snowflake or Databricks often face steep charges for transferring and querying large datasets, especially when processing logs or sensor data from distributed sources like IoT devices.
Imagine a fleet of IoT devices generating terabytes of log data daily—shipping all of that to a central location for processing can quickly eat into your budget.Bottlenecks During Heavy Workloads: A centralized system can easily become overwhelmed during periods of peak activity, slowing down queries and delaying insights. This is particularly true for real-time analytics, where speed is critical. Competitors like Amazon Redshift or Google BigQuery, while powerful, often struggle with latency when processing massive concurrent queries.
For example, a marketing platform analyzing millions of user interactions in real-time might see significant delays during high-traffic events like Black Friday.Time-to-Insight Increases with Massive Datasets: Centralized architectures require raw data to be collected, transferred, and processed before insights can be derived. This delay grows exponentially with larger datasets, creating lag times that impact decision-making. Data warehouses like Snowflake may excel at long-term storage but can slow down when processing raw, unoptimized logs from distributed systems.
Consider an e-commerce company analyzing server logs for security breaches. Waiting hours—or even minutes—for centralized processing could mean the difference between proactive prevention and costly downtime.
These limitations aren’t just inconvenient; they can directly impact business outcomes, from increased costs to missed opportunities. Enter Expanso’s Bacalhau and DuckDB: a solution designed to tackle these challenges head-on by enabling distributed log processing directly at the nodes, significantly reducing overhead and boosting efficiency.
The Solution: Distributed Orchestration with Bacalhau
Bacalhau is a distributed orchestration system that simplifies managing nodes across on-prem and multiple cloud providers like AWS, GCP, and Azure through a unified control plane. It enables jobs to run directly on individual nodes, reducing the reliance on centralized data processing and ensuring efficiency where it’s needed most.
Example Workflow:
A cluster of 20 nodes is set up across AWS, GCP, and Azure.
A simple task, like running a containerized command, is executed on all nodes simultaneously.
The results are aggregated and logged with minimal latency.
Efficient Local Processing with DuckDB
DuckDB, a lightweight and fast SQL engine, is the perfect companion to Bacalhau. It enables local-first data processing, reducing the volume of raw data sent to centralized systems like BigQuery.
Key Steps in the Workflow:
Raw Log Uploads: Initially, all nodes upload unprocessed log files to BigQuery. While effective, this generates millions of rows of raw data, making queries complex.
Schema-Based Uploads: By preprocessing logs locally with DuckDB, nodes structure data into user IDs, timestamps, and error categories before uploading.
Aggregation and Filtering: Instead of uploading every row, nodes aggregate data, retaining only critical information like emergencies and key metrics.
This approach drastically reduces the volume of data processed centrally, improving both speed and scalability.
Compliance Made Easy: Sanitizing Sensitive Data
A major challenge in distributed systems is handling Personally Identifiable Information (PII), such as IP addresses. The workflow included a simple Python function to sanitize IPs by zeroing out the last octet, ensuring compliance with regulations like GDPR.
Real-World Results
In just a few minutes, Expanso’s system processed:
46 million rows of data across 20 nodes.
Structured and sanitized data ready for analysis in BigQuery.
Aggregated metrics and actionable insights for real-time monitoring.
All this was achieved without the need for a costly central data warehouse, highlighting the power of distributed systems.
Why This Matters
Distributed orchestration with Bacalhau and DuckDB offers a smarter, faster, and more cost-effective alternative to traditional pipelines. Here’s why it’s a game-changer:
Cost Savings: Use existing compute resources instead of scaling centralized infrastructure.
Speed: Process data locally for real-time insights.
Flexibility: Adapt workflows for different data types, from raw logs to structured metrics.
Get Involved!
Expanso’s tools and templates make it easy to get started. Check out our public GitHub repository for examples and guides, and join the conversation on social media with #ExpansoInAction.
Have a unique use case? We’d love to hear about it! Share your projects and ideas, and let’s build the future of distributed systems together.
There are many ways to contribute and get in touch, and we’d love to hear from you! Please reach out to us at any of the following locations.
Commercial Support
While Bacalhau is open-source software, the Bacalhau binaries go through the security, verification, and signing build process lovingly crafted by Expanso. You can read more about the difference between open-source Bacalhau and commercially supported Bacalhau in our FAQ. If you would like to use our pre-built binaries and receive commercial support, please contact us!
How many columns?