Bacalhau

Bacalhau

Share this post

Bacalhau
Bacalhau
Stop Paying for Data Noise: Optimize Your Pipeline with Compute-Over-Data
Copy link
Facebook
Email
Notes
More
User's avatar
Discover more from Bacalhau
Compute Over Data
Already have an account? Sign in

Stop Paying for Data Noise: Optimize Your Pipeline with Compute-Over-Data

Why a great part of your data should never hit the cloud.

Federico Trotta's avatar
Federico Trotta
May 01, 2025
3

Share this post

Bacalhau
Bacalhau
Stop Paying for Data Noise: Optimize Your Pipeline with Compute-Over-Data
Copy link
Facebook
Email
Notes
More
3
Share

Are your cloud bills making you wince? Are you drowning in terabytes of logs? You're likely caught in a common, yet costly, data pipeline trap!

Many organizations default to a "ship everything" model. They collect data everywhere–servers, devices, applications–and funnel it all to a central location before analyzing it.

This seems logical, but it hides massive inefficiencies. Let’s discuss them.

Share

The Cascade of Costs

This traditional approach triggers a cascade of expenses for every byte generated, especially the noisy, low-value ones:

  • Egress costs: Paying to move data out of its source environment.

  • Ingestion costs: Fees charged by your central platform just to receive the data.

  • Storage costs: Ongoing charges, often in expensive "hot" tiers for data that might never be accessed again.

  • Query costs: Even analyzing the data costs compute time and resources.

Industry analyses suggest that a staggering amount of this centrally stored data–potentially around 20%–is never used after storage. Yet, organizations pay the full price for this journey.

Beyond cash, this model adds operational drag:

  • High transfer times.

  • Engineering complexity.

  • Managing pipelines.

  • Increased security risks from moving sensitive data unnecessarily.

Why Does Raw Data Overwhelm Your Systems?

Sure, hot storage in the cloud offers fast access, but it's pricey. But even cheaper object storage isn't efficient if the data sitting there is redundant or useless without preprocessing.

The catch? You often need significant compute power to preprocess or analyze data before deciding if it's worth keeping or shipping–compute power that might not be available where the data originates.

The Real Reason We Ship Everything

So why stick with this inefficient pattern? Primarily because most existing tools are built for it. Log shippers, message queues, observability platforms, and data warehouses assume data must be centralized first. Coupled with the lack of easy-to-deploy compute at the edge, teams feel forced to ship everything, accepting the cost and complexity.

A Smarter Way: Compute-Over-Data

There's a better approach: flip the model and bring the compute to the data. Instead of shipping raw, noisy data first, process it at the source.

Imagine running logic directly where data is born:

  • Filter verbose logs down to just critical errors before they leave the server.

  • Aggregate raw metrics on an IoT gateway before sending summaries.

  • Enrich events with local context instantly.

  • Compress data intelligently based on its type.

  • Decide what data is valuable enough to move before incurring network and storage costs.

The Compute-Over-Data approach leads to tangible benefits:

  • Massive cost reduction: Stop paying egress, ingestion, and storage fees for useless data.

  • Improved signal visibility: Filter out the noise early to spot important events faster.

  • Enhanced security & compliance: Keep sensitive data local; only move aggregated or necessary subsets.

How To Do So? Meet Bacalhau!

This isn't just theory. Bacalhau is an open-source framework designed specifically for Compute-Over-Data. It acts as an orchestration layer, letting you run compute jobs (packaged as Docker containers or WASM modules) directly where your data resides–be it data center servers, edge devices, or even workstations with GPUs.

Instead of pulling data to compute, Bacalhau sends compute to data. It helps you:

  • Process data at the edge: Execute filtering, aggregation, or analysis before data hits expensive network hops or ingestion endpoints.

  • Slash data volumes: Send only the valuable results, drastically cutting costs for downstream systems.

  • Handle diverse workloads: Supports various job types (batch, long-running services, ops, daemon jobs).

  • Operate reliably: Designed for distributed environments, handling intermittent connectivity gracefully (crucial for edge).

The Bottom Line

The "ship everything" era is proving unsustainable. Compute-Over-Data, powered by tools like Bacalhau, offers a more intelligent, secure, and cost-effective future for handling distributed data. Stop paying for noise and start optimizing your pipelines for value.


What's Next?

To start using Bacalhau, install Bacalhau and give it a shot.

If you don’t have a node network available and would still like to try Bacalhau, you can use Expanso Cloud. You can also set up a cluster on your own (with setup guides for AWS, GCP, Azure, and more 🙂).

Get Involved!

We welcome your involvement in Bacalhau. There are many ways to contribute, and we’d love to hear from you. Reach out at any of the following locations:

  • Expanso’s Website

  • Bacalhau’s Website

  • Bacalhau’s Bluesky

  • Bacalhau’s Twitter

  • Expanso’s Twitter

  • TikTok

  • Youtube

  • Slack

  • LinkedIn

  • Careers Page

Commercial Support

While Bacalhau is open-source software, the Bacalhau binaries go through the security, verification, and signing build process lovingly crafted by Expanso. Read more about the difference between open-source Bacalhau and commercially supported Bacalhau in the FAQ. If you want to use the pre-built binaries and receive commercial support, contact us or get your license on Expanso Cloud!


Subscribe to Bacalhau

Launched 2 years ago
Compute Over Data
Sean M. Tracey's avatar
Tony Evans's avatar
Federico Trotta's avatar
3 Likes∙
3 Restacks
3

Share this post

Bacalhau
Bacalhau
Stop Paying for Data Noise: Optimize Your Pipeline with Compute-Over-Data
Copy link
Facebook
Email
Notes
More
3
Share

Discussion about this post

User's avatar
Tutorial: Building a Distributed Data Warehousing Without a Data Lake
A step-by-step guide (9 min)
Nov 2, 2023 â€¢ 
Ross Jones
 and 
Michael Hoepler
6

Share this post

Bacalhau
Bacalhau
Tutorial: Building a Distributed Data Warehousing Without a Data Lake
Copy link
Facebook
Email
Notes
More
U.S. Navy Chooses Bacalhau to Manage Predictive Maintenance Workloads
Bacalhau deployment example (4 min)
Nov 29, 2023 â€¢ 
Michael Hoepler
 and 
David Aronchick
6

Share this post

Bacalhau
Bacalhau
U.S. Navy Chooses Bacalhau to Manage Predictive Maintenance Workloads
Copy link
Facebook
Email
Notes
More
Bacalhau 1.0: Unlocking The Potential of Private Data
New simple job and data moderation features in Bacalhau 1.0 unlock new data sharing, federated learning and compute islands using private data.
Jun 20, 2023 â€¢ 
Simon Worthington
6

Share this post

Bacalhau
Bacalhau
Bacalhau 1.0: Unlocking The Potential of Private Data
Copy link
Facebook
Email
Notes
More

Ready for more?

© 2025 Expanso
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More

Create your profile

User's avatar

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.