If you're dealing with data generated across thousands of devices or locations, you know the pain. Pipelining every raw byte back to a central data center for processing is slow, expensive, and fraught with regulatory hurdles.
Powerful databases exist to store this distributed data, but how do you process it before it gets there?
At Expanso, we recently explored this critical challenge in an Azure Cosmos DB TV episode featuring Mark Brown from Microsoft and our CEO, David Aronchick. The episode dives deep into a modern approach: Compute Over Data. Instead of moving mountains of raw data, why not process it right where it's created?
Let’s break down the main points!
The Distributed Data Dilemma
The world is generating data at an exponential rate, much of it unstructured and originating outside traditional data centers. Centralizing everything runs into several roadblocks:
Network bottlenecks: WANs aren't keeping up, and the speed of light imposes hard latency limits. Moving gigabytes or terabytes takes time and costs money.
Data quality and context loss: Raw data from edge devices is often poorly structured. Moving it immediately means losing context like location, local timestamps, device specifics, and more.
Regulatory compliance: GDPR, CCPA, and industry-specific rules restrict moving raw data, especially PII or sensitive operational details, across borders.
Delayed insights: Waiting for data to traverse networks and complex central ETL pipelines before analysis means delays in taking action, sometimes measured in minutes, hours, or even weeks.
Bacalhau: Bringing Compute to the Data
This is the gap Expanso, powered by the open-source Bacalhau project, is built to fill. Instead of moving data to compute, Bacalhau runs your processing jobs directly where the data resides.
This allows you to perform pre-processing steps locally like:
Schematization: Transform raw data streams into well-structured formats suitable for use in Cosmos DB.
Enrichment: Add metadata right at the source, preserving context.
Sanitization: Filter or modify sensitive information before it leaves the local environment, aiding compliance.
Aggregation: Reduce data volume by calculating summaries over time windows locally, sending only the essential information. This cuts network traffic and central processing costs.
The Synergy: Smarter Processing, Global Storage
The combination is powerful. Bacalhau acts as the edge/distributed processing layer, preparing and refining data. Optimized data, then, flows into the nearest Cosmos DB regional replica.
Key benefits of this approach include:
Cost savings: Reduced data transfer, lower central compute/storage needs.
Faster insights: Analyze data quicker by processing it closer to the source and landing analysis-ready data in Cosmos DB.
Enhanced security and compliance: Minimize raw data movement and sanitize sensitive information locally.
Increased resilience: Better handling of intermittent network connectivity at the edge.
Simplified operations: Declaratively manage distributed jobs without building complex custom orchestration.
The Takeaway
If you're building applications that span multiple locations, deal with IoT/edge devices, or face data gravity challenges, the traditional "move-then-process" model is holding you back. The "Compute Over Data" approach, enabled by Bacalhau working in concert with globally distributed databases like Azure Cosmos DB, offers a more efficient, cost-effective, and compliant path forward.
Want the full story and see the live demo? Watch the complete Azure Cosmos DB TV episode now! 👇
What's Next?
To start using Bacalhau, install Bacalhau and give it a shot.
If you don’t have a node network available and would still like to try Bacalhau, you can use Expanso Cloud. You can also set up a cluster on your own (with setup guides for AWS, GCP, Azure, and more 🙂).
Get Involved!
We welcome your involvement in Bacalhau. There are many ways to contribute, and we’d love to hear from you. Reach out at any of the following locations:
Commercial Support
While Bacalhau is open-source software, the Bacalhau binaries go through the security, verification, and signing build process lovingly crafted by Expanso. Read more about the difference between open-source Bacalhau and commercially supported Bacalhau in the FAQ. If you want to use the pre-built binaries and receive commercial support, contact us or get your license on Expanso Cloud!