Distributed Warehousing Made Easy With Bacalhau

How to use Bacalhau for commercial distributed warehousing (7 min)

and

Oct 26, 2023

Nowadays many organizations initiate data warehouse projects to enhance data access, covering both current and historical data, propelling their transformation into data-centric entities. After determining the required data and devising an appropriate data model, they use an ETL (Extract, Transform, Load) process to transfer this data from its source to the warehouse. This ETL process is continuous, ensuring the warehouse regularly receives updated data. Once stored, users can analyze the data, typically with SQL.

With distributed data, many organizations find it challenging to synchronize and merge information from various sources, complicating data warehousing. ETL processes often exacerbate these complexities, introducing additional hurdles to gaining lucid insights. In this article, we discuss the benefits of establishing a distributed data warehouse and its potential to enhance both clarity and efficiency in data pipelines.

Why is this a problem?

Centralized data warehouses, despite their widespread adoption, present various challenges affecting their efficiency and effectiveness. The primary issues with this data warehouse model include the following.

Scalability - As data generation increases and analytical expectations rise, centralized systems can become overwhelmed. Expanding these systems can be costly and intricate, making it difficult to meet growing demands, especially when scaling specific datasets independently from the entire system.
Performance - Centralized warehouses may face performance issues with increased user and application access. Queries can lag, causing data retrieval and analysis delays, or there's a growing dependency on caching layers to reposition data for more accessible warehouse use.
Data engineering costs - Data, originating from diverse sources in varying formats, necessitates relocation for manipulation and querying when centralizing a data warehouse. This relocation requires Data Engineering teams to execute intricate and often lengthy ETL processes, which can be costly, especially when data traverses network boundaries or cloud services.
Maintenance - Centralized data warehouse upkeep can be expensive, necessitating investments in hardware, software, and expertise. System disruptions from hardware failures, software glitches, or other complications can impede the entire data analytics process instead of affecting only specific data segments.
Slow to change - Integrating new data sources or modifying schemas in centralized data warehouses can be tedious and slow, hampering swift adaptation to evolving business requirements. Source changes often demand ETL process modifications and the involvement of data engineering teams.

With many organizations investigating Data Mesh architectures, where data is treated and managed like a product by the teams that generate it, there may be a better way than centralising these data.

An alternative approach

When teams produce data and treat it as a product, potentially utilizing a federated governance model, querying the data at its source can eliminate costly ETL processes. Even outside of a data mesh framework, sending computation to the location of the data is often as efficient as the reverse.

By positioning compute nodes near the data produced by various systems, we can dispatch computational tasks or queries directly to the data, while returning the results through our preferred storage mechanism. Transferring computation results is typically more manageable and faster than transmitting the complete dataset, allowing for near instantaneous processing alongside the standard backup of consolidated data.

Implementing a distributed data warehouse in this way will result in the following benefits.

Reduced data transit - Querying data in-place significantly reduces the volume of data transferred from individual stores to a centralized location, cutting both transit costs and associated data engineering efforts. This approach enables users to retrieve way smaller result datasets from their queries, while compressing and archiving the raw data for auditing purposes.
Improved scalability - A distributed data warehouse is designed to manage many compute nodes, distributed across different regions, its easy to scale by adding more compute nodes to new locations. Doubling the number of stores shouldn’t mean having to more than double the size and spend n a centralized data warehouse. Adding more compute nodes to the hardware already in each store should be all that is needed to add more nodes to the cluster.
Reduced vendor lock-in - Choosing a single provide for a centralized data warehouse today, doesn’t mean it will be fit for the same purpose in a year’s time. Limited functionality, new approaches to data analysis, change of ownership are all things that could reduce the effectiveness of the chosen solution. Unfortunately, with potentially massive sunk cost, there may be too little traction for change, and too much friction to do anything but suffer with the single provider.
Powerful real-time querying - There’s no need to wait until today’s data has been moved and imported at the end of the day, if a user wants to query today’s data now, they should be able to query it immediately. By querying the data where it is generated, users can perform queries on progress today and make immediate decisions reflecting the current state of the business.

Bacalhau as a Distributed Data Warehouse Orchestrator

Bacalhau offers the advantages of a distributed data warehouse with few modifications to current processes. For many entities, essential data and compute resources for analytical tasks are already dispersed, whether across varied database instances, application servers, edge locations, or otherwise.

Bacalhau enables organizations to capitalize on these pre-existing distributed resources, forming an adaptable data warehouse when needed. By deploying compact Bacalhau agents close to the data's location, computational tasks are directed to the data's vicinity, negating the need to transfer large data volumes across networks. Comprehensive ETL processes and data models remain unchanged.

Bacalhau already makes use of the following components.

Flexible compute nodes - Bacalhau provides Docker and WebAssembly execution engines out of the box, meaning as long as you can package your workload, it is likely to run well inside the network. In addition, the compute nodes that perform the work are also open, with recent additions adding support for pluggable execution engines. This means it is possible to build your own executors to suit your own purpose, whether that is executors for Microsoft .NET, or RPG on IBM AS/400.
Data access options - As discussed previously, moving large amounts of data around the network is expensive both in terms of cost as well as performance and scalability. By providing more options around local access to data, Bacalhau will reduce the need to move data across the network, instead bringing the code to the data, a much less expensive proposition. By default, Bacalhau provides support for S3-compatible object stores, the Interplanetary Filesystem (IPFS) (private or public), and locally mounted data. By allowing local storage devices to be mounted inside the docker container, or webassembly runtime, the data is available to the computation as far as the local device is able to deliver it.
Job selection - Having fine-grained control over how jobs are allocated to compute nodes, and with the cooperation of those compute nodes, is an extremely powerful feature of Bacalhau that has been getting better and better with each release. As each compute node can be configured with labels, or the form “property=value” it is possible to describe many attributes of the compute node. Perhaps it might have a geographic location at a particular scale, or a store identifier, or a description of its type. All of these values can be used to target a job at a specific compute node, or a group of nodes. In addition to this selection criteria, compute nodes are asked if they are able to run a specific job, and they have the opportunity to accept or reject the work based on. These job selection policies, and the availability of node labels, make it possible to target a precise set of nodes for the execution. With the recent release of Bacalhau 1.1 it is now ever possible to use expressions when comparing labels, allowing for instance to target nodes where a specific label is present in a list (country=fr,ca) or where the number of PoS devices < 10 (pos_count < 10). The following job selections are possible.
1. The local availability of the data requested, e.g. reject any job where we don’t have the local mapping.
2. The result of an external program which can be executed to determine whether the node is happy to take the job.
3. A web request, where the compute node can make a call to a web-service to determine whether to run the job or not.
4. Reject any jobs that don’t require access to any data.
Freedom from vendor lock-in - Choosing Bacalhau, and Open Source software, means that there is no longer any lock-in to a single vendor. Users are free to swap out any component of the system for any other, meaning that analysis today can be with DuckDB, with Apache DataFusion tomorrow, and custom hand-written code the day after that.

Conclusion

By taking advantages of the features provided by Bacalhau, it’s possible to solve a lot of the problems caused by the centralization of organizational data. As a data warehouse for companies and data infrastructures in every size. Bacalhau is flexible enough to be implemented in a whole scale of different scenarios, from Edge ML to automatic preprocessing of data on the edge device. Also Bacalhau the flexibility of Bacalhau offers you flexibility in a world with growing data demand. The full demo with the required code will be published next week and will also be available on our parent website.

What’s Next?

Stay tuned as we continue to enhance and expand Bacalhau's capabilities. We are committed to delivering groundbreaking updates and improvements. We're looking for help in several areas. If you're interested in helping out, there are several ways to contribute and you can always reach out to us via Slack or Email.

The Bacalhau project is available today as Open Source Software and the public GitHub repo can be found here. If you like the project, please give us a star ⭐

If you need professional support from experts, you can always reach out to us via Slack or Email.