Discover more from Bacalhau
Why Distributed Computing Isn't Ubiquitous ... Yet
How edge computing and Compute over Data work (10 min)
Thanks to the growth of ubiquitous data, networking, and devices, a groundbreaking shift is occurring in the way people are building applications. It is no longer realistic to move everything to a single location to gain insights; things need to be rethought. To address this problem, people are including distributed needs as core requirements as they build out their applications and platforms, rather than bolting on these features late in the process. We, as platform developers, need to help our users achieve this transformation by giving them distributed-native abstractions and tools, so they can focus on the needs of their applications.
Bacalhau seeks to help you to move your compute to where it is needed and to where your data is being generated. Designed to function across zones, regions, clouds, and even countries, Bacalhau manages jobs tasks globally and where you need it, instead of requiring you to move your data to a central data center. In collaboration with other cloud-native platforms, we are working to shift toward a new era of globally coordinated and distributed computing.
Before we try to convince you of this, let’s take a short walk down memory lane.
Distributed computing has been an active area of investment and research for decades. There are so many great studies and essays on the subject, one wonders if everything has already been covered. Principles of Distributed Database Systems. Distributed Systems: Principles and Paradigms. Fallacies of Distributed Computing. Countless research papers and articles. Why do we have personal computers at all when we could be running every application on a global computational network?
It isn’t that there aren’t benefits; the idea of distributing a job is alluring. The ability to split a job into many components, take advantage of cheaper hardware, and parallelize problems can dramatically improve the economics for many scenarios. Although there are trade-offs - the hit to performance for cross-compute communication, for example - the potential benefits mean it is a dream worth pursuing.
But when faced with reality - complex platforms, hacked together solutions, and centralized systems that only “sort of” work globally - most sophisticated organizations often do not spread applications across multiple zones in a single cloud region, much less span across various regions or clouds.
So, with so much work done on the problem and so much benefit for adopting distributed jobs, the real question is why aren’t distributed jobs far more broadly adopted?
A Distributed Walk Down Memory Lane
Technology has been in pursuit of distributed workloads since the invention of the ENIAC in the 1940’s – one computer in one room is about as centralized as you can get. Mainframes became minicomputers; minicomputers became microcomputers; and microcomputers became microprocessors wired together with the ARPANet and the advent of a network that gave us the possibility of one job, many machines.
Caveat: While many of the technologies discussed below have been around since nearly the advent of computing, I have done my best to talk about the waves in the context of when that technology became widely adopted.
Phase 0: The Move to Commodity Compute
The first moves towards broadly used distributed compute took off in the late 1990s, with the canonical example being the launch of commodity distributed computing with Google.
The first Google rack. Unclear if the SRE who built it received a peer bonus for saving money.
Previously, most machines and deployments required vertical scaling. In other words, increasing the amount of compute would require exponentially larger costs. But beyond cost, vertical scaling often created single points of failure. If a single machine had a hardware outage, or needed rebooting for OS upgrades, or a datacenter went offline, the entire system could fail.
Google, and other commodity compute users, recognized the advantages in cost, performance, and uptime when they incorporated an expectation of failure into their system design. Servers would always fail; by spreading their services across more servers there was a greater chance they could stay operational during any one machine’s outage.
Phase 1 - Virtualization Takes Hold
If adding a single piece of hardware could provide improved uptime and performance, could you virtualize that machine and simulate MANY machines to get even further benefits? Enter virtual machines (VMs).
Virtualizing machines enabled a new level of resource utilization and separation between the hardware platform and the operating system. Applications could now be deployed in a defined and repeatable way to clusters of machines. And because of the virtualization of the machine, each physical machine could be much more tightly packed with applications (including redundant ones to reduce downtime). While the technology has been around for more than half a century, the mainstream adoption of virtualization really took off around the early 2000s with the introduction of VMWare ESX and GSX.
Virtual machine – VMware ESX Server (Source). Pretty, Win32 interface and all!
Virtualization was a step up from completely dedicating a single machine to a single application, offering better resource utilization, isolation, and load balancing. Applications could be tightly packed together, yet still retain the isolation necessary to avoid a single bad application bringing down an entire cluster. Further, by abstracting away hardware, cleaner portability and configuration was possible.
There were also groundbreaking new technologies such as live migration, which significantly reduced downtime due to server issues and planned upgrades. However, these VMs were still localized, limiting the true potential of distributed computing architectures. Worse, applications were often still restricted to a single machine (virtual or otherwise), limiting the ability to take advantage of the distributed topology.
Phase 2 - Distribution Via Platforms
The Map/Reduce paper was seminal in rethinking what an application platform could provide to simplify the concepts and utilization of distributed machines. Behind a single interface, developers could now schedule and build distributed workloads at the application level, with a limited need to dive deeper than top level structures. This paper inspired the popular frameworks like Hadoop and Spark and enabled massive amounts of data to be processed in parallel across a network of machines.
An example of data jobs running in parallel using Spark (Source). Truly the star of the distributed data world.
Platforms like this represented a huge leap in scalability and performance. But even with this level of distribution, because of the primary-secondary topologies and high need for cross-node networking of the clusters, computing tasks were still confined to a single data center. Additionally, they also required adopting a new architecture, new SDKs, and, often, entire rewrites of applications to take advantage of the technology.
Phase 3 - It’s In The Clouds
Containerized workloads, popularized by Docker, have revolutionized the way we develop, deploy, and manage applications. This revolution has been bolstered by sophisticated orchestration systems like Kubernetes and Mesosphere, enhancing orchestration and scaling to unprecedented levels. This wave of technologies, traditionally called Cloud Native Compute and rolled up into the Cloud Native Computing Foundation, has profoundly reshaped the landscape of cloud computing.
An example of the Kubernetes dashboard. So much distributed wisdom knowledge contained in a single interface (sorry, not sorry). (Source)
These cloud native platforms make the adoption of distributed compute technologies much more straightforward because encapsulation of applications and their dependencies have been built into the packaging. Additionally, with the declarative nature of both containers and cloud-native orchestrators, the ability to abstract jobs from underlying APIs and systems has never been easier. This allows for significantly improved utilization through moving the application to the environments that are best suited for the application and have the space to run it dynamically.
However, despite their remarkable capabilities, most distributed compute platforms still operate primarily within a single zone or region. Although they do offer some cross-region support, being based on a centralized platform will always require significant work to support distributed and unreliable networks.
Why Does Anyone Need Distribute in the First Place?
Since fully distributed frameworks seem hard to develop, one may ask is this necessary. In a world where simplicity is favored, why would anyone take on the challenge of adding the complexity of running lots of disparate systems.
Distributed execution was only explored because monolithic systems were hitting real-world limitations. Vertically scaling up individual machines provides some gains, but costs quickly scale out of control as increasing resources within a single machine becomes exponentially more expensive. Distributed execution isn’t just more scalable, it costs less too!
Further, parallelism across distributed systems provides efficiency and redundancy at scale. Workloads like training machine learning benefit from parallel execution across clusters of commodity machines. By sharing the memory, disk, and networking throughput across many physical devices, it is possible to get significantly higher aggregate performance than all but the most expensive hardware. And you get this, with all the benefits of distributed systems we covered earlier.
Realistically, distributed processing is the most natural option. Data is increasingly generated and stored across networks of edge devices and cloud servers, moving it to a single, big bucket or data lake is already an unnatural action. Why not take advantage of the distributed architecture already in place?
You May Already Be Distributed!
A lot of organizations are already operating a distributed system, often without even knowing it. A quick rule of thumb is to ask yourself how many resources need to be updated when a new version of your application is rolled out. If the answer involves coordinating changes across multiple servers, databases, or microservices, congrats - you're distributed!
Another clue is if your team is following the hyperscale guidance of spreading your systems into multiple zones or regions. To ensure uptime and reliability, teams have to shard databases, deploy isolated regional applications, and scale machines horizontally. That’s distributed too!
Basically, if your IT landscape involves herding cats (cats being servers and services) rather than just one big dog (a beefy central server), you've got a stealth distributed system. But don't worry, you can still tame the cats with the right distributed computing tools!
Why Do Distributed Implementations Fall Short?
While many organizations end up with ad-hoc distributed systems organically over time, these often lack robust distributed computing capabilities out of the box. Typical issues that arise include:
Poor reliability - Loosely coupled components fail unpredictably. Lack of redundancy means single points of failure.
Inefficient scaling - Hard to optimally place computations near relevant data. Adding nodes creates bottlenecks.
No data locality - Jobs and queries require expensive cross-network data shuffling.
Low fault tolerance - Partial failures cascade causing widespread outages.
Slow processing - Distributed data dependencies stall workflows. Lack of parallelism wastes resources.
Compliance risks - Sensitive data spread across systems multiplies security concerns.
High latency - Synchronous calls between distributed modules incur delays.
Difficult orchestration - Deploying updates across disparate systems is complex and risky.
While ad-hoc systems distribute data, they lack first-class support for distributed programming. This leads to brittle, inefficient systems that struggle to leverage the full benefits of distribution.
So, What’s Missing?
This brings us to the crux of the matter - what does it mean to be distributed? Going back to the start of this essay – this has all been covered in depth by giants in the field. In summary, being truly distributed means re-thinking how an application and platform interacts with the network.
Re-reviewing the Fallacies of Distributed Computing, the fallacies listed are as true today as they have ever been:
The network is reliable;
Latency is zero;
Bandwidth is infinite;
The network is secure;
Topology doesn't change;
There is one administrator;
Transport cost is zero;
The network is homogeneous.
With worker nodes that need to check in with a centralized task manager, unreliable networks, widely varying latency, and challenging networking can exacerbate scheduling problems very quickly. To realize the dream of truly distributed computing, we will need a platform that takes these challenges head-on.
Our Proposal - A Distributed FIRST Platform
So, what's next? This is where we come to Bacalhau - the distributed computing platform. Our tenets in building are the following:
Familiar - Should feel very similar to existing applications and platforms. If you use Kubernetes, Spark, Docker, Python, our platform should feel like a familiar friend.
Resilient - Supporting distributed networks means a very different approach to resilience - particularly a first-class support for a network that could disappear at any time. Bacalhau must support these kinds of topologies natively.
Efficient - Applications can take advantage of idle compute, move jobs to where there are optimization opportunities, and observe (and respond) to dynamic network and pricing changes, all without the user having to take any actions beyond detailing the requirements for their job.
Distributed-Native - The platform should represent distributed concepts as first-class elements. Targeting datasets, managing permissions, navigating multi-location shards, and many more features should be as easy as building locally.
With Bacalhau, we are building in the EXPECTATION that machines, applications, and networks are unreliable, and as a result we will give you better scaling and reliability. In the same way that:
Commodity compute allowed people to go beyond a single large machine;
VMs allowed people to go beyond a physical cluster;
Data clusters allowed people to go beyond clusters of VMs;
And containers allowed people to go beyond a cluster of VMs;
Bacalhau will allow people to scale beyond a single zone to both meet your data and compute where it is and make truly global deployments a reality for everyone. Bacalhau helps make global compute possible.
Bacalhau is designed to be 'Distributed Everywhere.' The system is architected to handle tasks anywhere across the globe, allocating workloads to where they can be handled most efficiently.
The benefits of this truly distributed world are manifold - it's faster, with job execution happening where it’s needed; it's cheaper, as resources are used more efficiently and costly data transfers are minimized; and it's more secure, with data sovereignty concerns addressed by processing and storing data where it originates.
As we look ahead, the horizon of distributed computing is stretching wider than ever before. Projects like Bacalhau are poised to redefine our understanding of 'distributed.' As we move towards a world where compute is ubiquitous, a truly distributed, globally coordinated computing system isn't just a nice-to-have, it's a necessity. This is an exciting time to be in technology, and we are thrilled to be part of this transformational journey.
We are committed to keep delivering groundbreaking updates and improvements and we're looking for help in several areas. If you're interested, there are several ways to contribute and you can always reach out to us via Slack or Email.
Thanks for reading our blog! Subscribe for free to receive new updates.