Introducing Bacalhau Amplify: A Simple Way to Augment Data Processing Jobs

Scope and Goals

Mar 30, 2023

A long time ago, I sat in the dusty basement of a nuclear power plant in the US. There were no windows. My only link to the outside was a piece of technology that can best be described as sonar on land.

I spent days staring at waterfalls of data. Streams of live data that used the then-standard blue-yellow-red color map made the CRT monitor look like it was melting. What was I doing, you ask? I was looking for tiny splashes of activity in that ocean of data.

Of course, what I needed was a simple way to batch that huge amount of data through simple signal-processing algorithms to produce downsampled images and descriptive metadata.

I needed to enrich the data via transformations and algorithms, enhance the data via downsampling and compression, and explain the data with images and statistics.

These ideas form the basis of our new project, Bacalhau Amplify.

Star us on Github

Big Data is Common

Big data has arguably come and gone. We now live in a world where everyone from web developers to business analysts requires simple tools to march through tranches of data.

The overwhelming pattern that these people encounter is that the work is “embarrassingly parallel.” But people often forget that most batch workloads are also embarrassingly repeatable.

Many jobs are so repeatable that we think that it’s about time to automate these tasks away.

Understandable Data is Not Common

The depth and breadth of the world’s data have grown exponentially in both directions. This led to architectural data patterns like the data lake, which make it easy to write and store immense amounts of information. Opponents sarcastically suggest this pattern should be called “write once, read never (WORN).”

But I’d argue that the main problem with the data lake approach isn’t the amount of data. The problem is that it’s not observable. There’s rarely any metadata attached to describe what the data is. Technical or statistical information is even rarer. Hell, I’d be happy if people just named their files.

We want to empower data users by encouraging transparency and explainability. We think it’s possible to automatically infer metadata and descriptive statistics.

Star us on Github

Novel Use Cases

Downstream data applications like machine learning (ML), analytics, and even web development, expect data to be clean, reliable, and in appropriate formats.

ML algorithms might appreciate augmentation in the form of added noise, rotations, or translations. Web developers could benefit from automated image classification and tagging to expose their content to their users. Analysts could benefit from basic default imputations like removing null values and asserting data sanity.

The permutations of the requirements of these use cases are vast and quite often domain dependent. But we believe that there is a wide range of basic enrichments that can instantly add value to a wide range of use cases.

Amplify: The Premise

These ideas, and more, led us to the conclusion that we need three things to make this happen:

An open, immutable data source
A scalable, decentralised computational capability
A tool to describe common workflows to produce derivative data

You can see where this is going.

“BUT HANG ON ONE JEFFING MINUTE!” you scream out at your phone screen.

“1. IPFS!”

“2. BACALHAU!” — yes, you can shout in hyperlinks

“Oh, ok. 3… ¯\_(ツ)_/¯”

The Amplify project provides a tool to connect to data sources, run the data through a predefined computational graph, using Bacalhau as the compute provider, and record the derivative result back in IPFS for downstream use.

The computation graph contains a wide range of tools that enhance, enrich, and explain your data. From common image transformations to parquet summary statistics. We’re aiming to provide hundreds of generic primitives that can transform 80% of the most common data types we see in IPFS.

Star us on Github

Convinced? — Next Steps

Does this resonate? Is it useful for your use case?

If you’d like to get involved at this early stage and direct development, then reach us on Slack. We’re interested in capturing ideas, use cases, requests for functionality, and, of course, help!

If you want to follow progress, then we’ll be releasing a follow-up blog post in a couple of months with a description of all the functionality we’ve implemented before the Compute Over Data Summit on May 9th, 2023, in Boston.

In the meantime, head over to the repository, track what we’re aiming for on May 9th, or try an early version right now!