Compute Over Data summit - Technical Deep Dive

The CoD Summit² was held recently in Lisbon, and the Bacalhau team announced a few new features that this blog post will attempt to dissect in some detail

Dec 15, 2022

Lisbon is famous for many things, and food is one of them. As the BBC good food guide clearly states:

“It’s almost impossible to visit Lisbon and not be confronted with Bacalhau.”

The CoD Summit² was held recently in Lisbon, and we in the Bacalhau team had the privilege of attending.

Upon entering the restaurant, the server said, "Ahhhh - you must be with the fish people," and I found myself in an unintended encounter with Bacalhau.

I was confused, so I searched around for clues as to who and where these fish people were, until I finally saw that everyone on the team was wearing these t-shirts:

After exchanging a knowing glance with the waiter - I found my table and had a great evening with my fellow Bacalhau fish people team members.

The deep dive

While in Lisbon, we announced a few new features that this blog post will attempt to dissect in some detail:

WASM
Airflow Support
FIL+

WASM

You can now run WASM workloads inside Bacalhau! This means that you can now run workloads written in languages such as Rust, C, C++, Go, etc. inside Bacalhau without needing Docker.

WASM is designed as a compilation target for various languages. It can run workloads outside the context of a browser, and most importantly for Bacalhau, it can be made to guarantee determinism.

Bacalhau uses the wazero library to run WASM code in its dedicated WASM executor. It can run any program that is written with the WASI interface in mind.

You can watch the video of the Bacalhau WASM executor in action here:

WASM: Part 1 - the workload

To run a WASM job in Bacalhau, just as with Docker, we will need some kind of workload - i.e., a job to actually run.

With Docker, this would be an “image” and a “command” to run in that image. For example, we could say, “please run “ffmpeg” with the following parameters to the resize command”

In that case, our image is “ffmpeg” and our command is the resize command.

With a WASM workload, we don't point at a Docker image; rather, we point at some pre-compiled WASM bytecode. We get this bytecode by compiling a Rust program to WASM. For example, here is a short program in Rust that will print out the contents of a file:

First of all, we would compile the Rust code above and turn it into WASM bytecode, at each point we have a few options to run it:

run Bacalhau directly against the local “cat.wasm” file we just compiled
upload the “cat.wasm” file to IPFS, and then run Bacalhau against the IPFS CID

If we choose the first option, Bacalhau will automatically upload the cat.wasm file to IPFS and then run the job against that IPFS CID.

We can use the “bacalhau wasm run” command to do either of these:

bacalhau wasm run \
    {cid-of-wasm | <local.wasm>} \
    [--entry-point <string>] \
    [wasm-args ...] [flags]

The “--entry-point” flag is the function's name to call in the WASM workload, and the additional arguments are passed to the WASM workload as arguments.

Because you can compile code written in Rust, C, C++, Go, etc. to WASM, it opens up some interesting use cases.

For example, the following screenshot shows a WASM workload that is running a Rust program that does Seam Carving on an image:

WASM: part 2 - the data

When we run our WASM workload in Bacalhau, we can also pass in additional CLI flags to the WASM program. For example, if we wanted to run the “cat.wasm” program above against a file called “foo.txt”, we would do the following:

bacalhau wasm run cat.wasm /data/foot.txt

Our program would pick up “/data/foo.txt” as a CLI argument and then attempt to print out the contents of that file.

The problem is we have not yet told Bacalhau where to find the contents of the “foo.txt” file. We can do this by using the “-v” flag combined with an IPFS CID:

# first get a CID of some content
echo "hello world" > /tmp/myfile.txt
cid=$(ipfs add -q /tmp/myfile.txt)
# now run our bacalhau job
bacalhau wasm run -v $cid:/data/foo.txt cat.wasm /data/foo.txt

Now that we can mount data from IPFS inside a WASM job on Bacalhau, we can start to develop some useful WASM workloads and reuse them across different data CIDs.

WASM: part 3 - determinism

We have a verification strategy that depends on the deterministic results of a Bacalhau workload. This means that for the same invocation (i.e., the same code with the same data) - the result should be deterministic across multiple runs.

This is very hard to do with generic Docker workloads. There are various places in the system where entropy creeps in:

Wall clocks – time is never in sync in distributed systems (and jobs won’t be guaranteed to run at identical times either!)
Random numbers
Remote network (and network failures)
Multi-threading
Values in garbage memory
Thermal noise from the CPU
Cosmic rays

Running WASM inside Bacalhau makes it possible to define how to handle these various sources of entropy. For example, we can configure the system clock to move forward one "tick" every time a command is executed, meaning the same time would be reported each time.

Bacalhau does not allow network access from inside a job, so we don't need to worry about that. Finally, we can configure WASM to use a deterministic random number generator, meaning that the same sequence of random numbers will be generated each time.

Getting our WASM executor to guarantee deterministic behavior unlocks the verification strategy for Bacalhau, and that means we have taken a huge leap toward a functioning decentralized compute network.

Airflow Support

Bacalhau now supports Airflow! This is exciting because it means that if you currently work with Airflow DAGs - you can now run any of your Airflow stages on Bacalhau, consume data on IPFS, and automatically publish the results of each stage to IPFS and Filecoin!

We have written an Airflow Provider that allows an Airflow DAG stage to call out to the Bacalhau network when it runs. Because Bacalhau supports Docker, it means that as long as your Airflow stage can be Dockerized, you can run it on Bacalhau.

You can watch a video of the Bacalhau Airflow support here:

Airflow Support: part 1 - the provider

Our Airflow provider is called “bacalhau-provider” and it lives here.

Using the provider is very simple, you import the following operators:

BacalhauDockerRunJobOperator - used to run a Docker job
BacalhauWasmRunJobOperator - used to run a WASM job
BacalhauGetOperator - used to get the results of a job

And then chain them into a DAG as you would with any other Airflow operator.

We then use the xcoms system in Airflow to pass results from one stage to another.

The following example shows this in action:

Notice how the red arrow highlights where we are using xcoms to pass the results of one stage to another.

Airflow Support: Part 2 - the workload

When a single stage is being run by Airflow, it results in a Bacalhau job being invoked on the network. Because we have both the BacalhauDockerRunJobOperator and the BacalhauWasmRunJobOperator, you can choose if your stage is better suited to be run as a Docker job or a WASM job.

When a single Bacalhau job stage is reached, it will convert the Python spec into a JSON payload that will be submitted to the Bacalhau network in exchange for a JOB id. The operator will then block the pipeline on that Bacalhau job completion before moving on to the next stage.

Each stage can consume data from IPFS as well as the data from the previous stage. We can also combine the WASM and Docker operators and make a single DAG that uses both types of workload:

The following image shows the output of each stage of our pipeline:

Airflow orchestrates Bacalhau to run both WASM and Docker workloads as part of a DAG running on a decentralized compute network!

FIL+

Bacalhau is integrating with evergreen, so we can start remunerating compute providers with FIL+ in exchange for running jobs.

The concept is simple, if a Bacalhau job operates on existing FIL+ data, chances are the output data that the job produces will also be useful to humanity and, therefore eligible for FIL+.

Because it's the compute providers that run the job that will also be custodians of the output data, it makes sense for them to be the ones to receive FIL+. This has a useful side effect of incentivizing them to run jobs in the first place. Consider this as a bridge towards an incentivized network before we officially release a token for incentivization.

The FIL+ integration with Bacalhau will use human moderators to ultimately decide if a job deserves FIL+ and, if so, how much FIL+ should be awarded to the compute provider that ran the job (and so has the storage deal for the results). The moderators will use a centralized dashboard that constantly scans the Bacalhau network for jobs that are eligible for FIL+.

You can watch our talk all about our FIL+ integration here:

FIL+: part 1 - existing FIL+ data

To help our moderation team, we need to quickly know if a CID used in a Bacalhau job has already been awarded DataCap. To do this, we will constantly scan the public FileCoin network and cache any CIDs we notice have had the “verified” bit set in their deal.

FIL+: part 2 - selection of jobs

When a user submits a new job to the network, compute providers will ask the following question: "is it worth it for me to run this job?"

To help answer that question, the compute provider will reach out to our centralized dashboard and ask if the input CIDs for the job are already blessed with FIL+. The compute node can still choose to run the job anyway, it just means it's a lot more likely to be awarded DataCap if the job operates on existing FIL+ data.

FIL+: part 3 - recording results CIDs

Once the job has been completed, our centralized dashboard will be notified of the results CIDs. This means that a moderator can log in, check the job's nature and look at the results the job produced.

FIL+: part 4 - moderation

Once the moderator has used the dashboard to decide whether the job is useful, they can choose to "verify" the job. This decision is recorded in our centralized dashboards database.

FIL+: part 5 - storage deals

Once the job has been moderated, the compute provider currently holding these results can initiate a storage deal with evergreen.

Evergreen is a notary and has the power to award DataCap to storage deals as it sees fit. In this case, when the compute provider negotiates a storage deal with evergreen, evergreen will reach out to our centralized dashboard and ask if the results are part of a "verified" job. If so, evergreen will award DataCap to the storage deal.

Conclusion

We had a great time in Lisbon at the CoD Summit², we ate a lot of Bacalhau (the fish), bemused at least one waiter, and had a lot of fun talking to people about Bacalhau (the decentralized compute network).

If you like the sound of what we are up to - be sure to check out our website and have a play with Bacalhau today!

Bacalhau