Bacalhau Project Report – March 6, 2023
Multi-arch! Removing sharding, improving test reliability, and wasm cancellation.
More heavy lifting behind the scenes this week, with a ton of improvements to test reliability. Also some feature work landed!
Multi-arch support 🏹
WASM jobs can run anywhere, but Docker jobs can’t necessarily.
If a user has only pushed an ARM Docker image, which is increasingly common now that all the cool kids have M1/M2/M3 Macs (personally I went the other way and got a ThinkPad with a SIM slot so now you can point and laugh at how 2023 is still not the year of Linux on the desktop, but anyway, I digress), then previously Bacalhau would still try to run it on an x86 compute node, but crap out with an obscure error.
Now we have multi-arch support, it means each compute node also broadcasts which architectures it supports, and docker jobs will only be run on nodes that can actually run it!
Removing sharding 💥
Bacalhau previously had a feature where you could give a job a CID with lots of files in it, and specify a “glob pattern” which would distribute the files from that CID across multiple executions, as a way to distribute the work across multiple nodes.
It was a neat feature, but no one ever used it. The vast majority of the jobs that run on our network are just single job executions. What’s more, having support for sharding jobs throughout the codebase made the code quite a lot more complicated. We had multiple layers: jobs —> generate many shards, shards —> generate many executions (based on concurrency). As we prepare for a lean, mean & reliable 1.0 release, we decided to strip out this complexity and eventually move the sharding feature up to a higher level once we start seeing user demand for it. This is nice because it means we can have a low level which is relatively simple, like “pods” in Kubernetes, and build more complex systems on top, in the scheduler code.
This has also allowed us to close a bunch of TODO items related to sharding, and it’s made the test suite more reliable too. As a former colleague of mine used to say, code is a liability. The less of it we have — while still delivering tremendous value to users — the better 😅
WASM cancellation 🚫
Last week we added support for cancelling Docker jobs, and by good fortune this week we upgraded our wazero dependency which added new support for cancelling the context for WASM jobs (which run in-process), which means we could extend our support for job cancellation to WASM jobs as well. Nice!
What’s next? ⏩
Docs for Stable Airflow Operator
Streaming logs
Further Station integration
Insulated Jobs demo
Something very cool for the FVM launch 🤫
Questions/comments? Let us know!
Thanks for reading!
Your Humble Bacalhau Team