Bacalhau Project Report – Feb 27, 2023
Stable Airflow operator! IPFS unreliability fixed, job selection hooks in requester! Job cancellation. Improved distributed tracing. Work on station support (JSON logging & exec #s)
Lots of behind-the-scenes improvements this week, such as upgrading dependencies and working on flaky tests. From a user-facing perspective, here are the highlights.
Stable Airflow Operator 💪
We now have now finished development work on the stable Bacalhau Airflow integration. This is cool because the prototype of Bacalhau Airflow integration relied on shelling out to the bacalhau CLI, which is not sustainable long-term. The new integration instead uses our shiny new Python SDK, so it makes API calls to Bacalhau directly.
For now, the code is here. Docs are coming soon!
IPFS unreliability improved 😅
We spoke a few weeks ago about how we were having challenges with IPFS stability in our “canary” tests. The canaries are our automated tests which run continuously against production, in order to alert us to issues on the production network. For several weeks, one of the canary tests which includes downloading a fairly large (several 10s of megabytes) file from IPFS as an input to a job was mysteriously being really flaky. Much time, effort and gnashing of teeth went into diagnosing possible fixes.
A few weeks after the fixes were deployed, the bloody thing fixed itself.
How? Well, one of the problems was that every canary also connected to IPFS as an IPFS server, and ended up polluting the IPFS DHT (the giant distributed hash table which tells IPFS clients where to look to download a certain dataset) with the addresses of the canaries themselves as possible places to go download the data from. But the problem of course is that the canaries are ephemeral IPFS servers that only run for a few seconds, inside AWS Lambda!
So why did it start working a few weeks after we put the fixes in? Well, one of the fixes was to correctly close the connection to the upstream IPFS servers so the ephemeral instances get removed from the DHT. Seems like the DHT caches those servers for a few weeks, so it was only after they got cleaned up that the canaries started working again. Phew! 😅
Job selection hooks in requester 🪝
We already supported job selection hooks in the compute node, but now we support them in the requester node as well. This gives the Requestor node the ability to filter the jobs that it accepts, so that it can take on a "gateway" role. This is a prerequisite to the “Insulated Jobs” feature we’re busily developing, to allow two disjoint organizations – rather than sharing data with eachother which can be costly and fraught, can instead share programs with eachother to query over the other’s private dataset, and moderate the running of the programs and the sharing of the results.
More on this demo soon!
Job cancellation 🚫
Ever run a job on Bacalhau and then wished you hadn’t? Now you can cancel a job that you started. But you can’t cancel other peoples’ jobs!
Improved distributed tracing 🫥
We had some support for distributed tracing in the code, but it had rotted and was missing in several places. It’s much better now, which will aid debuggability in the future!
Progress on Station support 👽
Station is a really cool project from Protocol Labs to enable running things like IPFS and Filecoin on everyone’s desktop computers, and soon it will allow users’ desktop computers to run Bacalhau nodes as well, and contribute to the global Bacalhau network!
In order to integrate Bacalhau with Station, we are making some changes to Bacalhau to enable it to run as a Station module. In particular, we’ve now added support for JSON logging and added an API so that Station can query how many jobs have been correctly executed on the node.
What’s next? ⏩
Docs for Stable Airflow Operator
Streaming logs
Further Station integration
Insulated Jobs demo
Questions/comments? Let us know!
Thanks for reading!
Your Humble Bacalhau Team