Power Your Flyte Machine Learning Workflows With Bacalhau v1.1

5 Days of Bacalhau - Day 4

Enrico Rotundo

Simon Worthington

, and

Laura Hohmann

Sep 28, 2023

We are very excited to announce that building production-grade data and machine learning workflows is now easier than ever thanks Bacalhau’s new integration with Flyte!

Flyte is an intuitive Python framework that simplifies the development and execution of ML pipelines. With Flyte, you can easily build and run complex workflows and now you can leverage the power of distributed computing and scalable infrastructure using Bacalhau.

Bacalhau and Flyte solve common challenges with data pipelines

Building data pipelines can be challenging for several reasons:

Data flow: Data pipelines often involve multiple stages such as data extraction, transformation, and loading. Ensuring that data is processed in the correct order and that all necessary dependencies are met can be complex, especially in pipelines with multiple branches or parallel processing.
Scalability: As the volume of data grows, the pipeline needs to scale accordingly. Designing a scalable pipeline that can handle large amounts of data and processing requirements can be a significant challenge.
Monitoring and error-handling: Pipelines need robust monitoring systems to detect any issues or bottlenecks in real-time and take appropriate actions to ensure smooth and uninterrupted data flow. Implementing error handling mechanisms and strategies to handle exceptions and failures gracefully is essential for maintaining the reliability of the pipeline.

Flyte is a fantastic open-source tool for solving these challenges! However, common deployments of data pipelines often still require data and models to be centralised in one place or only operate reliably when deployed to a single cloud and region.

Using Bacalhau distributed orchestration features, machine learning workflows can now span multiple regions and clouds and access data wherever it is stored more easily that ever before!

Multi-cloud workflows: Bacalhau will automatically run tasks wherever there is compute resource available. If you have specialised hardware in one cloud but a cheaper resource in another, Bacalhau can intelligently schedule the Flyte task to wherever it needs to be executed. A single workflow can now be made up of AWS, Azure or Google Cloud execution combined with native data-center or edge computation.
Data-local orchestration: Bacalhau tasks can detect where the required input data is located and schedule Flyte tasks intelligently to minimise slow data transfer or expensive cloud egress. Flyte workflows can now run each task as close to the data as possible.
Variable trust tasks: Bacalhau can apply different access control to different parts of the Flyte workflow. Tasks that access sensitive or personal data can automatically ensure that no secret data is leaked and that data is being used appropriately. The inputs or outputs to these high-trust jobs can be combined with low-trust execution happening on local machines or public nodes to maximise the capability and minimise execution happening near sensitive data.

What’s more, all of these tasks integrate seamlessly with other Flyte tasks! So if you need data-local orchestration for some of your workflow and want to rely on traditional execution for the rest, Flyte allows the two styles of task to intermingle freely.

Decentralised ML and data workflows are just one install away

Getting started with Bacalhau and Flyte is as simple as adding our new Python package to your existing Flyte installation:

pip install flytekitplugins-bacalhau

from flytekit import workflow, task, kwtypes
from flytekitplugins.bacalhau import BacalhauTask

bacalhau_types = kwtypes(spec=dict, api_version=str)
bacalhau_task = BacalhauTask(name="hello_world", inputs=bacalhau_types)

@workflow
def wf():
    bac_task = bacalhau_task(
        api_version="V1beta1",
        spec=dict(
            engine="Docker",
            PublisherSpec={"type": "IPFS"},
            docker=dict(
                image="ubuntu:latest",
                entrypoint=["echo", "Flyte is awesome!"],
            ),
        ),
    )

The code block above defines a Bacalhau task and a workflow using Flyte. The Bacalhau task will execute a Docker container with the Ubuntu image and print the text "Flyte is awesome!".

To see some more complex examples, including how to chain multiple Bacalhau tasks together, check out the examples in our integration repository.

Get the best of both worlds with Bacalhau and Flyte

Our new integration between Bacalhau with Flyte allows users to leverage Bacalhau's distributed orchestration features to build and run machine learning workflows across multiple regions and clouds – all while avoiding bandwidth limits and egress costs. Bacalhau allows Flyte workflows to scale in a whole new way!

Get started today and show us what you’re building on the Bacalhau Slack.

Be Part of the Evolution

We invite our community to experience the enhanced flexibility of Bacalhau 1.1. Dive into our updated documentation, explore the new job types, and let us know your feedback. Collectively, we're redefining the boundaries of distributed compute frameworks.

Your journey with Bacalhau is just beginning, and the horizon has never looked brighter. The Bacalhau project is available today as Open Source Software.The public GitHub repo can be found here. If you like the project, please give us a star ⭐ 🙂

We're looking for help in several areas. If you're interested in helping out, there are several ways to contribute and you can always reach out to us via Slack or Email.

For more information, see:

⭐ Star on Github

Join Our Slack