Streamline Data Projects with Bacalhau Python SDK

and

Jul 25, 2023

In the Bacalhau 1.0 release, we introduced the Bacalhau Python SDK, to enable developers to interact with Bacalhau's API endpoints. With the SDK, developers can perform various job management tasks and integrate Bacalhau into their Python projects with ease.

Why use the Bacalhau Python SDK?

Simplified Job Management: The SDK makes it easy to interact with Bacalhau's endpoints. You can create, list, and inspect Bacalhau jobs using Python objects, making it easier to integrate Bacalhau into your Python projects.

Automatic Dependency Installation: By using the SDK, you don't need to worry about installing the lower-level bacalhau-apiclient library separately. The bacalhau-sdk automatically installs the bacalhau-apiclient as a dependency, ensuring that you have all the necessary components to interact with Bacalhau.

Multi-Network Support: The SDK also offers the flexibility to target any Bacalhau network. By setting the appropriate environment variables, such as BACALHAU_API_HOST and BACALHAU_API_PORT, you can work with different Bacalhau networks based on your requirements.
Key-Pair Generation: Authentication is made simpler with the SDK. If a key-pair is not found in the specified directory (BACALHAU_DIR), the SDK automatically generates one for you. This streamlines the authentication process, ensuring a smooth developer experience.

Practical Example

Let's dive into a practical example to demonstrate how to leverage the Bacalhau Python SDK:

Installation:

Suppose you want to submit a job that performs a complex computation using a Docker image. First, you need to install the Bacalhau Python SDK:

From PyPi:

pip install bacalhau-sdk

From source:

git clone https://github.com/bacalhau-project/bacalhau/

cd python/

pip install .

Initialization:

Similar to the Bacalhau CLI, the Bacalhau Python SDK requires a key-pair to sign requests. If a key-pair is not found in the specified directory (BACALHAU_DIR), the SDK will automatically create one for you.

“Hello World”:

Let’s submit a a "Hello World" job using the Bacalhau Python SDK and then fetch the output data's CID (content identifier). We start by importing the bacalhau_sdk used to create and submit a job create request. Then we import bacalhau_apiclient (installed automatically with this SDK), it provides various object models that compose a job create request. These are used to populate a simple python dictionary that will be passed over to the submit util method.

import pprint
from bacalhau_sdk.api import submit
from bacalhau_sdk.config import get_client_id
from bacalhau_apiclient.models.storage_spec import StorageSpec
from bacalhau_apiclient.models.spec import Spec
from bacalhau_apiclient.models.job_spec_language import JobSpecLanguage
from bacalhau_apiclient.models.job_spec_docker import JobSpecDocker
from bacalhau_apiclient.models.job_sharding_config import JobShardingConfig
from bacalhau_apiclient.models.job_execution_plan import JobExecutionPlan
from bacalhau_apiclient.models.publisher_spec import PublisherSpec
from bacalhau_apiclient.models.deal import Deal

data = dict(
    APIVersion='V1beta1',
    ClientID=get_client_id(),
    Spec=Spec(
        engine="Docker",
        verifier="Noop",
        publisher_spec=PublisherSpec(type="Estuary"),
        docker=JobSpecDocker(
            image="ubuntu",
            entrypoint=["echo", "Hello World!"],
        ),
        language=JobSpecLanguage(job_context=None),
        wasm=None,
        resources=None,
        timeout=1800,
        outputs=[
            StorageSpec(
                storage_source="IPFS",
                name="outputs",
                path="/outputs",
            )
        ],
        sharding=JobShardingConfig(
            batch_size=1,
            glob_pattern_base_path="/inputs",
        ),
        execution_plan=JobExecutionPlan(shards_total=0),
        deal=Deal(concurrency=1, confidence=0, min_bids=0),
        do_not_track=False,
    ),
)

pprint.pprint(submit(data))

The script above prints the following object, the job.metadata.id value is our newly created job id!

{'job': {'api_version': 'V1beta1',
         'metadata': {'client_id': 'bae9c3b2adfa04cc647a2457e8c0c605cef8ed93bdea5ac5f19f94219f722dfe',
                      'created_at': '2023-02-01T19:30:21.405209538Z',
                      'id': '710a0bc2-81d1-4025-8f80-5327ca3ce170'},
         'spec': {'Deal': {'Concurrency': 1},
                  'Docker': {'Entrypoint': ['echo', 'Hello World!'],
                             'Image': 'ubuntu'},
                  'Engine': 'Docker',
                  'ExecutionPlan': {'ShardsTotal': 1},
                  'Language': {'JobContext': {}},
                  'Network': {'Type': 'None'},
                  'Publisher': 'Estuary',
                  'Resources': {'GPU': ''},
                  'Sharding': {'BatchSize': 1,
                               'GlobPatternBasePath': '/inputs'},
                  'Timeout': 1800,
                  'Wasm': {'EntryModule': {}},
                  'outputs': [{'Name': 'outputs',
                               'StorageSource': 'IPFS',
                               'path': '/outputs'}]},
         'status': {'JobState': {},
                    'Requester': {'RequesterNodeID': 'QmdZQ7ZbhnvWY1J12XYKGHApJ6aufKyLNSvf8jZBrBaAVL',
                                  'RequesterPublicKey': 'CAASpgIwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDVRKPgCfY2fgfrkHkFjeWcqno+MDpmp8DgVaY672BqJl/dZFNU9lBg2P8Znh8OTtHPPBUBk566vU3KchjW7m3uK4OudXrYEfSfEPnCGmL6GuLiZjLf+eXGEez7qPaoYqo06gD8ROdD8VVse27E96LlrpD1xKshHhqQTxKoq1y6Rx4DpbkSt966BumovWJ70w+Nt9ZkPPydRCxVnyWS1khECFQxp5Ep3NbbKtxHNX5HeULzXN5q0EQO39UN6iBhiI34eZkH7PoAm3Vk5xns//FjTAvQw6wZUu8LwvZTaihs+upx2zZysq6CEBKoeNZqed9+Tf+qHow0P5pxmiu+or+DAgMBAAE='}}}}

We can then use the results method to fetch, among other fields, the output data's CID.

from bacalhau_sdk.api import results

print(results(job_id="710a0bc2-81d1-4025-8f80-5327ca3ce170"))

The above prints the following dictionary:

{'results': [{'data': {'cid': 'QmYEqqNDdDrsRhPRShKHzsnZwBq3F59Ti3kQmv9En4i5Sw',
                       'metadata': None,
                       'name': 'job-710a0bc2-81d1-4025-8f80-5327ca3ce170-shard-0-host-QmYgxZiySj3MRkwLSL4X2MF5F9f2PMhAE3LV49XkfNL1o3',
                       'path': None,
                       'source_path': None,
                       'storage_source': 'IPFS',
                       'url': None},
              'node_id': 'QmYgxZiySj3MRkwLSL4X2MF5F9f2PMhAE3LV49XkfNL1o3',
              'shard_index': None}]}

With its simplified job management, and multi-network support, the Bacalhau Python SDK is a valuable tool for integrating Bacalhau into Python projects while also harnessing the power of Bacalhau's distributed computing capabilities.

Bacalhau