Streamline Data Projects with Bacalhau Python SDK
In the Bacalhau 1.0 release, we introduced the Bacalhau Python SDK, to enable developers to interact with Bacalhau's API endpoints. With the SDK, developers can perform various job management tasks and integrate Bacalhau into their Python projects with ease.
Why use the Bacalhau Python SDK?
Simplified Job Management: The SDK makes it easy to interact with Bacalhau's endpoints. You can create, list, and inspect Bacalhau jobs using Python objects, making it easier to integrate Bacalhau into your Python projects.
Automatic Dependency Installation: By using the SDK, you don't need to worry about installing the lower-level
bacalhau-apiclient
library separately. Thebacalhau-sdk
automatically installs thebacalhau-apiclient
as a dependency, ensuring that you have all the necessary components to interact with Bacalhau.
Multi-Network Support: The SDK also offers the flexibility to target any Bacalhau network. By setting the appropriate environment variables, such as
BACALHAU_API_HOST
andBACALHAU_API_PORT
, you can work with different Bacalhau networks based on your requirements.Key-Pair Generation: Authentication is made simpler with the SDK. If a key-pair is not found in the specified directory (
BACALHAU_DIR
), the SDK automatically generates one for you. This streamlines the authentication process, ensuring a smooth developer experience.
Practical Example
Let's dive into a practical example to demonstrate how to leverage the Bacalhau Python SDK:
Installation:
Suppose you want to submit a job that performs a complex computation using a Docker image. First, you need to install the Bacalhau Python SDK:
From PyPi:
pip install bacalhau-sdk
From source:
git clone https://github.com/bacalhau-project/bacalhau/
cd python/
pip install .
Initialization:
Similar to the Bacalhau CLI, the Bacalhau Python SDK requires a key-pair to sign requests. If a key-pair is not found in the specified directory (BACALHAU_DIR
), the SDK will automatically create one for you.
“Hello World”:
Let’s submit a a "Hello World" job using the Bacalhau Python SDK and then fetch the output data's CID (content identifier). We start by importing the bacalhau_sdk
used to create and submit a job create request. Then we import bacalhau_apiclient
(installed automatically with this SDK), it provides various object models that compose a job create request. These are used to populate a simple python dictionary that will be passed over to the submit
util method.
import pprint
from bacalhau_sdk.api import submit
from bacalhau_sdk.config import get_client_id
from bacalhau_apiclient.models.storage_spec import StorageSpec
from bacalhau_apiclient.models.spec import Spec
from bacalhau_apiclient.models.job_spec_language import JobSpecLanguage
from bacalhau_apiclient.models.job_spec_docker import JobSpecDocker
from bacalhau_apiclient.models.job_sharding_config import JobShardingConfig
from bacalhau_apiclient.models.job_execution_plan import JobExecutionPlan
from bacalhau_apiclient.models.publisher_spec import PublisherSpec
from bacalhau_apiclient.models.deal import Deal
data = dict(
APIVersion='V1beta1',
ClientID=get_client_id(),
Spec=Spec(
engine="Docker",
verifier="Noop",
publisher_spec=PublisherSpec(type="Estuary"),
docker=JobSpecDocker(
image="ubuntu",
entrypoint=["echo", "Hello World!"],
),
language=JobSpecLanguage(job_context=None),
wasm=None,
resources=None,
timeout=1800,
outputs=[
StorageSpec(
storage_source="IPFS",
name="outputs",
path="/outputs",
)
],
sharding=JobShardingConfig(
batch_size=1,
glob_pattern_base_path="/inputs",
),
execution_plan=JobExecutionPlan(shards_total=0),
deal=Deal(concurrency=1, confidence=0, min_bids=0),
do_not_track=False,
),
)
pprint.pprint(submit(data))
The script above prints the following object, the job.metadata.id
value is our newly created job id!
{'job': {'api_version': 'V1beta1',
'metadata': {'client_id': 'bae9c3b2adfa04cc647a2457e8c0c605cef8ed93bdea5ac5f19f94219f722dfe',
'created_at': '2023-02-01T19:30:21.405209538Z',
'id': '710a0bc2-81d1-4025-8f80-5327ca3ce170'},
'spec': {'Deal': {'Concurrency': 1},
'Docker': {'Entrypoint': ['echo', 'Hello World!'],
'Image': 'ubuntu'},
'Engine': 'Docker',
'ExecutionPlan': {'ShardsTotal': 1},
'Language': {'JobContext': {}},
'Network': {'Type': 'None'},
'Publisher': 'Estuary',
'Resources': {'GPU': ''},
'Sharding': {'BatchSize': 1,
'GlobPatternBasePath': '/inputs'},
'Timeout': 1800,
'Wasm': {'EntryModule': {}},
'outputs': [{'Name': 'outputs',
'StorageSource': 'IPFS',
'path': '/outputs'}]},
'status': {'JobState': {},
'Requester': {'RequesterNodeID': 'QmdZQ7ZbhnvWY1J12XYKGHApJ6aufKyLNSvf8jZBrBaAVL',
'RequesterPublicKey': 'CAASpgIwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDVRKPgCfY2fgfrkHkFjeWcqno+MDpmp8DgVaY672BqJl/dZFNU9lBg2P8Znh8OTtHPPBUBk566vU3KchjW7m3uK4OudXrYEfSfEPnCGmL6GuLiZjLf+eXGEez7qPaoYqo06gD8ROdD8VVse27E96LlrpD1xKshHhqQTxKoq1y6Rx4DpbkSt966BumovWJ70w+Nt9ZkPPydRCxVnyWS1khECFQxp5Ep3NbbKtxHNX5HeULzXN5q0EQO39UN6iBhiI34eZkH7PoAm3Vk5xns//FjTAvQw6wZUu8LwvZTaihs+upx2zZysq6CEBKoeNZqed9+Tf+qHow0P5pxmiu+or+DAgMBAAE='}}}}
We can then use the results
method to fetch, among other fields, the output data's CID.
from bacalhau_sdk.api import results
print(results(job_id="710a0bc2-81d1-4025-8f80-5327ca3ce170"))
The above prints the following dictionary:
{'results': [{'data': {'cid': 'QmYEqqNDdDrsRhPRShKHzsnZwBq3F59Ti3kQmv9En4i5Sw',
'metadata': None,
'name': 'job-710a0bc2-81d1-4025-8f80-5327ca3ce170-shard-0-host-QmYgxZiySj3MRkwLSL4X2MF5F9f2PMhAE3LV49XkfNL1o3',
'path': None,
'source_path': None,
'storage_source': 'IPFS',
'url': None},
'node_id': 'QmYgxZiySj3MRkwLSL4X2MF5F9f2PMhAE3LV49XkfNL1o3',
'shard_index': None}]}
With its simplified job management, and multi-network support, the Bacalhau Python SDK is a valuable tool for integrating Bacalhau into Python projects while also harnessing the power of Bacalhau's distributed computing capabilities.