Reading and Writing from Any S3-Compatible Data Store with Bacalhau

and

Jul 28, 2023

In the Bacalhau 1.0 release, we introduced Bacalhau's seamless integration with S3-compatible storage systems, including AWS S3, MinIO, Ceph, and SeaweedFS. We have focused on enhancing your data processing capabilities by enabling effortless access to data stored in S3 buckets and other S3-compatible storage solutions. This integration eliminates the complexities of data retrieval and enables you to seamlessly incorporate your S3 data into Bacalhau workflows. Moreover, you can now easily write back your results to S3 once your processing job is complete.

In this blog post, we will explore the importance of Bacalhau's integration with S3-compatible data stores and walk you through a practical example to showcase how to make the most of this powerful feature. Whether you are a seasoned data engineer or just starting with Bacalhau, you'll find valuable insights and step-by-step instructions to enhance your data processing workflows.

Why Is This Useful?

Effortless Data Access: Bacalhau's integration with S3-compatible data stores simplifies the process of accessing data stored in S3 buckets or other S3-compatible storage systems. You can seamlessly retrieve data from these sources and incorporate them into your Bacalhau workflows without the need for complex data retrieval mechanisms.
Streamlined Data Processing: By seamlessly integrating with S3-compatible data stores, Bacalhau enables efficient and streamlined data processing. You can directly process data from S3 buckets within your Bacalhau jobs, eliminating the need for manual data transfer or intermediate storage steps. This significantly reduces processing time and enhances overall workflow efficiency.

The Power of Bacalhau with S3-Compatible Data Stores in Practice

Let's dive into a practical example to demonstrate how to leverage Bacalhau's integration with S3-compatible data stores. Consider a scenario where you need to copy data from an S3 bucket to a public storage solution like IPFS. In this example, we will demonstrate the step-by-step process of scraping links from a public AWS S3 bucket and copying the data to IPFS using Bacalhau.

Prerequisites:

Before getting started, ensure that you have the Bacalhau client installed on your system. If you haven't done so, please refer to the Bacalhau documentation for installation instructions.

Step 1: Running the Bacalhau Job

To run the Bacalhau job, execute the following command:

bacalhau docker run \
-i "s3://<bucket-name>/<prefix>:/inputs,opt=region=<region>" \
--id-only \
--wait \
<docker-image> \
-- sh -c "cp -r /inputs/* /outputs/"

Let's break down the components of the command:

bacalhau docker run: Initiates the Bacalhau job using a Docker container.
-i "s3://<bucket-name>/<prefix>:/inputs,opt=region=<region>": Specifies the S3 bucket and object prefix as the input for the job. This command will download all objects that match the given prefix from the specified bucket and mount them under the /inputs directory inside the Docker container. Replace <bucket-name>, <prefix>, and <region> with your own values.
--id-only: Instructs Bacalhau to only output the job ID.
--wait: Waits for the job to complete before returning.
<docker-image>: Specifies the Docker image to use for the Bacalhau job.
-- sh -c "cp -r /inputs/* /outputs/": Executes the command inside the Docker container to copy all files from the /inputs directory to the /outputs directory. By default, the contents of the /outputs directory will be published to the specified destination, which is IPFS in this case.

Upon running the command, Bacalhau will provide the job ID as the output.

N.b. Store this job ID in an environment variable for future reference.

Step 2: Checking the Job Status

You can check the status of your Bacalhau job by using the bacalhau list command:

bacalhau list --id-filter <job-id> --wide

Replace <job-id> with the job ID obtained from Step 1. The command will display detailed information about the job, including its current status.

Step 3: Retrieving Job Results

Once the job is completed, you can retrieve the results using the bacalhau get command:

bacalhau get <job-id> --output-dir <output-directory>

Replace <job-id> with the job ID from Step 1 and <output-directory> with the desired directory where you want to store the results.

Step 4: Viewing Job Output

To view the job output, navigate to the output directory specified in Step 3. The output may contain the scraped links or any processed data, depending on your job configuration.

Step 5: Publishing Results

To publish your result back to your S3 data store, define a PublisherSpec configuration to specify the S3 bucket and object key where the processed images should be stored.

Example configuration for publishing to S3:

type PublisherSpec struct {
   Type    Publisher              `json:"Type,omitempty"`
   Params  map[string]interface{} `json:"Params,omitempty"`
}

For Amazon S3, you can specify the PublisherSpec configuration as shown below:

PublisherSpec:
  Type: S3
  Params:
    Bucket: <bucket>              # Specify the bucket where results will be stored
    Key: <object-key>             # Define the object key (supports dynamic naming using placeholders)
    Compress: <true/false>        # Specify whether to publish results as a single gzip file (default: false)
    Endpoint: <optional>          # Optionally specify the S3 endpoint
    Region: <optional>            # Optionally specify the S3 region

Star us on GitHub ⭐

Example Usage

Let's explore some examples to illustrate how you can use this…

Publishing results to S3 using default settings:

bacalhau docker run -p s3://<bucket>/<object-key> ubuntu ...

Publishing results to S3 with a custom endpoint and region:

bacalhau docker run -p s3://<bucket>/<object-key>,opt=endpoint=http://s3.example.com,opt=region=us-east-1 ubuntu ...

Publishing results to S3 as a single compressed file:

bacalhau docker run -p s3://<bucket>/<object-key>,opt=compress=true ubuntu ...

Utilizing naming placeholders in the object key:

bacalhau docker run -p s3://<bucket>/result-{date}-{jobID} ubuntu ...

With this feature in place you can effortlessly access and process data from S3-compatible storage systems, eliminating complexities and saving valuable time. Whether managing large datasets or optimizing workflows, Bacalhau empowers you to take data processing to new heights.

Experience the future of data processing with Bacalhau today and let your data work for you like never before. Unlock unparalleled opportunities in the data realm and elevate your capabilities. Embrace the power of Bacalhau and start your data journey now!

A guest post by

Favour Kelvin

Technical Writer

Bacalhau