Bacalhau v1.7.0 - Day 4: Using AWS S3 Partitioning With Bacalhau

(7:00) Bacalhau 1.7.1 simplifies S3 data processing with automated partitioning and built-in failure handling.

Federico Trotta

and

David Aronchick

Mar 27, 2025

This is part of the 5-days of Bacalhau 1.7 series! Make sure to go back to the start to catch all of them!

Day 1: Announcing Bacalhau 1.7.0: Empowering Enterprises with Enhanced Scalability, Job Management, and Support
Day 2: Scaling Your Compute Jobs with Bacalhau Partitioned Jobs
Day 3: Streamlining Security: Simplifying Bacalhau's Authentication Model

Processing large datasets from S3 can be a challenge, particularly when the size of the data exceeds certain values. The good news is that we made it a lot easier with Bacalhau!

While we introduced generic partitioning in Bacalhau v1.7, our new S3 partitioning feature handles data distribution automatically across multiple executions. It’s complete with failure handling, and independent retry of failed partitions, specifically on S3.

Let's dive into how this changes the game for distributed data processing.

Why Partition at All?

Before this feature, processing large S3 datasets was challenging. You had to create multiple jobs or write custom code to split the work. Without this effort, your machine would crash or slow down due to limited computing and disk throughput.

To solve this challenge, you had to figure out which part of your job was running on which machine. Then, you had to tell each part its position in the overall task. This was hard to manage. For example, if slice #3 of 8 failed, how would you know? Or how would you know which data should handle the #7 slice? More generally: how would you see the big picture of the entire job?

The Power of Automated Partitioning

Bacalhau 1.7.1 orchestrates everything for you. You just need to choose your partitioning strategy, and each task automatically gets its assigned subset of S3 objects. Your code stays clean and focused on its main job. If a partition fails, Bacalhau automatically retries only that and keeps the results from the successful ones.

For example, suppose you run a job and obtain a result as follows:

Job with 5 partitions: Partition 0: ✓ Completed Partition 1: ✓ Completed Partition 2: ✓ Completed Partition 3: ✗ Failed -> Scheduled for retry Partition 4: ✓ Completed

This means that Partition 3 has failed. However, Bacalhau will automatically retry the failed partition while preserving the results of the other four successfully completed jobs.

Partitioning Strategies for Every Need

Let’s now go through an overview of different partitioning strategies in different scenarios with data retrieved from S3.

No Partitioning: When Sharing Is Good

There are cases where every execution needs access to all the data and partitioning is not needed. Typical scenarios are:

Loading shared reference data
Processing configuration files
Running analysis that needs the complete dataset

In these cases, you can process the whole dataset you have like so with Bacalhau:

name: shared-reference-data

count: 3

...

tasks:

- inputSources:

- target: /data

source:

type: s3

params:

bucket: config-bucket

key: reference-data/

# No partition config - all executions see all files

In this case, the partition block is not present in the code. So your dataset remains as is.

Also, type: s3 under the source field specifies the type of data source used for this task. So the input data is coming from an S3-compatible storage system.

Object-Based Distribution: When Balance Matters

If you need to process many files without any specific grouping, object partitioning provides an even distribution of the load.

This solution is ideal for:

Processing large volumes of user uploads
Handling randomly named files
Large-scale data transformation tasks

Here is how Bacalhau handles this for you:

name: process-uploads

count: 5

...

tasks:

- inputSources:

- target: /uploads

source:

type: s3

params:

bucket: data-bucket

key: user-uploads/

partition:

type: object

In this case, the count: 5 processes data for 5 units, thanks to the partition block.

Processing by Date: Time-Series Analysis

Time-series analysis is the cross and delight of every data professional—well, sometimes more a cross than a delight!

With Bacalhau, you can use partitioning to process each day's data in parallel. This is the perfect case for:

Daily analytics processing
Log aggregation and analysis
Time-series computations

Here is how you can do so:

name: daily-log-analysis

count: 7 # Process a week's worth of logs in parallel

...

tasks:

- inputSources:

- target: /logs

source:

type: s3

params:

bucket: app-logs

key: "logs/*"

partition:

type: date

dateFormat: "2006-01-02"

In this case, the count:7 processes data for 7 units, representing the week’s data.

Processing by Region: Geographic Analysis

Analyzing geographical data is another scenario that may come with a lot of data. A solution is to distribute processing by region with partitioning. This enables scenarios like:

Regional sales analysis
Geographic data processing
Territory-specific reporting

Here is how you can manage this in Bacalhau:

name: regional-analysis

count: 3 # One execution per region

...

tasks:

- inputSources:

- target: /sales

source:

type: s3

params:

bucket: global-sales

key: "regions/*"

partition:

type: regex

pattern: "([^/]+)/.*"

For example, if you have data in regions/NA/, regions/EU/, regions/APAC/, etc., each execution will process one region's worth of data. The pattern: "([^/]+)/.*" is a standard Regex that does the following:

([^/]+): This part matches and captures one or more characters that are not a forward slash (/). This is the first capturing group.
/.*: This matches a forward slash (/) followed by zero or more characters (.*).

As a result, if the S3 key is regions/europe/sales.csv, the regex will capture europe.

Processing by Customer Segment

Another typical example of the usage of partitioning is customer segmentation. In this case, common analysis scenarios are:

Customer cohort analysis
Segment-specific processing
Category-based computations

You can handle your analysis with Bacalhau partitioning as follows:

name: segment-analytics

count: 4

...

tasks:

- inputSources:

- target: /segments

source:

type: s3

params:

bucket: customer-data

key: segments/*

partition:

type: substring

startIndex: 0

endIndex: 3

Combining Partitioned and Shared Inputs

In certain cases, you may need Bacalhau’s jobs to process partitioned data while sharing reference data that all executions need to access. Common scenarios are:

Processing daily logs with shared lookup tables
Analyzing data using common reference files
Running calculations that need both partitioned data and shared configuration

As an example, consider this job that combines static reference data with daily logs partitioned by date:

name: daily-analysis

count: 7# Process a week of data

...

tasks:

- inputSources:

- target: /config

source:

type: s3

params:

bucket: config-bucket

key: reference/*

# No partitioning - all executions see all reference data

- target: /daily-logs

source:

type: s3

params:

bucket: app-logs

key: logs/*

partition:

type: date

dateFormat: "2006-01-02"

This code partitions only the /daily-logs, subdividing them into weekly data with count:7.

The reference data (/config), instead, is not partitioned as it is not under the partition block.

Why This Changes Your Large Data Set Processing

As you have seen, this feature is simple yet powerful. You are no longer required to write partition-aware code: just clean, focused processing logic with automatic data assignment. We’ve already tested this on scaling to over 1,000 partitions with no code changes needed and automated load balancing. But you tell us if you’d like us to go even further!

Getting Started With S3 Partitioning

If you’d like to try this example on your own, dive right in! Install Bacalhau and give it a shot.

By the way, if you don’t have a network and you would still like to try it out, we recommend using Expanso Cloud. Also, if you'd like to set up a cluster on your own, you can do that too (we have setup guides for AWS, GCP, Azure, and many more 🙂).

What's Next?

Start processing your S3 data today:

Identify your natural data groupings (dates, regions, categories)
Choose the matching partition strategy
Let Bacalhau handle the distribution

Ready to simplify your distributed data processing? Check out our documentation for more examples and detailed guides.

Join our community to share your data processing stories and learn from others!

Get Involved!

We welcome your involvement in Bacalhau. There are many ways to contribute, and we’d love to hear from you. Please reach out to us at any of the following locations.

Commercial Support

While Bacalhau is open-source software, the Bacalhau binaries go through the security, verification, and signing build process lovingly crafted by Expanso. You can read more about the difference between open-source Bacalhau and commercially supported Bacalhau in our FAQ. If you would like to use our pre-built binaries and receive commercial support, please contact us or get your license on Expanso Cloud!

A guest post by

I am a freelance Technical Writer specializing in writing technical articles and documenting digital products. My mission is to democratize software through technical content.

Bacalhau

Bacalhau v1.7.0 - Day 4: Using AWS S3 Partitioning With Bacalhau

(7:00) Bacalhau 1.7.1 simplifies S3 data processing with automated partitioning and built-in failure handling.

Why Partition at All?

The Power of Automated Partitioning

Partitioning Strategies for Every Need

No Partitioning: When Sharing Is Good

Object-Based Distribution: When Balance Matters

Processing by Date: Time-Series Analysis

Processing by Region: Geographic Analysis

Processing by Customer Segment

Combining Partitioned and Shared Inputs

Getting Started With S3 Partitioning

What's Next?

Get Involved!

Commercial Support

Discussion about this post