Bacalhau v1.7.0 - Day 4: Using AWS S3 Partitioning With Bacalhau
(7:00) Bacalhau 1.7.1 simplifies S3 data processing with automated partitioning and built-in failure handling.
This is part of the 5-days of Bacalhau 1.7 series! Make sure to go back to the start to catch all of them!
Day 2: Scaling Your Compute Jobs with Bacalhau Partitioned Jobs
Day 3: Streamlining Security: Simplifying Bacalhau's Authentication Model
Processing large datasets from S3 can be a challenge, particularly when the size of the data exceeds certain values. The good news is that we made it a lot easier with Bacalhau!
While we introduced generic partitioning in Bacalhau v1.7, our new S3 partitioning feature handles data distribution automatically across multiple executions. It’s complete with failure handling, and independent retry of failed partitions, specifically on S3.
Let's dive into how this changes the game for distributed data processing.
Why Partition at All?
Before this feature, processing large S3 datasets was challenging. You had to create multiple jobs or write custom code to split the work. Without this effort, your machine would crash or slow down due to limited computing and disk throughput.
To solve this challenge, you had to figure out which part of your job was running on which machine. Then, you had to tell each part its position in the overall task. This was hard to manage. For example, if slice #3 of 8 failed, how would you know? Or how would you know which data should handle the #7 slice? More generally: how would you see the big picture of the entire job?
The Power of Automated Partitioning
Bacalhau 1.7.1 orchestrates everything for you. You just need to choose your partitioning strategy, and each task automatically gets its assigned subset of S3 objects. Your code stays clean and focused on its main job. If a partition fails, Bacalhau automatically retries only that and keeps the results from the successful ones.
For example, suppose you run a job and obtain a result as follows:
Job with 5 partitions:
Partition 0: ✓ Completed
Partition 1: ✓ Completed
Partition 2: ✓ Completed
Partition 3: ✗ Failed -> Scheduled for retry
Partition 4: ✓ Completed
This means that Partition 3 has failed. However, Bacalhau will automatically retry the failed partition while preserving the results of the other four successfully completed jobs.
Partitioning Strategies for Every Need
Let’s now go through an overview of different partitioning strategies in different scenarios with data retrieved from S3.
No Partitioning: When Sharing Is Good
There are cases where every execution needs access to all the data and partitioning is not needed. Typical scenarios are:
Loading shared reference data
Processing configuration files
Running analysis that needs the complete dataset
In these cases, you can process the whole dataset you have like so with Bacalhau:
name: shared-reference-data
count: 3
...
tasks:
- inputSources:
- target: /data
source:
type: s3
params:
bucket: config-bucket
key: reference-data/
# No partition config - all executions see all files
In this case, the partition block is not present in the code. So your dataset remains as is.
Also, type: s3
under the source
field specifies the type of data source used for this task. So the input data is coming from an S3-compatible storage system.
Object-Based Distribution: When Balance Matters
If you need to process many files without any specific grouping, object partitioning provides an even distribution of the load.
This solution is ideal for:
Processing large volumes of user uploads
Handling randomly named files
Large-scale data transformation tasks
Here is how Bacalhau handles this for you:
name: process-uploads
count: 5
...
tasks:
- inputSources:
- target: /uploads
source:
type: s3
params:
bucket: data-bucket
key: user-uploads/
partition:
type: object
In this case, the count: 5
processes data for 5 units, thanks to the partition
block.
Processing by Date: Time-Series Analysis
Time-series analysis is the cross and delight of every data professional—well, sometimes more a cross than a delight!
With Bacalhau, you can use partitioning to process each day's data in parallel. This is the perfect case for:
Daily analytics processing
Log aggregation and analysis
Time-series computations
Here is how you can do so:
name: daily-log-analysis
count: 7 # Process a week's worth of logs in parallel
...
tasks:
- inputSources:
- target: /logs
source:
type: s3
params:
bucket: app-logs
key: "logs/*"
partition:
type: date
dateFormat: "2006-01-02"
In this case, the count:7
processes data for 7 units, representing the week’s data.
Processing by Region: Geographic Analysis
Analyzing geographical data is another scenario that may come with a lot of data. A solution is to distribute processing by region with partitioning. This enables scenarios like:
Regional sales analysis
Geographic data processing
Territory-specific reporting
Here is how you can manage this in Bacalhau:
name: regional-analysis
count: 3 # One execution per region
...
tasks:
- inputSources:
- target: /sales
source:
type: s3
params:
bucket: global-sales
key: "regions/*"
partition:
type: regex
pattern: "([^/]+)/.*"
For example, if you have data in regions/NA/, regions/EU/, regions/APAC/,
etc., each execution will process one region's worth of data. The pattern: "([^/]+)/.*"
is a standard Regex that does the following:
([^/]+)
: This part matches and captures one or more characters that are not a forward slash (/). This is the first capturing group./.*
: This matches a forward slash (/) followed by zero or more characters (.*).
As a result, if the S3 key is regions/europe/sales.csv
, the regex will capture europe
.
Processing by Customer Segment
Another typical example of the usage of partitioning is customer segmentation. In this case, common analysis scenarios are:
Customer cohort analysis
Segment-specific processing
Category-based computations
You can handle your analysis with Bacalhau partitioning as follows:
name: segment-analytics
count: 4
...
tasks:
- inputSources:
- target: /segments
source:
type: s3
params:
bucket: customer-data
key: segments/*
partition:
type: substring
startIndex: 0
endIndex: 3
Combining Partitioned and Shared Inputs
In certain cases, you may need Bacalhau’s jobs to process partitioned data while sharing reference data that all executions need to access. Common scenarios are:
Processing daily logs with shared lookup tables
Analyzing data using common reference files
Running calculations that need both partitioned data and shared configuration
As an example, consider this job that combines static reference data with daily logs partitioned by date:
name: daily-analysis
count: 7# Process a week of data
...
tasks:
- inputSources:
- target: /config
source:
type: s3
params:
bucket: config-bucket
key: reference/*
# No partitioning - all executions see all reference data
- target: /daily-logs
source:
type: s3
params:
bucket: app-logs
key: logs/*
partition:
type: date
dateFormat: "2006-01-02"
This code partitions only the /daily-logs
, subdividing them into weekly data with count:7
.
The reference data (/config)
, instead, is not partitioned as it is not under the partition block.
Why This Changes Your Large Data Set Processing
As you have seen, this feature is simple yet powerful. You are no longer required to write partition-aware code: just clean, focused processing logic with automatic data assignment. We’ve already tested this on scaling to over 1,000 partitions with no code changes needed and automated load balancing. But you tell us if you’d like us to go even further!
Getting Started With S3 Partitioning
If you’d like to try this example on your own, dive right in! Install Bacalhau and give it a shot.
By the way, if you don’t have a network and you would still like to try it out, we recommend using Expanso Cloud. Also, if you'd like to set up a cluster on your own, you can do that too (we have setup guides for AWS, GCP, Azure, and many more 🙂).
What's Next?
Start processing your S3 data today:
Identify your natural data groupings (dates, regions, categories)
Choose the matching partition strategy
Let Bacalhau handle the distribution
Ready to simplify your distributed data processing? Check out our documentation for more examples and detailed guides.
Join our community to share your data processing stories and learn from others!
Get Involved!
We welcome your involvement in Bacalhau. There are many ways to contribute, and we’d love to hear from you. Please reach out to us at any of the following locations.
Commercial Support
While Bacalhau is open-source software, the Bacalhau binaries go through the security, verification, and signing build process lovingly crafted by Expanso. You can read more about the difference between open-source Bacalhau and commercially supported Bacalhau in our FAQ. If you would like to use our pre-built binaries and receive commercial support, please contact us or get your license on Expanso Cloud!