Cross-Border Data Processing With Privacy Compliance Through Expanso
Using Bacalhau to handle complex data pipelines that cross borders while preserving privacy
Many organizations work with clients and infrastructure around the world and face significant challenges ensuring they follow privacy regulations as their application data flows across borders.
Data privacy regulations such as the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA) in California impose requirements on how applications can store, process, and transfer personal and sensitive information across borders.
The core challenge lies in maintaining data sovereignty and conforming to these rules while enabling cross-border analytics. For instance, when collecting personal data in the European Union, regulations often require that this data remain within EU borders. However, organizations still need to perform analytics on this data in other regions, creating a complex compliance challenge that requires careful architectural considerations.
Organizations often have to process large volumes of data efficiently. They need a distributed processing approach that maintains data locality and sovereignty while providing reliable job orchestration and monitoring that can scale in parallel based on demand.
This post looks at how you can use Bacalhau to handle distributed cross-border processing and anonymize data with Microsoft Presidio to help meet some of these requirements.
Bacalhau is an open-source distributed platform that enables you to run compute jobs where data is generated and stored.
A practical guide
This tutorial migrates data from the EU to the USA by creating a synthetic dataset, and anonymizing it before migration with Microsoft Presidio to analyze, extract, and anonymize sensitive data.
Prerequisites
Before starting, ensure your system meets the following minimum requirements:
20 GB of free disk space
4 CPU cores
Docker and Docker Compose are installed
Set up multi-region deployment environment
The Docker Compose file below sets up a multi-region Bacalhau deployment with an orchestrator node that receives and schedules jobs, three compute nodes that execute the jobs, and one MinIO storage node per region, which in this case are the US and EU, using labels and constraints.
The storage nodes use MinIO, an S3-compatible object storage server, to store the data.
git clone https://github.com/bacalhau-project/bacalhau-network-setups
cd bacalhau-network-setups/docker-compose/multi-region
docker compose up -d
The Bacalhau solution implements a multi-regional data processing architecture that strictly adheres to data sovereignty requirements while enabling efficient cross-border analytics. The architecture consists of three main components:
Regional compute resources in each geographical region.
Distributed storage system spread across each region that can be queried individually or as a whole
An orchestration layer that coordinates jobs and requests across the system.
Install the Bacalhau CLI
To interact with the newly created Bacalhau deployment, install the Bacalhau CLI:
curl -sL 'https://get.bacalhau.org/install.sh' | bash
To verify that you are targeting the right Bacalhau deployment, run the command below.
bacalhau node list
You should see a list of 7 nodes: 1 orchestrator and 6 compute nodes. Something like the below:
ID TYPE APPROVAL STATUS LABELS CPU MEMORY DISK GPU
compute-eu-1 Compute APPROVED CONNECTED Architecture=arm64, Operating-System=linux, region=eu 8.0 6.6 GB 861 GB 0
compute-eu-2 Compute APPROVED CONNECTED Architecture=arm64, Operating-System=linux, region=eu 8.0 6.6 GB 861 GB 0
compute-eu-3 Compute APPROVED CONNECTED Architecture=arm64, Operating-System=linux, region=eu 8.0 6.6 GB 861 GB 0
compute-us-1 Compute APPROVED CONNECTED Architecture=arm64, Operating-System=linux, region=us 8.0 6.6 GB 861 GB 0
compute-us-2 Compute APPROVED CONNECTED Architecture=arm64, Operating-System=linux, region=us 8.0 6.6 GB 861 GB 0
compute-us-3 Compute APPROVED CONNECTED Architecture=arm64, Operating-System=linux, region=us 8.0 6.6 GB 861 GB 0
orchestrator Requester APPROVED CONNECTED Architecture=arm64, Operating-System=linux, region=global, type=orchestrator
Clone example job repository
In a separate directory, clone the examples repository and navigate to the data anonymization example folder.
git clone https://github.com/bacalhau-project/examples.git
cd examples/data-engineering/data-anonymization-with-microsoft-presidio/
You can find the job specifications used for the rest of this post in the jobs folder, and more details on the possible specification options in the Bacalhau documentation.
Generate fake sensitive data
The data-generator.yaml job consists of a bash script that runs in a Docker container and generates 30 files. These files simulate a memo full of personal data such as names, phone numbers, and addresses. The job generates the data in the node’s location in EU regions and then pushes those files to a MinIO bucket.
Submit the job to the compute nodes in Bacalhau cluster labeled with eu using the command below:
bacalhau job run -V Region=eu jobs/data-generator.yaml
Anonymize the data
The anonymize-job.yaml job runs a Python script on a Docker image that uses Microsoft Presidio to analyze, extract, and anonymize sensitive data in the EU-based MinIO bucket files.
Presidio is an open-source toolkit that uses NLP models to identify and anonymize sensitive information in structured and unstructured data formats. It can process different content types, including:
Unstructured text documents and communications.
Emails and business correspondence.
Internal memos and reports.
Images containing sensitive information.
Business documents and forms.
Presidio’s strength lies in recognizing multiple types of Personally Identifiable Information (PII). It can identify and sanitize sensitive elements such as names, addresses, identification numbers, and other personal information while maintaining the document’s structure and meaning. This capability is ideal for preparing data for cross-border transfers while maintaining compliance with data protection regulations.
The job outputs the anonymized files to a US-based MinIO bucket.
Submit the job to the compute nodes in Bacalhau cluster labeled with eu using the command below:
bacalhau job run -V Region=eu jobs/anonymize-job.yaml
This job takes a while to process, and you can check the job status using the job executions <job-id>
command.
Presidio anonymizes the data by replacing sensitive information with generic placeholders. For example, it replaces names with <PERSON>
, IBAN codes with <IBAN_CODE>
, and dates with <DATE_TIME>
.
Bacalhau input and output configuration
The InputSources and ResultPaths sections of the anonymize-job.yaml specification are the key components that enable the cross-border anonymous data processing.
InputSources:
- Target: /inputs
Source:
Type: s3
Params:
Bucket: my-bucket
Key: "confidential-memos/"
Endpoint: "http://storage-local:9000"
Region: "eu-central-1"
Filter: ".*txt$"
This job uses MinIO and simulates an S3 bucket in the EU region.
This connects to the Publisher section of the data-generator.yaml job that defines where it writes the output of the job. In this case, a MinIO S3-compatible bucket in an EU region.
Publisher:
Type: s3
Params:
Bucket: my-bucket
Key: "confidential-memos/{nodeID}/"
Endpoint: "http://storage-local:9000"
Region: "eu-central-1"
Encoding: plain
After anonymizing the data, the job writes the output to a different MinIO bucket in a US region, using a different MinIO endpoint.
ResultPaths:
- Name: anonymized-memos
Path: /anonymized-output
Publisher:
Type: "s3"
Params:
Bucket: "my-bucket"
Key: "anonymized-memos/{date}/{time}/memos-{executionID}"
Endpoint: "http://storage-us:9000"
Region: "us-east-1"
The configuration uses a dynamic Key structure using {date}
, {time}
, and {executionID}
to create a well-organized storage hierarchy that makes it easier to track different processing runs.
Cleanup
When you’ve finished with the example, you can clean up the environment with the following commands:
# Stop the stack
docker compose down -v
# Clean up volumes
docker volume prune
Summary
This post showed how you can use Bacalhau to maintain clear boundaries between sensitive and anonymized data by taking the following steps:
Accessing input data only from a source in an EU region
Processing using Presidio within the same region as the sensitive data
Only publishing anonymized results to a US region
US-based compute nodes can then perform analytics on the sanitized data
This process and setup ensure compliance with data sovereignty requirements while enabling efficient cross-region data processing and analytics.