Process Remote Data with Python - Serverlessly!
(03:27) Augment your existing workflows with Bacalhau to take advantage of the raw power of the cloud.
Frustrated by jumping through hoops to process remote data securely? You're not alone. In today's data-driven world, organizations grapple with:
Distributed Data Silos: Data scattered across multiple environments—in clouds, regions, and on-premise servers.
Security Concerns: Granting access to remote machines usually means exposing SSH keys or sensitive credentials.
Complex Tooling: Traditional methods require containerization or complex setups to run code remotely.
These challenges make it difficult to process data without compromising security or efficiency. Worst of all, interacting with the data often requires adopting entirely new tools like Databricks, Docker, or Kubernetes.
Wouldn’t it be easier to just run a Python script? With Bacalhau, you can! If you want to follow along, the complete code for these examples can be found at https://github.com/bacalhau-project/examples/tree/main/shared-cluster
Scenario: A Multi-region Data Team
Let’s walk through a scenario in which we process a large amount of data spread across a variety of locations. The cluster may look like this:
The data also contains Personally Identifiable Information (PII) that requires regulatory oversight. At first glance, the solution may require building and deploying containers - a complicated burden for your data scientists. But it doesn’t have to be that way. They can execute raw Python instead.
Processing Data - Remotely and Securely - with Python and Bacalhau
With Bacalhau 1.5 and templating, it’s easy to run your jobs remotely, even under extremely locked-down circumstances.
For information on setting up a Bacalhau network, check out this article.
In a Bacalhau network, in order to run scripts next to the data, your data scientists need a job to execute. A sample job looks like this:
Quite straightforward! The magic happens in these two lines:
Line 18: We have built a very simple container that executes Python if there is a script in the COMMAND environment variable.
Line 20: We can inject this at the command line when we execute the job.
How does this work in practice? Let’s use a simple Python file that calculates the Fibonacci sequence:
Execute it with the following command:
bacalhau job run jobs/template_job.yaml --template-vars "fulltext=$(cat scripts/hello_world.py | base64)"
This command takes the Python script, encodes it in Base64, embeds it as an environment variable, and runs it on the server of your choice (or the first one available). Here is the output:
As you can see on line 38, everything printed as expected. No containerization or other work - it just ran!
Making Sure You’re Running Securely
You may have heard that running arbitrary Python scripts can be insecure. Yes, it can be. Things can still squeak through even the most locked-down scripts. The solution is to adopt the “defense in depth” approach. Bacalhau is built to be secure by default.
The following script tests a number of malicious activities, like exfiltrating data, reading from /etc/passwd, and deleting data:
So let’s run this one:
bacalhau job run jobs/template_job.yaml --template-vars "fulltext=$(cat scripts/bad_stuff.py | base64)"
Now check the output:
Presto! Notice what the script attempted:
Tried to read user/password?
Tried to exfiltrate data?
Tried to run arbitrary binary?
Tried to open a network connection?
Tried to delete a file?
All blocked! With Bacalhau, you get a much cleaner security profile.
Analyzing an HDF File with Bacalhau
But locking down the environment does not matter if you can’t do anything with the data.
Let's say you have a script that analyzes an HDF file using the h5py package, but you ALSO have data that is too big to move.
Bacalhau empowers you to securely process remote data without the usual headaches:
No Need to Move Data: Instead, bring the compute to it.
Enhanced Security: Run code in a sandboxed environment with strict controls.
Simplified Workflow: Execute arbitrary Python scripts without containerization.
Flexible Deployment: Easily set up Bacalhau in multi-compute environments.
Let’s execute a job with a 14GB+ file WITHOUT moving the data, offering (potentially) risky shell access, or installing complicated dependencies on the server.
As before, we’ll create a Python script (abbreviated for readability - full version here):
This is the standard HD5 analysis code that you would normally run on your local machine. In this example, we want to run it on the remote server.
Seems complicated? It’s not - running it is EXACTLY THE SAME!
bacalhau job run jobs/template_job.yaml --template-vars "fulltext=$(cat scripts/analyze_hd5_file.py | base64)"
When we look at the output, it displays as if we were running locally - in only 7.5 seconds! Much faster than downloading several terabytes.
Conclusion
Securely processing remote data doesn't have to be a labyrinth of complexity. Bacalhau eliminates the need for complicated tooling or containerization, enabling you to run Python scripts next to your data, no matter where it's located.
Whether you're handling massive datasets across multiple regions or navigating strict regulatory requirements, Bacalhau empowers you to focus on extracting insights without the usual headaches.
So why keep jumping through hoops? Simplify your workflow and enhance your security posture by giving Bacalhau a try. Your data scientists - and your peace of mind - will thank you.