Introduction
We've embarked on a journey exploring Bacalhau's transformative approach to log management. From cost-saving strategies to deploying daemon jobs for streaming logs to OpenSearch and S3, and to batch process historic logs on S3 , we've covered a lot of ground. Today, we're diving into the world of real-time log analysis with Bacalhau's ops jobs. This is crucial when aggregated logs and metrics in OpenSearch provide insights but fall short in pinpointing the root cause of live incidents.
The Power of Ops Jobs
Ops jobs in Bacalhau are uniquely designed for real-time analysis. Unlike batch jobs that operate on a fixed number of nodes, ops jobs run on all nodes that meet specific criteria. This feature is particularly valuable for real-time troubleshooting and incident investigation. It enables operators to run queries across all relevant nodes simultaneously, providing a comprehensive view of live situations.
In fact, ops jobs play a pivotal role in driving down costs in log orchestration. They allow us to avoid streaming all raw logs to costly real-time destinations while still making live logs easily accessible on-demand during incident investigation.
Key Features of Bacalhau Ops Jobs
Real-Time Analysis: Instantly query live logs across all relevant nodes.
Comprehensive Coverage: Execute queries on every node that meets your criteria, ensuring no data is missed.
Versatile Compute Engines: Bacalhau's flexibility shines with the various types of engines it supports. In this tutorial, we're using DuckDB for its ease of use and efficiency.
Prerequisites
Before diving in, make sure you've set up your log orchestration network and started generating nginx access logs as outlined in our previous blog post.
Step 1 - Access Bacalhau Network
In our existing setup, we have a network of web servers generating logs. Ensure your Bacalhau CLI is configured for your network:
Let's ensure our network is ready and the orchestrator can see the new compute instances.
Configure your Bacalhau CLI for your private cluster:
export BACALHAU_NODE_CLIENTAPI_HOST=<OrchestratorPublicIp>
List the nodes:
bacalhau node list --labels "service=WebService"
You should now see three web server nodes.
Step 2 - Run Ops Job
2.1 Job Specification
This is the specification forops-job.yaml, which targets all WebService local logs and utilizes Bacalhau's powerful job templating to define the query and optional time ranges.
Name: Live logs processing
Type: ops
Namespace: logging
Constraints:
- Key: service
Operator: ==
Values:
- WebService
Tasks:
- Name: main
Engine:
Type: docker
Params:
Image: expanso/nginx-access-log-processor:1.0.0
Parameters:
- --query
- {{.query}}
- --start-time
- {{or (index . "start-time") ""}}
- --end-time
- {{or (index . "end-time") ""}}
InputSources:
- Target: /logs
Source:
Type: localDirectory
Params:
SourcePath: /data/log-orchestration/logs
Job Type: This ops job operates on all nodes labeled service=WebService, showcasing Bacalhau's flexible deployment capabilities.
Input Source: The job is mounting local logs from /data/log-orchestration/logs. Bacalhau takes security seriously and only allow-listed paths can be mounted as shown here.
Execution: The job executes within a Docker container, running a DuckDB query on local logs. In addition to defining the query, you can define fixed and relative time ranges, such as start-time=-5m. For a deep dive into the code, check here.
2.2 Submitting the Job
Declarative Approach
To submit your job, run:
bacalhau job run ops-job.yaml --follow \
-V "query=SELECT status FROM logs WHERE status LIKE '5__'" \
-V "start-time=-5m"
-V or --template-vars sets values for your job template.
-follow streams the job’s log output to stdout. Alternatively, you can fetch logs using bacalhau logs <job_id>.
Imperative Approach
Alternatively, submit an ops job like this:
bacalhau docker run -f \
--target all \
-i "file:///data/log-orchestration/logs:/logs" \
-s "service=WebService" \
expanso/nginx-access-log-processor:1.0.0 -- \
--query "SELECT status FROM logs WHERE status LIKE '5__'" \
--start-time "-5m"
Step 3 - Explore More Queries
Error Spike Analysis:
Query: SELECT COUNT(*), status FROM logs WHERE status LIKE '5__' GROUP BY status;
Purpose: This query helps in quickly identifying servers experiencing a surge in 5xx errors, indicating potential issues.
Traffic Source Monitoring:
Query: SELECT remote_addr, COUNT(*) FROM logs GROUP BY remote_addr ORDER BY COUNT(*) DESC LIMIT 10;
Purpose: Identify the top IP addresses generating traffic, crucial for spotting potential DDoS attacks or popular content.
Resource Usage Analysis:
Query: SELECT request, SUM(body_bytes_sent) FROM logs GROUP BY request ORDER BY SUM(body_bytes_sent) DESC LIMIT 10;
Purpose: Understand which endpoints are consuming the most bandwidth to optimize server performance.
Security Checks:
Query: SELECT * FROM logs WHERE request LIKE '%sensitive_path%' OR status = '401';
Purpose: Detect access attempts to sensitive URLs or unauthorized access attempts.
Conclusion
Bacalhau's ops jobs are a powerful solution for real-time log analysis. They provide a level of depth and insights that aggregated logs alone can’t match, which is vital for quick incident response and effective troubleshooting. As we continue to explore Bacalhau's capabilities, its potential to transform real-time log management and incident investigation becomes increasingly clear.
Stay tuned for more insights and practical guides as we further unlock the power of Bacalhau in efficient and effective log orchestration.
How to Get Involved
We're looking for help in several areas. If you're interested in helping, there are several ways to contribute. Please reach out to us at any of the following locations.
Commercial Support
While Bacalhau is open-source software, the Bacalhau binaries go through the security, verification, and signing build process lovingly crafted by Expanso. You can read more about the difference between open source Bacalhau and commercially supported Bacalhau in our FAQ. If you would like to use our pre-built binaries and receive commercial support, please contact us!