Efficient Historic Log Analysis with Bacalhau Batch Jobs

Save $2.5M Per Year by Managing Logs the AWS Way (5 min)

and

Dec 07, 2023

Introduction

Welcome back to our Bacalhau blog series! We've been journeying through the world of efficient log management with Bacalhau, uncovering ways to make this crucial task both cost-effective and operationally streamlined. From slashing log management costs to deploying nimble logging agents, Bacalhau has consistently proven its worth. Today, we're diving into a vital piece of the puzzle: querying historic raw logs in S3 with Bacalhau's batch jobs.

The Challenge of Historic Log Analysis

Historic log analysis often treads a fine line between accessibility and cost-effectiveness. Typically real-time metrics and aggregated data do the trick, but sometimes a deep dive into raw logs is necessary for comprehensive batch analysis or detailed troubleshooting. However, constantly streamingall raw logs to platforms like Splunk or ElasticSearch can be prohibitively expensive and operationally cumbersome.

Bacalhau's Approach: On-Demand, Scalable Batch Jobs

Enter Bacalhau's scalable, on-demand approach for querying historic raw logs in S3. This method significantly cuts down on operational overhead and costs, as we've highlighted in our previous post.

Key Features of Bacalhau Batch Jobs

On-Demand: Bacalhau's batch jobs kick into action as needed, optimizing resource utilization.
Reliable: Bacalhau ensures smooth job execution with efficient node selection, monitoring, and failover.
Remote Input Sources: Bacalhau can pull data from various sources, like S3, without additional coding requirements.
Flexible Routing: Bacalhau runs jobs on idle capacity or on a dedicated compute fleet. In our example, we use compute nodes labeled service=ComputeService.
Versatile Compute Engines: Bacalhau's flexibility shines with the various types of jobs it supports. In this tutorial, we use duckdb for its large-scale data analysis capabilities.

Step 0 - Prerequisites

Prepare your log orchestration network and start generating nginx access logs, as detailed in our previous blog post, before proceeding.

Step 1 - Deploy Compute Fleet

In our previous setup, we deployed three web server instances with logging agents. While Bacalhau can process logs on S3 using any network node, including these web servers, we're deploying a dedicated compute fleet for this demo to highlight Bacalhau’s job routing flexibility.

1. Navigate to the right directory in the examples repo:

cd log-orchestration/cdk

2. Re-deploy CDK with three compute EC2 instances:

cdk deploy -c computeServiceInstanceCount=3

CDK Outputs

Post-deployment, keep an eye on these outputs:

OrchestratorPublicIp: Connect Bacalhau CLI to this IP
AccessLogBucket: S3 bucket where raw logs are stored
ResultsBucket: S3 bucket to store query results

Step 2 - Access Bacalhau Network

Let's ensure our network is ready and the orchestrator can see the new compute instances.

1. Configure your Bacalhau CLI for your private cluster:

export BACALHAU_NODE_CLIENTAPI_HOST=<OrchestratorPublicIp>

2. List all the nodes:

bacalhau node list

You should see the web server nodes and the new compute nodes.

Step 3 - Run Batch Job

Let's dive into the heart of the process: running a batch job with Bacalhau. We'll start by setting up a sample job specification.

3.0 Sample Job Specification

Type: batch
Count: 1
Constraints:
  - Key: service
    Operator: ==
    Values:
      - ComputeService
Tasks:
  - Name: main
    Engine:
      Type: docker
      Params:
        Image: expanso/nginx-access-log-processor:1.0.0
        Parameters:
          - --query
          - SELECT status FROM logs WHERE status LIKE '5__'
    InputSources:
      - Target: /logs
        Source:
          Type: s3
          Params:
            Bucket: {{.AccessLogBucket}}
            Key: {{.AccessLogPrefix}}
            Filter: {{or (index . "AccessLogPattern") ".*"}}
            Region: {{.AWSRegion}}

Job Type: This batch job operates on a single node labeled service=ComputeService, showcasing Bacalhau's flexible deployment capabilities.
Input Source: The job fetches data from an s3 source, specifically from AccessLogBucket. You can fine-tune the data selection using AccessLogPrefix and AccessLogPattern to focus on specific log timeframes. Bacalhau leverages Go’s text/template for dynamic job specs, allowing CLI flags or environment variables as inputs. More details here.
Execution: The job executes within a Docker container, running a duckdb query on logs retrieved from S3. For a deep dive into the code, check here.

3.1 Submitting the Job

Declarative Approach

To submit your job, run:

bacalhau job run sample_job.yaml --follow \
-V "AccessLogBucket=<VALUE>" \ 
-V "AWSRegion=<VALUE>" \
	-V "AccessLogPrefix=2023-11-19-*"  \
	-V "AccessLogPattern=^[10-12].*"

-V or --template-vars sets values for your job template. Remember to update AccessLogBucket and AWSRegion based on your CDK outputs. This example targets logs from 2023-11-19 between 10 AM and 12 PM.
--follow streams the job’s log output to stdout. Alternatively, you can fetch logs using bacalhau logs <job_id>.

Note: You can also set these variables as environment variables and run your command using --template-envs ".*"

Imperative Approach

Alternatively, submit a batch job like this:

bacalhau docker run -f \
    -i "s3://$AccessLogBucket/2023-11-19-20/:/logs,opt=region=$AWSRegion" \
    -s "service=ComputeService" \
    expanso/nginx-access-log-processor:1.0.0 -- \
    --query "SELECT status FROM logs WHERE status LIKE '5__'"

3.2 Accessing Results

Bacalhau Logs

To access the query results which are published to stdout, use:

bacalhau logs <job_id>

Publish Results

Use Bacalhau publishers to store the result in ResultsBucket for later use or analysis by updating the job spec as follows:

...
Tasks:
  - Name: main
    Engine:
      Type: docker
      Params:
        Image: expanso/nginx-access-log-processor:1.0.0
        Parameters:
          - --query
          - SELECT status FROM logs WHERE status LIKE '5__'
          - --json
    Publisher:
      Type: "s3"
      Params:
        Bucket: {{.ResultsBucket}}
        Region: {{.AWSRegion}}
        Key: "5xx_requests-{date}.tar.gz"
        Compress: true
...

Download these results anytime using:

bacalhau get <job_id>

Step 4 - Explore More Powerful Queries

In this section, we'll delve into various meaningful queries you can run on your raw Nginx access logs using Bacalhau batch jobs and duckdb. Each query is designed to extract specific insights from the logs, ranging from basic analysis to more complex, deep-dive investigations. You can find all the job specifications on our GitHub repository.

Basic Queries

Security Threat Identification: This query helps in spotting IPs with unusually high request rates, which could signal potential security threats. It's a crucial step in proactive security management. View Job Spec
Error Analysis: This analysis focuses on identifying the most common 404 error URLs. It's particularly useful for pinpointing broken links or areas of your site that users are struggling to access. View Job Spec
Top Referring Sites: This query helps understand which external sites are driving the most traffic. View Job Spec

Deep Dive Queries

Longest User Sessions: This query identifies user sessions with the longest duration by analyzing request start and end times from the same IP. It's valuable for understanding user engagement and behavior on your site. View Job Spec
Sequential Request Patterns: This query reveals common user navigation patterns by identifying sequential page requests. It uses window functions to track each user’s subsequent request, shedding light on their journey across your site. View Job Spec
Detailed Error Investigation: For errors like 404 or 500, this query delves into the sequence of actions leading to the error, offering deeper insight for troubleshooting and context understanding.. View Job Spec
Behavior-Based User Segmentation: This query segments users by their specific action sequences, time spent on pages, and interaction patterns. It's a sophisticated approach to user behavior analysis. View Job Spec
Unusual Request Patterns: This query spots IPs with atypical request patterns, like high error rates or odd page access sequences, to identify potential system issues or user behavior anomalies. View Job Spec

Conclusion

Bacalhau's batch jobs offer a powerful solution to historic log analysis. This approach not only saves costs but also boosts operational efficiency, enabling a level of analysis that was once difficult to attain. As we continue to explore Bacalhau's capabilities, its ability to revolutionize log management becomes ever more apparent.

Keep an eye out for more insights and tutorials as we further unlock the power of Bacalhau in log orchestration.

Commercial Support

While Bacalhau is open-source software, the Bacalhau binaries go through the security, verification, and signing build process lovingly crafted by Expanso. You can read more about the difference between open source Bacalhau and commercially supported Bacalhau in our FAQ. If you would like to use our pre-built binaries and receive commercial support, please contact us!

What’s Next?

Keep an eye out for part 4 of our log orchestration series. You can learn more about our Logstash pipeline configuration and aggregation implementation here.

And stay tuned - we've got new content rolling out regularly, including step-by-step tutorials and code snippets to jump-start your own log management adventure with Bacalhau. Get ready to level up and more information here.

We are committed to keep delivering groundbreaking updates and improvements and we're looking for help in several areas. If you're interested, there are several ways to contribute and you can always reach out to us via Slack or Email.

Bacalhau is available as open-source software and you can download it for free here. The public GitHub repo can be found here.

If you like our software and content, please give us a star ⭐