Skip to content

Ship Stalwart logs to CloudWatch Logs#209

Open
ryanjjung wants to merge 12 commits intomainfrom
logdests
Open

Ship Stalwart logs to CloudWatch Logs#209
ryanjjung wants to merge 12 commits intomainfrom
logdests

Conversation

@ryanjjung
Copy link
Copy Markdown
Contributor

@ryanjjung ryanjjung commented Apr 3, 2026

A few things are happening here. A brief list is below, but I'll leave some commentary inline as well.

Importantly, this cannot be merged until a new release of tb_pulumi is cut. Update: The new code has been released, and we can proceed with this rollout.

Briefly, we:

  • Build a new log destination for Stalwart logs with two log streams, one for each of our server functions. This helps us separate mail service logs from the logs that show management API activity, which is used programmatically by thunderbird-accounts.
  • Add a new function concept. This helps us with the api/mail dichotomy. This new parameter in the node definition becomes a tag on the instance, and that gets picked up by the second phase bootstrapping process and templated into the fluent-bit config.
  • Install fluent-bit as part of the bootstrapping process and install a configuration that extracts only the MESSAGE portion of Thundermail systemd log events and send them along to CloudWatch Logs.

@ryanjjung ryanjjung self-assigned this Apr 3, 2026
@ryanjjung ryanjjung added the enhancement New feature or request label Apr 3, 2026

[Service]
Type=simple
Environment="ENV={{ env }}"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{{ this }} is a Jinja variable, replaced in the bootstrapping process. This is how ENV=stage or whatever gets into the service environment.

# do not exist, this service must have permission to create them.
- name: cloudwatch_logs
match: cloudwatch.stalwart.mail
log_group_name: /tb/${ENV}/stalwart
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

${THIS} is not Jinja, but a syntax native to fluent-bit's configuration. This gets subbed out live by fluent-bit with the value of this environment variable.

pipeline:
inputs:
- name: systemd
tag: cloudwatch.stalwart.{{ function }}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tagging things with the function lets us catch them later and route them properly.

# Map of template files to target files
TEMPLATE_MAP = {
'fluent-bit.service.j2': '/usr/lib/systemd/system/fluent-bit.service',
'fluent-bit.yaml.j2': '/etc/fluent-bit/fluent-bit.yaml',
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the new fluent-bit files to the list of things we template on a host.

# Map of template variable to EC2 tags
TEMPLATE_VALUE_TAG_MAP = {
'env': 'environment',
'function': 'postboot.stalwart.function',
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the new template variables based on instance tags.

self,
name: str,
project: tb_pulumi.ThunderbirdPulumiProject,
log_group_arn: str,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new variable gets passed down to the IAM setup so we can be sure the node profile includes write access to the Stalwart log group.

archive_files = [
'bootstrap.py',
'templates/fluent-bit.service.j2',
'templates/fluent-bit.yaml.j2',
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These entries ensure that the new config templates wind up in the first-phase bootstrapping blob.

private:
- destination_cidr_block: 10.202.0.0/22 # observability-dev
vpc_peering_connection_id: pcx-0d2027442f0e54ca4
vpc_peering_connection_id: pcx-04d7e54008cd9326c
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The peering connection changed when I rebuilt dev for testing.

tb:cloudwatch:LogDestination:
stalwart:
log_group:
retention_in_days: 3
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shorter data retention in prod. This is actually the default value for the option, but I like to be explicit.

Jinja2>=3.1,<4.0
pulumi_cloudflare==6.6.0
tb_pulumi @ git+https://github.com/thunderbird/pulumi.git@v0.0.16
tb_pulumi @ git+https://github.com/thunderbird/pulumi.git@v0.0.18
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a future version with a bunch of little fixes to the core library that we need here, which is why we need the release to be done before merging this.

set -x
set -e

# Places data get stored
Copy link
Copy Markdown
Contributor Author

@ryanjjung ryanjjung Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun fact: User data can't exceed 16KB. That's why we bzip2 the second phase blob - great compression ratio! This file, fully rendered, is sitting at 7,257 bytes as of this PR, so we're well within range still.

@ryanjjung ryanjjung marked this pull request as ready for review April 9, 2026 20:39
@ryanjjung ryanjjung requested review from aatchison and mzeier April 9, 2026 20:43
inputs:
- name: systemd
tag: cloudwatch.stalwart.{{ function }}
systemd_filter: _SYSTEMD_UNIT=thundermail.service
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a persisted cursor (for example DB) or a read_from_tail guard on this systemd input, Fluent Bit will read the existing thundermail.service journal on first boot and can replay entries again after service restarts. That seems likely to backfill or duplicate logs in CloudWatch for these long-lived nodes. Is that intentional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'll revisit this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted for the db option since it seems more likely to not miss messages.

return stalwart.StalwartCluster(
f'{project.name_prefix}-stalwart',
project=project,
log_group_arn=logdests['stalwart'].resources['iam_policies']['write'].arn,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wires the instances to the new LogDestination write policy, but that policy lives in the upcoming tb_pulumi release. From the implementation I reviewed, it looks like logs:CreateLogStream / logs:PutLogEvents may be scoped only to the log-group ARN rather than the log-stream ARNs. If that is still true in the cut release, Fluent Bit will bootstrap successfully but get AccessDenied when it tries to write events. Could we double-check the released policy shape before merging?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this was a problem, but that was fixed by this PR, which is slated to go out with that release after it gets approved and merged.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just got that PR merged and have tested that code directly with success. When I go to prod with it, I'll double-check the policy that goes out there before shipping the fluent-bit configs to the live servers.

@ryanjjung ryanjjung requested a review from mzeier April 10, 2026 15:48
@ryanjjung
Copy link
Copy Markdown
Contributor Author

I cut the new version of tb_pulumi at the end of last week, so we can proceed with this on v0.0.18. I will roll this to stage today, but will wait on review to get it into prod.

@ryanjjung
Copy link
Copy Markdown
Contributor Author

Just deployed this to stage after setting the tb_pulumi version to v0.0.18 and rebuilding the virtual environment with it. The logs themselves are here, split by server function.

I also double-checked the read/write log policies, and they are correct.

Read:

Screenshot_20260413_093118

Write:

Screenshot_20260413_093134

@ryanjjung
Copy link
Copy Markdown
Contributor Author

Let's hang on just a bit. I need to set the journald config to retain fewer logs as well.

@ryanjjung
Copy link
Copy Markdown
Contributor Author

ryanjjung commented Apr 13, 2026

Ok, this is ready for review again. Vetted in stage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants