Skip to content

[CAI-753] New lambda chatbot monitor with its queue and dlq#2000

Open
uolter wants to merge 48 commits intomainfrom
CAI-753-lambda-monitor
Open

[CAI-753] New lambda chatbot monitor with its queue and dlq#2000
uolter wants to merge 48 commits intomainfrom
CAI-753-lambda-monitor

Conversation

@uolter
Copy link
Copy Markdown
Member

@uolter uolter commented Feb 9, 2026

List of Changes

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

  • Chore (nothing changes by a user perspective)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Feb 9, 2026

🦋 Changeset detected

Latest commit: 076c822

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
infrastructure Major

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new “chatbot monitor” Lambda and supporting SQS queue/DLQ wiring, while also reshaping Langfuse infrastructure (module placement, networking access, and ClickHouse resource sizing) and updating the default chatbot generation model.

Changes:

  • Add a new chatbot monitor Lambda, plus a new monitor SQS FIFO queue and DLQ, and refactor SQS resources to a keyed for_each structure.
  • Move/encapsulate the Langfuse module under the chatbot module and expose a service-discovery endpoint output for internal access.
  • Update Langfuse ClickHouse EFS throughput mode and increase ClickHouse task CPU/memory; update default chatbot generation model ID.

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
apps/infrastructure/src/variables.tf Update default chatbot generation model value.
apps/infrastructure/src/refacotr.tf Add moved blocks for SQS refactor and Langfuse module relocation.
apps/infrastructure/src/README.md Regenerate docs: remove top-level langfuse module entry; update model default shown.
apps/infrastructure/src/modules/langfuse/variables.tf Add lambda SG input for Langfuse web ingress.
apps/infrastructure/src/modules/langfuse/security_group.tf Change Langfuse web ingress handling to SG rules and attempt to include lambda SG.
apps/infrastructure/src/modules/langfuse/README.md Regenerate module docs (new rule + new output).
apps/infrastructure/src/modules/langfuse/outputs.tf Add service discovery endpoint output for langfuse-web.
apps/infrastructure/src/modules/langfuse/efs.tf Switch ClickHouse EFS to throughput_mode = "elastic".
apps/infrastructure/src/modules/langfuse/ecs.tf Increase ClickHouse task/container CPU & memory.
apps/infrastructure/src/modules/chatbot/variables.tf Extend monitoring ECS config object; add hosted zone id input.
apps/infrastructure/src/modules/chatbot/sqs.tf Refactor evaluate queue/DLQ into for_each and add monitor queue/DLQ.
apps/infrastructure/src/modules/chatbot/README.md Regenerate module docs for new resources/inputs.
apps/infrastructure/src/modules/chatbot/langfuse.tf Add nested Langfuse module instantiation from within chatbot.
apps/infrastructure/src/modules/chatbot/lambda_monitor.tf Add the new monitor Lambda + IAM/logging/event source mapping.
apps/infrastructure/src/modules/chatbot/lambda_index.tf Reuse shared lambda assume-role policy document.
apps/infrastructure/src/modules/chatbot/lambda_evaluate.tf Update to new queue addresses; add env vars for monitor/evaluate queues.
apps/infrastructure/src/modules/chatbot/lambda_api.tf Adjust env vars and IAM policy to send to the monitor queue.
apps/infrastructure/src/modules/chatbot/ecs.tf Make Langfuse ECS desired/min/max capacities configurable.
apps/infrastructure/src/modules/chatbot/ecr.tf Add a new ECR repo definition for the monitor lambda image.
apps/infrastructure/src/modules/chatbot/data.tf Add shared lambda_assume_role policy document.
apps/infrastructure/src/main.tf Pass hosted zone id into chatbot; remove top-level langfuse module block.
apps/infrastructure/src/env/dev/terraform.tfvars Override monitoring ECS desired count and update model generation.
.changeset/small-sheep-like.md Changeset entry for EFS throughput mode change.
.changeset/six-animals-hear.md Changeset entry for ClickHouse CPU/RAM change.
.changeset/shy-hoops-trade.md Changeset entry for chatbot monitor lambda addition.
.changeset/polite-months-design.md Changeset entry for model id update.
.changeset/metal-knives-grow.md Changeset entry for ClickHouse CPU update.
.changeset/dry-llamas-beam.md Changeset entry for scaling Langfuse down in dev.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +101 to +112
# TODO: the for_each fails when aws_security_group.lb.id is a string, known only after apply. How can be sloved?
resource "aws_security_group_rule" "langfuse_web_lambda_ingress" {
for_each = { for id in [aws_security_group.lb.id, var.lambda_security_group_id] : id => id if id != null }

type = "ingress"
from_port = 3000
to_port = 3000
protocol = "tcp"
security_group_id = aws_security_group.langfuse_web.id
source_security_group_id = each.value
description = "Allow lambda monitor access to langfuse-web"

}


# TODO: the for_each fails when aws_security_group.lb.id is a string, known only after apply. How can be sloved?
Comment on lines +14 to 27
resource "aws_sqs_queue" "chatbot_queue" {
for_each = local.chatbot_queues

name = "${local.prefix}-${each.key}-queue.fifo"
fifo_queue = true
content_based_deduplication = true
deduplication_scope = "messageGroup"
fifo_throughput_limit = "perMessageGroupId"
visibility_timeout_seconds = 120

redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.chatbot_evaluate_queue_dlq.arn
deadLetterTargetArn = aws_sqs_queue.chatbot_dlq[each.key].arn
maxReceiveCount = 2
})
@@ -75,15 +64,17 @@ resource "aws_iam_role_policy" "lambda_evaluate_policy" {
"sqs:DeleteMessage",
"sqs:GetQueueAttributes",
"sqs:ReceiveMessage",
"sqs:DeleteMessage",
"sqs:GetQueueAttributes",
"sqs:SendMessage"
"infrastructure": minor
---

Update Langfuse clickhouse EFS from bursting do elastic.
Comment on lines +4 to +12
count = var.environment == "prod" ? 0 : 1

environment = var.environment
region = var.aws_region
vpc_id = var.vpc.id
private_subnet_ids = var.vpc.private_subnets
public_subnet_ids = var.vpc.public_subnets
custom_domain_id = var.hosted_zone_id
custom_domain_name = var.dns_domain_name
Comment on lines 325 to 336
resource "aws_iam_policy" "chatbot_monitor_queue" {
name = "lambda-sqs-send"
description = "Allow Lambda to send messages to SQS queue"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["sqs:SendMessage", "sqs:GetQueueUrl"]
Resource = aws_sqs_queue.chatbot_evaluate_queue.arn
Resource = aws_sqs_queue.chatbot_queue["monitor"].arn
}
]
Comment on lines 69 to 78
{
Effect = "Allow"
Action = [
"sqs:SendMessage",

"sqs:GetQueueUrl",
]
Resource = [
aws_sqs_queue.chatbot_queue["monitor"].arn,
]
Resource = aws_sqs_queue.chatbot_evaluate_queue_dlq.arn
},
Comment on lines +73 to +80
"sqs:SendMessage",
"sqs:GetQueueUrl"
]
Resource = aws_sqs_queue.chatbot_queue["evaluate"].arn
},
{
Effect = "Allow"
Action = [
uolter and others added 3 commits March 23, 2026 08:59
* lambda api and evaluate new env variable

* + change set

* Apply suggestion from @mdciri

Co-authored-by: Marco Domenico Cirillo <59966344+mdciri@users.noreply.github.com>

* lambda evaluate new env variables google service account

---------

Co-authored-by: Marco Domenico Cirillo <59966344+mdciri@users.noreply.github.com>
* refacotr permissions lambda evaluate and index

* +changeset
@github-actions
Copy link
Copy Markdown
Contributor

Branch is not up to date with base branch

@uolter it seems this Pull Request is not updated with base branch.
Please proceed with a merge or rebase to solve this.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 24, 2026

Jira Pull Request Link

This Pull Request refers to the following Jira issue CAI-753

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

This pull request is stale because it has been open for 14 days with no activity. If the pull request is still valid, please update it within 21 days to keep it open or merge it, otherwise it will be closed automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants