[CAI-753] New lambda chatbot monitor with its queue and dlq#2000
[CAI-753] New lambda chatbot monitor with its queue and dlq#2000
Conversation
🦋 Changeset detectedLatest commit: 076c822 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
…o CAI-753-lambda-monitor
* Update clickhouse ram end cpu * + changeset
* Update clickhouse ram end cpu * + changeset * clickhouse task definition update
…o CAI-753-lambda-monitor
…o CAI-753-lambda-monitor
…o CAI-753-lambda-monitor
…veloper-portal into CAI-753-lambda-monitor
* update var chb_moded_id * chb-model set in lambda evaluate too. * terraform fmt
…o CAI-753-lambda-monitor
…o CAI-753-lambda-monitor
There was a problem hiding this comment.
Pull request overview
This PR introduces a new “chatbot monitor” Lambda and supporting SQS queue/DLQ wiring, while also reshaping Langfuse infrastructure (module placement, networking access, and ClickHouse resource sizing) and updating the default chatbot generation model.
Changes:
- Add a new chatbot monitor Lambda, plus a new
monitorSQS FIFO queue and DLQ, and refactor SQS resources to a keyedfor_eachstructure. - Move/encapsulate the Langfuse module under the chatbot module and expose a service-discovery endpoint output for internal access.
- Update Langfuse ClickHouse EFS throughput mode and increase ClickHouse task CPU/memory; update default chatbot generation model ID.
Reviewed changes
Copilot reviewed 28 out of 28 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| apps/infrastructure/src/variables.tf | Update default chatbot generation model value. |
| apps/infrastructure/src/refacotr.tf | Add moved blocks for SQS refactor and Langfuse module relocation. |
| apps/infrastructure/src/README.md | Regenerate docs: remove top-level langfuse module entry; update model default shown. |
| apps/infrastructure/src/modules/langfuse/variables.tf | Add lambda SG input for Langfuse web ingress. |
| apps/infrastructure/src/modules/langfuse/security_group.tf | Change Langfuse web ingress handling to SG rules and attempt to include lambda SG. |
| apps/infrastructure/src/modules/langfuse/README.md | Regenerate module docs (new rule + new output). |
| apps/infrastructure/src/modules/langfuse/outputs.tf | Add service discovery endpoint output for langfuse-web. |
| apps/infrastructure/src/modules/langfuse/efs.tf | Switch ClickHouse EFS to throughput_mode = "elastic". |
| apps/infrastructure/src/modules/langfuse/ecs.tf | Increase ClickHouse task/container CPU & memory. |
| apps/infrastructure/src/modules/chatbot/variables.tf | Extend monitoring ECS config object; add hosted zone id input. |
| apps/infrastructure/src/modules/chatbot/sqs.tf | Refactor evaluate queue/DLQ into for_each and add monitor queue/DLQ. |
| apps/infrastructure/src/modules/chatbot/README.md | Regenerate module docs for new resources/inputs. |
| apps/infrastructure/src/modules/chatbot/langfuse.tf | Add nested Langfuse module instantiation from within chatbot. |
| apps/infrastructure/src/modules/chatbot/lambda_monitor.tf | Add the new monitor Lambda + IAM/logging/event source mapping. |
| apps/infrastructure/src/modules/chatbot/lambda_index.tf | Reuse shared lambda assume-role policy document. |
| apps/infrastructure/src/modules/chatbot/lambda_evaluate.tf | Update to new queue addresses; add env vars for monitor/evaluate queues. |
| apps/infrastructure/src/modules/chatbot/lambda_api.tf | Adjust env vars and IAM policy to send to the monitor queue. |
| apps/infrastructure/src/modules/chatbot/ecs.tf | Make Langfuse ECS desired/min/max capacities configurable. |
| apps/infrastructure/src/modules/chatbot/ecr.tf | Add a new ECR repo definition for the monitor lambda image. |
| apps/infrastructure/src/modules/chatbot/data.tf | Add shared lambda_assume_role policy document. |
| apps/infrastructure/src/main.tf | Pass hosted zone id into chatbot; remove top-level langfuse module block. |
| apps/infrastructure/src/env/dev/terraform.tfvars | Override monitoring ECS desired count and update model generation. |
| .changeset/small-sheep-like.md | Changeset entry for EFS throughput mode change. |
| .changeset/six-animals-hear.md | Changeset entry for ClickHouse CPU/RAM change. |
| .changeset/shy-hoops-trade.md | Changeset entry for chatbot monitor lambda addition. |
| .changeset/polite-months-design.md | Changeset entry for model id update. |
| .changeset/metal-knives-grow.md | Changeset entry for ClickHouse CPU update. |
| .changeset/dry-llamas-beam.md | Changeset entry for scaling Langfuse down in dev. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # TODO: the for_each fails when aws_security_group.lb.id is a string, known only after apply. How can be sloved? | ||
| resource "aws_security_group_rule" "langfuse_web_lambda_ingress" { | ||
| for_each = { for id in [aws_security_group.lb.id, var.lambda_security_group_id] : id => id if id != null } | ||
|
|
||
| type = "ingress" | ||
| from_port = 3000 | ||
| to_port = 3000 | ||
| protocol = "tcp" | ||
| security_group_id = aws_security_group.langfuse_web.id | ||
| source_security_group_id = each.value | ||
| description = "Allow lambda monitor access to langfuse-web" | ||
|
|
| } | ||
|
|
||
|
|
||
| # TODO: the for_each fails when aws_security_group.lb.id is a string, known only after apply. How can be sloved? |
| resource "aws_sqs_queue" "chatbot_queue" { | ||
| for_each = local.chatbot_queues | ||
|
|
||
| name = "${local.prefix}-${each.key}-queue.fifo" | ||
| fifo_queue = true | ||
| content_based_deduplication = true | ||
| deduplication_scope = "messageGroup" | ||
| fifo_throughput_limit = "perMessageGroupId" | ||
| visibility_timeout_seconds = 120 | ||
|
|
||
| redrive_policy = jsonencode({ | ||
| deadLetterTargetArn = aws_sqs_queue.chatbot_evaluate_queue_dlq.arn | ||
| deadLetterTargetArn = aws_sqs_queue.chatbot_dlq[each.key].arn | ||
| maxReceiveCount = 2 | ||
| }) |
| @@ -75,15 +64,17 @@ resource "aws_iam_role_policy" "lambda_evaluate_policy" { | |||
| "sqs:DeleteMessage", | |||
| "sqs:GetQueueAttributes", | |||
| "sqs:ReceiveMessage", | ||
| "sqs:DeleteMessage", | ||
| "sqs:GetQueueAttributes", | ||
| "sqs:SendMessage" |
| "infrastructure": minor | ||
| --- | ||
|
|
||
| Update Langfuse clickhouse EFS from bursting do elastic. |
| count = var.environment == "prod" ? 0 : 1 | ||
|
|
||
| environment = var.environment | ||
| region = var.aws_region | ||
| vpc_id = var.vpc.id | ||
| private_subnet_ids = var.vpc.private_subnets | ||
| public_subnet_ids = var.vpc.public_subnets | ||
| custom_domain_id = var.hosted_zone_id | ||
| custom_domain_name = var.dns_domain_name |
| resource "aws_iam_policy" "chatbot_monitor_queue" { | ||
| name = "lambda-sqs-send" | ||
| description = "Allow Lambda to send messages to SQS queue" | ||
| policy = jsonencode({ | ||
| Version = "2012-10-17" | ||
| Statement = [ | ||
| { | ||
| Effect = "Allow" | ||
| Action = ["sqs:SendMessage", "sqs:GetQueueUrl"] | ||
| Resource = aws_sqs_queue.chatbot_evaluate_queue.arn | ||
| Resource = aws_sqs_queue.chatbot_queue["monitor"].arn | ||
| } | ||
| ] |
| { | ||
| Effect = "Allow" | ||
| Action = [ | ||
| "sqs:SendMessage", | ||
|
|
||
| "sqs:GetQueueUrl", | ||
| ] | ||
| Resource = [ | ||
| aws_sqs_queue.chatbot_queue["monitor"].arn, | ||
| ] | ||
| Resource = aws_sqs_queue.chatbot_evaluate_queue_dlq.arn | ||
| }, |
| "sqs:SendMessage", | ||
| "sqs:GetQueueUrl" | ||
| ] | ||
| Resource = aws_sqs_queue.chatbot_queue["evaluate"].arn | ||
| }, | ||
| { | ||
| Effect = "Allow" | ||
| Action = [ |
* lambda api and evaluate new env variable * + change set * Apply suggestion from @mdciri Co-authored-by: Marco Domenico Cirillo <59966344+mdciri@users.noreply.github.com> * lambda evaluate new env variables google service account --------- Co-authored-by: Marco Domenico Cirillo <59966344+mdciri@users.noreply.github.com>
* refacotr permissions lambda evaluate and index * +changeset
Branch is not up to date with base branch@uolter it seems this Pull Request is not updated with base branch. |
Jira Pull Request LinkThis Pull Request refers to the following Jira issue CAI-753 |
|
This pull request is stale because it has been open for 14 days with no activity. If the pull request is still valid, please update it within 21 days to keep it open or merge it, otherwise it will be closed automatically. |
List of Changes
Motivation and Context
How Has This Been Tested?
Screenshots (if appropriate):
Types of changes
Checklist: