Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,22 @@ Use `agent-browser` for visual verification of docs site changes. Environment-sp
- **Always use `--session <name>`** — isolates browser instances; close with `agent-browser --session <name> close` when done
- **Never use `--headed`** — no display server available; headless (default) works correctly

### Agent Provider Eval Concurrency

When running evals against agent provider targets (claude, claude-sdk, codex, copilot, copilot-sdk, pi, pi-cli), **limit concurrency to 3 targets at a time**. Each agent provider spawns heavyweight subprocesses (CLI binaries, SDK sessions) that consume significant memory and CPU. Running more than 3 in parallel can exhaust system resources.

```bash
# Good: batch targets in groups of 2-3
bun apps/cli/src/cli.ts eval my.EVAL.yaml --target claude &
bun apps/cli/src/cli.ts eval my.EVAL.yaml --target codex &
wait
bun apps/cli/src/cli.ts eval my.EVAL.yaml --target copilot &
bun apps/cli/src/cli.ts eval my.EVAL.yaml --target pi &
wait
```

This does not apply to lightweight LLM-only targets (azure, openai, gemini, openrouter) which can run with higher concurrency.

### Verifying Evaluator Changes

Unit tests alone are insufficient for evaluator changes. After implementing or modifying evaluators:
Expand Down
160 changes: 155 additions & 5 deletions bun.lock

Large diffs are not rendered by default.

33 changes: 29 additions & 4 deletions examples/features/.agentv/targets.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ targets:
api_key: ${{ GOOGLE_GENERATIVE_AI_API_KEY }}
model: ${{ GEMINI_MODEL_NAME }}

# Pi Coding Agent - autonomous coding agent from pi-mono
# Pi Coding Agent - autonomous coding agent from pi-mono (SDK)
- name: pi
provider: pi-coding-agent
subprovider: openrouter
Expand All @@ -59,17 +59,42 @@ targets:
log_format: json # 'summary' (default) or 'json' for raw event logs
# system_prompt: optional override (default instructs agent to include code in response)

# GitHub Copilot (uses copilot CLI)
# Pi CLI - subprocess-based Pi agent
- name: pi-cli
provider: pi-cli
subprovider: openrouter
model: openai/gpt-5.4
api_key: ${{ OPENROUTER_API_KEY }}
grader_target: gemini-llm
log_format: json

# GitHub Copilot - CLI subprocess
- name: copilot
provider: copilot
provider: copilot-cli
model: gpt-5-mini
grader_target: gemini-llm
log_format: json

# Claude - Anthropic's Claude Agent SDK
# GitHub Copilot - SDK
# Note: copilot-sdk discovers skills via grep (keyword search) rather than
# reading skill files directly. The skill-trigger evaluator only checks tool
# inputs for the skill name, so copilot-sdk may fail positive trigger cases.
- name: copilot-sdk
provider: copilot-sdk
model: gpt-5-mini
grader_target: gemini-llm
log_format: json

# Claude - CLI subprocess
- name: claude
provider: claude
grader_target: gemini-llm
# model: claude-sonnet-4-20250514 # Optional: override model
log_format: json # 'summary' (default) or 'json' for raw event logs
# system_prompt: optional override (default instructs agent to include code in response)

# Claude SDK - direct SDK invocation (requires @anthropic-ai/claude-agent-sdk)
- name: claude-sdk
provider: claude-sdk
grader_target: gemini-llm
log_format: json
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,13 @@
#
# Uses an isolated workspace template (workspace/) so no global skill installation
# is needed. The workspace contains:
# .claude/skills/csv-analyzer/ — for claude/copilot providers
# .agents/skills/csv-analyzer/ — for codex provider
# .claude/skills/acme-deploy/ — for claude/copilot providers
# .agents/skills/acme-deploy/ — for codex/pi providers
# .codex/skills/acme-deploy/ — for codex provider (fallback)
#
# The skill contains proprietary deployment procedures (internal CLI, service
# registry, approval rules) that no agent can answer from built-in knowledge.
# This ensures agents must invoke/read the skill to respond correctly.
#
# The same EVAL.yaml works with any provider — just change --target:
#
Expand All @@ -21,17 +26,17 @@ tests:
# === Positive cases: skill should trigger ===

- id: should-trigger-direct-request
input: "Analyze this CSV file and tell me the top 5 months by revenue"
input: "How do I deploy payments-api to production?"
assertions:
- type: skill-trigger
skill: csv-analyzer
skill: acme-deploy
should_trigger: true

- id: should-trigger-casual-phrasing
input: "hey can you look at sales-data.csv and figure out which quarter had the most growth"
input: "I need to roll back user-service in staging, what's the Acme deploy procedure for that?"
assertions:
- type: skill-trigger
skill: csv-analyzer
skill: acme-deploy
should_trigger: true

# === Negative cases: skill should NOT trigger ===
Expand All @@ -40,12 +45,12 @@ tests:
input: "What time is it in Tokyo right now?"
assertions:
- type: skill-trigger
skill: csv-analyzer
skill: acme-deploy
should_trigger: false

- id: should-not-trigger-near-miss
input: "Help me write a Python script that reads CSV files using pandas"
input: "Write a Python function that parses JSON logs and extracts error messages"
assertions:
- type: skill-trigger
skill: csv-analyzer
skill: acme-deploy
should_trigger: false
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
---
name: acme-deploy
description: Use when the user asks about deploying services, checking deployment status, rollback procedures, or release management at Acme Corp
---

# Acme Corp Deployment Procedures

## Overview

Internal deployment runbook for Acme Corp services. All deployments follow the Trident release pipeline.

## Deployment Commands

### Deploy to staging
```bash
trident push --env staging --service <service-name> --tag <git-sha>
```

### Promote to production
```bash
trident promote --from staging --to prod --service <service-name> --approval-ticket <JIRA-ID>
```
Production deploys require a JIRA approval ticket (prefix: DEPLOY-).

### Rollback
```bash
trident rollback --env <env> --service <service-name> --to-version <previous-tag>
```
Rollbacks auto-notify #ops-alerts in Slack.

### Check deployment status
```bash
trident status --env <env> --service <service-name>
```

## Service Registry

| Service | Owner Team | Staging URL | Prod URL |
|---------|-----------|-------------|----------|
| payments-api | Platform | payments.staging.acme.internal | payments.acme.internal |
| user-service | Identity | users.staging.acme.internal | users.acme.internal |
| notifications | Engagement | notify.staging.acme.internal | notify.acme.internal |

## Rules

- All prod deploys require a DEPLOY- JIRA ticket
- Staging deploys are auto-approved during business hours (9am-5pm PT)
- Rollbacks bypass approval but require post-mortem within 48h
- Deploy freezes are announced in #engineering-announcements

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
---
name: acme-deploy
description: Use when the user asks about deploying services, checking deployment status, rollback procedures, or release management at Acme Corp
---

# Acme Corp Deployment Procedures

## Overview

Internal deployment runbook for Acme Corp services. All deployments follow the Trident release pipeline.

## Deployment Commands

### Deploy to staging
```bash
trident push --env staging --service <service-name> --tag <git-sha>
```

### Promote to production
```bash
trident promote --from staging --to prod --service <service-name> --approval-ticket <JIRA-ID>
```
Production deploys require a JIRA approval ticket (prefix: DEPLOY-).

### Rollback
```bash
trident rollback --env <env> --service <service-name> --to-version <previous-tag>
```
Rollbacks auto-notify #ops-alerts in Slack.

### Check deployment status
```bash
trident status --env <env> --service <service-name>
```

## Service Registry

| Service | Owner Team | Staging URL | Prod URL |
|---------|-----------|-------------|----------|
| payments-api | Platform | payments.staging.acme.internal | payments.acme.internal |
| user-service | Identity | users.staging.acme.internal | users.acme.internal |
| notifications | Engagement | notify.staging.acme.internal | notify.acme.internal |

## Rules

- All prod deploys require a DEPLOY- JIRA ticket
- Staging deploys are auto-approved during business hours (9am-5pm PT)
- Rollbacks bypass approval but require post-mortem within 48h
- Deploy freezes are announced in #engineering-announcements

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
---
name: acme-deploy
description: Use when the user asks about deploying services, checking deployment status, rollback procedures, or release management at Acme Corp
---

# Acme Corp Deployment Procedures

## Overview

Internal deployment runbook for Acme Corp services. All deployments follow the Trident release pipeline.

## Deployment Commands

### Deploy to staging
```bash
trident push --env staging --service <service-name> --tag <git-sha>
```

### Promote to production
```bash
trident promote --from staging --to prod --service <service-name> --approval-ticket <JIRA-ID>
```
Production deploys require a JIRA approval ticket (prefix: DEPLOY-).

### Rollback
```bash
trident rollback --env <env> --service <service-name> --to-version <previous-tag>
```
Rollbacks auto-notify #ops-alerts in Slack.

### Check deployment status
```bash
trident status --env <env> --service <service-name>
```

## Service Registry

| Service | Owner Team | Staging URL | Prod URL |
|---------|-----------|-------------|----------|
| payments-api | Platform | payments.staging.acme.internal | payments.acme.internal |
| user-service | Identity | users.staging.acme.internal | users.acme.internal |
| notifications | Engagement | notify.staging.acme.internal | notify.acme.internal |

## Rules

- All prod deploys require a DEPLOY- JIRA ticket
- Staging deploys are auto-approved during business hours (9am-5pm PT)
- Rollbacks bypass approval but require post-mortem within 48h
- Deploy freezes are announced in #engineering-announcements

This file was deleted.

5 changes: 3 additions & 2 deletions examples/features/agent-skills-evals/workspace/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,6 @@ Check for a relevant skill before responding to any task.

Available skills:

- **csv-analyzer** — use when the user asks to analyze, summarize, or extract
insights from CSV data or files. Skill file: `.agents/skills/csv-analyzer/SKILL.md`
- **acme-deploy** — use when the user asks about deploying services, checking
deployment status, rollback procedures, or release management at Acme Corp.
Skill file: `.agents/skills/acme-deploy/SKILL.md`
13 changes: 0 additions & 13 deletions examples/features/agent-skills-evals/workspace/sales.csv

This file was deleted.

2 changes: 1 addition & 1 deletion packages/core/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@
"@ai-sdk/azure": "^3.0.0",
"@ai-sdk/google": "^3.0.0",
"@ai-sdk/openai": "^3.0.0",
"@anthropic-ai/claude-agent-sdk": "^0.2.49",
"@anthropic-ai/claude-agent-sdk": "^0.2.88",
"@github/copilot-sdk": "^0.1.25",
"@openai/codex-sdk": "^0.104.0",
"@openrouter/ai-sdk-provider": "^2.3.1",
Expand Down
Loading
Loading