Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
53d4a06
test: add granularity marker taxonomy infrastructure (#727)
planetf1 Mar 25, 2026
4ea0c50
test: add audit-markers skill for test classification (#728)
planetf1 Mar 25, 2026
4f248dc
chore: add CLAUDE.md and agent skills infrastructure
planetf1 Mar 25, 2026
9c82f82
test: improve audit-markers skill quality and add resource predicates
planetf1 Mar 25, 2026
4f1db52
chore: remove issue references from audit-markers skill
planetf1 Mar 25, 2026
fc72f3f
docs: align MARKERS_GUIDE.md with predicate factory pattern
planetf1 Mar 25, 2026
845c6ad
fix: validate_skill.py schema mismatch and brittle YAML parsing
planetf1 Mar 25, 2026
62d4bbd
fix: migrate deprecated llm markers to e2e, add backend registry, upd…
planetf1 Mar 25, 2026
c3a5651
feat: add estimate-vram skill and fix MPS VRAM detection
planetf1 Mar 25, 2026
6a3d6f3
refactor: fold estimate-vram into audit-markers skill
planetf1 Mar 25, 2026
3288541
docs: drop isolation refs and fix RAM guidance in markers docs
planetf1 Mar 25, 2026
914502d
docs: add legacy marker guidance for example files in audit-markers s…
planetf1 Mar 25, 2026
ab8ad75
refactor: remove require_ollama() predicate — redundant with backend …
planetf1 Mar 25, 2026
be39488
refactor: replace requires_heavy_ram gate with huggingface backend ma…
planetf1 Mar 25, 2026
ab8a20f
refactor: replace ad-hoc bedrock skipif with require_api_key predicate
planetf1 Mar 25, 2026
c0c004e
refactor: migrate legacy resource markers to predicates
planetf1 Mar 26, 2026
01fdc1e
test: skip collection gracefully when optional backend deps are missing
planetf1 Mar 26, 2026
c6d565e
test: refine integration marker definition and apply audit fixes
planetf1 Mar 26, 2026
7ccf182
test: add importorskip guards and optional-dep skip logic for examples
planetf1 Mar 26, 2026
32f1f9b
fix: convert example import errors to skips; add cpex importorskip gu…
planetf1 Mar 26, 2026
3e6ec88
test: skip OTel-dependent tests when opentelemetry not installed
planetf1 Mar 26, 2026
c6fbfb6
fix: use conservative heuristic for Apple Silicon GPU memory detection
planetf1 Mar 26, 2026
8cee781
test: add training memory signals to audit-markers skill; bump alora …
planetf1 Mar 26, 2026
28808ff
fix: cache system capabilities result in examples conftest
planetf1 Mar 26, 2026
7f05eb8
fix: cache get_system_capabilities() result in test/conftest.py
planetf1 Mar 26, 2026
66d35f0
fix: flush MPS memory pool in intrinsic test fixture teardown
planetf1 Mar 27, 2026
355154f
fix: load LocalHFBackend model in config dtype to prevent float32 upc…
planetf1 Mar 27, 2026
601162c
test: remove --isolate-heavy process isolation and bump intrinsic VRA…
planetf1 Mar 27, 2026
58d2692
test: migrate legacy markers in test_intrinsics_formatters.py
planetf1 Mar 27, 2026
8ec3756
test: add integration marker to test_dependency_isolation.py
planetf1 Mar 27, 2026
f6f49fc
docs: document OLLAMA_KEEP_ALIVE=1m as memory optimisation for unorde…
planetf1 Mar 27, 2026
7119f78
fix: suppress mypy name-defined for torch.Tensor after importorskip c…
planetf1 Mar 27, 2026
dbb5f11
fix: ruff format huggingface.py from_pretrained args
planetf1 Mar 27, 2026
9dfea0d
fix: ruff format test_watsonx.py and test_huggingface_tools.py
planetf1 Mar 27, 2026
d445f0c
refactor: remove requires_gpu, requires_heavy_ram, requires_gpu_isola…
planetf1 Mar 27, 2026
6148d8d
refactor: remove --ignore-*-check override flags from conftest
planetf1 Mar 27, 2026
22b29bb
refactor: remove requires_api_key marker; fix api backend group to ma…
planetf1 Mar 27, 2026
e1d79fb
fix: address review
ajbozarth Mar 27, 2026
b772cc4
test: mark test_image_block_in_instruction as qualitative
planetf1 Mar 28, 2026
c9b996d
chore: commit .claude/settings.json with skillLocations for skill dis…
planetf1 Mar 28, 2026
3d80a81
docs: broaden audit-markers skill description to cover diagnostic use…
planetf1 Mar 28, 2026
ec0254d
docs: add diagnostic mode to audit-markers skill for troubleshooting …
planetf1 Mar 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
902 changes: 902 additions & 0 deletions .agents/skills/audit-markers/SKILL.md

Large diffs are not rendered by default.

159 changes: 159 additions & 0 deletions .agents/skills/skill-author/SKILL.md
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence about whether this skill belongs in mellea proper and not put in a "useful skills" repo instead. I'm ok adding it here for now though

Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
---
name: skill-author
description: >
Draft, validate, and install new agent skills. Use when asked to create a new
skill, automate a workflow, or add a capability. Produces cross-compatible
SKILL.md files that work in both Claude Code and IBM Bob.
argument-hint: "[skill-name]"
compatibility: "Claude Code, IBM Bob"
metadata:
version: "2026-03-25"
capabilities: [bash, read_file, write_file]
---

# Skill Authoring Meta-Skill

Create new agent skills that work across Claude Code (CLI/IDE) and IBM Bob.

## Skill Location

Skills live under `.agents/skills/<name>/SKILL.md`.

Discovery configuration varies by tool:
- **Claude Code:** Add `"skillLocations": [".agents/skills"]` to `.claude/settings.json`.
Without this, Claude Code looks in `.claude/skills/` by default.
- **IBM Bob:** Discovers `.agents/skills/` natively per agentskills.io convention.

Both tools read the same `SKILL.md` format. Use the frontmatter schema below
to maximise compatibility.

## Workflow

1. **Name the skill** — kebab-case, max 64 chars (e.g. `api-tester`, `audit-markers`).

2. **Scaffold the directory:**
```
.agents/skills/<name>/
├── SKILL.md # Required — frontmatter + instructions
├── scripts/ # Optional — helper scripts
└── templates/ # Optional — output templates
```

3. **Write SKILL.md** — YAML frontmatter + markdown body (see schema below).

4. **Dry-run review** — mentally execute the skill against a realistic scenario
before finalising. Walk through the procedure on a concrete example (a real
file in the repo, not a hypothetical) and check for:
- **Scaling gaps:** Does the procedure work for 1 file AND 100 files? If the
skill accepts a directory or glob, it needs a triage strategy (e.g., "grep
first to find candidates, then deep-read only files with issues") — not
just "read every file fully."
- **Boundary ambiguity:** If the skill defines categories or classifications,
test the boundaries between adjacent categories with a real example. The
edges are where agents will disagree or ask the user. Sharpen definitions
until two agents reading the same test would classify it the same way.
- **Stale references:** If the skill describes project state ("this hook needs
to be added", "this marker is not yet registered"), verify those statements
are still true. Embed checks ("read conftest.py to confirm") rather than
assertions that rot.
- **Output format at scale:** Run the report template mentally against the
largest expected input. A per-function report for 5 files is fine; for 165
files it's unusable. Design output for the largest scope — summary table
first, per-item detail only where issues exist.
- **Format coverage:** If the skill operates on multiple input formats (e.g.,
`pytestmark` lists AND `# pytest:` comments), verify each format is
explicitly addressed in the procedure. Implicit coverage causes agents to
skip or guess.
- **Rigid rules:** If you wrote "always X" or "never Y", find the edge case
where the rule is wrong. Add the escape hatch. E.g., "per-function only"
should say "module-level is acceptable when every function qualifies."

5. **Validate:**
- Check the skill is discoverable: list files in `.agents/skills/`.
- Confirm no frontmatter warnings from the IDE.
- Verify the skill does not conflict with existing skills or `AGENTS.md`.

## SKILL.md Frontmatter Schema

Use only fields from the **cross-compatible** set to avoid IDE warnings.

### Cross-compatible fields (use these)

| Field | Type | Purpose |
|-------|------|---------|
| `name` | string | Kebab-case identifier. Becomes the `/slash-command`. Max 64 chars. |
| `description` | string | What the skill does and when to trigger it. Be specific — agents use this to decide whether to invoke the skill automatically. |
| `argument-hint` | string | Autocomplete hint. E.g. `"[file] [--dry-run]"`, `"[issue-number]"`. |
| `compatibility` | string | Which tools support this skill. E.g. `"Claude Code, IBM Bob"`. |
| `disable-model-invocation` | boolean | `true` = manual `/name` only, no auto-invocation. |
| `user-invocable` | boolean | `false` = hidden from `/` menu. Use for background knowledge skills. |
| `license` | string | SPDX identifier if publishing. E.g. `"Apache-2.0"`. |
| `metadata` | object | Free-form key-value pairs for tool-specific or custom fields. |

### Tool-specific fields (put under `metadata`)

These are useful but not universally supported — nest them under `metadata`:

```yaml
metadata:
version: "2026-03-25"
capabilities: [bash, read_file, write_file] # Bob/agentskills.io
```

Claude Code's `allowed-tools` and `context`/`agent` fields are recognised by
Claude Code but may trigger warnings in Bob's validator. If needed, add them
to `metadata` or accept the warnings.

### Example frontmatter

```yaml
---
name: my-skill
description: >
Does X when Y. Use when asked to Z.
argument-hint: "[target] [--flag]"
compatibility: "Claude Code, IBM Bob"
metadata:
version: "2026-03-25"
capabilities: [bash, read_file, write_file]
---
```

## SKILL.md Body Structure

After frontmatter, write clear markdown instructions the agent follows:

1. **Context section** — what the skill operates on, key reference files.
2. **Procedure** — numbered steps the agent follows. Be explicit about decisions and edge cases.
3. **Rules / constraints** — hard rules the agent must not break.
4. **Output format** — what the agent should produce (report, edits, summary).

### Guidelines

- **Be specific.** Vague instructions produce inconsistent results across models.
"Check if markers are correct" is worse than "Compare the test's assertions
to the qualitative decision rule in section 3."
- **Reference project files.** Point to docs, configs, and examples by relative
path so the agent can read them. E.g. "See `test/MARKERS_GUIDE.md` for the
full marker taxonomy."
- **Declare scope boundaries.** State what the skill does NOT do. E.g. "This
skill does not modify conftest.py — flag infrastructure issues as notes."
- **Use `$ARGUMENTS`** for user input. `$ARGUMENTS` is the full argument string;
`$1`, `$2` etc. are positional.
- **Keep SKILL.md under 500 lines.** Use supporting files for large reference
material (link to them from the body).
- **Portability:** use relative paths from the repo root, never absolute paths.
- **Formatting:** use YYYY-MM-DD for dates, 24-hour clock for times, metric units.
- **Design for variable scope.** If the skill can operate on a single file or an
entire directory, provide a triage strategy for the large case. Agents given
"audit everything" with no prioritisation will either read every file (slow)
or skip files (incomplete).
- **Sharpen category boundaries.** When defining classifications, the boundary
between adjacent categories causes the most disagreement. Add a "key
distinction from X" sentence for each pair of adjacent tiers.
- **Avoid temporal assertions.** Don't write "this conftest hook needs to be
added" — write "check whether conftest.py already has the hook." State that
goes stale silently is worse than no guidance at all.
- **Qualify absolutes.** "Always X" and "never Y" rules need escape hatches for
the common exception. E.g., "per-function only — unless every function in the
file qualifies, in which case module-level is acceptable."
52 changes: 52 additions & 0 deletions .agents/skills/skill-author/scripts/validate_skill.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
"""Validate SKILL.md frontmatter for agent skills."""

import json
import os
import sys

import yaml


def validate_skill(skill_path: str) -> dict:
"""Check that a skill directory has valid SKILL.md with required frontmatter keys."""
skill_file = os.path.join(skill_path, "SKILL.md")

if not os.path.exists(skill_file):
return {"status": "error", "message": "Missing SKILL.md"}

try:
with open(skill_file) as f:
# safe_load_all handles the --- delimiters correctly and won't
# break on markdown horizontal rules later in the file.
frontmatter = next(yaml.safe_load_all(f))

if not isinstance(frontmatter, dict):
return {"status": "error", "message": "Frontmatter is not a YAML mapping"}

# Root-level required keys
for key in ("name", "description"):
if key not in frontmatter:
return {"status": "error", "message": f"Missing root key: {key}"}

# version lives under metadata (per skill-author guide)
meta = frontmatter.get("metadata")
if not isinstance(meta, dict) or "version" not in meta:
return {
"status": "error",
"message": "Missing nested key: metadata.version",
}

return {"status": "success", "data": frontmatter}

except yaml.YAMLError as e:
return {"status": "error", "message": f"Invalid YAML: {e}"}
except StopIteration:
return {"status": "error", "message": "No YAML frontmatter found"}


if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python3 validate_skill.py <skill-directory>", file=sys.stderr)
sys.exit(1)
result = validate_skill(sys.argv[1])
print(json.dumps(result))
3 changes: 3 additions & 0 deletions .claude/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"skillLocations": [".agents/skills"]
}
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -451,7 +451,8 @@ pyrightconfig.json

# AI agent configs
.bob/
.claude/
.claude/*
!.claude/settings.json

# Generated API documentation (built by tooling/docs-autogen/)
docs/docs/api/
Expand Down
77 changes: 36 additions & 41 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@ uv run pytest # Default: qualitative tests, skip slow te
uv run pytest -m "not qualitative" # Fast tests only (~2 min)
uv run pytest -m slow # Run only slow tests (>5 min)
uv run pytest --co -q # Run ALL tests including slow (bypass config)
uv run pytest --isolate-heavy # Enable GPU process isolation (opt-in)
uv run ruff format . # Format code
uv run ruff check . # Lint code
uv run mypy . # Type check
Expand All @@ -44,49 +43,44 @@ uv run mypy . # Type check
| `cli/` | CLI commands (`m serve`, `m alora`, `m decompose`, `m eval`) |
| `test/` | All tests (run from repo root) |
| `docs/examples/` | Example code (run as tests via pytest) |
| `.agents/skills/` | Agent skills ([agentskills.io](https://agentskills.io) standard) |
| `scratchpad/` | Experiments (git-ignored) |

## 3. Test Markers
All tests and examples use markers to indicate requirements. The test infrastructure automatically skips tests based on system capabilities.

**Backend Markers:**
- `@pytest.mark.ollama` — Requires Ollama running (local, lightweight)
- `@pytest.mark.huggingface` — Requires HuggingFace backend (local, heavy)
- `@pytest.mark.vllm` — Requires vLLM backend (local, GPU required)
- `@pytest.mark.openai` — Requires OpenAI API (requires API key)
- `@pytest.mark.watsonx` — Requires Watsonx API (requires API key)
- `@pytest.mark.litellm` — Requires LiteLLM backend

**Capability Markers:**
- `@pytest.mark.requires_gpu` — Requires GPU
- `@pytest.mark.requires_heavy_ram` — Requires 48GB+ RAM
- `@pytest.mark.requires_api_key` — Requires external API keys
- `@pytest.mark.qualitative` — LLM output quality tests (skipped in CI via `CICD=1`)
- `@pytest.mark.llm` — Makes LLM calls (needs at least Ollama)
- `@pytest.mark.slow` — Tests taking >5 minutes (skipped via `SKIP_SLOW=1`)

**Execution Strategy Markers:**
- `@pytest.mark.requires_gpu_isolation` — Requires OS-level process isolation to clear CUDA memory (use with `--isolate-heavy` or `CICD=1`)

**Examples in `docs/examples/`** use comment-based markers for clean code:
Tests use a four-tier granularity system (`unit`, `integration`, `e2e`, `qualitative`) plus backend and resource markers. The `unit` marker is auto-applied by conftest — never write it explicitly. The `llm` marker is deprecated; use `e2e` instead.

See **[test/MARKERS_GUIDE.md](test/MARKERS_GUIDE.md)** for the full marker reference (tier definitions, backend markers, resource gates, auto-skip logic, common patterns).

**Examples in `docs/examples/`** use comment-based markers:
```python
# pytest: ollama, llm, requires_heavy_ram
# pytest: e2e, ollama, qualitative
"""Example description..."""

# Your clean example code here
```

Tests/examples automatically skip if system lacks required resources. Heavy examples (e.g., HuggingFace) are skipped during collection to prevent memory issues.
⚠️ Don't add `qualitative` to trivial tests — keep the fast loop fast.
⚠️ Mark tests taking >1 minute with `slow`.

## 4. Agent Skills

Skills live in `.agents/skills/` following the [agentskills.io](https://agentskills.io) open standard. Each skill is a directory with a `SKILL.md` file (YAML frontmatter + markdown instructions).

**Tool discovery:**

**Default behavior:**
- `uv run pytest` skips slow tests (>5 min) but runs qualitative tests
- Use `pytest -m "not qualitative"` for fast tests only (~2 min)
- Use `pytest -m slow` or `pytest` (without config) to include slow tests
| Tool | Project skills | Global skills | Config needed |
| ----------------- | ----------------- | ------------------- | ------------------------------------------------------------------ |
| Claude Code | `.agents/skills/` | `~/.claude/skills/` | `"skillLocations": [".agents/skills"]` in `.claude/settings.json` |
| IBM Bob | `.bob/skills/` | `~/.bob/skills/` | Symlink: `.bob/skills` → `.agents/skills` |
| VS Code / Copilot | `.agents/skills/` | — | None (auto-discovered) |

⚠️ Don't add `qualitative` to trivial tests—keep the fast loop fast.
⚠️ Mark tests taking >5 minutes with `slow` (e.g., dataset loading, extensive evaluations).
**Bob users:** create the symlink once per clone:

## 4. Coding Standards
```bash
mkdir -p .bob && ln -s ../.agents/skills .bob/skills
```

**Available skills:** `/audit-markers`, `/skill-author`

## 5. Coding Standards
- **Types required** on all core functions
- **Docstrings are prompts** — be specific, the LLM reads them
- **Google-style docstrings** — `Args:` on the **class docstring only**; `__init__` gets a single summary sentence. Add `Attributes:` only when a stored value differs in type/behaviour from its constructor input (type transforms, computed values, class constants). See CONTRIBUTING.md for a full example.
Expand All @@ -96,37 +90,38 @@ Tests/examples automatically skip if system lacks required resources. Heavy exam
- **Friendly Dependency Errors**: Wraps optional backend imports in `try/except ImportError` with a helpful message (e.g., "Please pip install mellea[hf]"). See `mellea/stdlib/session.py` for examples.
- **Backend telemetry fields**: All backends must populate `mot.usage` (dict with `prompt_tokens`, `completion_tokens`, `total_tokens`), `mot.model` (str), and `mot.provider` (str) in their `post_processing()` method. Metrics are automatically recorded by `TokenMetricsPlugin` — don't add manual `record_token_usage_metrics()` calls.

## 5. Commits & Hooks
## 6. Commits & Hooks
[Angular format](https://github.com/angular/angular/blob/main/CONTRIBUTING.md#commit): `feat:`, `fix:`, `docs:`, `test:`, `refactor:`, `release:`

Pre-commit runs: ruff, mypy, uv-lock, codespell

## 6. Timing
## 7. Timing
> **Don't cancel**: `pytest` (full) and `pre-commit --all-files` may take minutes. Canceling mid-run can corrupt state.

## 7. Common Issues
## 8. Common Issues
| Problem | Fix |
|---------|-----|
| `ComponentParseError` | Add examples to docstring |
| `uv.lock` out of sync | Run `uv sync` |
| Ollama refused | Run `ollama serve` |
| Telemetry import errors | Run `uv sync` to install OpenTelemetry deps |

## 8. Self-Review (before notifying user)
## 9. Self-Review (before notifying user)
1. `uv run pytest test/ -m "not qualitative"` passes?
2. `ruff format` and `ruff check` clean?
3. New functions typed with concise docstrings?
4. Unit tests added for new functionality?
5. Avoided over-engineering?

## 9. Writing Tests
## 10. Writing Tests

- Place tests in `test/` mirroring source structure
- Name files `test_*.py` (required for pydocstyle)
- Use `gh_run` fixture for CI-aware tests (see `test/conftest.py`)
- Mark tests checking LLM output quality with `@pytest.mark.qualitative`
- If a test fails, fix the **code**, not the test (unless the test was wrong)

## 10. Writing Docs
## 11. Writing Docs

If you are modifying or creating pages under `docs/docs/`, follow the writing
conventions in [`docs/docs/guide/CONTRIBUTING.md`](docs/docs/guide/CONTRIBUTING.md).
Expand All @@ -144,7 +139,7 @@ Key rules that differ from typical Markdown habits:
mellea source; mark forward-looking content with `> **Coming soon:**`
- **No visible TODOs** — if content is missing, open a GitHub issue instead

## 11. Feedback Loop
## 12. Feedback Loop

Found a bug, workaround, or pattern? Update the docs:

Expand Down
5 changes: 5 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Claude Code Directives
@AGENTS.md

## Execution
- If instructed to create a new capability, strictly trigger the `skill-author` meta-skill to ensure cross-compatibility.
Loading
Loading