-
Notifications
You must be signed in to change notification settings - Fork 100
test: agent skills infrastructure and marker taxonomy audit (#727, #728) #742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
planetf1
wants to merge
42
commits into
generative-computing:main
Choose a base branch
from
planetf1:test/audit-markers-727-728
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
42 commits
Select commit
Hold shift + click to select a range
53d4a06
test: add granularity marker taxonomy infrastructure (#727)
planetf1 4ea0c50
test: add audit-markers skill for test classification (#728)
planetf1 4f248dc
chore: add CLAUDE.md and agent skills infrastructure
planetf1 9c82f82
test: improve audit-markers skill quality and add resource predicates
planetf1 4f1db52
chore: remove issue references from audit-markers skill
planetf1 fc72f3f
docs: align MARKERS_GUIDE.md with predicate factory pattern
planetf1 845c6ad
fix: validate_skill.py schema mismatch and brittle YAML parsing
planetf1 62d4bbd
fix: migrate deprecated llm markers to e2e, add backend registry, upd…
planetf1 c3a5651
feat: add estimate-vram skill and fix MPS VRAM detection
planetf1 6a3d6f3
refactor: fold estimate-vram into audit-markers skill
planetf1 3288541
docs: drop isolation refs and fix RAM guidance in markers docs
planetf1 914502d
docs: add legacy marker guidance for example files in audit-markers s…
planetf1 ab8ad75
refactor: remove require_ollama() predicate — redundant with backend …
planetf1 be39488
refactor: replace requires_heavy_ram gate with huggingface backend ma…
planetf1 ab8a20f
refactor: replace ad-hoc bedrock skipif with require_api_key predicate
planetf1 c0c004e
refactor: migrate legacy resource markers to predicates
planetf1 01fdc1e
test: skip collection gracefully when optional backend deps are missing
planetf1 c6d565e
test: refine integration marker definition and apply audit fixes
planetf1 7ccf182
test: add importorskip guards and optional-dep skip logic for examples
planetf1 32f1f9b
fix: convert example import errors to skips; add cpex importorskip gu…
planetf1 3e6ec88
test: skip OTel-dependent tests when opentelemetry not installed
planetf1 c6fbfb6
fix: use conservative heuristic for Apple Silicon GPU memory detection
planetf1 8cee781
test: add training memory signals to audit-markers skill; bump alora …
planetf1 28808ff
fix: cache system capabilities result in examples conftest
planetf1 7f05eb8
fix: cache get_system_capabilities() result in test/conftest.py
planetf1 66d35f0
fix: flush MPS memory pool in intrinsic test fixture teardown
planetf1 355154f
fix: load LocalHFBackend model in config dtype to prevent float32 upc…
planetf1 601162c
test: remove --isolate-heavy process isolation and bump intrinsic VRA…
planetf1 58d2692
test: migrate legacy markers in test_intrinsics_formatters.py
planetf1 8ec3756
test: add integration marker to test_dependency_isolation.py
planetf1 f6f49fc
docs: document OLLAMA_KEEP_ALIVE=1m as memory optimisation for unorde…
planetf1 7119f78
fix: suppress mypy name-defined for torch.Tensor after importorskip c…
planetf1 dbb5f11
fix: ruff format huggingface.py from_pretrained args
planetf1 9dfea0d
fix: ruff format test_watsonx.py and test_huggingface_tools.py
planetf1 d445f0c
refactor: remove requires_gpu, requires_heavy_ram, requires_gpu_isola…
planetf1 6148d8d
refactor: remove --ignore-*-check override flags from conftest
planetf1 22b29bb
refactor: remove requires_api_key marker; fix api backend group to ma…
planetf1 e1d79fb
fix: address review
ajbozarth b772cc4
test: mark test_image_block_in_instruction as qualitative
planetf1 c9b996d
chore: commit .claude/settings.json with skillLocations for skill dis…
planetf1 3d80a81
docs: broaden audit-markers skill description to cover diagnostic use…
planetf1 ec0254d
docs: add diagnostic mode to audit-markers skill for troubleshooting …
planetf1 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,159 @@ | ||
| --- | ||
| name: skill-author | ||
| description: > | ||
| Draft, validate, and install new agent skills. Use when asked to create a new | ||
| skill, automate a workflow, or add a capability. Produces cross-compatible | ||
| SKILL.md files that work in both Claude Code and IBM Bob. | ||
| argument-hint: "[skill-name]" | ||
| compatibility: "Claude Code, IBM Bob" | ||
| metadata: | ||
| version: "2026-03-25" | ||
| capabilities: [bash, read_file, write_file] | ||
| --- | ||
|
|
||
| # Skill Authoring Meta-Skill | ||
|
|
||
| Create new agent skills that work across Claude Code (CLI/IDE) and IBM Bob. | ||
|
|
||
| ## Skill Location | ||
|
|
||
| Skills live under `.agents/skills/<name>/SKILL.md`. | ||
|
|
||
| Discovery configuration varies by tool: | ||
| - **Claude Code:** Add `"skillLocations": [".agents/skills"]` to `.claude/settings.json`. | ||
| Without this, Claude Code looks in `.claude/skills/` by default. | ||
| - **IBM Bob:** Discovers `.agents/skills/` natively per agentskills.io convention. | ||
|
|
||
| Both tools read the same `SKILL.md` format. Use the frontmatter schema below | ||
| to maximise compatibility. | ||
|
|
||
| ## Workflow | ||
|
|
||
| 1. **Name the skill** — kebab-case, max 64 chars (e.g. `api-tester`, `audit-markers`). | ||
|
|
||
| 2. **Scaffold the directory:** | ||
| ``` | ||
| .agents/skills/<name>/ | ||
| ├── SKILL.md # Required — frontmatter + instructions | ||
| ├── scripts/ # Optional — helper scripts | ||
| └── templates/ # Optional — output templates | ||
| ``` | ||
|
|
||
| 3. **Write SKILL.md** — YAML frontmatter + markdown body (see schema below). | ||
|
|
||
| 4. **Dry-run review** — mentally execute the skill against a realistic scenario | ||
| before finalising. Walk through the procedure on a concrete example (a real | ||
| file in the repo, not a hypothetical) and check for: | ||
| - **Scaling gaps:** Does the procedure work for 1 file AND 100 files? If the | ||
| skill accepts a directory or glob, it needs a triage strategy (e.g., "grep | ||
| first to find candidates, then deep-read only files with issues") — not | ||
| just "read every file fully." | ||
| - **Boundary ambiguity:** If the skill defines categories or classifications, | ||
| test the boundaries between adjacent categories with a real example. The | ||
| edges are where agents will disagree or ask the user. Sharpen definitions | ||
| until two agents reading the same test would classify it the same way. | ||
| - **Stale references:** If the skill describes project state ("this hook needs | ||
| to be added", "this marker is not yet registered"), verify those statements | ||
| are still true. Embed checks ("read conftest.py to confirm") rather than | ||
| assertions that rot. | ||
| - **Output format at scale:** Run the report template mentally against the | ||
| largest expected input. A per-function report for 5 files is fine; for 165 | ||
| files it's unusable. Design output for the largest scope — summary table | ||
| first, per-item detail only where issues exist. | ||
| - **Format coverage:** If the skill operates on multiple input formats (e.g., | ||
| `pytestmark` lists AND `# pytest:` comments), verify each format is | ||
| explicitly addressed in the procedure. Implicit coverage causes agents to | ||
| skip or guess. | ||
| - **Rigid rules:** If you wrote "always X" or "never Y", find the edge case | ||
| where the rule is wrong. Add the escape hatch. E.g., "per-function only" | ||
| should say "module-level is acceptable when every function qualifies." | ||
|
|
||
| 5. **Validate:** | ||
| - Check the skill is discoverable: list files in `.agents/skills/`. | ||
| - Confirm no frontmatter warnings from the IDE. | ||
| - Verify the skill does not conflict with existing skills or `AGENTS.md`. | ||
|
|
||
| ## SKILL.md Frontmatter Schema | ||
|
|
||
| Use only fields from the **cross-compatible** set to avoid IDE warnings. | ||
|
|
||
| ### Cross-compatible fields (use these) | ||
|
|
||
| | Field | Type | Purpose | | ||
| |-------|------|---------| | ||
| | `name` | string | Kebab-case identifier. Becomes the `/slash-command`. Max 64 chars. | | ||
| | `description` | string | What the skill does and when to trigger it. Be specific — agents use this to decide whether to invoke the skill automatically. | | ||
| | `argument-hint` | string | Autocomplete hint. E.g. `"[file] [--dry-run]"`, `"[issue-number]"`. | | ||
| | `compatibility` | string | Which tools support this skill. E.g. `"Claude Code, IBM Bob"`. | | ||
| | `disable-model-invocation` | boolean | `true` = manual `/name` only, no auto-invocation. | | ||
| | `user-invocable` | boolean | `false` = hidden from `/` menu. Use for background knowledge skills. | | ||
| | `license` | string | SPDX identifier if publishing. E.g. `"Apache-2.0"`. | | ||
| | `metadata` | object | Free-form key-value pairs for tool-specific or custom fields. | | ||
|
|
||
| ### Tool-specific fields (put under `metadata`) | ||
|
|
||
| These are useful but not universally supported — nest them under `metadata`: | ||
|
|
||
| ```yaml | ||
| metadata: | ||
| version: "2026-03-25" | ||
| capabilities: [bash, read_file, write_file] # Bob/agentskills.io | ||
| ``` | ||
|
|
||
| Claude Code's `allowed-tools` and `context`/`agent` fields are recognised by | ||
| Claude Code but may trigger warnings in Bob's validator. If needed, add them | ||
| to `metadata` or accept the warnings. | ||
|
|
||
| ### Example frontmatter | ||
|
|
||
| ```yaml | ||
| --- | ||
| name: my-skill | ||
| description: > | ||
| Does X when Y. Use when asked to Z. | ||
| argument-hint: "[target] [--flag]" | ||
| compatibility: "Claude Code, IBM Bob" | ||
| metadata: | ||
| version: "2026-03-25" | ||
| capabilities: [bash, read_file, write_file] | ||
| --- | ||
| ``` | ||
|
|
||
| ## SKILL.md Body Structure | ||
|
|
||
| After frontmatter, write clear markdown instructions the agent follows: | ||
|
|
||
| 1. **Context section** — what the skill operates on, key reference files. | ||
| 2. **Procedure** — numbered steps the agent follows. Be explicit about decisions and edge cases. | ||
| 3. **Rules / constraints** — hard rules the agent must not break. | ||
| 4. **Output format** — what the agent should produce (report, edits, summary). | ||
|
|
||
| ### Guidelines | ||
|
|
||
| - **Be specific.** Vague instructions produce inconsistent results across models. | ||
| "Check if markers are correct" is worse than "Compare the test's assertions | ||
| to the qualitative decision rule in section 3." | ||
| - **Reference project files.** Point to docs, configs, and examples by relative | ||
| path so the agent can read them. E.g. "See `test/MARKERS_GUIDE.md` for the | ||
| full marker taxonomy." | ||
| - **Declare scope boundaries.** State what the skill does NOT do. E.g. "This | ||
| skill does not modify conftest.py — flag infrastructure issues as notes." | ||
| - **Use `$ARGUMENTS`** for user input. `$ARGUMENTS` is the full argument string; | ||
| `$1`, `$2` etc. are positional. | ||
| - **Keep SKILL.md under 500 lines.** Use supporting files for large reference | ||
| material (link to them from the body). | ||
| - **Portability:** use relative paths from the repo root, never absolute paths. | ||
| - **Formatting:** use YYYY-MM-DD for dates, 24-hour clock for times, metric units. | ||
| - **Design for variable scope.** If the skill can operate on a single file or an | ||
| entire directory, provide a triage strategy for the large case. Agents given | ||
| "audit everything" with no prioritisation will either read every file (slow) | ||
| or skip files (incomplete). | ||
| - **Sharpen category boundaries.** When defining classifications, the boundary | ||
| between adjacent categories causes the most disagreement. Add a "key | ||
| distinction from X" sentence for each pair of adjacent tiers. | ||
| - **Avoid temporal assertions.** Don't write "this conftest hook needs to be | ||
| added" — write "check whether conftest.py already has the hook." State that | ||
| goes stale silently is worse than no guidance at all. | ||
| - **Qualify absolutes.** "Always X" and "never Y" rules need escape hatches for | ||
| the common exception. E.g., "per-function only — unless every function in the | ||
| file qualifies, in which case module-level is acceptable." |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,52 @@ | ||
| """Validate SKILL.md frontmatter for agent skills.""" | ||
|
|
||
| import json | ||
| import os | ||
| import sys | ||
|
|
||
| import yaml | ||
|
|
||
|
|
||
| def validate_skill(skill_path: str) -> dict: | ||
| """Check that a skill directory has valid SKILL.md with required frontmatter keys.""" | ||
| skill_file = os.path.join(skill_path, "SKILL.md") | ||
|
|
||
| if not os.path.exists(skill_file): | ||
| return {"status": "error", "message": "Missing SKILL.md"} | ||
|
|
||
| try: | ||
| with open(skill_file) as f: | ||
| # safe_load_all handles the --- delimiters correctly and won't | ||
| # break on markdown horizontal rules later in the file. | ||
| frontmatter = next(yaml.safe_load_all(f)) | ||
|
|
||
| if not isinstance(frontmatter, dict): | ||
| return {"status": "error", "message": "Frontmatter is not a YAML mapping"} | ||
|
|
||
| # Root-level required keys | ||
| for key in ("name", "description"): | ||
| if key not in frontmatter: | ||
| return {"status": "error", "message": f"Missing root key: {key}"} | ||
|
|
||
| # version lives under metadata (per skill-author guide) | ||
| meta = frontmatter.get("metadata") | ||
| if not isinstance(meta, dict) or "version" not in meta: | ||
| return { | ||
| "status": "error", | ||
| "message": "Missing nested key: metadata.version", | ||
| } | ||
|
|
||
| return {"status": "success", "data": frontmatter} | ||
|
|
||
| except yaml.YAMLError as e: | ||
| return {"status": "error", "message": f"Invalid YAML: {e}"} | ||
| except StopIteration: | ||
| return {"status": "error", "message": "No YAML frontmatter found"} | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| if len(sys.argv) < 2: | ||
| print("Usage: python3 validate_skill.py <skill-directory>", file=sys.stderr) | ||
| sys.exit(1) | ||
| result = validate_skill(sys.argv[1]) | ||
| print(json.dumps(result)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| { | ||
| "skillLocations": [".agents/skills"] | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| # Claude Code Directives | ||
| @AGENTS.md | ||
|
|
||
| ## Execution | ||
| - If instructed to create a new capability, strictly trigger the `skill-author` meta-skill to ensure cross-compatibility. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on the fence about whether this skill belongs in mellea proper and not put in a "useful skills" repo instead. I'm ok adding it here for now though