Skip to content

Update CLAUDE.md Generator to reflect research findings from Gloaguen et al. (2026) #86

@jwm4

Description

@jwm4

Summary

The paper Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? (Gloaguen, Mündler, Müller, Raychev, Vechev — ICML 2026) is the first rigorous empirical evaluation of context files (CLAUDE.md / AGENTS.md) on real-world coding tasks. The findings have direct implications for our claude-md-generator workflow and should be incorporated to ensure the files we help users create are evidence-backed, not just opinion-driven.

Key findings from the paper

Finding Detail
LLM-generated context files hurt performance Across 4 agents and 2 benchmarks, auto-generated context files reduced task success rates by 0.5–2% on average
Developer-written files help only marginally +4% average improvement, and only when manually authored
All context files increase cost 20–23% higher inference cost due to more steps, more reasoning tokens, broader exploration
Codebase overviews are ineffective Despite being the most common recommendation, directory/structure overviews did not help agents find relevant files any faster
Context files are redundant with existing docs When existing docs (README, docs/) were removed, context files did help — meaning they mostly duplicate what's already discoverable
Instructions are followed but make tasks harder Agents obey context file instructions, but the extra constraints increase reasoning tokens by 14–22%
Only specific tooling info consistently helps e.g., "use uv for deps", "run pytest" — concrete, repository-specific tooling that the agent wouldn't guess on its own
Stronger models don't generate better context files Using GPT-5.2 to generate context files didn't consistently outperform using the default model

What the current workflow already gets right

The claude-md-generator already reflects several of these principles:

  • "Onboard, don't configure" — aligns with the "minimal requirements" finding
  • "Less is more" / under 300 lines, ideally under 60 — aligns with findings that bloat hurts
  • "Don't auto-generate it" / skip /init — directly supported by the data
  • "Don't use it as a linter" — unnecessary requirements make tasks harder
  • "Only universally applicable instructions" — aligns with minimal requirements
  • "Prefer pointers to copies" — reduces redundancy

Proposed changes

1. Add research backing to BEST_PRACTICES_CLAUDE.md

Add a "Research" or "Evidence" section citing the paper and summarizing the key numbers. This gives the advice authority beyond "best practice" — it's now empirically validated.

2. De-emphasize or reframe the "Structure" section in project template

The paper found that codebase overviews (listing directories and their purposes) do not help agents find relevant files faster. The current interview asks "Key directories and their purposes? (3-5 max)" and the project template includes a ## Structure section.

Options:

  • a) Remove the Structure section entirely and rely on progressive disclosure (BOOKMARKS.md)
  • b) Keep it but reframe: make it optional, shorter (2-3 dirs max), and explicitly warn it's for human readers, not agent navigation
  • c) Replace with a "Key entry points" section that points to 2-3 files an agent should start from (more useful than a directory tree)

3. Strengthen emphasis on concrete tooling commands

The paper shows tooling-specific info (e.g., "use uv", "use pytest", repo-specific CLI tools) is the most consistently useful content. The Commands section already does this, but we should:

  • Make it the primary focus of the interview
  • Add a question about repo-specific tooling (custom CLIs, Makefiles, task runners)
  • Emphasize that this is the highest-signal content in the file

4. Add redundancy awareness

If a project already has a good README.md and docs/, the CLAUDE.md should be even shorter — potentially just commands and tooling. Add a question to the interview: "Does this repo already have a README/docs?" and adjust output length accordingly.

5. Add cost awareness messaging

Every line in CLAUDE.md has a measurable cost: ~20% increase in inference spend. The workflow should communicate this to users — "each unnecessary line costs you ~20% more per task" is a stronger motivator than "keep it short."

6. Update the "Don't auto-generate" advice

Currently: "Skip /init and follow this guide."
Improved: "Skip /init. Research shows auto-generated context files reduce task success by up to 2% while increasing cost by 20%+. Human-authored minimal files outperform LLM-generated ones."

7. Consider adding a "lint" or audit checklist

Post-generation, offer a quick audit:

  • Is every line relevant to every task? (not just some tasks)
  • Does this duplicate information already in README.md or docs/?
  • Are commands concrete and copy-pasteable?
  • Is the Structure section actually needed? (agents explore on their own)
  • Is the total under 60 lines?

Files likely affected

  • workflows/claude-md-generator/.ambient/ambient.json (systemPrompt, description)
  • workflows/claude-md-generator/BEST_PRACTICES_CLAUDE.md
  • workflows/claude-md-generator/.claude/templates/project-template.md
  • workflows/claude-md-generator/README.md

Acceptance criteria

  • BEST_PRACTICES_CLAUDE.md cites the paper and incorporates findings
  • systemPrompt interview flow reflects research (tooling-first, structure-optional)
  • Project template updated to de-emphasize overview, emphasize tooling
  • Redundancy awareness added to interview
  • Cost awareness messaging added
  • README updated to reflect changes
  • No regressions to the personal CLAUDE.md flow (paper focused on project/repo context files)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions