As an aside, 4o-mini came out months before agent skills were released… I’m curious how it performs with choosing to load skills in the first place?
> Activation: When a task matches a skill’s description, the agent reads the full SKILL.md instructions into context.[1]
> Full instructions load only when a task calls for them, so agents can keep many skills on hand with only a small context footprint.
For Gemini it seems to always pick 2.5 despite 3.1 being the latest, Claude the 3.5-era models.
Not sure what’s preventing AI labs on ensuring this stuff is refreshed during training.
[0] https://developers.googleblog.com/closing-the-knowledge-gap-...
That training on existing models is what brings out various other things about other models; then there's models that are just like snowballs, where you build one iteration, then you give it it's identity, then you train on that with the same synthetic generaiton.
So a model could generation include at some point it's own name.
Synthetic data is generated by other models, and yes this is often where identity propagates.
I think with the snowballing you mean things like iterative self distillation? That’s definitely not done unsupervised, because of the risk of model collapse, and typically heavily curated and/or mixed with real data.
The report leaves out a lot of detail. Several changes I found useful were: Pair with/without on same screen as left/right for easier viewing, token count for skill consumed, token used per run, time, pass rate, estimated cost, detailed aggregate stats, a parsed version of the conversation log (capturing the jsonl with each run, sometimes reading the log is the only way to find out why it's screwing up), work output logging (in my case screenshots and outputted script code), better formatting (syntax highlighting, log formatting).
Finally, I think the most useful thing was adding a self-reflection pass. After an eval is done, another agent looks at everything from that eval and tries to evaluate what went wrong along the way and what should be added to the skill, and conversely, from the without skill run what was in the skill that didn't need to be. It produces a skill change recommendation file for each eval. A further summary agent aggregates up all those recommendations in a way I can feed back to an agent.
> Bash(DATABASE_URL=$(grep -E '^DATABASE_URL=' .env 2>/dev/null | head -1) echo "ok")
even though I have in CLAUDE.md:
> For database queries, use tidewave first.
I then prompted:
> use tidewave as per CLAUDE.md. also diagnose why you failed to heed that
> ● Diagnosis first: I defaulted to shell habits (env grep → psql) instead of pausing to recall the CLAUDE.md rule that tidewave is the first-line DB tool. The trigger was "look at this record" — I should have read that as "run a SQL query" and reached for tidewave immediately.
If Opus 4.7 doesn't follow simple CLAUDE.md instructions, I'm not sure what benefits other markdown files could bring. I don't trust Opus's own explanation, but it could point to the fact that the system prompt for bash is much longer than CLAUDE.md with tidewave.
While LLM judging could be helpful, I think the tool-call assertions (https://github.com/darkrishabh/agent-skills-eval#what-you-ge...) may be the most useful thing in agent-skills-eval given that it's the only objective measure of compliance.
I think it's better to have a repo-level skill instead, titled something like "connecting_to_db.md" and demonstrate exactly how to connect. Codex has been pretty good at referring to skills but it depends on context at the end of the day.
So, atleast heuristically, it should know _why_ it ignored whatever and hopefully pulls the correct anti-matter context. It took about two reptitions of this to get it to use pg-promise instead of psql to do queries for me. I assume the longer the context goes on, the less likely any of priming works.
Use a different harness
Same approach is useful for everything: model, params, prompt, sub-agents, skills, rag, etc?
This tool has me thinking there's some merit to setting that up. My only real qualm is that I'm not super convinced skills are that great yet. I'm trying to get better at developing them in my workflow, but still get a lot of results where they are ignored even after spending time trying to tighten them up.