What's in it: task sizing: auto-classifies XS to XL from the description, then runs PERT on that tier human-equivalent comparison: a per-task-type multiplier so you see the speedup METR p80 thresholds: warns when an estimate exceeds a model's reliability horizon wave planning: schedules independent tasks in parallel across a multi-agent fleet
The estimation data is from my daily coding tasks from past few weeks: per-runtime calibration: Opus 4.7, GPT-5.5, different models have different reliability horizons and costs per-task-type priors: backend, frontend, app development, docs, and brainstorm PR review: I usually let Codex and Claude Code review each other’s code, and the tool takes that into consideration a calibration loop that keeps me honest: dispatch data is validated at end of day by my coordinator agent
Try it: pip install agent-estimate, read the code https://github.com/kiloloop/agent-estimate/ , or the writeup https://kiloloop.com/agent-estimate/
If you ask an AI agent how long a task will take, it answers in human time, so the gut number is that’s an afternoon when the real answer is 20min. I have run it for a few months now, with a few hundred real dispatches. With better estimation my morning plan and my end-of-day reality mostly line up now, and when they don't it's usually because something finished faster than I guessed, not slower.