In many cases, even if you can see what happened, it’s hard to reproduce the exact execution — especially when prompts, external tools, or timing are involved.
So you end up analyzing behavior rather than actually stepping through the same failure.
Curious how you think about reproducibility vs observability for AI debugging?