Semble is our solution for this. It combines static Model2Vec embeddings (using our latest static model: potion-code-16M) with BM25, fused via RRF and reranked with code-aware signals. Everything runs on CPU since there's no transformers involved. On our benchmark of ~1250 query/document pairs across 63 repos and 19 languages, it uses 98% fewer tokens than grep+read and reaches 99% of the retrieval quality of a 137M-parameter code-trained transformer, while being ~200x faster.
Main features:
- Token-efficient: 98% fewer tokens than grep+read
- Fast: ~250ms to index a typical repo on our benchmark, ~1.5ms per query on CPU (very large repos may take longer)
- Accurate: 0.854 NDCG@10, 99% of the best transformer setup we tested
- MCP server: drop-in for Claude Code, Cursor, Codex, OpenCode
- Zero config: no API keys, no GPU, no external services
Install in Claude Code with: claude mcp add semble -s user -- uvx --from "semble[mcp]" semble
Or check our README for other installation instructions, benchmarks, and methodology:
Semble: https://github.com/MinishLab/semble
Benchmarks: https://github.com/MinishLab/semble/tree/main/benchmarks
Model: https://huggingface.co/minishlab/potion-code-16M
Let us know if you have any feedback or questions!
We’re interested in measuring it end to end and also optimizing, e.g. the prompt and tools, for this, but we just haven’t gotten around to it.
1) How do you compare accuracy? by checking if the answer is in any of the returned grep/bm25/semble snippets?
2) How do you measure token use without the agent, prompt, and tools?
e.g. agents often run `grep -m 5 "QUERY"` with different queries, instead of one big grep for all items.
I guess the point we’re trying to make is that you need fewer semble queries to achieve the same outcome, compared to grep+readfile calls.
For example, I have explored RTK and various LSP implementations and find that the models are so heavily RL'd with grep that they do not trust results in other forms and will continually retry or reread, and all token savings are lost because the model does not trust the results of the other tools.
Perhaps anecdotally: we do use this tool ourselves of course, and it's been working pretty well so far. Anthropic models call it and seem to trust the results.
One thing that irks me is that when it doesn't support eg. a cli flag of find, it gives an error message rather than sending the full output of the command instead. Then the agent wastes tokens retrying, or worse, doesn't even try because the prompting may make them afraid to not run commands without rtk
I'm not likely to install it again in my latest configuration, instead applying some specific tricks to things like `make test` to spit out zero output exit on unsuccessful error codes, that sort of thing. Anecdotally, I see GPT-5.5 often automatically applying context limiting flags to the bash it writes :shrug:
And you should disable the savings reporting feature since it’s worse than useless—it breaks sandboxing and always reports ~100% savings for me because rtk obviously doesn’t know about the head/tail the agent pipes into.
I also added the plugins directly to Claude Code: ty Plugin · claude-code-lsps · enabled vscode-langservers Plugin · claude-code-lsps · enabled vtsls Plugin · claude-code-lsps · enabled
~ cat ~/.claude/CLAUDE.md
# Python Environment
- ALWAYS use uv — never use pip, pip install, python, or python3 directly
- Activate venv: `source .venv/bin/activate`
- Install deps: `uv sync`
- Add a dep: `uv add <package>`
- Run scripts: `uv run <script.py>`
- Run tools: `uvx <tool>`# Long-running scripts - Any script, command, migration, data job, or test run that may take more than 2-3 seconds should emit regular status updates while it runs. - Prefer progress that is useful for diagnosing where time is going: current phase, item counts, batch numbers, elapsed time, retry/backoff state, or the external service being waited on. - For loops or batch jobs, log progress periodically rather than only at start/end; keep the cadence readable and avoid flooding output.
# Code Intelligence - LSP servers available: ty (Python), vtsls (JS/TS), vscode-langservers (HTML/CSS/JSON) - Use LSP for: - findReferences before any refactor - goToDefinition when navigating unfamiliar code - diagnostics after edits to catch type errors - grep/search is fine for simple lookups in small files - NEVER refactor without findReferences impact analysis first - After every edit, check LSP diagnostics before moving on
# Documentation - Context7 is available for up-to-date library docs - Use `ctx7 docs <libraryId> <query>` to fetch current documentation - Use `ctx7 library <name> <query>` to find a library ID first
# Working with unfamiliar data or systems - Prefer experimenting on real data over reasoning about it in the abstract. Your outputs are noticeably better when grounded in a concrete sample than when derived from minutes of speculation. - When a task involves parsing/processing/integrating with some external artifact (a report, an API response, a file format, a third-party tool's output), the FIRST step is to fetch or generate a real example and inspect it. Do not write code against an imagined shape. - Experiments must be non-destructive: read-only fetches, copies into a scratch dir, dry-run flags. Never mutate the user's real data to learn about it. - Before assuming you lack credentials, check the current working directory's `.env` file (and `.env.example` for hints about which keys exist) — API keys, tokens, and connection strings for the relevant service are very often already there. - If you cannot obtain real data on your own (auth genuinely missing, lives on another machine, behind a paywall, etc.), STOP and ask the user to provide a sample rather than guessing. - Example: asked to process an Amazon sales report, the first action is to fetch (or have the user paste) one actual report and look at its columns — not to draft a parser based on what such a report "probably" contains.
At least codex listens to me telling it to use rg instead of grep, cause grep is often so slow. But when adding rtk it uses grep through rtk which is kind of annoying.
So the model trusts the output because it is grep :D
Could you add fff to the benchmarks?
I will try that ! It make sense and I'm curious to see results, for this or any similar projects mentioned in the thread
Still semble is a few orders of magnitude faster and gave better results against ck —-sem. I am running both on rust-lang/rust and CK is going to take hours at least, extrapolating from current stats probably 3 days? Semble: 26 seconds without any caching. The thing doesn’t have a cache and it’s still massively faster. I added caching support and watchman integration and got it down to 1.4 seconds. 3 days is basically not good enough for this use case. It’s slow enough that indexing is going to lag your code changes. Semble is fast enough that it’s not going to be behind.
Tried against a 84K loc C project. ck took at least 5 minutes to index, but replies are indeed fast. semble indexing (if any) took no noticeable time (except for the first download of HF model, which took a couple seconds), and replied in a couple of seconds.
Unrelated but ck was a pain to install / compile (install instructions do not say you have to lock the build / you have to have latest libc).
I forget the exact tests I used (a couple of the standard agent evals that people use, one python and one typescript because those are what I use).
I don't claim it was an exhaustive test, or even a good one. It's possible I could have spent a day or so tuning my AGENTS.md and the pi system prompt/tool instructions and gotten better results, because if there's one thing running evals taught me it's that subtle differences there can change the results a lot.
However, I got clearly better results with both off, enough to convince me to stop the tests immediately after 3 rounds.
The problem was that while context use did go down (sometimes), the number of turns to complete went up so the overall cost of the conversation was higher.
It's made me very aware of one thing: so many people are sharing these kind of tools, but either with zero evals (or suspiciously hard to reproduce), or in the case of this one, extensive benchmarks testing the wrong thing.
I'm sure this tool does use fewer tokens than grep, and the benchmarks prove it, but that's not what matters here. What matters is, does an agent using it get the same quality of work done more quickly and for lower cost?
We didn't generate this project, we wrote it, a lot of it manually, and trained custom models. We'd been working in the real-time retrieval space for a while, and we thought coding was a good fit for this specific technology.
But I still think you're missing the harder but more important proof which is agent evals. Have you done any of that?
I would personally love to find tools in this space which can make agents more efficient and I do believe there's a scope for massive improvements compared to default workflows. But my evals with RTK and Headroom have made me wary that a tool can look like it should work, conceptually make sense, pass non-agentic benchmarks, and still make an actual agentic workflow worse.
I agree with your point about the evals and how you can get discontinuities: good search can be worse than bad search when agents can do many searches. We’re working on it
is that an issue? the tiny model might not surface something important
I found a nice workaround which is that you can just dump the whole directory into context, as a startup hook. So then Claude skips the "fumble around blindly in the dark" portion of every task. (I've also seen a great project that worked on bigger repos where it'll give the model an outline with stubs, though I forget what it was called.)
It just does what I need and no more: load code into context, append my question or instruction, call LLM, apply patches. Repeat.
I haven't used Aider itself though, maybe it does that too.
My harness was nice relative to Claude Code and Codex because, it doesn't need to poke around the filesystem (cause I have small repos and dump the whole thing), and it makes all edits simultaneously (doesn't need to edit one file at a time).
It reads all files and edits all necessary files in a single round trip.
The really nice thing is that when you're making many small fine grained changes like that, you can use a much smaller, faster, cheaper model.
And if it's fast enough, it actually becomes a real-time activity. It's not "prompt, wait..." but "prompt, immediately get the result." It's interactive. You stay active and engaged. It's great.
Although for small codebases it also holds that whatever you would like to find it easy to find, so search still might help you with cost
"Answer this question by only using the `semble` CLI (docs below):
> What tools does Browsercode provide to the agent other than the base OpenCode tools? Provide the exact schema for tool input and tool output and briefly summarize what they do and how they work
---
[the AGENTS.md snippet provided from https://github.com/MinishLab/semble#bash-integration]"
And the equivalent for the non-Semble test:
"Answer this question by only using the `rg` and `fd` CLIs:
> What tools does Browsercode provide to the agent other than the base OpenCode tools? Provide the exact schema for tool input and tool output and briefly summarize what they do and how they work"
In both cases, I used Pi with gpt-5.4 medium and a very minimal setup otherwise. (And yes, I did verify that either instance only used rg & fd, or only used semble.)
Without Semble, it used 10.9% of the model context and used $0.144 of API credits (or, at least, that's what Pi reported - I used this with a Codex sub so cannot be sure). With Semble, it used 9.8% of the model context and $0.172 of API credits. The resulting responses were also about the same. Very close!
I tried one more test in the OpenCode repo. The question was > Trace the path from 1) the OPENCODE_EXPERIMENTAL_EXA env var being set to to 1 to 2) the resulting effects in the system prompt or tool provided to the OpenCode agent.
And I included the same instructions/docs as above. The non-Semble version was a bit more detailed -- it went into whether the tool call path invoked Exa based on whether Exa or Parallel was enabled for the web search provider -- but w.r.t. actually answering the question, both versions were accurate. The Semble version used 14.7% context / $0.282 API cost, while the non-Semble version used 19.0% / $0.352. Clearly a win for Semble for context efficiency, but note that the non-Semble version finished about twice as fast as the Semble version.
Of course this is just me messing around. ymmv.
I will have to add this as a comparison to https://github.com/boyter/cs and see what my LLMs prefer for the sort of questions I ask. It too ships with MCP, but does NOT build an index for its search. I am very curious to see how it would rank seeing as it does not do basic BM25 but a code semantic variant of it.
This seems to work better for the "how does auth work" style of queries, while cs does "authenticate --only-declarations" and then weighs results based on content of the files, IE where matches are, in code, comments and the overall complexity of the file.
Have starred and will be watching.
These agentic AI's are already smart enough to figure out a highly optimized path to code exploration or search. But, with these tools, they just go very aggressive, partly because the search results from these tools almost in 100% of the cases do not furnish full details, but, just the pointers.
To confirm this behaviour, I did a small test run. This is in no way conclusive, but, the results do align with what I been observing:
---
Task: trace full ingestion and search paths in some okayish complex project. Harness is Pi.
1. With "codebase-memory-mcp": 85k/4.4k (input/output tokens).
2. With my own regular setup: 67k/3.2k.
3. Without any of these: 80k/3.2k.
As we see, such a tool made it worse (not by much, but, still). The outputs were same in quality and informational content.
---
Now, what my "regular setup" mentioned above is?:
Just one line in AGENTS.md and CLAUDE.md: "Start by reading PROJECT.md" .
And PROJECT.md contains just following: 2-3 line description of the project, all relevant files and their one-line description, any nuiances, and finally, ends with this line:
## To LLM
Update this file if the changes you have done are worth updating here. The intent of this file is to give you a rough idea of the project, from where you can explore further, if needed.Anyways, I made it work by making it generate relevant doc (using semble init), and then copying this into AGENTS.md, and then prompting it with this line:
""" Start by reading AGENTS.md in current folder. Now, the task::: `Explore the ingestion and search paths. Do not read README.md at all`. Prefer to use `semble` search for code search. Do not do new installation. semble is already available at `/Users/nitinbansal/.local/bin/semble` . """
The results are much better. Even better than my own setup, but, vary a lot. I did 4 runs:
95k/2.9k
25k/2.7k
71k/2.9k
37k/4.0k
Hasn't been my experience. We used to use Augment Code at work which has a thing called Context Engine - basically an MCP that can answer natural language queries about pre-indexed code. Then we switched to Claude Code, which for some reason prefers to use sed to read from files using line ranges from its own memory (this despite having a range-capable read tool). I don't know, does that really mean that sed is the highly optimized path?
Also, it'll run a formatter, read, edit to undo auto formatting and then continue on its merry way. What is the point of that??? Lol
> And PROJECT.md contains...
…Why not just use that PROJECT.md as the AGENTS/CLAUDE.md?
Also, I dont want to keep my project's details in those files, but, keep it separate.
With current setup/way, a single line in both satisifies all constraints and requirements.
> Our tool uses 99x fewer tokens and delivers 88x better results.
Okay, great, but...
1) It's VERY difficult to quantify something is better.
2) They almost never post how they measured how much better it is and what the margin of error might be.
3) I assume they are incompetent and don't even try the tool.
Like you pointed out, the odds these things make agents worse is FAR higher than they make them better.
Not saying it's impossible, but if it was possible on the scales they are claiming, it probably would already be done, or put into the next release of the agents...
It's unfortunately a nearly impossible task, as the models change regularly (without letting you know), so you have a moving (invisible) target that's 1) hard to test exhaustively, and 2) very expensive to test with any low margin of error.
This is why no one does it and just makes broad sweeping unverified claims instead.
If you figure out how to do it... You should probably just get a job at Anthropic or OpenAI and make $2M+ per year...
``` - For planning, prefer using morph-mcp `codebase_search` - subagent that takes in a search string and tries to find relevant context. Best practice is to use it at the beginning of codebase explorations to fast track finding relevant files/lines. Do not use it to pin point keywords, but use it for broader semantic queries. "Find the XYZ flow", "How does XYZ work", "Where is XYZ handled?", "Where is <error message> coming from?" ```
(see also https://news.ycombinator.com/item?id=48205911; having higher quality results at the beginning of a thread seem to improve the output vs. having faster search later on).
Also curious what the authors think about Claude team explicitly trying out indexing and deciding against it.
The tool itself is fully local though, so there's no real security risks there, there are no outbound network calls or anything like that.
My observation is that greps and the processing of grep outputs account for only a small portion of overall consumption; I haven't measured this scientifically though.
Nice!
For example, an AI would already use linux commands like tree to traverse the code base. And again it already has good training in this.
The other problem is that it is easy to cook up examples which demonstrate the efficacy of tools like these - but actually proving that the cognitive deficit that such tools result it, is surmounted by their efficacy in long horizon runs. My first contact instinct is that this will result in a net negative 'deployable intelligence' over long horizon runs - make the agent perform worse than using existing tools.
Proving the opposite is a non-trivial problem - but maybe it might be something you want to take up.
So are we supposed to believe that grep is so wasteful that models are reading 98% useless garbage every time they call it? Either this claim is not representative, or you're missing something else when you throw away the vast majority of context for the model.
I suspect this comparison is against reading the whole codebase though compared to just getting the bits you need.
Depends on the size of the project and specific files. I have definitely seen agents make smart use of pi's "read" tool, which can take an offset and line limit (or defaults to a max 2000 lines/50KiB if the model doesn't specify). The bash tool also has the same max output, so if a model decides to cat instead of using the read tool it still wont blow out its context window with a single large file read.
But this sort of thing is going to vary with harness, model, project, and whatever the RNG delivers for the day.
codex-cli hangs when calling this through the MCP. The semble process even sticks around as a zombie, forever stalled out. No idea why, logs have nothing.
When called through a skill via CLI style calling, GPT 5.5 loves to give a ton of search terms like it is used to doing with ripgrep. Not sure how effective this is, the short docs in the github and the instructions the agent has isn't clear on what is optimal.
Lastly, I got some errors with external connections to github when I was installing it for bash use. Maybe its related to the hanging? No idea.
edit: My agent also loves to follow-on with ripgrep, which seems redundant. Acts like it has trust issues. I think a more extensive agent skill description could guide the agent into proper use.
Before you had faster implementation times, something would take six weeks to implement. Feedback from the client about how far off target you were came through in the same amount of time: a help desk ticket, a post-call check-in, a quarter end review. The price you paid for being off target was proportional to how long it took to figure out.
Now, when you can ship features in an afternoon, the customer feedback loop remains the same speed. Surveys, help desk tickets, and churn analysis come back days, even weeks later, by which point you've shipped five new features going the same way.
You can fix the internal bottlenecks easily enough: write better specs, have faster test cycles, deploy continuously. The customer feedback loop bottleneck is built into the system. It won't get any faster just because implementation did.
Today most organizations are busy fixing the internal bottleneck, but not the external one.
When grep does not find a file of interest, the agent does not fail; it will continue working on an incomplete context. For a monolingual code base, the miss rate is okay. In case of polylingual code (Python backend code and TypeScript frontend code), the problems emerge when it comes to querying for cross-file dependencies. Grep will return a route from the backend API. However, there is an interface in TypeScript that needs to be matched. Agent generates a response that does not fit the type. Correction cycle is one; two if the type conflict is ambiguous.
Combining grep with the understanding of semantic relations between files is a solution. Number of tokens saved is real but underestimates the actual benefit since fewer correction cycles are more valuable than tokens themselves.
Burntsushi (author of ripgrep), please chime in!
> The clearest result was that faster search alone only modestly helps, while better-ranked results improve first-query retrieval and help agents find the right code sooner.
Their tool "pgr" is a research preview only, so it'd be interesting to see semble vs pgr.
I'm also collecting other tools that are similar, most notably is probably Morph's WarpGrep (has a free tier too). Apart from that, there is codemogger (https://github.com/glommer/codemogger), cs (the author also commented in this HN post).
In the similar area, but not fully related, the author of fff is also pretty involved in any thread that goes into that direction (see e.g. https://x.com/neogoose_btw/status/2052161471296225710). Similar to colGREP is also mgrep (by mixedbread) and osgrep (but they seem to predate colGREP). I also found codedb on X (https://codegraff.com/blog/codedb-code-intelligence), the post reads well, but haven't tried.