if they say it's 4.7 comparable, it anchors that into your head as the model to evaluate against.
Realistically I assume they hope readers don’t notice the fine details.
The Qwen models are great for open weights but for every past release they haven’t performed as well as the benchmarks in my experience. They’re optimizing for benchmark numbers because they know it works.
The pool of people reading such articles while ignoring such details can't be big.
On Hacker News I wonder if most people even opened the article at all most times.
e: which itself is a modification of RTFM from usenet
Even with LLMs, posts like this don't just fall out of a coconut tree. If you have a set of target benchmarks for your own model, then keeping "the set" of side-by-side comparable models is its own maintenance headache.
This even applies to OpenAI & Anthropic who don't even eval on the same datasets a lot of the time.
Which is fine, we all have to make money, but it is disingenuous. It's just unfortunate that running some of these benchmarks is so expensive that it's not really realistic for most companies to actually run them.
4.7 is much better. But perception is a funny thing, once you think something is bad you start looking for it everywhere.
I’m on an M1 Max with 32GB VRAM, so I’m looking forward to the 27B or 35B-A3B models. Is dropping $5k for an RTX 6000 or a DGX Spark really the best option?
In October/2024 I got my Mac studio M1 ultra with 128G, IIRC it was ~$2500. With recent prices explosion, it has certainly gotten more expensive. https://frame.work/ is selling 128G strix halo mainboard for $2700, but you have to add storage and case.
Non-NVIDIA backends tend to get less support and new features land slower, or features that are expected to improve performance wind up hurting it instead. That sort of thing.
For basic “token in/token out” workloads without fine tuning, it’s probably fine ??
Unfortunately, the prices rose on these a lot, but unevenly. Beelink GTR 9 Pro is $4400, Framework Desktop is ~$3500, for what is basically the exact same mainboard as a Bosgame M5 for $2800.
Apple's M5 Max is another attractive option. Apple silicon traditionally had great MBW and was good at TG, but struggled with PP, but the new neural engines in those GPU cores have made a big difference in a good way here.
Gorgon Halo is rumored for June announcement with Q4'26 release with basically +100 MHz clocks on Strix Halo, LPDDR5X-8533 instead of LPDDR5X-8000, but more importantly, 192 GB max instead of 128 GB.
I'd say it's better to wait for Gorgon Halo than to grab Strix Halo now. However, Medusa Halo, rumored for H2'27, is slated to have up to 26c Zen 6 (heterogeneous cores - kinds funny that AMD is heading towards these as Intel retreats from them), 48 CU of RDNA 5 instead of 40 CU RDNA 3.5, and a 384 bit bus w/ LPDDR6, which should make 256 GB at more like ~490-600 GB/s MBW, which will really make Strix and Gorgon Halo obsolete.
Also worth keeping an eye out for Serpent Lake (intel CPU + nvidia iGPU on a single board with unified memory, rumored for 2028-2029 iirc), and on the 160 GB Crescent Island Intel dGPU.
- Your RTX 6000 is closer to $10k now
- Sparks are creeping into the $4-5k range
- AMD Strix are ~3.5k
- Apple depends on chipset and memory. Sweet spot would be 128gb M3 Ultra, probably $6-8k but admittedly haven't been tracking closely. New M5 might come in the fall. You can get a new 128gb M5 Max laptop for ~5-6k today.
- a 4x3090 rig would take $5-6k
Every platform has tradeoffs, but it's mostly ecosystem, memory bandwidth, and power consumption. They're all slow. The best option is likely to rent hardware on Runpod. The RIO on self-hosting is very low unless you have a specific need or you're ok treating it as a hobby.
>The best option is likely to rent hardware on Runpod.
Vast.ai is much cheaper, but the broader point here is contestable. The only dimension in which cloud GPU rentals win is cost. You lose the confidentiality, integrity, and availability benefits of local deployments.
Not that I'd encourage anyone to throw large amounts of money to have access to LLMs, but you're definately going to be better off buying something that you can amortize over multiple years with a multi year warranty.
This whole thing is really starting to remind me of the crypto hype phases of 2016-2018 when everyone thought their investment in GPUs was going to make them rich.
Yes, LLMs are sloppy, and local models usually more so (but things change fast).
But the local ones have one big advantage: they are private. So you can safely feed them the collection of your private documents and things you wouldn't trust people like sama with. The fact that some people do not care is one of the failures of our educational system.
The server edition has gone up $2K in the last couple of weeks alone, at the outlet where I bought one previously.
I've created a 2.54BPW quant that fit on my hardware with 128k context, 20 tps tg and 200tps pp, while maintaining high scores on many benchmarks: https://huggingface.co/tarruda/Qwen3.5-397B-A17B-GGUF/discus...
But I was not super impressed with deepseek 4 flash using it from the official API either, so it doesn't seem quantization fault. It is a good model, but nothing out of the ordinary in the few benchmarks I ran on it (with full awareness that benchmarks are biased).
If by ROI you mean saving more money than using paid APIs, then I don't think it is worth it. All you gain is full sovereignty over your AI usage.
Running w/ Cursor and doing some "nights and weekends" type coding / conversations, I was hitting $100-200 of usage within a few weeks. I know there's probably better ways to manage costs, but I was getting enough value out of it to keep bumping my spend limit from $20 => $40 => $80 => $120 (and then I stopped spending! :-)
Messing around with local-llm, I've settled on `omlx` and `gemma` for "conversational", and I think it's `qwen-120b-a3b-6bit` or something for the "heavy hitter". Gemma "gets it" a lot more, whereas that particular `qwen` tends to fall into the "MuSt WrItE CoOooDeee!" behaviour in a lot of cases instead of holding a conversation, and does an awesome job of randomly spitting out ascii-art diagrams or including full-blown bash shell scripts to illustrate different cases.
My POV is: "Local for slightly slower/casual usage", the ~1% of battery usage per minute of LLM is shockingly accurate (eg: 30 minutes == 30% drop!). "Gemma for discussion and emitting DESIGN-... docs", and "Qwen for converting DESIGN-... to PLAN-...", (as well as implementation, but generally from a fresh context loading the relevant PLAN-... or supporting docs)
...then supplement that with direct Cursor usage in case I screw up some setting on being able to get the local LLM working, or if I need to include literal web-research or really having access to some SOTA model. Using the pi-coder harness locally, web pages are kindof a difficult conundrum as they can be kindof gigantic and are really worthy of special casing, some sort of sub-harness, etc... but the more "stuff" you put into the agent, the less context window (and memory!) you have available, so it's a real balancing act.
The other biggest problem is that you're limited (locally) to ~20-80tps and in some cases you have to chew on or "swallow" the whole prompt up to that point if you end up with some sort of cache miss (TTFT). The `omlx` server does a pretty good job (after you tweak some settings and stuff) of allowing MANY prompt continuations to nearly immediately start generated tokens, but sometimes if I have two agents going (eg: Gemma talking shit about Qwen's output or vice versa) in a longer context window, then you'll take that hit.
"Other people's compute" is definitely more freeing, but even looking at $200/mo usage that's $2400 vs. the ~$6k for a maxed out MBP. Call it $2500 vs. $7500 and you'd say that "local AI gives you a 3-year amortization window for a slower, worse experience" ... but if you're strategic about your usage, the ability to "talk for free" and occasionally "burst" to an online provider or having some hugging-face tokens to try out different models that you can't quite run locally is really nice. Talking to the AI (locally) to even just do non-coding planning without worrying about data leakage or privacy issues is phenomenal, and you end up owning a really nice laptop!
In some ways, seeing the "advantage" of having the local 128gb capacity for LLM, I'm semi-wishing I'd have gotten a mac mini instead, but then I can't quite do the 100% offline stuff (eg: coffee-shop) that the maxed out laptop allows.
If it were a mini running locally, I'd feel more comfortable calling it the always-on "AI brain" to process my emails, run crontab summaries, whatever kindof "open-claw-ish" stuff that you could do w/o relying on having to "keep the laptop lid open all the time". I'm sure there's ways to repurpose things, but longer-term, call it even 3-5 years from now... any sort of 128gb machine will be more than capable where you'd want to have one "doing stuff" locally within your home network (IMHO).
>"...if you're strategic about your usage, the ability to "talk for free" and occasionally "burst" to an online provider or having some hugging-face tokens to try out different models that you can't quite run locally is really nice. Talking to the AI (locally) to even just do non-coding planning without worrying about data leakage or privacy issues is phenomenal, and you end up owning a really nice laptop!"
^ this resonates, loudly.
Again: I'm finding waaaay enough utility that I'm tempted to invest more "CapEx" and get a used system for day-to-day, "always on" local work... but more literally, that's probably a better job for "OpEx"! Tune my "crontab" work against local models and then max out at a $1/day budget slaved to an always on RPI connected to ethernet at home.
$365/year of off-site AI lasts 10 years before I come close to recouping the hardware (and electricity) costs of having "yet another device" purchased and turned on 24x7... and certainly there will come a day when you go to the store and buy a $200-500 "TITO" device (Tokens In => Tokens Out) that plugs into a ~30-60W USB-C port before then.
If you're using HF tokens (or "rent-a-A100" or whatever), are always connected to home ethernet (Sun Microsystems: The Network IS the Computer), and maybe supplement with a Kagi backend for attaching to the raw internet then you get _most_ of the surety of "my queries are private" unless you're locally hacked or are the target of nation-state scrutiny. :shrug:?
Keep in touch if you end up doing something cool with all this! $USERNAME@yahoo.com (and hopefully I'll have my AI setup filtering out all the viagra spam before then!).
It is higher than 110GB. MacOS allows up to 125G of the RAM to be shared with GPU, so it is certainly less than that!
> HF link is broken though!
Doesn't seem broken to me, but you should be able to search for tarruda/Qwen3.5-397B-A17B-GGUF on huggingface.
Oddly enough, though, Qwen 3.6 35B A3B and Gemma got some really good reviews, despite being way smaller than any of these ones.
Qwen 3.5, 122B A10B: https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF
Qwen Coder Next, 80B A3B: https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF
It's kinda weird that DeepSeek V4 Flash is supposed to be 284B A13B, but shows up as 158B in HuggingFace, probably some weird bug: https://huggingface.co/unsloth/DeepSeek-V4-Flash and that's not even just Unsloth but like the official source too https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash (so also doesn't fit the category unless you get a heavily quantized version to run, but cool regardless)
Mistral Medium 3.5 is interesting because it's 128B but dense, so probably too slow for most folks: https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF
GPT-OSS, 120B A5B: https://huggingface.co/unsloth/gpt-oss-120b-GGUF
The setup I had to do was important and I had to compile koboldcpp with a few special params for my hardware, I mostly just had Claude figure it out. I don't remember everything I did now but it was very slow and would often stop mid task, it seems it was mostly a parsing issue. It made the model seem broken/dumb, but once I had all that settled I actually am able to use this how I use Claude Code. Disclaimer, I am pretty explicit with requirements, I imagine this fails more when you leave it to figure out things on its own but for my flow its pretty rad.
Currently setting it up as an automated agent now to pull Trello cards, create PRs for them, and move the card to be reviewed.
Command I am using to run: python koboldcpp.py \ --port 61514 --quiet --multiuser --gpulayers 999 --contextsize 262144 --quantkv 2 \ --usecublas normal --threads 4 --jinja --jinja_tools --jinja_kwargs '{"enable_thinking":true, "preserve_thinking":false}' \ --skiplauncher --model /data/models/Qwen3.6-27B-Q5_K_M.gguf --smartcache 5
It's very capable on almost any coding task I've thrown at it, and very good for easy-to-medium hard scripts, new code bases.
It struggles on some complex tasks in larger code bases, e.g. using to debug and fix bugs in llama.cpp it gets close to working code but often introduces errors. For such tasks its still very useful as a search/explore tool and drafting fixes.
Totally understand why it may not be reasonable or in their best interest (and that the US is _absolutely_ not doing the same reflexively). But it would be lovely to be able to try these out on production workloads in earnest.
In an ideal world U.S. residents would use Chinese AI models and Chinese residents would use U.S. AI models.
Governments in both countries are collecting data for nefarious reasons. But the Chinese government has far less influence on a U.S. resident and vice versa.
We are all better off if our data is collected by a government halfway across the world instead of our own governments which hold incredible amounts of power over us.
As Americans go through life, some of them will become people with power. When you need to leverage that power, having the right knowledge about them can effectively transfer that power to you.
Tiktok was a goldmine, because every 20-something on their way to a future position of power was uploading every single facit of their digital life to CCP servers everyday.
On the other hand, there's other models where the source is 100% open, the training data is known, and people have reproduced the same model from scratch, so while those trail behind, there's definitely an effort to make models more open and capable.
It's not very subtle manipulation either; ask qwen of Taiwan is a part of China in German and in English and only the English answer will be party-approved.
I think it's borderline naive to assume various agencies haven't infiltrated OpenAI, Anthropic and others, essentially the entire world was wiretapped by NSA in the past, to assume they don't have an employee or two at these companies does seem a bit naive to me.
Are they? They don't behave like it.
It's highly improbable that the US government has a secret team inside Anthropic and OpenAI manipulating their training regimen.
Two thoughts.One: it would be relatively technically trivial for $GOVERNMENT_AGENCY to just monitor all the prompts + context we send over the wire to OpenAI/Anthropic/etc. That's a goldmine of sensitive personal and corporate data, no secret team needed (although, the LLM providers obviously would need to cooperate)
Two: Rather than secret infiltration teams influencing model training I think what's more likely on the training side of things is simply self-censoring by the LLM providers, so that they don't risk angering the government.
I highly doubt that China has government interlopers, secret or otherwise, inside Qwen's training team. Nonetheless, "sensitive" issues like Tiananmen Square are censored. I would imagine that much/most such censorship in China is self-censorship that doesn't leave a legal/paper trail. That's what we're in danger of seeing (more of) in America IMO.
I take this for granted given Room 641A https://en.wikipedia.org/wiki/Room_641A
Thus, I’ve pondered whether anything they’ve learned has changed the world / had a big impact (like on their understanding of human psychology, perhaps per region). They’ve heard phone calls, they’ve read emails, diaries get brought to court… but these are systems that would be used like diaries but also prompt users for more and more.
You don't need a secret team to manipulate whats coming from them: https://responsiblestatecraft.org/israel-chatgpt/
I've certainly used these models without wifi without any differences.
A lot of people are purchasing access via Alibaba Cloud directly, or indirectly by companies which host the model.
Sure, that is until each government's dataset is interesting enough to the other to facilitate a data-sharing agreement.
There's gotta be an internet "law" that says something like "Eventually, the data you volunteer to a benign 3rd party eventually winds up being used against you by someone". This is short-term thinking at it's finest.
It's not nearly worth it to me to get an incremental improvement in performance if it means I have to move to hosted environments with Qwen 3.7 (or Claude or Gemini or whatever).
If you use a service outside your country, I believe you could have all your code stolen and get hacked/exploited in a way that would be totally legal.
Even if they weren’t individually worried about their proprietary data being shared with Chinese domestic competitors or with government… their audit / security programs likely wouldn’t allow it for a _huge_ range of types of data.
China has more integration between intelligence and industry than many western countries, and it does present a higher risk of unwanted “tech transfer” to industry than running on oracle or Google or ms or Amazon does in the US.
DHS has long staffed full time agents in California to deal with foreign IP exfiltration - using qwen is like fast/easy mode for IP exfiltration: why make anyone get a job in your palo alto office when you can just send it to them in Hanzhou?
Upshot - If you have something proprietary you’re working on I would generally advise not to just direct send it to Alibaba.
That's exactly the fear, and why would it not be logistically feasible? The threat is definitely a bit overhyped, but China has a longstanding track record of aggressive corporate espionage.
This made me think of a Seinfeld episode: "I didn't know it was possible not to know that."
Tiananmen Square is the first place to start.
What do you mean? This is not self hosted, it's closed source. And any website that targets China or is hosted in China will probably censor Tiananmen Square.
BTW - They’re still censoring human rights violations.
Similarly, try talking to Nemotron about Epstein and see how quickly it shuts down.
Europe's sense of superiority and actual global importance/relevance is assbackwards.
Hilarious thing to say when half this comment section is Americans giving so much of a fuck that they consider China-adjacent hosted models unusable due to the supposed risks. If what you were saying was true then those pragmatic Americans would just use whatever is most effective.
The Americans can cry about Chinese censorship and turn around and use Claude or Opus or Gemma or whatever, but the Europeans just throw a fit and then have to use one of the two anyway. And that whole crying about something while being completely helpless vis-a-vis doing anything about it is the definition of Europe so far this century. Globally irrelevant outside Germany.
"Tennessee man jailed 37 days for Trump meme wins settlement after lawsuit" and "The FBI Wants to Buy Nationwide Access to License Plate Readers"
Gotta love how the US is the bastion of free speech, justice and liberty!
https://artificialanalysis.ai/evaluations/omniscience?models...
(had to add it to the chart, wasn't displayed by default. is it the lowest rate in the datasetor no?)
So I feel like that's exactly the right metric and the way to track it wrt hallucinations.
We want the hallucination rate to decrease while the overall answer rate of queries remains sufficiently high. For more specifics, look into ROC and AUC.
But no, Google and OpenAI would rather always have an answer ready and tell you to mix glue into your pizza toppings :)
The glue on pizza reference brought back memories :)
Hallucination detection is an open problem. If it were that simple, people would indeed "just" do it.
Basically the problem is that LLMs aren't trained on things they don't know; an alternative way of saying this is that they're not trained on things they're not trained on, which is obviously true.
When you RL a model and it answers incorrectly, you don't teach it to answer "I don't know", you teach it to answer correctly instead. This makes it very hard for it to realize when it doesn't know things.
https://artificialanalysis.ai/evaluations/omniscience#aa-omn...
It rewards correct answers and penalizes hallucinations, and finally no reward for refusing to answer.
It's interesting just how poorly some popular Chinese models fare in this regard, like GLM 5.1 or DeepSeek 4 Pro.
Gemini 3.x has truly remarkable knowledge given how it leads in this benchmark despite being (quite a bit) more prone to hallucinate than Claude Opus.
Cool, precisely the thing other AI is too stupid to do when they don't have the necessary knowledge.
Note that a perfect "non-hallucination rate" is rather meaningless as such tests can contain human hallucinations.
It means the model aligns with the possibly-true, possibly-false beliefs of the group that made the test.
Or would you describe your methodology as more like picking a random sentence fragment as an input value then generating completions from your existing corpus without any post-input "learning" process related to the rest of the source material?
Running Step 3.5 Flash locally for example, it's an amazingly capable model all things considered, but it's token efficiency is so bad that it gets out performed by most others wall-clock time (even with my MTP-support for it hacked in to llama.cpp: despite being trained on three heads, MTP 2 is the sweet spot, and only gets it from 20tk/s to 30tk/s on my Spark)
The DeepSeek models and Qwen 3.5 Plus are also good examples of this: compared to Opus, and especially GPT 5.5 they use many more tokens to get to the same answers.
I'm really hoping that Qwen 3.7 is better in this regard, can't wait to try it out
(ps. running DeepSeek v4 Flash on my Spark is absolutely wild, thanks antirez if you see this haha)
Nvidia models are even worse than Qwen! https://sql-benchmark.nicklothian.com/#token-efficiency-and-... (mouse over the cells for token counts and click for traces)
Gemma 4 is good for this, as AA notes:
> Gemma 4 31B is notably token efficient, using 39M output tokens to run the Intelligence Index vs 98M for Qwen3.5 27B (Reasoning). This is ~2.5x fewer output tokens for a model scoring 3 points lower. For context, the other models at the 42-point intelligence level also use significantly more tokens: MiniMax-M2.5 (56M), DeepSeek V3.2 (Reasoning, 61M), and GLM-4.7 (Reasoning, 167M)
https://artificialanalysis.ai/articles/gemma-4-everything-yo...
...except its notably worse at coding in an agent context even with a harness setup to do exactly what Google says it should do (wrt. to sending summarised thinking back and so on)
So despite it being far better token efficiency wise, it's just worse for what I need to use it for compared to DSv4 Flash or Qwen 3.6 27B
Such a shame, too.
/Users/gcr/llama.cpp/build/bin/llama-server
-hf unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M
--no-mmproj-offload
--fit on
-c 65536 # edit to taste
--reasoning on --chat-template-kwargs '{"preserve_thinking": true}'
--sleep-idle-seconds 90 # very aggressive: purge model from vram after this long
-ctk q8_0 -ctv q8_0 # Optional. Lower memory use, but lower speed. Omit if you can.
I don't recommend ollama or lm-studio. Ollama's in the process of switching from their llama-cpp backend anyway, but their new go framework frequently OOMs and crashes on my hardware. I also don't recommend MLX-based inference backends on this hardware; I've found them to consistently reduce performance, contrary to what I've read online. I've tried all the llama-cpp metal forks, but right now, MTP, TurboQuant, MLX, etc etc etc are too new and just slow things down. It's all dust in the wind still.For agent harnesses, opencode is okay, as is pi or even Zed's built in agent panel. Claude code "works" with ANTHROPIC_BASE_URL=http://localhost:8080/v1, but is very chatty (the default system prompt burns 20k tokens). Crush (from the charm-bracelet folks) is particularly nice when starting out. I've personally converged on pi-agent under an otherwise-mostly-default setup. You can ask qwen to customize pi or write you an extension which helps a little.
You'll need to add `http://localhost:8080/v1` as an OpenAI-compatible model provider in your coding harness with any API key (doesn't matter) and any model identifier (doesn't matter with llama-cpp).
Note that pi doesn't have permissions. Everything is permitted. The hundred hungry ghosts you've trapped in a jar WILL find a way to delete your home folder someday. That's what Man gets for summoning demons without casting a circle of protection first. Flying too close to the sun etc etc etc
Take backups and then go have fun. Hope this helps.
LM-Studio doesn't support certain parameter combinations. For instance, LM-Studio supports KV quantization....but if you're using the MLX backend, you can't set the context length when KV quantization is used? Why? Running a model with certain settings requires keeping a little SAT solver going in your head. I found that overwhelming, so I just stopped using it.
The Ollama devs want to offer a central curated experience, but I perceive their approach as "playing fast and loose." They've re-implemented unique code for every model they support in their own Go runtime, so certain parameter choices aren't supported. On my hardware, their MLX backend just doesn't work at all without segfaulting the server process for example. It doesn't smack as vibe coded the way oMLX does, but it also doesn't smack as professional or battle-tested.
Ultimately, just dropping down to llama-cpp's GGUF model support and asking for default settings has provided faster inference speeds than anything I've been able to benchmark with them, but everything's within 10% of each other anyway so it's not a huge deal for me.
Are there any resources to help me figure out how to best optimize my runtime paramaters for a given model, based on a given task, similar to what you've shown?
I've been a little... irritated? that hooking vscode up to my company LLM subscription seems so much more out-of-the-box capiable than what I can get to work. My assumption at the moment is that I need to create a lot of... I think they're called harnesses? agents? workflows? integrations? (not sure) by hand. Is that accurate?
Right now I have ollama running an nvidia nano model and I can poke it with a stick over a web interface I installed. It works, initial token response is slow, after that it seems fine enough.
I can't seem to get a good handle on how much context I've used, when context usage starts to degrade response accuracy, or in general how to mirror the results I get (not in terms of accuracy or speed, just features) from the company github copilot + vscode integration.
I was also trying to get a plugin called qodeassist working via qtcreator, mixed results there as well.
I've been keeping up with this space since the jump, never paid for a sub, work gave me a sub a handful of weeks ago, so the actual useage is all new to me.
I can't say I'm super impressed with any of it relative to the hype, but I found it neat to be able to point vscode at a c++ codebase and say "enable wextra, build the code, tell me if there is any low-hanging fruit I can clean up" and get a useful response.
I also asked my local model to turn a picture of my dog into a picture of an otter, got a blank picture back, which the thinking bit told me it would do. The whole thing was actually kind of funny. "I am allowed to edit pictures, I can't edit pictures, I am allowed to edit pictures, I'll tell the user I did and send a blank picture back because I can't edit pictures, but I am allowed to."
I tried the qwen3.6-27b Q6_k GUFF in llama.cpp and LM Studio on my M2 MacBook Pro 32GB machine last week, and I barely get a token a second with either.
What sort of speed should I be expecting?
I tried some of the Llama 3 34b (nous-capybara?) models two years ago with llama.cpp, and I seem to remember getting a few tokens a second then, so not sure if I've got something completely mis-configured, or I just have unreasonable expectations.
Or maybe qwen 3.x is slower for some reason? (Is it mixture of experts?)
I'm not expecting it to be instant, but what I'm currently seeing is not really usable.
- A 27B "dense" model
- A 35B "Mixture of Experts" model, which activates only 3B parameters for each token.
For your hardware, I strongly recommend `unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M`. I have an M1 Max with 32GB VRAM from 2021 that can read at ~300-500 tokens/sec and write at ~30 tokens/sec with llama-cpp's default settings, which is plenty fast. The 27B model can read ~70tok/sec and write ~5tok/sec.
The 35B MoE model technically takes slightly more memory but is much faster because it's doing 1/9th the work. It's not quite as "smart", but it's comparable.
Obviously bigger != better but I don't know what the differences are.
* _0 and _1 do not use K quant and scales 32x32 blocks according to the original (B)F16 values; _0 scales the block using the original max and min values. _1 does this per row instead of per block.
* K quants do something similar, but now splits blocks into subblocks inside a superblock where the superblock has min/max scaling, but the subblocks also have scaling in the range of the superblock's scaling and are stored using less bits.
* K's M, L, XL are just how aggressively the subblocks and their scaling factors are chosen. Generally, it puts a max on how far you can deviate from the chosen quant to maintain the desired quality, but also gives them a bigger budget to perform that excursion in. XL most aggressively tries to preserve the intended quality, while S does the least.
* Dynamic quant on top of this scales entire layers, full of blocks, according to how much they effect various measurements (such as KLD and perplexity).
That said, there is no reason K_S is even produced by anyone, same with Q_0, Q_1, and I_NL. People should no longer be using those. M only is meaningful if you're trying to restrict the upper bounds: K_XL can reach BF16 for some weights, but rarely; people think this has a speed implication for hardware that has native 8bit in their tensor units (but it doesn't).
Unless you're specifically trying to cure a problem, stick with K_XL.
A lot of the content about AI out there is kind of produced to the lowest common denominator. Basically a never ending scheme of get rich quick/passive income kinds of AI content.
If you’re curious about what a particular switch does, clone the llama-cpp repository to your computer and try asking your favorite pet rock prompts like “This is llama-cpp. Can you look at what the -ctk parameter does and explain to me?” Giving Claude/codex/whatever access to the actual code goes a long way, but it is just one opinion.
If you’d like to learn how transformer-based language modeling works in detail, I suggest starting with chapter 0 or 1 of https://arena-chapter0-fundamentals.streamlit.app/ depending on your skill level, then use that to work your way to reading research papers.
Graduate students who study these topics are generally as annoyed by the “get rich quick” style of advertising as you are, so the deeper you go toward academic research the quieter those voices tend to get, mercifully. That said, this is balanced by the unfortunate fact that top labs have strong posturing signals they try to send, so it can be hard to see which preprints actually have good ideas, which are trying to promote their group’s tech instead of doing science out of curiosity, and which have authors who’ve innocently deluded themselves into overfitting their own pet projects. Read widely but adversarially, test everything but hold fast to the good stuff, etc etc
Its not amazing at compute (yet is a member of the GCN family, which I have been a fan of since its inception) and ended up being too expensive for perf/$ and perf/watt.
The only thing it did was make Nvidia rush Series 10 out the door and make it too good. Nvidia has been unable to live up to the gen-to-gen uplift Series 10 did, all because AMD made Nvidia blink.
Basically, you're 2 gens too early. CDNA2/gfx90a is the minimum you need to get any meaningful performance out of inference, or maybe CDNA1/gfx908 if you really don't need to quantize at all.
BTW, I did suggest this elsewhere in this HN story, but have you tried just disabling KV quant entirely? That is a huge speed uplift for compute-poor users.
Also, llama.cpp's support for gfx906 is probably never going to as good as it is for other cards, and good ROCm support for cards before they rebooted the driver/stack team is probably never going to materialize. I don't see the point in hanging onto them.
Like, if I was in your place, replacing it with even a 9060xt, with half the RAM, would be a step up. They go for $450. People have been building dedicated inference machines with these and they've been amazing, just throwing in 3 or 4 in, and scaling VRAM to meet needs.
But as models are starting to pack more information into less bits, some weights are just going to end up becoming super important and very sensitive to quant. So, I'd just move down a Q size, and continue with K_XL. Like, I'm betting Q3_K_XL will beat Q4_K_M on any given model in real world testing, even though its ~20% smaller, but perform worse on benchmaxxing.
The only exception I could think of is quantizing small models, like, my testing on Gemma E2B/E4B and Qwen 3.5 9B, quantizing at all was super noticeable... they can't spread the error across more weights.
Good news (at least for me), 24GB of VRAM is enough to store either of those in BF16 and then a ton of room for F16/F16 KV cache.
Recommend https://www.reddit.com/r/LocalLLaMA/ as a great source for this type of discussion.
That's the dense model, you probably want a mixture-of-experts (MoE) one.
Here's what you probably want instead: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF
For comparison, I just ran a couple of quick benchmarks (default settings) with llama-bench:
Qwen3.6-35B-A3B at Q6_K_XL gave 858 t/s pp512 (prompt processing) and 43 t/s tg128 (token generation).
Qwen3.6-27B at Q4_K_XL gave 103 t/s pp512 and 8 t/s tg128.
So I’m assuming I’ve done something wrong along the way, but I’ve not had time yet to explore it.
I tried the qwen3.6-27b Q6_k GUFF in llama.cpp
and LM Studio on my M2 MacBook Pro 32GB machine
last week, and I barely get a token a second with either.
The fact that it was this slow makes me suspect it's a matter of insufficient free RAM. The entire model needs to fit into RAM (and stay there the entire time) for acceptable performance.(not sure of exact diagnosis/fix, but definitely look in that direction if you're still having this issue when you give it another shot)
Also, there are two stages - prompt processing, and token generation. Prompt processing is notoriously slow on Apple Silicon unfortunately. If you have large context (which includes system prompts, lots of tools loaded by a harness like Claude Code, OpenCode, etc) it can take minutes for prompt processing before you see the first output token. On the bright side, the tokens are cached between turns, so subsequent turns won't be so bad.
EDIT: I run with context wired at 64K
On testing I've done on same-quant apples to apples, with F16/F16 (ie, unquantized) kv cache, 35B-A3B underperforms against 27B on anything even remotely complex. But yes, 35B-A3B can be like 3-4x faster on my hardware.
By Qwen's own admission, on any meaningful benchmark (ie, ones that involve logic, math, or tool calling), 27B performs like 122B-10B and 397B-A17B, but 35B-A3B is somewhere between 27B dense and 9B dense.
Also, MTP recently got merged in, so I'd suggest downloading Qwen 3.6 MTP (I assume you get it from unsloth) and updating your copy of llama.cpp, and adding `--spec-type draft-mtp --spec-draft-n-max 2` to your arguments.
https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/ https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/
Also, I recommend not quantizing kv cache, and if you do, only quantize v. Lowering model quant while also lowering context size to fit F16/F16 or F16/Q8_0 massively improves model performance for thinking models. Also, quantizing cache, either k or v, decreases speed by a lot on some hardware.
I have a 24gb 7900xtx, so I can fit >32k F16/F16 context with Qwen3.6-27B, but use unsloth's Q3_K_XL. This performs better than Q(4,5,6)_K_XL with v quantized.
Edit: Oh, and since I mentioned Gemma 4, my testing mirrors my Qwen 3.5/3.6 experiences, 26B-A4B performs worse than 31B, but is also way faster. llama.cpp doesn't support Gemma 4's MTP style yet, so both could get even faster.
Maybe that's underselling it. It is quite a good model and might end up replacing a lot of the work I was sending to Sonnet 4.6.
Also, Sonnet 4.6 is almost certain a much bigger model so the performance differences aren't unexpected.
In my experience Sonnet bills can be higher than Opus because it churns a lot more trying to get things right.
Example from my fairly simple but agentic benchmark:
Opus 4.7, 25/25, 81c: https://sql-benchmark.nicklothian.com/?highlight=anthropic_c...
Opus 4.6, 24/25, 61c: https://sql-benchmark.nicklothian.com/?highlight=anthropic_c...
Sonnet 4.6: 24/25, 41c: https://sql-benchmark.nicklothian.com/?highlight=anthropic_c...
I only tested the free OpenRouter version of Qwen 3.6 Plus, and it scored 23/25: https://sql-benchmark.nicklothian.com/?highlight=qwen_qwen3....
This doesn't quite show Opus cheaper, but it isn't the 10-20 times more either. Harder tasks close the gap even further.
This is not an open model
(As a reference, DeepSeek v4 is severely throttled on these proxy services.)
I couldn’t say how throttled it is, but it seems fine?
Good balance of intelligence and speed.
I had a Google Pro account that I inherited from buying a Pixel 9 XL - it's free for a year after a flagship Pixel phone purchase. After a year they started charging for it, and i tolerated it, because Flash was usable in Antigravity for dumb auxiliary tasks that I did not want to waste GPT/Opus on. It had a separate generous quota from Gemini 3.1 Pro. Now with Flash 3.5 they combined the quotas with Pro, such that on a Google pro account you can work 4-5 hours per week in Flash. And by the way, 3.1 Pro is useless for programming, compared to Codex/Opus
I think they envision Pro plan as "just a taste of AI, enough to lure folks into the Ultra plan" but that won't work for me when Codex is half the price and DeepSeek 4 Flash is 1/10 of their price per task.
So I'll downgrade just enough to keep my Google Drive space. And use DeepSeek 4 as workhorse plus Codex or Copilot for advanced stuff.
https://marketplace.visualstudio.com/items?itemName=sst-dev....
It adds a button to VSCode to open a tab with opencode loaded. It's a bit better than just opening the CLI because it has some vscode integration.
With their $10/mo opencode go plan: https://opencode.ai/go
For my use it's about endless use of DS4 Flash on high setting. I find high better than max because it's less chatty.
The best thing is the speed. So many tokens per second.
edit: This is how it looks in action https://i.imgur.com/RNDXr07.png
I haven't tested openrouter but I expect it to be slightly less cheap because it charges per token and opencode Go plan is a $10/mo fixed price model. Economies of scale leads me to think that for heavy use, openrouter will be more costly since opencode Go can subside heavy users like me with money from light users (just like gyms do with people that pay but barely use it).
With that said, I find vscode native copilot chat more pleasant to use, but also more laggy for large sessions.
opencode configuration is less polished and you'll have to grok around for some things. For example opencode CTRL+p conflicts with VSCode CTRL+p. I changed opencode to use Ctrl+L instead.
> Oops! There was an issue connecting to Qwen3.6-Plus.
> Content Security Warning: The input text data may contain inappropriate content.
hey ChatGPT, how many civilians were killed in Gaza in the war since 2023?
> [one page of estimates from local and international sources with links]
Your ID has been passed to Israel and your internalized "threat" rating number increased 300 units. Every packet you produce on the internet is now earmarked for 100 year retention.
Is this normal humans kicking the tires on a new model, or a few whales doing serious benchmarks?
Open-weight: Good enough for the majority of tasks, and I'm willing to spend a bit more time and effort steering towards my desired result.
95% of the work most of us do is mostly just plumbing - connecting X and Y together. A ton of grunt work - writing basic loops, fetch statements, importing libraries. You really don't need PhD level intelligence to handle these
The only time you need Opus 4.7+ tier intelligence is when you're quashing a nasty bug or refactoring something complex
I ran llama3:latest and it ran pretty fast! I’m curious to see how Qwen would run on my system.