They didn't make special purpose hardware to run a model. They crafted a large model so that it could run on consumer hardware (a phone).
We haven't had phones running laptop-grade CPUs/GPUs for that long, and that is a very real hardware feat. Likewise, nobody would've said running a 400b LLM on a low-end laptop was feasible, and that is very much a software triumph.
Agree to disagree, we've had laptop-grade smartphone hardware for longer than we've had LLMs.
We've had solid CPUs for a while, but GPUs have lagged behind (and they're the ones that matter for this particular application). iPhones still lead by a comfortable margin on this front, but have historically been pretty limited on the IO front (only supported USB2 speeds until recently).
And even if you raise the requirements, we still have to contend with cheap CUDA-capable GPUs like the one in the ($300!!!) Nintendo Switch, or the Jetson SOCs. The mobile market has had tons of high-speed/low-power options for a very long time now.
It’s been a lot of years, but all I can hear after reading that is … I’m making a note here, huge success
Remember when people were arguing about whether to use mmap? What a ridiculous argument.
At some point someone will figure out how to tile the weights and the memory requirements will drop again.
Experts are predicted by layer and the individual layer reads are quite small, so this is not really feasible. There's just not enough information to guide a prefetch.
It's just so slow that nobody pursued it seriously. It's fun to see these tricks implemented, but even on this 2025 top spec iPhone Pro the output is 100X slower than output from hosted services.
iPhone 17 Pro outperforms AMD’s Ryzen 9 9950X per https://www.igorslab.de/en/iphone-17-pro-a19-pro-chip-uebert...
It is objectively slow at around 100X slower than what most people consider usable.
The quality is also degraded severely to get that speed.
> but the point of this is that you can run cheap inference in bulk on very low-end hardware.
You always could, if you didn't care about speed or efficiency.
If they continue to increase.
Is this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?
This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.
The iPhone 17 Pro only has 12GB of RAM. This is a -17B MoE model. Even quantized, you can only realistically fit one expert in RAM at a time. Maybe 2 with extreme quantization. It's just swapping them out constantly.
If some of the experts were unused then you could distill them away. This has been tried! You can find reduced MoE models that strip away some of the experts, though it's ony a small number. Their output is not good. You really need all of the experts to get the model's quality.
When the individual expert sizes are similar to the entire size of the RAM on the device, that's your only option.
What was more interesting about the unreal engine demo, was that they can stream not only textures, but geometry too.
Virtual texturing had been around a long time, but virtual geometry with nanite is really interesting.
EDIT: found this in the replies: https://github.com/Anemll/flash-moe/tree/iOS-App
Also I wouldn’t trust 3-bit quantization for anything real. I run a 5-bit qwen3.5-35b-A3B MoE model on my studio for coding tasks and even the 4-bit quant was more flaky (hallucinations, and sometimes it would think about running tools calls and just not run them, lol).
If you decided to give it a go make sure to use the MLX over the GGUF version! You’ll get a bit more speed out of it.
With 64GB of RAM you should look into Qwen3.5-27B or Qwen3.5-35B-A3B. I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.
This approach also makes less sense for discrete GPUs where VRAM is quite fast but scarce, and the GPU's PCIe link is a key bottleneck. I suppose it starts to make sense again once you're running the expert layers with CPU+RAM.
There are dynamic quants such as Unsloth which quantize only certain layers to Q4. Some layers are more sensitive to quantization than others. Smaller models are more sensitive to quantization than the larger ones. There are also different quantization algorithms, with different levels of degradation. So I think it's somewhat wrong to put "Q4" under one umbrella. It all depends.
Nobody actually quantizes every layer to Q4 in a Q4 quant.
It’s only paying Google $1 billion a year for access to Gemini for Siri
Apple’s bet is intelligent, the “presumed winners” are hedging our economic stability on a miracle, like a shaking gambling addict at a horse race who just withdrew his rent money.
Put another way, there is no demonstrated first mover advantage in LLM-based AI so far and all of the companies involved are money furnaces.
Pretty sure the M5 Ultra will be out after WWDC, so my M3 Ultra is (while still completely capable of fulfilling my needs) looking a bit long in the tooth. If I can get a good price for it now, I might be able to offset most of the M5 post WWDC...
The financial math on actually buying over $40k worth of Mac for 1 to 2 youtube videos probably doesn't work that well, even for the really big players.
0.6 t/s, wait 30 seconds to see what these billions of calculations get us:
"That is a profound observation, and you are absolutely right ..."
This is 100% correct!
"You are absolutely right to be confused"
That was the closest AI has been to calling me "dumb meatbag".
(One) source: https://www.reddit.com/r/Fedora/comments/1mjudsm/comment/n7d...
To quote the message from the universes creators to its creation “We apologise for the inconvenience”. Does seem to sum up Douglas Adam’s views on absurdity of life.
Which makes it even funnier.
It makes me a little sad that Douglas Adams didn't live to see it.
https://gwern.net/doc/fiction/science-fiction/1953-dahl-theg...
The joke revolves around the incongruity of "42" being precisely correct.
Emphasis on slowly.
So this post is like saying that yes an iPhone is Turing complete. Or at least not locked down so far that you're unable to do it.
laughed when it slowly began to type that out
You're absolutely right. Now, LLMs are too slow to be useful on handheld devices, and the future of LLMs is brighter than ever.
LLMs can be useful, but quite often the responses are about as painful as LinkedIn posts. Will they get better? Maybe. Will they get worse? Maybe.
I find it hard to understand your uncertainty; how could they not keep getting even better when we've been seeing qualitative improvements literally every second week for months on end? These improvements being eminently public and applied across multiple relevant dimensions: raw inference speed (https://github.com/ggml-org/llama.cpp/releases), external-facing capabilities (https://github.com/open-webui/open-webui/releases) and performance against established benchmarks (https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks)
This exists[0], but the chip in question is physically large and won't fit on a phone.
Moore's law will shrink it to 8mm soon. I think it'll be like a microSD card you plug in.
Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.
> Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.
It's amazing to me that people consider this to be more realistic than FAANG collaborating on a CUDA-killer. I guess Nvidia really does deserve their valuation.
Not for this approach
Getting bigger (foldable) phones, without losing battery life, and running useable models in the same form-factor is a pretty big ask.
The $$$ would probably make my eyes bleed tho.
Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants. That's at least 380GB of memory. You can gimmick a demo like this with an ssd, but the ssd is just not fast enough to meet the minim specs for anything more than showing off a neat trick on twitter.
The only hope for a handheld execution of a practical, and capable AI model is both an algorithmic breakthrough that does way more with less, and custom silicon designed for running that type of model. The transformer architecture is neat, but it's just not up for that task, and I doubt anyone's really going to want to build silicon for it.
The latest M5 MacBook Pro's start at 307 GB/s memory bandwidth, the 32-core GPU M5 Max gets 460 GB/s, and the 40-core M5 Max gets 614 GB/s. The CPU, GPU, and Neural Engine all share the memory.
The A19/A19 Pro in the current iPhone 17 line is essentially the same processor (minus the laptop and desktop features that aren’t needed for a phone), so it would seem we're not that far off from being able to run sophisticated AI models on a phone.
Apple has always seen RAM as an economic advantage for their platform: Make the development effort to ensure that the OS and apps work well with minimal memory and save billions every year in hardware costs. In 2026, iPhones still come with 8Gb of RAM, Pro/Max come with 12Gb.
The problem is that AI (ML/LLM training and inference) are areas where you can't get around the need for copious amounts of fast working memory. (Thus the critical shortage of RAM at the moment as AI data centers consume as many memory chips as possible.)
Unless there's something I don't know (which is more than possible) Apple can't code their way around this problem, nor create specialized SoCs with ML cores that obviate the need for lots and lots of RAM.
So, it's going to be interesting whether they accept this reality and we start seeing the iPhones in the future with 16Gb, 32Gb or more as standard in order to make AI performant. And if they give up on adding AI to the billions of iPhones with minimal RAM already out there.
As a side note, 8Gb of RAM hasn't been enough for a decade. It prevents basic tasks like keeping web tabs live in the background. My pet peeve is having just a few websites open, and having the page refresh when swapping between them because of aggressive memory management.
To me, Apple's obvious strength is pushing AI to the edge as much as possible. While other companies are investing in massive data centers which will have millions of chips that will be outdated within the next couple years, Apple will be able to incrementally improve their ML/AI features by running on the latest and greatest chips every year. Apple has a huge advantage in that they can design their chips with a mega high speed bus, which is just as important as the quantity of RAM.
But all that depends on Apple's willingness to accept that RAM isn't an area they can skimp on any more, and I'm not sure they will.
Sorry for the brain dump. I'd love to be educated on this in case I'm totally off base.
It's not like Apple's GPU designs are world-class anyways, they're basically neck-and-neck with AMD for raster efficiency. Except unlike AMD, Apple has all the resources in the world to compete with Nvidia and simply chooses to sit on their ass.
Apple technically hasn't supported the professional GPGPU workflow for over a decade. macOS doesn't support CUDA anymore, Apple abandoned OpenCL on all of their platforms and Metal is a bare-minimum effort equivalent to what Windows, Android and Linux get for free. Dedicated matmul hardware is what Apple should have added to the M1 instead of wasting silicon on sluggish, rinky-dink NPUs. The M5 is a day late and a dollar short.
According to reports, even Apple can't quite justify using Apple Silicon for bulk compute: https://9to5mac.com/2026/03/02/some-apple-ai-servers-are-rep...
Apple recently stated on an earnings call they signed contracts with RAM vendors before prices got out of control, so they should be good for a while. Nvidia also uses TSMC for their chips, which may affect A series and M series chip production.
Yes, TSMC has a plant in Arizona but my understanding is they can't make the cutting edge chips there; at least not yet.
Pros will want higher intelligence or throughput. Less demanding or knowledgeable customers will get price-funneled to what Apple thinks is the market premium for their use case.
It'll probably be a little harder to keep their developers RAM disciplined (if that's even still true) for typical concerns. But model swap will be a big deal. The same exit vs voice issues will exist for apple customers but the margin logic seems to remain.
Why do you say they can't do this?
If you're loading gigabytes of model weights into memory, you're also pushing gigabytes through the compute for inference. No matter how you slice it, no matter how dense you make the chips, that's going to cost a lot of energy. It's too energy intensive, simple as.
"On device" inference (for large LLM I mean) is a total red herring. You basically never want to do it unless you have unique privacy considerations and you've got a power cable attached to the wall. For a phone maybe you would want a very small model (like 3B something in that size) for Siri-like capabilities.
On a phone, each query/response is going to cost you 0.5% of your battery. That just isn't tenable for the way these models are being used.
Try this for yourself. Load a 7B model on your laptop and talk to it for 30 minutes. These things suck energy like a vacuum, even the shitty models. A network round trip costs gets you hundreds of tokens from a SOTA model and costs 1 joule. By contrast, a single forward pass (one token) of a shitty 7b model costs 1 joule. It's just not tenable.
That said, power consumption is one of the reasons I think pushing this stuff to the edge is the only real path for AI in terms of a business model. It basically spreads the load and passes the cost of power to the end user, rather than trying to figure out how to pay for it at the data center level.
You do have a lot of "MLEs" and "Data Scientists" who only know basic PyTorch and SKLearn, but that kind of fat is being trimmed industry wide now.
Domain experience remains gold, especially in a market like today's.
I understand this is for a demo but do we really need a 400B model in the mobile? A 10B model would do fine right? What do we miss with a pared down one?
Putting the GPU and CPU together and having them both access the same physical memory is standard for phone design.
Mobile phones don't have separate GPUs and separate VRAM like some desktops.
This isn't a new thing and it's not unique to Apple
> I understand this is for a demo but do we really need a 400B model in the mobile? A 10B model would do fine right? What do we miss with a pared down one?
There is already a smaller model in this series that fits nicely into the iPhone (with some quantization): Qwen3.5 9B.
The smaller the model, the less accurate and capable it is. That's the tradeoff.
> Mobile phones don't have separate GPUs and separate VRAM like some desktops.
That's true. The difference is the iPhone has wider memory buses and uses faster LPDDR5 memory. Apple places the RAM dies directly on the same package as the SoC (PoP — Package on Package), minimizing latency. Some Android phones have started to do this, too.
iOS is tuned to this architecture which wouldn't be the case across many different Android hardware configurations.
Package-on-Package has been used in mobile SoCs for a long time. This wasn't an Apple invention. It's not new, either. It's been this way for 10+ years. Even cheap Raspberry Pi models have used package-on-package memory.
The memory bandwidth of flagship iPhone models is similar to the memory bandwidth of flagship Android phones.
There's nothing uniquely Apple in this. This is just how mobile SoCs have been designed for a long time.
More correct to say that the memory bandwidth of ALL iPhone models is similar to the memory bandwidth of flagship Android models. The A18 and A18 pro do not differ in memory bandwidth.
A18 Pro has a modest memory bandwidth advantage over the standard A18, which is part of why it can support ProRes recording and always-on display while the standard A18 cannot.
Tl;dr a lot, model is much worse
(Source: maintaining llama.cpp / cloud based llm provider app for 2-3 years now)
Practical LLMs on mobile devices are at least a few years away.
https://www.reddit.com/r/EmulationOnAndroid/comments/1m269k0...
Was wondering, but this the most duct tap hacker solution!
https://onexplayerstore.com/products/onexplayer-super-x?vari...
https://www.notebookcheck.net/Xiaomi-launches-new-mobile-wat...
Apple fans never cease to amaze me.
https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile#a-...
"The phone a few minutes after finishing the benchmark. It no longer booted because the battery was too cold!"
Your time-average power budget for things that run on phones is about 0.5W (batteries are about 10Wh and should last at least a day). That's about three orders of magnitude lower than a the GPUs running in datacenters.
Even if battery technology improves you can't have a phone running hot, so there are strong physical limits on the total power budget.
More or less the same applies to laptops, although there you get maybe an additional order of magnitude.
That said, it'd be a fun quote and I've jokingly said it as well, as I think of it more as part of 'popular' culture lol
Having a complete computer in my pocket was very new to me, coming from Nokia where I struggled (as a teenager) to get any software running besides some JS in a browser. I still don't know where they hid whatever you needed to make apps for this device. Android's power, for me, was being able to hack on it (in the HN sense of the word)
Instead, take the advantage of Termux power, namely the fact that you can install things like Openclaw or Gemini-cli. Google Ai plus or Pro plans are actually really good value, considering they bundle it with storage.
https://www.mobile-hacker.com/2025/07/09/how-to-install-gemi...
There is also Termux:GUI with bindings for languages, which you can use to vibecode your own GUI app, which then can basically serve as an interface to an agent, an Termux API which lets you interface with the phone, including USB devices.
Furthermore, termux has the cloudflared package availble, which lets you use clouflared free ssh tunnels (as long as you have a domain name).
All put together, you can do some pretty cool things.
https://scienceleadership.org/thumbnail/34729/1920x1920
Just in case if someone still didn't realize - we do live in Idiocracy
Don't get me wrong, it's an awesome achievement, but 0.6s token/s at presumably fairly heavy compute (and battery), on a mobile device? There aren't too many use cases for that :)
With hardware and model improvements, the future is bright.
This is a toy.
We need to build open infrastructure in the cloud capable of hosting a robust ecosystem of open weights.
And then we need to build very large scale open weights.
That's the only way we don't get owned by the hyperscalers.
At the edge isn't going to happen in a meaningful way to save us.
The fact that it's running on a phone now just sets the goalpost and gets everyone excited about it: add more RAM and GPU to the next iPhone and it's not a toy anymore. Co-incidentally, phone companies also have thousands of engineers sitting around wondering what to do in their next release to convince consumers to buy ...
We're not going to get more RAM and GPU in consumer devices.
All of the supply is going into data center build outs. As the hyper scaler gamble on the future continues, we get left with weaker (or more expensive) devices - not stronger ones.
The market makers make more money if we're left to thin clients. They're also the ones who control supply and the shapes of devices.
While there are problems that can be solved with 0.6t/sec, particularly offline, at the edge, in the field applications, these are currently vastly outnumbered by other applications.
There's just no competing. Local sucks.
absolutely, however this doesn’t mean we should abandon local. i can’t remember who, but someone in the ai nuts and bolts arena said “smaller local models is where the exciting stuff is happening right now. it’s the area real fast progression is happening.” and it seems to be true. new big models aren’t making near the leaps smaller models are.
it’s so important we keep moving forward on running locally for the same reason it was important for us to use open standards when building the internet. if we hadn’t we’d all be connected through aol with 10 hours/month allowed internet usage and termed in through a sun workstation renting cpu cycles from some mainframe company at like “you’ve got 10,000 cpu cycles left on your monthly plan, please deposit $500 for 5,000 more.”
while all of this this is before my time, i’ve heard and read so many horror stories about how people could only connect through dumb terminals to “you wouldn’t believe it, computers then were the size of buildings” 1000 miles away and had to sign up for workload timeslots. make no mistake, this is the future these companies want, they want us to rent everything and own nothing.
I don't know why we can't just get over the local compute thing and instead build open infra and models in the cloud. That's literally the only way we'll be able to keep pace with hyperscalers.
Local is not going to benefit 99% of use cases. It's a silly toy.
If we build open infra for cloud-based provisioning and inference, we could build a future we still have some ownership in. We'd be able to fine tune large models for lots of purposes. We wouldn't be locked in to major vendors.
use the experience we gain from both to bolster the other.
a future where we are unable to locally run is kind of troubling. as is a future with no open cloud. we need both to stop some of the horrors the hyperscalers will happily inflict.
Quantizing is also a cheat code that makes the numbers lie, next up someone is going to claim running a large model when they're running a 1-bit quantization of it.
There's no misleading here, they show every detail from model to quantization to that atrocious time to first token. Stuff like this feels more like code golf than anyone claiming the mainstream phone user is going to even download 100GB of model weights.
Local LLMs are going to make people sit on their phones instead of taking to real people.
With all the money you will save on subscription fees you should be able to afford treatment for your psychosis!
That blows up the whole “industrial complex” being developed around massive data centers, proprietary models, and everything that goes with that. Complete implosion.
Apple has sat on the sidelines for much of this as it seems clear they know the end game is everyone just does this stuff locally on their phone or computer and then it’s game over for everything going on now.