:( you paid a professional pc builder and you weren't told this?
edit: Hm, finding mixed information online on whether that's still supported or not. Apparently it was removed in workstation GPUs.
At the time he put this rig together, there weren't a lot of open-weight LLMs that could run well on 6x48=288 GB, so it probably wasn't a huge loss. There still aren't, really.
Right now I'm in the process of cramming Blackwell cards into an old DDR4-based Milan server, where the important thing is to be able to run large models at all. The GPU fans alone burn over 400 watts at full throttle.
The server is going to live in the garage, so I'm not that concerned with noise. But I had no idea what to expect when I flipped the switch for the first time. It sounds like something out of the Book of Revelation. No way, no how could something like this be used in an inhabited area.
There is no specs in this blogpost regarding cpu/motherboard choice, but if you go with threadripper pro they have 128 pci-e lanes for some time now, so using all GPUs at full speed shouldn't be a problem
They did not. That's a mining rig not a workstation. It's visible from the photo and the chart showing multiple failures over a short period of time including the risers -- which are visibly very low quality -- failing twice.
You have 50K, you call a real expert like Puget Systems or Digital Storm.
Frankly that's something a landlord should provide. And there's insurance against losses from electrical issues.
Genuine question; would anyone here recommend any specific motherboard to best utilize these cards?
I myself run with gigabyte trx40 aorus xtreme, but since it's regular threadripper (not pro) with 4 GPUs 2 of them will run at x16 and two of them at x8 speeds
AI is cool but it's not going to have all the good and bad experiences that humans have had with different motherboards.
"AI is cool but it's not going to have all the good and bad experiences that humans have had with different motherboards."
AI will have more access to experiences than you'll find here.
"I spent a long time trying high risk/high reward experiments and failing. But now I have something good. I’ve solved a major problem with LLMs. And I’m launching next Monday so we will soon see if it’s actually a breakthrough or just LLM psychosis "
Maybe ai companies today have some bounty program?
(I would assume they haven't made a lot of $ off of this, if nothing else because they've only just put out that post and demo. They do seem to have produced a model that doesn't sound very LLM-y to my ear, though it also seems rather weak for its size.)
Cynical take: They made an LLM that can bypass existing AI slop detectors.
Realistic take: They found a research problem they found interesting, dumped a bunch of capital and sweat equity into and (claimed to have, at least) found a solution. Neat!
Risking their own money and time instead of leveraging a PowerPoint to hire other peoples' labor with other peoples' money. I can respect that.
https://rosmine.ai/2026/05/18/fixing-llm-writing-with-distri...
In comparison to just spending for tokens, the tokens would have been much cheaper and much much faster. I've been running against Gemma4:31b, Qwen3.5 and 3.6, and getting local LLMs to solve AMC 8/10 math questions and it's about 10-100x slower than just doing it online. When I tried it with ChatGPT late last year, it took about one night and $25 to solve about 1000 questions. Using my RTX 6000 and M3 Ultra and Gemma4:31b on both, it answered about 40 questions in 7 hours and I haven't checked how good the answer is yet. At 800 watts (600 for RTX and 200 for M3 Ultra) and running for 7 hours, it solved around 40 questions.
At the very least I'm going to try to sell my M3 Ultra if I can find a reliable place to sell it without getting ripped off by scammers.
Yes this is exactly what I'm doing. I isolated the actual math question, and then sent it to my two servers to process and that's what's taking 10m+ to return. I'm asking them to solve the question and return the full answer along with their steps. I care about correctness so taking time is okay but I can't use 10m per solution.
"The DGX GB200 NVL72 AI server costs approximately $3 million per unit. This system includes 72 Blackwell GPUs and 36 Grace CPUs, making it one of the most powerful AI servers available."
The search assist actually credited a source used with: https://www.tweaktown.com/news/98292/nvidias-new-gb200-super...
That $25k spend by GGGP seems like nothing in comparison. That's ~1/3 of one chip in that cabinet. God gawd I'm old and out of touch with modern AI data centers.
We've been in a centralised phase for longer than usual - first cloud everything, then AI - but at some point in the next decade prices will crash and a market will appear for personal, local intelligence.
There are bigger data centers than Colossus 1 around too.
There is a reason NVidia is the most valuable company on the planet.
https://en.wikipedia.org/wiki/Colossus_(supercomputer)#Curre...
A better way of putting it is that you can run plenty of things on a single ordinary system, but you may be disappointed at the performance. Generally, you can't expect inference to be as quick as with cloud for SOTA-like models. You have to run smaller models for quick replies, and large models with a lot of real-world knowledge for less time-critical inference, possibly batching many requests simultaneously to improve throughput.
They are selling on EBay for over $20k, used.
The 0-reputation account in Spain selling an M3U 512GB for $4200 is 100% fraud.
Computers depreciate because they are obviously being supplanted by newer better models—until they become vintage and then move into collectibles.
Doing this particular one is definitely expecting the market squeeze to continue. "Worst case" is back to more "normal" depreciation. Where I'd expect to only be able to recoup more like 18k. But... if you look at GPU prices the last 3 years... it's not a crazy assumption that it won't drop that fast.
iPhone example since those are easiest to find in quantity: new iPhone 16 Pro Max for $1200, Gazelle would want $866 for "execllent" condition. Lost ~28% for one-model-back. iPhone 15 Pro Max, though: excellent priced at $667 here, only down another 23%, and gives you basically half-priced-upgrade if you can sell it for that and roll into the newest.
So to have never-more-than-one-model-old rough estimate at today's value-holding you'd be out $3600 for three new phones, with getting 1732 of that back, or 1868 for it (with a $334-per-year incremental cost of upgrade).
For never-more-than-two-models-back you'd be out $2400, getting back $866, for net $1534 spend, with a $167 incremental per-year upgrade cost once you buy the first one. Pretty good if you keep the phone in excellent condition and are happy to budget a bit over $10/month to be on a every-two-year upgrade train.
Well, you'd also eat the tax...
https://buy.gazelle.com/products/iphone-16-pro-max-256gb-unl...
https://buy.gazelle.com/products/iphone-15-pro-max-256gb-unl...
I would absolutely not count on that, if and when it drops it will drop hard.
We aren't exactly in "standard" times and haven't been for quite a while. Even five year old graphics cards are worth more today than they were just a year ago. Things will obviously depreciate at some point, but you gotta throw your existing notions of how quickly and how much hardware will depreciate out the window. There's just been too much money dumped into AI for a "well I guess this won't ever pan out, let's dump all this hardware to recoup our costs" moment to happen and tank the price of everything suddenly IMO.
And that's not even getting into the other geopolitical stuff going on right now. Strange times.
If you are able to tie up $25k for a few years just for shiggles, you clearly are able to make do fine without that money and if lost it would be at worst annoying, not catastrophic.
if some have more than one layer it could fewer but that's the order of magnitude
Remember: one year showed up to be a gigantic leap in regards to quality of results and innovation in the AI space. Agents weren't really a thing and vibe coding wasn't even invented as a term because the top notch tools at the time were lousy, with lovable being the frontrunner with its - in my view - sorry Tailwind recombination tool shaming AI to do the work.
Then fall hit 2025 hit us, new year's eve and suddenly there was such a massive surge of innovation and competition with ChatGPT Codex suddenly showing up.
Remember: one year ago many now commonly used tools weren't yet available like Nano Banana or Codex.
"The 25k are so vast" - Yes, and no. For example, if the machine is bought for business usage I can deduct the costs from taxes. This roughly amount for 50% of the financial burden.
So I jokingly use to say, that I pay only half the price for my Apple business machines. And yes, I am strict in this regard. Business means business. No private emails etc. nothing on my company computers.
Maybe there are other options as well to reduce the financial expenses the dude mentions, but it doesn't seem so.
I would also go for leasing, this way already the monthly payments can be deduced and I don't need to buy and maybe resell the machine.
Apple is a luxury good. Without business usage or at least partly using it for business as well as private (mixed usage in tax reports) I wouldn't buy the devices or think twice.
Apple under Cook evolved into a Gucci like luxury brand, that is more and more a rip off than quality delivered, especially considering the latest OS updates for Mac, iOS and iPad. Apple is a mess, following Microsoft Windows' footsteps happily, because the CEO is as has been correctly assessed, no product guy.
But I stop with my rant here.
Always try to use tax deduction as leverage for your computer expenses. Every citizen should invest in basic knowledge about that.
Even a 10-20% professional usage for work (mixed usage) gives you a noticeable advantage over normal pay.
It's not financially a good idea: renting really does beat owning, and cloud beats both if you're only running inference on these machines. But I'm not just doing inference, and as a thing I can do silly stuff on to learn, it's hard to beat!
I do still use Vast and Runpod for things too, but it’s much nicer to test a fine tuning run here to make sure I’m in the ballpark
I also did literally say “It's not financially a good idea, renting is better than owning” so I’m confused why I have two people telling me that
Also it’s just far more fun to play with something tangible to me :)
It’s also annoying because then I need to make sure my little “lab” setup is well automated, and I’m lazy :)
Also, I literally said “ It's not financially a good idea” so I’m confused why you think I don’t know that.
That money could have been spent on way more bang/buck performance in the form of a set of 4 graphics cards.
Also I would probably put the odds 70:30 that Apple marketing is astroturfing on HN from the amount of posts about running llms on Macbooks, because in reality, the inference speed of any decent llm is unusable on a Macbook despite the ability to fit it into RAM.
If you like having a box with 8-12 fans blasting hot air and noise into your office all day, nobody's stopping you.
Point being, once the M5 Ultra is available, I suspect a lot of people will get very serious about making Macs work with RTX GPUs because that will yield an inference platform with a good bang:buck ratio. If so, you may find that your existing hardware is more powerful than it seems today. And it may be a lot more expensive to replace later if you sell it now.
I saw your heat comments about the RTX 6000 Pro as well. I bought a few of them recently and I'm running 2 of them in a 2U case in a colo. You need a lot of active airflow to keep them cool. Mine range from 23 C to 80 C.
After my last run, I'm going to wait for the new case I ordered to come in and cannibalize my kid's PC that we built beginning of this year to form an entirely separate computer. And then figure out better ways to deal with the heat, especially with summer coming up. I'll have to play around with undervolting and running vents directly outside my house to see if that helps.
Used to overclock back in the day during winter with an intake duct rigged to suck in outside air, best thing about -30c :)
But the trend here is interesting. I think by 2030 you'll be able to buy fairly cheap hardware that is currently $10k+. I don't know what this does to the trillions invested in AI data centers because the next NVidia architecture after Blackwell will essentially half the value of purchased cards overnight.
I'm not convinced Apple has yet pivoted the Mac Studio line towards this market and the expected M5 Ultras in Q3 2026 will likely be an incremental improvement rather than big leap forward but I'd like to be proven wrong.
I feel that the open weight models pale in comparison to the frontier models, and I believe that if the gap closes quickly, that the open weight vendors will stop releasing it for free.
Higher radiation, space insulations, etc.
Underwater data centers provide a lot of the same benefits and can (much more) easily be hauled to the surface
The AI space is moving so fast that it is hard to know which conclusions are stable. After all the discussion around local models, is the practical conclusion still that API/frontier providers have a huge structural advantage because of datacenter hardware, high utilization, batching, optimized inference stacks, and perhaps strategic pricing?
In a comparison like this, a $25k local setup versus buying tokens, what multiple are we really talking about? 10x? 100x? Or is it too workload-dependent to reduce to a single number?
Has someone written a good breakdown that separates true infrastructure efficiency from temporary underpricing/subsidy? The part I'm trying to understand is less ideological (local vs. cloud) and more basic economics.
An RTX 6000 pro Blackwell is a pretty good card
This is a real problem and why I've just about given up on ebay or fb marketplace, esp for computers. If you are in Canada though sellit9.com is a great solution to having to deal with sketchy buyers.
YMMV but between your nearest PD office and Library, you should be able to use one or the other for your exchange of goods/money. The biggest thing I've sold is a mid-range video card during late covid (I managed to get a better one via newegg shuffle) so I sold the old one (RX 5700XT -> RTX 2080) to make up the difference a bit. I just did the exchange at the Starbucks near me for that.
The buyer doesn't know who the seller is, and vice-versa... the level of trust you can bear depends on how much you're willing to lose. My advice is only in that there are safe venues you can use to make such an exchange.
Police "safe trade zones" are basically a parking space outside a police station, with a sign.
If it is an electronic payment, I'm not sure how completing the transaction in front of a police station will help any. Well, it will help the buyer to see it working, but the seller gets no additional protection besides seeing "a person."
The problem is that while one these gpus is a huge improvement over a laptop or a single 3090, you very quickly wish you had more. I would buy a second one, but I did the math and realized that with the current crop of models, 2 Blackwells doesn't buy me any new capability that I didn't have with one. So I would need a 3rd one. And when I buy a 3rd one I will feel like I want to running a higher quant, so then I will want a 4th.
Also, the 4-bit quants of MiniMax 2.7 will run at 100 tps or so with two cards, which is pretty decent. It doesn't go any faster at all with 4 GPUs from what I've seen, so if you don't actively need 384 GB of VRAM, 2x RTX6000 is a good place to be.
It's going to be a non-trivial haircut. This stuff depreciates pretty fast.
Of course, this is an unusual state of affairs; I see my GPU purchase as consumption, not investment.
The two major drivers of inference costs are GPUs and electricity. You can't get cheaper GPUs, but you can make existing GPUs not sit idle, and you do that by utilizing them 24/7, processing user B's request when user A is thinking, and handling many requests in parallel, neither of which you can do as an individual. You can get cheaper electricity... by moving, and it's much easier to move your AI workload than to move yourself.
This is a completely different dynamic than renting houses or apartments, as you can't really rent out the same house to different people at different times of day.
I'm very pro local models, but not to have parity with SoTA frontier models. Just contextually trained small models doing smaller specific tasks.
Trying to run bigger LLMs for an individual user to do big tasks is not going to be a good time.
This is the type of place one might be “waiting for the other shoe to drop.” Which carries a variety of potential meanings in this moment of AI.
Tangentially related: Mack and the boys lived in the “Palace Flophouse and Grill” in Cannery Row.
I suppose I must have looked up flophouse when reading all the Steinbeck I could get my hands on and it’s stuck w me.
You'll realize real quick its not profitible. You cant just say things you don't like to hear are unsubstantiated without verifying.
Not to mention, subscriptions.. $2mm in GPUs being given out for 5 hrs a day at a cost of $200 a month.
I could easily say that everyone who says its profitible is msking unsubstantiated claims lol.
These 1T param models running at <$3.00 per 1mm are certainly not profitable.
If the companies as a whole are destined to be profitable, or worth their valuations is a very different question. The only people who can truely answer that have time machines.
Yes, once you have modeled the problem correctly and you know all the input parameters. This is not that: Session# * tps * 86400 (secs in a day) * 30 days.
I don't think there is enough public information to check Anthropic's claims regarding inference profitability. It depends not just on unknown technical factors but also on agreements they have with other companies.
Shouldn't we compare the API pricing, where we pay per token? The whole point of local inference is that we don't have any restrictions regarding product use or time limits, so it would only be fair if we compare it to a plan that offers the same. And even that is only a first approximation, because the commercial models are usually much more capable than the open weight models.
> if I can find a reliable place to sell it without getting ripped off by scammers.
I don't follow this last part. What is the scam they try to run?For something listed at $25k I would not list on eBay at all. eBay corporate will pocket $3400 in fees and will also dock you local taxes on the $25k.
$48000 is equal to 12000 hours of renting an h100, which is about as long as you’d spend at your job for 6 years!
The Ada has a memory bandwidth of 960GB/s. The Pro has 1.8TB/s and about 40-50% better performance so is at least equivalent in processing power, much better in memory bandwidth (important for inference) and can hold larger models on a single card.
I've considered buying a rig with 1-2 6000 Pros for similar reasons but I want to see what happens with this year's Mac Studios with a likely M5 Ultra. Macs have a shared memory architecture whereas NVidia segments the market based on max memory where the biggest consumer card (RTX 5090) has 32GB of VRAM but still excellent memory bandwidth (1.8TB/s). A RTX 5090 rig will still trounce a Mac Studio seems to be the conventional wisdom. Despite being able to hold larger models and being able to chain Mac Studios on TB5, their lower memory bandwidth (~900GB/s) and lower overall GFLOPS mean they still come out behind.
That being said, the current Mac Studios are relatively long in the tooth, being released in 2024.
I'm still not sure any of this is really wroth it because things are still changing so fast. I think there's a decent chance of a number of large AI companies going bust in the next 2-3 years such that you'll be able to buy enterprise AI hardware at cents on the dollar, a bit like how Google bought data centers in the post-dot-com crash.
But anyway, nowadays I'd be looking at the RTX 6000 Pro as the sweet spot, having anywhere from 1-4 in a single server.
The electricial issues the author mentions are interesting. I hadn't really thought about the max amperage on a residential circuit. In a DC, these would typically operate on three phase power and much higher overall amperage. I wonder if there's a device you can buy that can combine multiple residential circuits into a single power source for a server this power hungry?
I don't think anything compares to the nVidia chips at all.
Is this the best general-purpose choice as of 2026 with $50k for training, fine-tuning and running large open models?
Edit: I now see the author was in an apartment and couldn't do this, so I concede this is not responsive here.
While I'm skeptical that there is much of a moat, at least for the large players, it should at least hopefully set rosmine up with for the next job :)
It does seem to fix the current biggest issues with using LLMs for writing at various publishers. If you're The Economist, you have a very specific house style and you have a decent corpus of articles written in that style. At least on my reading of it, rosmine can use DFT to get a model to closely match its outputs, in terms of the language quirks that are generated, to that of the corpus it is fine tuned on. ie it will very much match the house style, particularly as it is used in writing, vs giving a system prompt to an LLM that has some Economist articles in its vast training set, and telling it to write in that style- it will do an ok job, but still exhibit LLM language quirks despite itself. Even if you feed it the specific "style guide" that they give their authors, I dare say the reality of their writing is the best place to learn, and it sounds like DFT can ground the writing of a model in a specific corpus like that.
[1]: https://rosmine.ai/2026/05/18/fixing-llm-writing-with-distri...
They do it well enough that it'd take really good output to beat.
If your goal is to say, write science fiction, their reversion to classic LLM-isms, is really distracting and is what makes people say from a glance that it was written by an LLM. You basically can't use them at the moment in any real "natural" long-form writing. Everyone will call "slop" pretty quickly on the current frontier models.
Rosmin's DFT paper is worth a read.
Some of it you could probably tell with statistical analysis, but actualy people are far worse at judging whether content is AI generated than they think they are.
If you need to beat an AI testing tool, you need to do marginally more work than to stop people from recognising it, but not all that much.
The nature of it is that you don't "see" most of the stuff that is well done because few people want to talk about it.
Or, for a person who did have a great way to monetize the same workload they’d probably find a lot of value in reading this post.
Cloud is optimized for development velocity but its nature of high margin business eventually makes on-prem more promising
It could be too late but it might be worth looking into tax saving if you have a business. Depreciation of asset is a loss and may deduct your income. (I'm NOT a tax expert)
As the author notes, there are also electrical/wiring issues that cap how much compute gear you can run in a space not designed for it. I suspect a standard 20A 110V circuit can probably handle 2x RTX 6000 Pros. 15A probably can but that requires more research. Anything more than that and you're using multiple circuits, which has issues, or you need an upgraded circuit (eg 40A 240V) with all that entails (eg heavier duty cables, custom plug, etc).
During initial setup of the server I am putting together, I found that a machine with 4x Blackwell cards derated to 300W can get by on a single 120V 20A circuit. It's tight but doable. A lot depends on the power supply. I don't think it's a great idea to run 4 high-power GPUs on a single ATX-style PSU, even a beefy 1600W job.
The other questionable part is whether all four cards can temporarily spike at full power during boot, before the wattage limit is applied by the OS. Some accounts say this is possible, and if so it could shut down the party in a hurry. But I didn't see any misbehavior when I tried it.
- https://www.williamangel.net/blog/2026/05/17/offline-llm-ene... - Discussion: https://news.ycombinator.com/item?id=48168198
But yes, for pure inference, the M5 Max Macbook Pros probably aren't there yet. They have other utility though of course. And you can get 64GB and 128GB MBPs at a discount. Micro Center currently will let you buy a 64GB M5 Max MBP for under $4k currently, for example.
Because that wasn't what they claimed to research?
>> for inference it's definitely not worth it.
It's entirely fine if you enjoy local LLMs on your computer, there are people doing horribly inefficient inference on smartphones now. But for pure inference tasks, it's pretty obvious why M5s and Mac Studios aren't replacing TPUs and GPUs.The idea is obviously to be running the LLM on your work laptop. As a developer I'd need a laptop with 24GB of RAM for work anyway, and 48GB, which is enough for a very good quant of Gemini, is just $400 extra.
You don't? It for sure doesn't run on my 32 GB M2 MAX.
You might need that to run it with a longer context, KV cache size is a known issue with that model series.
They’ve significantly increased in price (so much for hardware depreciation…) but you can still get a modded 22GB 2080 ti for $320, or a Mi50 32GB for ~$450 each (used to be $150 a few months ago, alas), or a Mi50 16GB or <$200 but you’d need to stack 4 of them.
There’s also some more exotic configurations but those are probably the simplest options. You won’t get the performance of an RTX Pro 6000 Blackwell of course, and the power consumption will be pretty high so it’s only worth it if you have cheap electricity. But it is possible.
"If I were to do this again, I wouldn’t do a custom build like this. I would buy a standard datacenter server and rent space in a colocation center. But then I would miss saying Hi to grumbl once in a while."
[0]: https://static.cisco-eagle.com/images/category/WireCrafters/...
[1]: https://www.edpeurope.com/wp-content/uploads/EDP-3-Compartme...
Yup, but i was assuming that he wanted to experiment building gpu rigs. For sure standard GPU servers are cheaper and easy to maintain. I have two lenovos, bought them used, already EOL.. was cheap and better than any custom gpu rig.. but i was pragmatic, because my goal was to put it in production, and not to research...
Would probably cost you $500-1000 depending on how difficult your home is.
It just scares me to own a box that is $48K in my house, especially if it breaks, or gets stolen.
No wonder gamers hate AI bros.
Nvidia’s drivers are trash for gaming on Linux and the majority of your “compatibility and framerate issues” are because you’re using a sub-par product for the job.
On top of the significantly worse software on AMD's side (literally didn't work on windows in particular - so the "performs as good on both systems" is a nonstarter, some GGUF library dependency just doesn't work/exist under AMD on windows). Had me running the AMD card on windows under WSL (not a problem with nvidia though, that ran just fine on windows-side directly).
Aaaand also the other AMD bugs, such as the pink squares display corruption that has been an active issue for my GPU in particular (7900XTX) for over a year, maybe approaching two at this point, with no fix in sight from the AMD team (barely and ack at all - not on a single patch notes, just a bunch of reddit discussion). Really regret spending so much on an AMD gpu.
I run hyprland, seems to be the only wayland based keyboard-forward WM that has good nvidia support (and, allegedly, supports HDR, though I haven't got this working). I heard gnome was pretty good otherwise. I was running i3 before and it also worked fine, however once I got into wanting to get streaming working, there wasn't good compatibility between i3/xorg and tools like sunshine. I believe steam streaming worked fine on it though iirc.
The only thing I miss from windows: easy streaming with sunshine/moonlight. Steam streaming works (usually heh) but it took me a couple days of fiddling to get a stream to work at all through sunshine, and it is choppy. But for local gaming, I don't miss windows at all, I'm so glad to finally have all my drives converted from NTFS to ext4.
I don't see it on the Dell site anymore, only more expensive, lesser configurations (good timing on my part?).
Yeah, I really want to put in the time to try out various games, but realistically, the whole point of getting a second computer and installing Linux was to be able to train and serve models, and switching between serving a model (that people in my house want to use at random times) and gaming didn't seem like a great choice. If I did get good results, I'd seriously consider wiping Windows 11 from my older machine (an older Alienware with a 4090), but to be honest, I'm perfectly comfortable on Windows desktop.
Personally, playing with AI models is way more fun than getting sucked into a game loop. Game loops feel like busy work hooked to an engineered dopamine drip. AI models are new frontiers and are exciting to build with, modify, lobotomize, and hack around with.
It looks like DM took a crack at it: https://deepmind.google/blog/capture-the-flag-the-emergence-...
Not everyone is hustling 24/7 like some kind of lunatic.
The high cost and power consumption are both signs of the death of Moore's law, so you are probably correct that this system will be near state of the art for some time.
I'm not saying it's worth it just that it's not such a crazy amount in comparison.
For a lot of research questions 6 GPUs is even overkill.
It’s one of the reasons I’m skeptical of the “trillion dollar supercluster” idea [0]. I think what we need is more reasonably smart people investigating medium-sized problems. A “GPU middle class” you might say.
[0] https://situational-awareness.ai/racing-to-the-trillion-doll...
"If I were to do this again, I wouldn’t do a custom build like this. I would buy a standard datacenter server and rent space in a colocation center"
I'm sure there are use cases when renting makes sense, but it can get crazy expensive really fast if you're not careful.
The main advantage, however, is that the friction of "this is going to cost me in tokens to even try" goes away. I was so much more willing to take chances and try new things on my own hardware than I would have been if I were paying API costs. I feel like this point isn't made clearly enough by those of us who run these absurd self-hosted inference systems.
Thanks for the write up, was a fun read. I spent an order of magnitude less, but I could relate to your story from beginning to end.
Epyc (Milan), 512gb ram, 4x 3090
Will you now be selling these GPUs for a profit?
Not really sure how that makes it safe but OK!
Just an assumption, though!
That issue can often be addressed fairly easily by splitting the power draw between two adjacent circuits. You can have an electrician do it permanently or temporarily DIY it with an appropriately rated extension cord. The real issue was OP was in an apartment at the time so an electrician would have been difficult. I assume they decided to just have a system integrator build it because they didn't want to figure out how to segment and route the power rails in a dual power supply system, but it's not exactly rocket science. Problems are often more due to choosing power supplies that aren't up to their claimed spec, not pre-testing them under load or using incorrect or under-spec cables.
This is actually THE standard in the US, which is actually fundamentally a 240V power grid but with an electrode stuck halfway down the secondary winding on every pole transformer, which becomes your "neutral". The two ends become L1 and L2, so that L1-N is 120Vrms, L2-N is 120Vrms, and L1-L2 is 240Vrms, and this is what goes into every home.
The power outlets connected to L1 are all opposite phase to all the ones connected to L2.
Rather than bussing the two outlets together, what you can safely do is get an electrician to just wire up an outlet with L1 and L2 and voila you have a 240V outlet. This is how you get all your dryer outlets, EV charging outlets, electric stove outlets, etc.
1. You no longer have the nice property that unplugging it guarantees (more or less) that it isn't electrified.
2. You open up the possibility of mains voltage from one plug appearing on the unplugged prongs of the other plug.
3. It possibly messes with RCDs, depending on what you do exactly.
Although in this case it's probably fine because he's just plugging totally separate power supplies in and they're already fully enclosed.
The picture shows two power supplies. Powering what is effectively one appliance from different circuits is a definite no-no, and I can't think of any circumstance where it wouldn't be in a home.
If his mains supply was sufficient to run the server and the house in the first place then the simplest solution would be to simply upgrade one of the MCBs/RCBOs on one of the circuits to the required capacity. I am not sure a landlord would even notice something like that, and if the house is wired correctly in the first place, it's unlikely to be dangerous. So going from say, 6A to 12A, on a 20A mains supply is generally fine if the gauge of wiring is correct.
- muscle cars, with all the stuff, driven occasionally.
- boats, that don't get taken out much
- gamer x, where x=system or laptop or keyboard or mouse or desk or glasses or mousepad or speakers or ... usually with "> too much RGB"
- children
$48k for something constructive even if ai related? no problem, refreshing even.
I didn't mean food, shelter, medical, education, lego, others-where-required-by-law.
However much it has cost me monetarily, it has repaid itself ten times over in value to my very soul.
Why didn't they just put a higher amp breaker in the box?
Someone needs to solve proper distribution of packaged GPUs with some Tesla-like wall connector for a consumer grade box that is plug and play.
Maybe John Ternus ends up doing that at Apple since they sit closer to this consumer profile.
It seems that he managed to get what he wanted from the hardware and I'm happy for them.
He said something interesting at the beginning of his post, he compared the cost of the hardware to the cost of his time based on his FAANG salary. Which is an interesting way to think of this, but the rest of the article didn't make me understand if at the end he did save money/time based compared to just rend on the cloud.
Also, outside of the power cost, hardware has other costs too, you need to operate it, maintain it, set it up, etc. all that require time. I mean, even the process of figuring out if it had a good enough ROI compared to cloud, takes from your time (collecting data, analyzing data, etc etc).
The real question is whether or not they could have done whatever it is they did with less hardware. Is there a business idea here that could have been proven on cheaper hardware that could be upgraded as demand increased? Is the expected ROI there based on future earnings?
Absent any indication that this was needed in the first place, I can only conclude that it wasn't worth anything.
> UPDATE: Launch was a success! 400K+ views, and multiple companies reached to use my IP. Read more here[0]
[0]https://rosmine.ai/2026/05/18/fixing-llm-writing-with-distri...
Was it worth it to spend that amount up front, yak shave while building the system, etc. vs. pay for cloud GPUs? Probably not in terms of dollars, when their time is also valued in dollars.
Was it worth it for this person? It seems, unequivocally, yes.
Abstract/TLDR: LLMs are notoriously formulaic at writing, overusing certain tokens or phrases. I show that models trained with SFT fail to match the distribution of the training data by using Maximum Mean Discrepancy (MMD), Judge Model Quality (JMQ), and L2 Token Distribution.
There was a time in this industry that it paid about as well as an accountant and people did it because they loved what they did. Then the money flooded in, a bunch of people switched majors from business to CS, washed out in industry, got their MBA, and became product managers and engineering managers and sucked all joy from it. God bless those that find that joy again.
So only 2 options in this profession are 1) sell your soul to 1 of 5 evil corporations that just so happen to also pay excessively well or 2) choose to be unemployed for years while spending a significant amount of money on hardware trying to turn a hobby into a business
Also by your reasoning, these GPUs are blood diamonds and the authors future product/business should warrant preemptive boycott by all the perfect people like you
The raw infra being local didn't enable any of that. Now if was building ASICs at TMSC that would a different thing because you'd then be using something different locally.
I spent a lot of time researching/adding/benchmarking many custom modifications to the software stack and its settings to make the server optimally handle the load with just 1 RTX 5090 without losing quality, but it's still not enough, and the wait times in the queue are getting longer. We're at the limits of the hardware, and I'm out of tricks.
The experiment was kind of a success, and the CTO agrees we should scale it. With our own infra, we could run agents 24/7 on everything. Currently, a lot of use cases for the cloud providers are completely blocked by PII/trade secret concerns (our infosec department doesn't buy the "zero retention" promise), plus you don't have to think about billing/budgets/etc. anymore.
Now I can't decide how to scale it. On one hand, I'd like to run larger models. And we have the budget to buy, say, 8xH200. But in many benchmarks, the larger models that do fit in 8xH200 comfortably and can serve many parallel requests with acceptable speed/quality don't seem to outperform Qwen3.6 that much in agentic coding/tasks to justify the price.
So another option is just to buy a bunch of RTX 6000s and scale horizontally instead: run a copy of a midrange LLM like Qwen3.6 on each GPU. It's cheaper and easier to scale/replace, but then we'll run into problems running larger models in the future if we have to, because of no NVLink support (say, if Alibaba & Co. stop releasing ~30b models and/or ~30b models start falling behind 400b+ models considerably)
Does anyone here have experience running large models in a multi-GPU setup with several RTX 6000s in a high-concurrency regime and with large context lengths? (something like Deepseek 4 Flash, Minimax 2.7 etc.)
Are you willing to share any lessons learned, etc. that I could make use of? We are evaluating paying for a SOTA sub or trying this, and the talk about Qwen3.6-27B makes me want to try deploying this machine.
It's not even a real comparison if they are actually using them for coding.
If you are deploying always running agents (e.g. monitoring logs and services) then sure - a QWEN local server is a good choice. But for coding the cost in productivity of using a lower performing model is way too high.
But in term of actually running a dev team - you are free to use QWEN or another quantized local model that can run on an RTX 5090 for coding if it makes you feel more independence. However you would struggle and spend many many more hours achieving the same thing, with a lot more debugging time, long delays before it's done, and many more prompts.
It's just not the right approach. I use QWEN and other local models all the time, but for more clearly defined monitoring and classification tasks.
For continues all day work you definitely need a higher tier sub level.
I'm actually looking into deploying a GPU at my company because we can not give out our code. Qwen 3.6 looks good
For what it's worth, I've been seeing ~100 tps with 4-bit MiniMax 2.7 on two RTX 6000 boards, just running under llama-server without any optimization effort at all. I have no serious long-context experience with that setup, but at 30K context it's still above 90 tps.
If you are happy with Qwen 3.6 27B, I would personally switch the 5090 out for 2x RTX 6000s and keep running 27B. That will give you ~2x your current throughput with a lot more headroom for multiple users. More important, it would buy time to see how things develop over the next few months before you spend a whole lot of money.
If you truly want to scale up, you should get the 8xH200 with NVLink.
They are wise to be skeptical! It is neither a promise nor zero data retention.
Look at Anthropic's Zero Data Retention policy -- and remember, this is the policy that applies to the exclusively eligible enterprise partners who can even qualify for a ZDR agreement with Anthropic:
> When ZDR is enabled, prompts and model responses generated during Claude Code sessions are processed in real time and not stored by Anthropic after the response is returned, *except where needed to comply with law or combat misuse*.
> Even with ZDR enabled, Anthropic may retain data where required by law or to address Usage Policy violations. If a session is flagged for a policy violation, *Anthropic may retain the associated inputs and outputs for up to 2 years*....
This means that Anthropic is actively inspecting all of your data with machine learning classifiers. When the usage is flagged for whatever reason as violating any aspect of Anthropic's Usage Policy, then they get to keep your data for 2 years, with no apparent limitation on what they can then use it for.
Crucially, you have ZERO guarantees about the sensitivity or specificity of these classifiers. For all anyone knows, Anthropic is silently flagging 75% of queries and retaining the data.
Given that all labs need to diversify to become profitable, they’ll end up competing with their customers and theres nothing that exposes a business more than having AI offload every job function for every account, every mail etc.
Assuming this won’t be an issue is naive at best.
Join the RTX6kPRO tribe!
A normal engineer may be running a couple of sessions with every session spawning sub agents left and right.
80 persons or even 10 having this workflow on this setup doesn't work, and this is the standard engineer workflow today.
with a single 5090?
idk i imagine you'll hit less edges with a larger model just because.. more data
if you think of them as a kind of NN compression, it's ~obvious that the larger model can have more stuff encoded in it and hopefully accessible
i don't use LLMs much right now but using midrange models seems like an unnecessary compromise in most cases, especially since the big open models sound to be rivaling opus and not just sonnet :p
I know it's not the same. But a lot of people buy expensive GPUs, just to find out they have no real use for smaller models.
I envision NixOS at the core... then everything I need virtualized on top with KVM/QEMU. Maybe a dual boot setup with Windows for gaming and Flight Simulator (but I could virtualize that too with easy GPU passthrough.)
Lingering questions I'm working to figure out:
- Will 2 RTX Pro 6000s run on a 1600 watt PSU? Not sure how much higher I can go without calling an electrician. (standard US home.)
- Assuming I plop this into my home office, should I expect the PC to run significantly hotter than my current rig? (3960x threadripper, 128GB RAM, 1600watt psu, overclocked and watercooled 4090.) My water temp, measured at radiator, is about 60c at peak load. (This is the only number I care about, as this is what I have to consider to be comfortable sitting next to it.)
- 512 GB
- Epyc 9684x
- 2x RTX 6000 Pro
- 1400 W PSU x 2 but in redundant mode
Mine is in a colo where it stays nice and cool. In my case, I went with less RAM and more GPUs (bought 4). Secondarily, the Max-Q blower version of an RTX 6000 Pro Blackwell is easier to keep cool and also only needs 300 W at the cost of very little performance. The non-max-q also only really use 300 W during inference, but the good thing about a lower power use is you can put more GPUs in very safely.
I assume you want the Threadripper Pro to maximize single-core performance? So you're spending a lot of time on CPU? Interesting stuff.
I gained a lot putting the machine somewhere else. TTFT on a thing like this is between 100-800 ms depending on batching and model size and so on, and your nearest datacenter is likely <10 ms. It sits on nice dual redundant power in a place where it's blown icy cool.
Good luck with your setup. If you get around to it, and end up writing about your setup on a blog, do share. Email in profile.
I'm just putting a 2nd hand 12gb 3060 into my lab box, but its only for use with HA/Paperless/Plex etc type things. I dont need multi-model agentic behavior for private use.
If I did I reckon I'd renting infrastructure rather than filling my home with that sort of gear.
I am not even going to pretend that this is financially reasonable option. I simply wanted to have a local models. Maybe down the line, as cloud models become less subsidized, I might benefit from having a local setup, but for now, it wasn't the most prudent financial decision.
But one big benefit is that I never have worry about my account being randomly banned nor I have to worry about running out of quota. I still use codex and opus for some specific tasks, but as tools are improving, I need them less and less.
I feel like there is some very deep generalizable wisdom buried here.
Also, sorry for the noob question, is not such server generate enormous amount of heat? You did not use any special cooling system?
I find the "independent researcher" business model quite interesting. In the linked post he writes """DFT is a proprietary training algorithm, however, I’m currently offering a beta for a model training service where I will train your model for you using DFT.""" I'm curious how successful this is. Essentially market some AI breakthrough as a service instead of publishing a paper like my academic brain is trained to do.
As an aside, one thing that I always loved about our field was that the startup cost for many business ideas was "a laptop, internet connection and some some grit". In the age of AI it's quite a bit more and I feel one of the sad side effects of this is that it crowds out poorer and younger developers.
Provisioned capacity is a really high end thing. I feel like you'd need to be spending more than $1000/day on tokens for this model to make any sense. You lose a lot of flexibility once you start dumping capital into specific pieces of hardware. Maybe start by renting the GPU server for a few days...
Is that California ?
I wonder how much worse just a bunch of Intel Arc B70s might have been, software fuckery aside. Ofc if I’d need to run local inference or simple fine tunes and learning stuff, I’d probably get one of the SFF options - Mac Minis and all of those Sparks or new AMD AI chips. Then again, I’m broke so go figure.
I just fork over some money every month to Anthropic, have been trying out more DeepSeek and also Mistral (their Vibe tool is surprisingly passable under WSL).
https://rosmine.ai/2026/05/18/fixing-llm-writing-with-distri...
privacy has a steep cost
1) Was the energy bill factored in? 2) Have you extracted any comparable value out of this?