Indexing a year of video locally on a 2021 MacBook with Gemma4-31B (50GB swap)

404 points by asenna a day ago | 119 comments

andai 21 hours ago |
Awesome. Say, this is very comprehensive.
I was vaguely aware of all these pieces existing (except for running a facial recognition database at home o_o), but it's really neat to put them all together like that.
asenna 20 hours ago |
Thanks! I was honestly casually trying it out on the side with Claude's help. And I was actually pleasantly surprised to see how good the result was.
Still blows my mind I can do all this from my 2021 MBP.
I'll try to do a post once I have the next steps working (helping with planning and editing videos with Davinci Resolve).
ahknight 20 hours ago |
I also have a 64GB M1 Max and am similarly impressed with what that workhorse can do. The M5 tempted me -- a lot -- but then I looked at what I was already getting done on that machine and just couldn't justify it ... yet. Someday, surely, but not yet. Gemma4 gave all my local projects new life, just like what you did here.
Great job. Long live the M1 Max!
asenna 19 hours ago |
100%
Although knowing how good these local models are getting, I am now eyeing the upcoming M5 Ultra Mac Studio (256gigs perhaps). But knowing how crazy the market is, it might be a year before I get the chance to get my hands on it. If it even launches by WWDC.
throwa356262 21 hours ago |
I ran Gemma on a 2015 thinkpad to do something similar. Fortunately, I could upgrade the memory otherwise it would have been a painful exercise.
Not gonna lie, llama.cpp had the fans spinning at max speed. But it worked and I got the job done.
iMerNibor 18 hours ago |
> the fans spinning at max speed
This always confuses me - don't people want their computations to run as fast as possible and thus inevitably produce more heat that needs to be vented?
I suppose sometimes it is just an analogy for "its utilizing 100% of my resources" (which I'm guessing it is here), but I've definitely had people say it as an actual complaint in different contexts
0xbadcafebee 18 hours ago |
Fans shouldn't be running at max speed if the model fits in RAM with room to spare for context. Usually fans max out when the model doesn't fit and the CPU is chugging to make up the difference (or the user didn't tune LLM settings)
dist-epoch 17 hours ago |
What people complain is when they visit a blog with two images and the fans are spinning at max speed because the blog has 100 trackers.
overfeed 14 hours ago |
> I've definitely had people say it as an actual complaint in different contexts
I think fan loudness is an outgrowth of conspicuous consumption because a certain OEM decided to make it a marketing bullet-point.
I was equally disappointed by by people - especially device reviewers - banging on the drum that phones made of plastic "didn't feel premium", and we got phones with glass backs that have to be shoved into plastic cases (because plastic is the near-perfect material to protect fragile phones screens and innards)
egorfine 20 hours ago |
> generative AI video has no place on a real travel brand
I am pretty sure that the vast majority of Airbnb hosts would not agree with you.
> equals TripAdvisor crucifixion
I have no idea how the Airbnb hosts with fake listings survive, really.
asenna 20 hours ago |
Haha. It's honestly something that I've been struggling with myself. I'm running this safari lodge but I don't want to go down that route of slop videos!
But on the other hand, genuine videos do take time and slows down the process.
desro 20 hours ago |
> The skill is open at ~/.claude/skills/video-index/. If you're working on something similar (indexing personal archives, getting a local model to do real archival work, building agents that drive editing tools), I'd be glad to compare notes.
When your Claude wrote this post they might not have selected the right URL to share, unless your home folder is exposed. Care to share the skill files?
asenna 20 hours ago |
Oops! My bad. Fixing it now. And yeah, I can share the Skill file. Give me 5 mins.
asenna 20 hours ago |
Ok I scrambled to finalize a name for it and create a new repo for it - https://github.com/Simbastack-hq/framedex
PS - I just put this together in the last few mins, removed my personal files and references. So it's not tested properly, please let me know if any issues.
It's still an early hack, but I have thousands of still images as well from my camera which I've not processed and I need to do the same analysis for those.
So I'll continue working on it, but happy to receive any PRs if anyone finds any use for it.
I'm tired of having a backlog of thousands of images and videos, leaving it for later.
jaggederest 19 hours ago |
Hey friend, try something in this ballpark, your post has a bunch of painful AI tropes:
https://github.com/blader/humanizer
You get a pass here because you're doing really cool stuff but it's kinda tough to read past the AI nonsense, and it's relatively easy to screen out "it's not x it's y" kind of things and the bolded bullet points.
asenna 19 hours ago |
Thanks for this! This is exactly what I was looking for.
Tbh, I have a lot of thoughts and ideas and things to share and I do spend time and effort trying to de-AI-ing it but this should help a lot.
I'll try it out.
In fact, I was expecting getting shit on by HN readers for this but was pleasantly surprised that readers moved past it.
jaggederest 18 hours ago |
Yeah I think you'll find these days that there's a lot of respect for substance like what you're doing, even past the noise of the AI. I also use a lot of AI but you really have to demand quality from it, whether it's writing, media, or code. It's clear you've got the taste from your media work, and we're all still learning as we go, so I'm very glad that I could point you in that direction.
refulgentis 18 hours ago |
I'm curious: how, exactly, did it go from this is painful to read due to AI, to no one cares about AI use and you demanded quality when you used it and delivered?
jaggederest 18 hours ago |
It didn't, it went from "this reeks from AI after edits, here's a tool that can help" to "people can read past it but there are better ways, you must demand quality". I don't think those two things are inconsistent.
refulgentis 18 hours ago |
Ah, I see, after he uses the tool it'll be great because he has taste.
jaggederest 18 hours ago |
I don't think "if you iterate on this, try using some tools, and ultimately demand that the output meet or exceed your demonstrated taste in other domains" is a hot take, honestly.
refulgentis 18 hours ago |
It's not a hot take, you're right, I gravely misunderstood the timing in your post, i.e. you were clearly framing it as after and being polite and encouraging.
I'm more hot about it because it's frustrating having so many HN posts be a place for people to work out first drafts, especially when the first piece of feedback is "hey, uh, you clearly used AI and it's horrible to read as a result." So easy to avoid...good on you for being kinder.
(part of my frustration is I was excited because I write an local LLM client and thought I missed Gemma 4 has streaming video input support, but after reading through the slop it turns out its just the ol' "extract frames" workflow. tbf that would have happened AI or not, but put me in a mood)
jaggederest 18 hours ago |
No worries, text is hard whether there's AI involved or not - I, in turn, mistook your clarification as a snarky "ah well of course if they try harder it'll be fine", my apologies for that. I share your frustration, but the best way I think is to educate not remonstrate unless they're someone who should clearly know better[1]
[1] https://news.ycombinator.com/item?id=48172536
AlecSchueler 17 hours ago |
I think you missed an important distinction being made:
> I also use a lot of AI but you really have to demand quality from it, whether it's writing, media, or code. It's clear you've got the taste from your media work, and we're all still learning as we go...
Their use of AI for "media work" has shown a taste but their writing usage still needs to equal that.
refulgentis 18 hours ago |
They haven't: this is the top thread, and the entire threads is saying its unreadable and explaining step by step how to do the basics you should have done before you posted. I'm not sure why you're pleasantly surprised, I would have expected embarrassed, and taken down the HN post to get at least the basics down before sharing it under my name (if possible, dunno how HN submissions work)
asenna 18 hours ago |
Unfortunately will have to disappoint you, can't get embarrassed easily. In fact when all of this worked well locally, felt pretty proud ngl.
constantius 2 hours ago |
It's quite sad that you're feeling pride largely for your ability to write a prompt, and it's sadder that you're being snarky with someone who expects more from HN users.
Your behaviour is not affecting the HN community in a positive way.
repparw 18 hours ago |
if you care for some feedback about the writing, dropping the link and saying "PR's are open!" would land probably equal or better, and would reduce noise on the message. as sibling said, substance and noise
jaggederest 16 hours ago |
That's actually a really good point, blog posts as open source
asenna 14 hours ago |
Agreed.
To be honest, my literal thought process initially when writing was: - I think this is cool, I should probably open source this - No wait, I'm again over planning, no one's gonna read this and the problem is probably too specific to me for anyone to care.
So I just mentioned "lets compare notes if anyone else trying".
Hence you can see from the comment above, I immediately realized I made a mistake when the parent asked for the Skill file. Should've had the link ready. Pleasant surprise.
bonoboTP 18 hours ago |
I don't dislike those tropes because they are frequent or because they are not pleasing to read intrinsically. I dislike them because it tells me it was made by AI and AI output varies strongly in quality and most of it is low on insight but rings the right bells to make it seem insightful. It indicates a lack of human care.
Hiding these clues by another AI pass doesn't solve the core problem. Now you just end up with content that camouflaged better but is still equally low in nutritional value.
cortesoft 15 hours ago |
I feel like human copywriters have been using those same tricks for clickbait articles for years…
nchmy 14 hours ago |
Hence I've hated them all since before Ai. But now I'm utterly repulsed by it
squeaky-clean 13 hours ago |
"My AI writing isn't slop. It's just as good as buzzfeed lists or celebrity gossip articles."
dvfjsdhgfv 3 hours ago |
Sure, but the omniprevalence of LLMs just just crystallized these into clearly recognizable patterns. Just like cliches, but not being limited to simple phrases.
Forgeties79 14 hours ago |
I dislike them because I find they generally don’t give any useful information OR if the information is in fact useful, it could do it with a fraction of the words.
Vigorous writing is concise.
fenix1851 6 hours ago |
it’s a different story.
I was highly interested in reading this article from start to finish.
Ofc there are was a lot of slop moments, but author experience itself is great!
And i genuinely don’t care if he would share it through LLM article.
Just please remove slop markers :)
yellow_postit 16 hours ago |
As someone that naturally used a rule of 3 and em dashes I hate AI for taking that away from me.
jaggederest 16 hours ago |
Agreed, I find myself avoiding constructs I would use naturally because they read as AI - "not just because other people would judge them, but because I also notice and dislike them".
Zababa 18 hours ago |
Btw I like your article, it does feel a bit AI generated but I think the problem and setting are interesting enough that it was a pleasant read.
embedding-shape 19 hours ago |
We just got a modern example of the classic message from a friend who just picked up programming, containing: "I just created my own web app, wanna check it out? It's here: http://localhost:8080"
z2 14 hours ago |
I've been getting this weekly from colleagues. It's very much an epidemic right now! And the port number is indeed almost always a random number between 8000 and 8100.
a012 11 hours ago |
Wait until they discovered the port number could go over 9000
dvfjsdhgfv 4 hours ago |
> I've been getting this weekly from colleagues. It's very much an epidemic right now! And the port number is indeed almost always a random number between 8000 and 8100.
Really? A bit hard to believe, unless you have many dumb colleagues.
aurmc 34 minutes ago |
A lot of people here have colleagues who are playing with Claude Code but who have essentially no experience with development at all.
It’s not at all surprising. Not everyone is a developer.
m463 12 hours ago |
reminds me of telling a friend:
I hacked your system: file:///etc/passwd
taneq 8 hours ago |
There was a Userfriendly comic with Miranda telling some ‘hacker’ “my IP addy is 127.0.0.1, come get some”.
867-5309 3 hours ago |
https://nitter.net/pic/orig/media%2FCrxXxYlWYAAjGJN.jpg
0x38B 6 hours ago |
Different context, but I sent a message like that in Signal the other day to a family member with a link to my IP, pointing to `Python -m http.server` running in a directory with a file for them to try (1). Easier than having them open my Samba share.
1: To get an Android app working that has been delisted and requires a 'key' app that you purchase. We did purchase it, but didn't think to make any backups.
egorfine 20 hours ago |
Thanks for the article! I have a beefy M5 Pro and I'm eagerly looking around for ways to use local models (specifically Gemma4 & Qwen3.6).
This is an excellent thing to do. Especially that LLMs excel at batching thus you can index multiple photos and videos in parallel for no performance penalty.
busfahrer 20 hours ago |
I have been contemplating a M5 Pro MBP, but for the life for me I wasn't able to find benchmarks for real-world models, do you happen to know how many tokens per second roughly you get with MoE models like Qwen 3.6 35B/A3B or Gemma 4 26B?
ahknight 20 hours ago |
I'm not normally one to share videos as answers, but this particular fellow does a LOT of work with local AIs and Macs and happens to have a nuanced answer. https://youtu.be/XGe7ldwFLSE
egorfine 20 hours ago |
Qwen 3.6 35B running on oMLX 0.3.9rc1: on oMLX I get 86 t/s on Q4 and 74 t/s on Q6.
Bear in mind that ttft on MLX is much much faster on M5 Pro as compared to M4 Pro.
Also bear in mind that those figures are with NO optimizations whatsoever: no MCP, no DFlash. I am waiting for both to be released for the Qwen models.
busfahrer 16 hours ago |
Great, thanks! :-) and to mirror another poster: what kind of prompt parsing (prefill) speed do you get for that model? Also how is the speed for the 27B model?
egorfine 15 hours ago |
35B: 1300-1800 t/s on both Q4 and Q6.
27B: give me 20 minutes
busfahrer 2 hours ago |
Thank you, good sir!
juancn 19 hours ago |
I'm running unsloth/Qwen3.6-35B-A3B-UD-Q8_K_XL on an M3 Max, 64GB at ~57 t/s with llama-server
brcmthrowaway 19 hours ago |
Prefill speed and 27B number?
embedding-shape 19 hours ago |
You need to ask macOS people for their prefill speed as well, there are two numbers you care about here, and current MacBooks have generally terrible numbers when it comes to prefill performance. Surely it'll get better with time, but if you already have a desktop, I'd go the "beefy GPU" route first.
egorfine 15 hours ago |
Qwen3.6 27B oQ6: 12.5 t/s generation, 340-360 t/s pp.
egorfine 15 hours ago |
Native MCP:
For Qwen 35B enabling native MCP on MLX models slows it down by 10%.
For Qwen 27B enabling native MCP on MLX models speeds token generation up almost exactly 1.5x.
(all tested on M5 pro).
satvikpendem 20 hours ago |
Unsloth Studio [0] is what I recommend these days, open source alternative to the more widely known LM Studio, and also built by the people who make good quantizations of released models. With MTP support not merged in you should get 2x token generation speed with no accuracy difference. They also have MLX quants if you scroll down a bit, which is a format specifically for macOS' Metal GPU acceleration but that's not integrated into Unsloth Studio just yet.
[0] https://unsloth.ai/docs/models/qwen3.6#mtp-guide
egorfine 20 hours ago |
I have researched for quite a bit and so far the fastest runtime is the oMLX one. But there's a caveat: ttft on MLX on M4 Pro is enormous. On M5 Pro it has been greatly sped up.
regexorcist 19 hours ago |
Curious if you tested llama.cpp and still found oMLX faster? I haven't tried the latter myself, might give it a go.
egorfine 18 hours ago |
Oh yeah I did test various solutions and different settings and quants
Llama is about 1/3 slower on Apple Silicon.
mft_ 19 hours ago |
I tried Unsloth Studio recently and was disappointed - in particular the downloading functionality is half-baked and didn’t cope with resuming downloads. As it seemed to just be a simple wrapper over llama.cpp, I found that huggingface hub, llama.cpp, and a couple of simple scripts actually offered better functionality once it was set up.
satvikpendem 9 hours ago |
Yeah it still has some issues on the UX side. It works fine resuming though, just select the same model again and it'll resume the download, the only issue is there isn't a dedicated download page as that would help a lot.
What's better about Unsloth Studio vs LM Studio is it tells you exactly what quantization to use especially as Unsloth ones are quite good, and that it has web search and self-healing tool calls so having a web-searching local ChatGPT alternative is very easy to spin up.
asenna 14 hours ago |
Thanks! Videos is still kinda new to me. But I have a large collection of amazing photos - tens of thousands of RAW images - just lying there spread across the different trip folders.
You know what I REALLY want? Just point this beast at the folders and it tell me which 150 shots are good to process from these 1,500 images. That's the dream!
Although the technology is getting there, it's still a very difficult problem to solve. Taste and art is subjective. Also me as a photographer will always be concerned - "what if my best shot was in one of these rejected shots".
But yeah, I think I'll try to do some more of these experiments soon.
endymi0n 13 hours ago |
there’s a lot of open models out there… I told Claude to do a weighted score on several models and deduplicate by CLIP similarity for an expedition, should be easy to replicate (see below). Sure doesn’t select the absolute best pics from an emotional impact perspective, but it was pretty damn good at me not having to wade through the bottom 80% of mediocre shots and dupes!
—-
“Models scored all 4,487 photos. NIMA rewards technical craft (sharpness, composition), LAION rewards emotional/aesthetic appeal, MUSIQ is more general quality. Combined: 0.4 NIMA + 0.3 LAION + 0.3 MUSIQ, deduped at 0.85 CLIP similarity.
Interesting: the models wildly disagreed on some shots — one photo ranked NIMA #2 globally but LAION #4313.”
asenna 11 hours ago |
Very interesting! Wasn't aware of these. I'll be exploring them soon. Thanks
herf 20 hours ago |
Two questions:
1. What is the search index?
2. The "description.md" example has things like "faces -> cluster_id". Is this from Davinci Resolve's face index? Things like faces+names and locations are really important with photo collections, but general LLMs don't handle them so well.
asenna 20 hours ago |
1) It's just simple plain-text `.description.md` sidecar files, one per clip, sitting next to each video.
Something which I can query later - Like when brainstorming with Claude "I wanna make some videos of the Luxury rooms in the lodge" and it knows what all videos could help here (going through the files).
There's also a folder root level files that aggregates the text descriptions to make it easier to find.
I've just attached an image in the blog showing an example - https://blog.simbastack.com/_media/gvcycx2n.png
2) No - nothing from DaVinci Resolve. Framedex is a standalone pipeline. Resolve isn't involved.
Faces come from insightface (the open-source buffalo_l pack - RetinaFace for detection), running locally on CPU. For each clip it detects faces in the sampled frames, embeds them, and writes rows to ~/.framedex/faces.db.
Tbh, this part I know it's building up in my local DB but I haven't tested how good is it. Will check them out properly soon.
But yeah, on your broader point that's why framedex deliberately does not ask the LLM to handle faces or locations.
----
Faces → insightface / ArcFace embeddings. Deterministic, comparable across clips. The vision model only contributes a rough people_count; it never tries to identify anyone.
Locations → EXIF GPS via exiftool, reverse-geocoded through Nominatim/OpenStreetMap. Hard metadata, not a guess.
The LLM only does what it's good at: scene description, mood, shot type, keywords, keep/review/cull rating (this last part is also debatable though).
asenna 20 hours ago |
UPDATE: Quickly created a repo for this - https://github.com/Simbastack-hq/framedex (MIT License)
It's not tested properly after I genericized it. Will try to go through it properly and add more updates.
Two big things on my TODO: 1) Make use of this indexing and using Claude's help, make video editing faster with Davinci Resolve (now that I have a good index of all the content)
2) I currently did this for videos, but I want to add more things to this for my thousands of still images of my camera - need to make sense of them. So I'll be working on this as well.
brcmthrowaway 19 hours ago |
So do they run the lodge or what?
asenna 18 hours ago |
Hi. I wrote this article - yes, I do run a safari lodge in Maasai Mara, Kenya. It's amazing. Ask me anything if you're interested in knowing more.
(Also email is in my profile).
theodorewiles 19 hours ago |
My take is that B2C AI applications are kind of structurally limited by how hard it is to build personalized context.
The idea of capable local models could be a huge unlock here if they are able to do the bottom-up context collection research / tagging / etc. at scale.
enos_feedler 19 hours ago |
Is it really local models that unlock this? Surely stateless model APIs would yield the same benefits? I get that local can be “cheaper” depending on usage, but we’ve been renting storage and compute from clouds at a premium for ages..
asenna 18 hours ago |
A huge thing here was the massive amount of data that was just processed - I went through about 1TB of files over 24 hours.
Using API to analyze even a subset of this would've been painful imo.
enos_feedler 17 hours ago |
I thought about that in this video case and it's true. I thought the parent comment was making a broader statement about local models in general. But even with video, if it was stored in private cloud storage near the LLM could this still have worked efficiently? What are the most painful elements of this whole setup / work environment if everything was cloud?
asenna 15 hours ago |
Oh yes, if everything is cloud, then this is a non-issue.
The few other points of consideration would be:
1) Cost - I was considering using Sonnet for this but there's always the concern of reaching limits OR the API cost if you're using the API.
The feeling of knowing you have a capable model in your hands without any limits is actually pretty awesome. Your mind starts running at what else can I throw at it to do grunt work.
2) Privacy issues - same as with moving to cloud.
3) Reliability issues - I know from experience Claude uptime has been pretty bad the past few months
4) Restrictions - Claude has been pretty heavy handed with their restrictions lately, anything which remotely triggers there flags gets an instant denial (or worse, an account ban). Often these are false-positives.
I love the value I get from Claude but there's a different kind of freedom you get with local, capable models.
asenna 18 hours ago |
Definitely agree with this. Here, me and Claude brainstorming together did that Research, and some trial-and-error to get to this.
But I can tell it's only a matter of time before agents become smart enough to let my non-tech friends be able to just say "Make sense of all these videos in my folder" and it just does it.
michaelbuckbee 17 hours ago |
I made a B2C AI app that's fully local (and free) to do AI based contextual file renaming.
So if you give it a bunch of screenshots it will try and intelligently name them based upon what is in the screenshot. Same for videos, PDFs, etc.
But to your point I haven't even tried charging money as it feels like something Apple is just going to bake in as a feature.
https://finalfinalreallyfinaluntitleddocumentv3.com/
ntcho 7 hours ago |
absolutely love the domain here. great taste
asenna 2 hours ago |
This is cool. And yeah love the name!
Are you planning to open source it? Or maintain it in the future?
michaelbuckbee 29 minutes ago |
My plan was to just see if anyone wanted to actually use it first. That if I couldn't give it away I'd not invest the time in selling or open sourcing it.
I'd sort of designed it for my own needs first and hadn't thought too far beyond that.
gitowiec 19 hours ago |
Reading this text feels strange, sentences seems to be detached
cataphract 16 hours ago |
I had exactly the same impression, and I recall seeing this style other times recently. First time I thought it was just bad writing skills, now I'm thinking it's AI generated.
asenna 5 hours ago |
I'm the author, yes it is AI-assisted.
You can make AI-generated content without it being slop. Slop, to me at least, is content that's wrong, padded, or generic.
I see the cadence / short-sentence issues but if there's something else beyond those, I'd actually want to know what made it feel bad.
I would've put off documenting what I did over the weekend but instead, I did document everything, spent quite some time (several iterations) and effort to make sure it does not hallucinate and writes in my own tone and voice. I'm sure it could be better but the content is not made-up.
At a time where most of us software engineers have changed our workflows to let AI write 80+% of our code using agents, I feel writing is heading the same way. It then becomes a matter of taste, whether it's done well or not.
If you're looking clues and signs for whether a content has used AI, you're going to be disappointed over the next 12 months.
If it feels jarring right now, I'll work harder on the workflow so it feels more natural next time (someone shared this project with me - https://github.com/blader/humanizer).
But this clearly allows me to make content which I wouldn't have done earlier.
zazibar 18 hours ago |
The subject matter is interesting but the amount of slop makes it difficult to read through. Yeah, it's great that you can throw your technical problems at Claude without caring much about the generated output but treating your own writing that you actually want to share with the world the same way is a terrible idea.
asenna 18 hours ago |
Tbh, I did spend a lot of time trying to ground it and de-slopify it - verified nothing was halucinated and went through 10 iterations to get to this. It's almost like wrestling with Claude and I knew it would be tough on HN.
But because of the fear of non-perfection, I used to put away things like creating this article or even posting it anywhere. And I do think the article has real value that HN would appreciate (I am myself an HN-enthusiast).
I'll try more. Someone else shared this project which would be really helpful - https://github.com/blader/humanizer
Also a side note, the blog is posted on my self-created Slopit.io platform which is purely meant for your personal agents (working along with you) to post content - I recommend trying it out. https://blog.slopit.io/this-blog-post-is-slop/
I know, things are getting difficult with all the slop around, but my personal opinion is, as the agents get better at writing, the "annoying-ness" factor reduces and pieces of substance will still be appreciated, even if it was written by agents. This and the fact that agents aren't going away.
If I've automated a lot of my coding, I feel like engineers like me would naturally progress to also taking agents' help to write useful content.
PS - this comment was 100% hand-typed.
teach 17 hours ago |
For what it's worth, I really enjoyed this read and almost came here to comment "this is the most enjoyable llm-assisted article I've read in a while"
The tells were unmistakable but it still had a human touch, so I for one am glad you published anyway.
asenna 14 hours ago |
I'm definitely learning and hope to do better next time but your comment truly means a lot.
I kid you not, I've taken a screenshot of this to motivate me next time I'm doubting publishing :)
cold_harbor 18 hours ago |
the reason 50GB swap is even viable here is Apple Silicon's memory bandwidth. on x86 that much swap would make inference unusably slow
throwawaytea 17 hours ago |
Memory bandwidth or storage bandwidth?
bahmboo 7 hours ago |
potato potatoh
yardie 18 hours ago |
Now I have another project for this weekend! I also have tons of video and not a lot of time to index them.
ngai_aku 17 hours ago |
I’d like to do something like this for the collection of home videos I have piling up, but I’m still on 16GB M1. Any hope of getting decent results with smaller models? If not, does anyone have tips on GPU rental?
I have a Claude max sub and plenty of OpenRouter credit, but I don’t feel good about uploading my family’s private videos
oceanus 16 hours ago |
[flagged]
dang 16 hours ago |
Could you please not post generated comments to HN? It's not allowed here. See https://news.ycombinator.com/newsguidelines.html#generated and https://news.ycombinator.com/item?id=47340079.
We ban accounts that do this and I don't want to ban you, so please write everything that you post to HN by hand.
Of course, it's impossible to know for sure what was LLM processed or not, but we're getting complaints about some of your posts and, upon inspection, the complaints seem justified.
genxy 15 hours ago |
The article itself has many AI tells. Can we update the guidelines on AI generated content ?
dang 14 hours ago |
That's a separate issue and more of a grey area still. We're thinking about it.
tefkah 40 minutes ago |
i’m glad to hear this is being considered, thank you for your efforts
clueless 16 hours ago |
This sounds like a great capability to be added to immich
asixicle 15 hours ago |
Or Stash lol
Confiks 15 hours ago |
I'm not quite sure why all that swapping is necessary. I really does age your SSD quite fast considering the enormous memory bandwidth required. Gemma 4 31B at 4-bit quantization should only be around 19 GiB [1], not 28.4 GiB. I'm not feeding it images regularly, so I'm not sure how much memory it needs to get those into context, but I can't imagine it is more than 10 GiB.
The activity monitor does show all kinds of Electron apps active, on top of a presumably model-loaded Handy and a virtual machine for Claude Code, so I guess that's the real root cause for all the swapping. If your laptop starts trashing I can't imagine you have any use for those apps, which will grind to a halt.
[1] https://huggingface.co/mlx-community/gemma-4-31b-it-4bit
asenna 5 hours ago |
Yeah to be fair, I could've cleaned everything up but this was taken when I was doing other work on my laptop while the screenshot was taken.
Although slightly laggy, I was impressed by the fact that I was still able to work on other things and have a bunch of tabs open on my Brave browser.
genxy 15 hours ago |
Why did you destroy your own voice to have it replaced by AI ?
mainaisakyuhoon 13 hours ago |
I really struggled to read the AI slop in this.
carpo 9 hours ago |
This is great. I wish I had enough ram for a local model. I just spent the last few weeks writing something very similar, but I made it a local Electron app with Whisper, ffmpeg and I added semantic search and embeddings for chatting with the videos. It talks to Claude for the vision analysis, tagging and video chat. Do you only send one image for yours? I used a customised scene detection algorithm to find multiple different images per video and then send them all in one request to Claude (along with the subtitles). It's definitely the most expensive part. Using Sonnet 4.6 for the analysis and Haiku for the tagging costs about $1 for an hour of footage, I can imagine it would be slow locally.
nl 8 hours ago |
Try some of the models on OpenRouter if you are looking to save money. Gemma 4 31B is $0.12/M input, $0.37/M output vs $1/M input, $5/M output for Haiku.
There are other options that are good too. Gemini 3.1 Flash Lite is great for this kind of thing (NOT Gemini 3.5 Flash though - the pricing for that is bad).
https://openrouter.ai/google/gemma-4-31b-it
carpo 7 hours ago |
Cheers, I'll give it a try. How are those models at returning structured results? When I was writing the prompts for the analysis step and testing with older Claude models, it would have trouble structuring the XML consistently. Sonnet 4.6 handles it really well.
nl 7 hours ago |
Use function calling/tool use, not XML output. The models are all trained for that now.
Ie, instead of telling it to generate
<name>Name</name> <age>19</name> <address>whatever</name>
give it a function
details(name: string, age: int, address: string)
That is actually a JSON schema, and the models do great at it. Here's the claude docs, but they are all similar: https://platform.claude.com/docs/en/agents-and-tools/tool-us...
carpo 6 hours ago |
Very interesting. Thank you!
asenna 3 hours ago |
Not one image - 5 frames per clip, sent in a single request with a transcript snippet. So the multi-frame + subtitles in one call part is the same as yours.
But yeah, how it picks the frame is the weak-point here. Scene detection would definitely help - this is #1 on the Roadmap.
Could you share how your scene-detection picks the frames?
---
For the vector search, I went for the trade-off of not having it but keeping it simple with plain Markdown files for more portability. The knowledge travels with the files when an SSD moves, no index to keep in sync, and plain text that outlives the tool. But the other path you mentioned is interesting as well to explore.
carpo an hour ago |
I originally limited mine to 10 frames spread evenly throughout the video, but it missed a fair bit of context at the analysis step, and didn't scale with length. So now when a video is loaded the app extracts a bunch of frames for the entire video, then calculates an image histogram and compares similarity to the previous one. There's some configuration so it doesn't send too many to the LLM, but still gets a good cross-section of frames to send.
You could also just use FFmpeg as it can do scene detection too. I tested both but liked the results from the histogram analyzer more.
Yeah, markdown works well if you're going to search through it with Claude Code or something like that. I built ClipScape as an Electron app with a local SQLite database, as I wanted an interface I could search and chat in and see the relevant thumbnails.
pavlov 5 hours ago |
The content is good, but this LLM writing style gets tiresome. Everything is a revelation:
>“I bought it for Chrome. It's running a model that didn't exist when I bought it.”
Well duh, personal computers run new software. That’s literally the whole point. The Apple II didn’t sell on the strength of the preinstalled apps.
asenna 4 hours ago |
Author here. I totally hear you. I wasn't expecting this to do well on HN for exactly this reason.
But I've mentioned elsewhere - if it wasn't for all the AI-assistance, I would've put-off documenting everything that I did and not even get to the writing part.
But yeah, I'll be working on the workflow to make the next write-up better, more humanized.
moinism 5 hours ago |
> Every AI video editor on the market assumes your footage is already labeled
Shameless plug: I'm the founder of Chat Octopus, an AI media assistant, and it actually 'looks' at the videos to understand them before creating a cut.
coldtea 3 hours ago |
The post is a mix of human and AI writing and the AI-mannerisms get on the nerves. At least it has a clear topic and some actionable insights and code examples.