I was vaguely aware of all these pieces existing (except for running a facial recognition database at home o_o), but it's really neat to put them all together like that.
Still blows my mind I can do all this from my 2021 MBP.
I'll try to do a post once I have the next steps working (helping with planning and editing videos with Davinci Resolve).
Great job. Long live the M1 Max!
Although knowing how good these local models are getting, I am now eyeing the upcoming M5 Ultra Mac Studio (256gigs perhaps). But knowing how crazy the market is, it might be a year before I get the chance to get my hands on it. If it even launches by WWDC.
Not gonna lie, llama.cpp had the fans spinning at max speed. But it worked and I got the job done.
This always confuses me - don't people want their computations to run as fast as possible and thus inevitably produce more heat that needs to be vented?
I suppose sometimes it is just an analogy for "its utilizing 100% of my resources" (which I'm guessing it is here), but I've definitely had people say it as an actual complaint in different contexts
I think fan loudness is an outgrowth of conspicuous consumption because a certain OEM decided to make it a marketing bullet-point.
I was equally disappointed by by people - especially device reviewers - banging on the drum that phones made of plastic "didn't feel premium", and we got phones with glass backs that have to be shoved into plastic cases (because plastic is the near-perfect material to protect fragile phones screens and innards)
I am pretty sure that the vast majority of Airbnb hosts would not agree with you.
> equals TripAdvisor crucifixion
I have no idea how the Airbnb hosts with fake listings survive, really.
But on the other hand, genuine videos do take time and slows down the process.
When your Claude wrote this post they might not have selected the right URL to share, unless your home folder is exposed. Care to share the skill files?
PS - I just put this together in the last few mins, removed my personal files and references. So it's not tested properly, please let me know if any issues.
It's still an early hack, but I have thousands of still images as well from my camera which I've not processed and I need to do the same analysis for those.
So I'll continue working on it, but happy to receive any PRs if anyone finds any use for it.
I'm tired of having a backlog of thousands of images and videos, leaving it for later.
https://github.com/blader/humanizer
You get a pass here because you're doing really cool stuff but it's kinda tough to read past the AI nonsense, and it's relatively easy to screen out "it's not x it's y" kind of things and the bolded bullet points.
Tbh, I have a lot of thoughts and ideas and things to share and I do spend time and effort trying to de-AI-ing it but this should help a lot.
I'll try it out.
In fact, I was expecting getting shit on by HN readers for this but was pleasantly surprised that readers moved past it.
I'm more hot about it because it's frustrating having so many HN posts be a place for people to work out first drafts, especially when the first piece of feedback is "hey, uh, you clearly used AI and it's horrible to read as a result." So easy to avoid...good on you for being kinder.
(part of my frustration is I was excited because I write an local LLM client and thought I missed Gemma 4 has streaming video input support, but after reading through the slop it turns out its just the ol' "extract frames" workflow. tbf that would have happened AI or not, but put me in a mood)
> I also use a lot of AI but you really have to demand quality from it, whether it's writing, media, or code. It's clear you've got the taste from your media work, and we're all still learning as we go...
Their use of AI for "media work" has shown a taste but their writing usage still needs to equal that.
Your behaviour is not affecting the HN community in a positive way.
To be honest, my literal thought process initially when writing was: - I think this is cool, I should probably open source this - No wait, I'm again over planning, no one's gonna read this and the problem is probably too specific to me for anyone to care.
So I just mentioned "lets compare notes if anyone else trying".
Hence you can see from the comment above, I immediately realized I made a mistake when the parent asked for the Skill file. Should've had the link ready. Pleasant surprise.
Hiding these clues by another AI pass doesn't solve the core problem. Now you just end up with content that camouflaged better but is still equally low in nutritional value.
Vigorous writing is concise.
I was highly interested in reading this article from start to finish.
Ofc there are was a lot of slop moments, but author experience itself is great!
And i genuinely don’t care if he would share it through LLM article.
Just please remove slop markers :)
Really? A bit hard to believe, unless you have many dumb colleagues.
It’s not at all surprising. Not everyone is a developer.
I hacked your system: file:///etc/passwd
1: To get an Android app working that has been delisted and requires a 'key' app that you purchase. We did purchase it, but didn't think to make any backups.
This is an excellent thing to do. Especially that LLMs excel at batching thus you can index multiple photos and videos in parallel for no performance penalty.
Bear in mind that ttft on MLX is much much faster on M5 Pro as compared to M4 Pro.
Also bear in mind that those figures are with NO optimizations whatsoever: no MCP, no DFlash. I am waiting for both to be released for the Qwen models.
27B: give me 20 minutes
For Qwen 35B enabling native MCP on MLX models slows it down by 10%.
For Qwen 27B enabling native MCP on MLX models speeds token generation up almost exactly 1.5x.
(all tested on M5 pro).
Llama is about 1/3 slower on Apple Silicon.
What's better about Unsloth Studio vs LM Studio is it tells you exactly what quantization to use especially as Unsloth ones are quite good, and that it has web search and self-healing tool calls so having a web-searching local ChatGPT alternative is very easy to spin up.
You know what I REALLY want? Just point this beast at the folders and it tell me which 150 shots are good to process from these 1,500 images. That's the dream!
Although the technology is getting there, it's still a very difficult problem to solve. Taste and art is subjective. Also me as a photographer will always be concerned - "what if my best shot was in one of these rejected shots".
But yeah, I think I'll try to do some more of these experiments soon.
—-
“Models scored all 4,487 photos. NIMA rewards technical craft (sharpness, composition), LAION rewards emotional/aesthetic appeal, MUSIQ is more general quality. Combined: 0.4 NIMA + 0.3 LAION + 0.3 MUSIQ, deduped at 0.85 CLIP similarity.
Interesting: the models wildly disagreed on some shots — one photo ranked NIMA #2 globally but LAION #4313.”
1. What is the search index?
2. The "description.md" example has things like "faces -> cluster_id". Is this from Davinci Resolve's face index? Things like faces+names and locations are really important with photo collections, but general LLMs don't handle them so well.
Something which I can query later - Like when brainstorming with Claude "I wanna make some videos of the Luxury rooms in the lodge" and it knows what all videos could help here (going through the files).
There's also a folder root level files that aggregates the text descriptions to make it easier to find.
I've just attached an image in the blog showing an example - https://blog.simbastack.com/_media/gvcycx2n.png
2) No - nothing from DaVinci Resolve. Framedex is a standalone pipeline. Resolve isn't involved.
Faces come from insightface (the open-source buffalo_l pack - RetinaFace for detection), running locally on CPU. For each clip it detects faces in the sampled frames, embeds them, and writes rows to ~/.framedex/faces.db.
Tbh, this part I know it's building up in my local DB but I haven't tested how good is it. Will check them out properly soon.
But yeah, on your broader point that's why framedex deliberately does not ask the LLM to handle faces or locations.
----
Faces → insightface / ArcFace embeddings. Deterministic, comparable across clips. The vision model only contributes a rough people_count; it never tries to identify anyone.
Locations → EXIF GPS via exiftool, reverse-geocoded through Nominatim/OpenStreetMap. Hard metadata, not a guess.
The LLM only does what it's good at: scene description, mood, shot type, keywords, keep/review/cull rating (this last part is also debatable though).
It's not tested properly after I genericized it. Will try to go through it properly and add more updates.
Two big things on my TODO: 1) Make use of this indexing and using Claude's help, make video editing faster with Davinci Resolve (now that I have a good index of all the content)
2) I currently did this for videos, but I want to add more things to this for my thousands of still images of my camera - need to make sense of them. So I'll be working on this as well.
(Also email is in my profile).
The idea of capable local models could be a huge unlock here if they are able to do the bottom-up context collection research / tagging / etc. at scale.
Using API to analyze even a subset of this would've been painful imo.
The few other points of consideration would be:
1) Cost - I was considering using Sonnet for this but there's always the concern of reaching limits OR the API cost if you're using the API.
The feeling of knowing you have a capable model in your hands without any limits is actually pretty awesome. Your mind starts running at what else can I throw at it to do grunt work.
2) Privacy issues - same as with moving to cloud.
3) Reliability issues - I know from experience Claude uptime has been pretty bad the past few months
4) Restrictions - Claude has been pretty heavy handed with their restrictions lately, anything which remotely triggers there flags gets an instant denial (or worse, an account ban). Often these are false-positives.
I love the value I get from Claude but there's a different kind of freedom you get with local, capable models.
But I can tell it's only a matter of time before agents become smart enough to let my non-tech friends be able to just say "Make sense of all these videos in my folder" and it just does it.
So if you give it a bunch of screenshots it will try and intelligently name them based upon what is in the screenshot. Same for videos, PDFs, etc.
But to your point I haven't even tried charging money as it feels like something Apple is just going to bake in as a feature.
Are you planning to open source it? Or maintain it in the future?
I'd sort of designed it for my own needs first and hadn't thought too far beyond that.
You can make AI-generated content without it being slop. Slop, to me at least, is content that's wrong, padded, or generic.
I see the cadence / short-sentence issues but if there's something else beyond those, I'd actually want to know what made it feel bad.
I would've put off documenting what I did over the weekend but instead, I did document everything, spent quite some time (several iterations) and effort to make sure it does not hallucinate and writes in my own tone and voice. I'm sure it could be better but the content is not made-up.
At a time where most of us software engineers have changed our workflows to let AI write 80+% of our code using agents, I feel writing is heading the same way. It then becomes a matter of taste, whether it's done well or not.
If you're looking clues and signs for whether a content has used AI, you're going to be disappointed over the next 12 months.
If it feels jarring right now, I'll work harder on the workflow so it feels more natural next time (someone shared this project with me - https://github.com/blader/humanizer).
But this clearly allows me to make content which I wouldn't have done earlier.
But because of the fear of non-perfection, I used to put away things like creating this article or even posting it anywhere. And I do think the article has real value that HN would appreciate (I am myself an HN-enthusiast).
I'll try more. Someone else shared this project which would be really helpful - https://github.com/blader/humanizer
Also a side note, the blog is posted on my self-created Slopit.io platform which is purely meant for your personal agents (working along with you) to post content - I recommend trying it out. https://blog.slopit.io/this-blog-post-is-slop/
I know, things are getting difficult with all the slop around, but my personal opinion is, as the agents get better at writing, the "annoying-ness" factor reduces and pieces of substance will still be appreciated, even if it was written by agents. This and the fact that agents aren't going away.
If I've automated a lot of my coding, I feel like engineers like me would naturally progress to also taking agents' help to write useful content.
PS - this comment was 100% hand-typed.
The tells were unmistakable but it still had a human touch, so I for one am glad you published anyway.
I kid you not, I've taken a screenshot of this to motivate me next time I'm doubting publishing :)
I have a Claude max sub and plenty of OpenRouter credit, but I don’t feel good about uploading my family’s private videos
We ban accounts that do this and I don't want to ban you, so please write everything that you post to HN by hand.
Of course, it's impossible to know for sure what was LLM processed or not, but we're getting complaints about some of your posts and, upon inspection, the complaints seem justified.
The activity monitor does show all kinds of Electron apps active, on top of a presumably model-loaded Handy and a virtual machine for Claude Code, so I guess that's the real root cause for all the swapping. If your laptop starts trashing I can't imagine you have any use for those apps, which will grind to a halt.
[1] https://huggingface.co/mlx-community/gemma-4-31b-it-4bit
Although slightly laggy, I was impressed by the fact that I was still able to work on other things and have a bunch of tabs open on my Brave browser.
There are other options that are good too. Gemini 3.1 Flash Lite is great for this kind of thing (NOT Gemini 3.5 Flash though - the pricing for that is bad).
Ie, instead of telling it to generate
<name>Name</name>
<age>19</name>
<address>whatever</name>
give it a function details(name: string, age: int, address: string)
That is actually a JSON schema, and the models do great at it. Here's the claude docs, but they are all similar: https://platform.claude.com/docs/en/agents-and-tools/tool-us...But yeah, how it picks the frame is the weak-point here. Scene detection would definitely help - this is #1 on the Roadmap.
Could you share how your scene-detection picks the frames?
---
For the vector search, I went for the trade-off of not having it but keeping it simple with plain Markdown files for more portability. The knowledge travels with the files when an SSD moves, no index to keep in sync, and plain text that outlives the tool. But the other path you mentioned is interesting as well to explore.
You could also just use FFmpeg as it can do scene detection too. I tested both but liked the results from the histogram analyzer more.
Yeah, markdown works well if you're going to search through it with Claude Code or something like that. I built ClipScape as an Electron app with a local SQLite database, as I wanted an interface I could search and chat in and see the relevant thumbnails.
>“I bought it for Chrome. It's running a model that didn't exist when I bought it.”
Well duh, personal computers run new software. That’s literally the whole point. The Apple II didn’t sell on the strength of the preinstalled apps.
But I've mentioned elsewhere - if it wasn't for all the AI-assistance, I would've put-off documenting everything that I did and not even get to the writing part.
But yeah, I'll be working on the workflow to make the next write-up better, more humanized.
Shameless plug: I'm the founder of Chat Octopus, an AI media assistant, and it actually 'looks' at the videos to understand them before creating a cut.