I used this to build a CLI that indexes hours of footage into ChromaDB, then searches it with natural language and auto-trims the matching clip. Demo video on the GitHub README. Indexing costs ~$2.50/hr of footage. Still-frame detection skips idle chunks, so security camera / sentry mode footage is much cheaper.
for example, for now if i search "cybertruck" in my indexed dashcam footage, i don't have any cybertrucks in my footage, so it'll return a clip of the next best match which is a big truck, but not a cybertruck
a bit expensive right now so it's not as practical at scale. but once the embedding model comes out of public preview, and we hopefully get a local equivalent, this will be a lot more practical.
Cool Project, thanks for sharing!
Would love to see open-weight models with this capability since it would eliminate the API cost and the privacy concern of uploading footage.
If there is text on the video (like a caption or wtv), will the embedding capture that? Never thought about this before.
If the video has audio, does the embedding capture that too?
This very well might be a reality in a couple years though!
The presence of cameras everywhere is considerably more concerning than the status quo, to me at least, when there is an AI watching and indexing every second of every feed—where camera owners or manufacturers or governments could set simple natural language parameters for highly specific people or activities notify about. There are obviously compelling and easy-to-sell cases here that will surely drive adoption as it becomes cost effective: get an alert to crime in progress, get an alert when a neighbor who doesn't clean up after his dog, get an alert when someone has fallen...but the potential implications of living in a panopticon like this if not well regulated are pretty ugly.
https://ai.google.dev/gemini-api/docs/pricing#gemini-embeddi...
(The code also tries to skip "still" frames, but if your video is dynamic you're looking at the cost above.)
regardless of the file's frame rate, the gemini api natively extracts and tokenizes exactly 1 fps. the 5 fps downscaling just keeps the payload sizes small so the api requests are fast and don't timeout.
i'll update the readme to make this more clear. thanks for bringing this up.
[0]: https://www.axon.com/products/axon-fusus [1]: https://citizen.com/
The problems start cropping up when you get things like Flock where governments start deploying cameras on a massive scale, or Ring where a single company has unrestricted access to everyone's private cameras.
I don't think it's a good thing but it seems the limiting factor has been technological feasibility instead of any kind of principle against it.
I've been hearing warnings that AI would be used for this since well before it seemed feasible.
Imagine a Premiere plugin where you could say "remove all scenes containing cats" and it'll spit out an EDL (Edit Decision List) that you can still manually adjust.
SentrySearch already returns precise in/out timestamps for any natural-language query and uses ffmpeg to auto-trim clips. Turning that into an EDL (or even a direct Premiere plugin that exports an editable cut list) feels natural.
I’m not a Premiere expert myself, but I’d love to see this happen. If you (or anyone) wants to sketch out a quick EDL exporter or plugin, I’ll happily review + merge a PR and help wherever I can. Just drop a GitHub issue if you start something!
Thanks for sharing!
collections.lwarfield.dev
I believe you could use a combination of select and scene parameters in ffmpeg to do this automatically when a chunk of video is created each time.