What did you expect?
>Classical music?
Nah I like hype, helps when things are slow.
* (Mute it if you don’t like the music, just like the rest of us will if you complain about the music)
Yamborghini high: https://www.youtube.com/watch?v=tt7gP_IW-1w L$D: https://www.youtube.com/watch?v=Gx4JEBwVlXo
I wonder what will be the state of cinema/series/video clips in 30 years? Will singers/rappers give up sentences completely and just mention names of emojis? Will we have to use screens at 576hz to be able to watch acclerated videos without seeing a constant blur?
I guess most kids from today would fall asleep before the end of the generic of Twin Peaks or the opening scene of Fargo.
fascinating
I wouldn't have normally read this and watched the video, but my Claude sessions were already executing a plan
the tl;dr is that all the actors were scanned in a 3D point cloud system and then "NeRF"'d which means to extrapolate any missing data about their transposed 3D model
this was then more easily placed into the video than trying to compose and place 2D actors layer by layer
Not sure if it's you or the original article but that's a slightly misleading summary of NeRFs.
The way TV/movie production is going (record 100s of hours of footage from multiple angles and edit it all in post) I wonder if this is the end state. Gaussian splatting for the humans and green screens for the rest?
I would say Superman's quality didn't suffer for it.
I would say cost is probably the most expensive part it's also just like "why bother", it's not CG, it's not "2d filming" so it's just niche, like the scenarios you would actually need this are very low.
That said, the technology is rapidly advancing and this type of volumetric capture is definitely sticking around.
The quality can also be really good, especially for static environments: https://www.linkedin.com/posts/christoph-schindelar-79515351....
"That data was then brought into Houdini, where the post production team used CG Nomads GSOPs for manipulation and sequencing, and OTOY’s OctaneRender for final rendering. Thanks to this combination, the production team was also able to relight the splats."The gist is that Gaussian splats can replicate reality quite effectively with many 3D ellipsoids (stored as a type of point cloud). Houdini is software that excels at manipulating vast numbers of points, and renderers (such as Octane) can now leverage this type of data to integrate with traditional computer graphics primitives, lights, and techniques.
I am vaguely aware of stuff like Gaussian blur on Photoshop. But I never really knew what it does.
Gaussian splatting is a bit like photogrammetry. That is, you can record video or take photos of an object or environment from many angles and reproduce it in 3D. Gaussians have the capability to "fade" their opacity based on a Gaussian distribution. This allows them to blend together in a seamless fashion.
The splatting process is achieved by using gradient descent from each camera/image pair to optimize these ellipsoids (Gaussians) such that the reproduce the original inputs as closely as possible. Given enough imagery and sufficient camera alignment, performed using Structure from Motion, you can faithfully reproduce the entire space.
Read more here: https://towardsdatascience.com/a-comprehensive-overview-of-g....
Given where this technology is today, you could imagine 5-10 years from now people will watch live sports on TV, but with their own individual virtual drone that lets them view the field from almost any point.
If you’re curious start with the Wikipedia article and use an LLM to help you understand the parts that don’t make sense. Or just ask the LLM to provide a summary at the desired level of detail.
The other two replies did a pretty good job!
Blurring is a convolution or filter operation. You take a small patch of image (5x5 pixels) and you convolve it with another fixed matrix, called a kernel. Convolution says multiply element-wise and sum. You replace the center pixel with the result.
https://en.wikipedia.org/wiki/Box_blur is the simplest kernel - all ones, and divide by the kernel size. Every pixel becomes the average of itself and its neighbors, which looks blurry. Gaussian blur is calculated in an identical way, but the matrix elements follow the "height" of a 2D Gaussian with some amplitude. It results in a bit more smoothing as farther pixels have less influence. Bigger the kernel, more blurrier the result.There are a lot of these basic operations:
https://en.wikipedia.org/wiki/Kernel_(image_processing)
If you see "Gaussian", it implies the distribution is used somewhere in the process, but splatting and image kernels are very different operations.
For what it's worth I don't think the Wikipedia article on Gaussian Blur is particularly accessible.
Happily. Gaussian splats are a technique for 3D images, related to point clouds. They do the same job (take a 3D capture of reality and generate pictures later from any point of view "close enough" to the original).
The key idea is that instead of a bunch of points, it stores a bunch of semi-transparent blobs - or "splats". The transparency increases quickly with distance, following a normal distribution- also known as the "Gaussian distribution."
Hence, "Gaussian splats".
tl;dr eli5: Instead of capturing spots of color as they would appear to a camera, they capture spots of color and where they exist in the world. By combining multiple cameras doing this, you can make a 3D works from footage that you can then zoom a virtual camera round.
I'm not up on how things have changed recently
For example, the camera orbits around the performers in this music video are difficult to imagine in real space. Even if you could pull it off using robotic motion control arms, it would require that the entire choreography is fixed in place before filming. This video clearly takes advantage of being able to direct whatever camera motion the artist wanted in the 3d virtual space of the final composed scene.
To do this, the representation needs to estimate the radiance field, i.e. the amount and color of light visible at every point in your 3d volume, viewed from every angle. It's not possible to do this at high resolution by breaking that space up into voxels, those scale badly, O(n^3). You could attempt to guess at some mesh geometry and paint textures on to it compatible with the camera views, but that's difficult to automate.
Gaussian splatting estimates these radiance fields by assuming that the radiance is build from millions of fuzzy, colored balls positioned, stretched, and rotated in space. These are the Gaussian splats.
Once you have that representation, constructing a novel camera angle is as simple as positioning and angling your virtual camera and then recording the colors and positions of all the splats that are visible.
It turns out that this approach is pretty amenable to techniques similar to modern deep learning. You basically train the positions/shapes/rotations of the splats via gradient descent. It's mostly been explored in research labs but lately production-oriented tools have been built for popular 3d motion graphics tools like Houdini, making it more available.
I would say it's a 3D photo, not a 3D video. But there are already extensions to dynamic scenes with movement.
https://www.realsenseai.com/products/real-sense-depth-camera...
That said, I don't think splats:voxels as pixels:vector graphics. Maybe a closer analogy would be pixels:vectors is the same as voxels:3d mesh modeling. You might imagine a sophisticated animated character being created and then animated using motion capture techniques.
But notice where these things fall apart, too. SVG shines when it's not just estimating the true form, but literally is it (fonts, simplified graphics made from simple strokes). If you try to estimate a photo using SVG it tends to get messy. Similar problems arise when reconstructing a 3d mesh from real-world data.
I agree that splats are a bit like pixels, though. They're samples of color and light in 3d (2d) space. They represent the source more faithfully when they're more densely sampled.
The difference is that a splat is sampled irregularly, just where it's needed within the scene. That makes it more efficient at representing most useful 3d scenes (i.e., ones where there are a few subjects and objects in mostly empty space). It just uses data where that data has an impact.
This includes sparse areas like fences, vegetation and the likes, but more importantly any material properties like reflections, specularity, opacity, etc.
Here's a few great examples: https://superspl.at/view?id=cf6ac78e
BTW I believe there is software that can turn point clouds into textured meshes reliably; multiple techniques even, depending on what your goals are.
It works well for what it does. But, it's mostly only effective for opaque, diffuse, solid surfaces. It can't handle transparency, reflection or "fuzz". Capturing material response is possible, but requires expensive setups.
A scene like this poodle https://superspl.at/view?id=6d4b84d3 or this bee https://superspl.at/view?id=cf6ac78e would be pretty much impossible with photogrammetry and very difficult with manual, traditional, polygon workflows. Those are not videos. Spin them around.
You generate the point clouds from multiple images of a scene or an object and some machine learning magic
2. Replace each point of the point cloud with a fuzzy ellipsoid, that has a bunch of parameters for its position + size + orientation + view-dependent color (via spherical harmonics up to some low order)
3. If you render these ellipsoids using a differentiable renderer, then you can subtract the resulting image from the ground truth (i.e. your original photos), and calculate the partial derivatives of the error with respect to each of the millions of ellipsoid parameters that you fed into the renderer.
4. Now you can run gradient descent using the differentiable renderer, which makes your fuzzy ellipsoids converge to something closely reproducing the ground truth images (from multiple angles).
5. Since the ellipsoids started at the 3D point cloud's positions, the 3D structure of the scene will likely be preserved during gradient descent, thus the resulting scene will support novel camera angles with plausible-looking results.
Now, perhaps referring to differentiability isn't layperson-accessible, but this is HN after all. I found it to be the perfect degree of simplification personally.
How about this:
Take a lot of pictures of a scene from different angles, do some crazy math, and then you can later pretend to zoom and pan the camera around however you want
How hard is it to handle cases where the starting positions of ellipsoids in 3D is not correct (being too off). How common is such a scenario with the state of the art? E.g., if having only a stereoscopic image pair, the correspondences are often not accurate.
Thanks.
https://x.com/RadianceFields alt: https://xcancel.com/RadianceFields
Is it a fully connected NN?
I think this tech has become "production-ready" recently due to a combination of research progress (the seminal paper was published in 2023 https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/) and improvements to differentiable programming libraries (e.g. PyTorch) and GPU hardware.
Did the Gaussian splatting actually make it any cheaper? Especially considering that it needed 50+ fixed camera angles to splat properly, and extensive post-processing work both computationally and human labour, a camera drone just seems easier.
This tech is moving along at breakneck pace and now we're all talking about it. A drone video wouldn't have done that.
There’s no proof of your claim and this video is proof of the opposite.
Volumetric capture like this allows you to decide on the camera angles in post-production
This approach is 100% flexible, and I'm sure at least part of the magic came from the process of play and experimentation in post.
This is a “Dropbox is just ftp and rsync” level comment. There’s a shot in there where Rocky is sitting on top of the spinning blades of a helicopter and the camera smoothly transitions from flying around the room to solidly rotating along with the blades, so it’s fixed relative to rocky. Not only would programming a camera drone to follow this path be extremely difficult (and wouldn’t look as good), but just setting up the stunt would be cost prohibitive.
This is just one example of the hundreds you could come up with.
They would look much better in a very "familiar" way. They would have much less of the glitch and dynamic aesthetic that makes this so novel.
And it's not always giving in to those voices, sometimes it's going in the opposite direction specifically to subvert those voices and expectations even if that ends up going against your initial instincts as an artist.
With someone like A$AP Rocky, there is a lot of money on the line wrt the record execs but even small indie artists playing to only a hundred people a night have to contend with audience expectation and how that can exert an influence on their creativity.
I don’t disagree with you—I felt “Tailor Swif,” “DMB,” and “Both Eyes Closed” were all stronger than the tracks that made it onto this album.
But sometimes you’ve gotta ship the project in the state it’s in and move on with your life.
Maybe now he can move forward and start working on something new. And perhaps that project will be stronger.
If I was in his position I’d probably be doing the same. Why bother with another top hit that pleases the masses.
No, it’s simply the framerate.
I’m curious what other artists end up making with it.
This is clearly an artistic statement, whether you like the art or not. A ton of thought and time was put into it. And people will likely be thinking and discussing this video for some time to come.
A good example are those Subway Surfers split screen videos, where someone is babbling about nothing in one frame, but the visuals in the other keep people watching.
Another example is AI-narrated “news” on YouTube. Nobody would normally listen to an AI voice read AI slop, but if there are some extreme video clips quickly switching every few seconds, people don’t immediately click away.
Brain rot shreds the attention span and uses all kinds of psychological tricks to keep people engaged. In the Helicopter video, every second is packed with visual information, not to contribute to the narrative but to capture attention. The backgrounds are full of details. The camera never stops moving. The subjects depicted are even attention grabbing: police lights, dancing people, guns, car crashes, flamethrowers! Hey, does that guy have pink curlers in his hair?
It’s not that I don’t like it (I kinda do), but a media diet of that kind of content is bad for the brain.
I'm David Rhodes, Co-founder of CG Nomads, developer of GSOPs (Gaussian Splatting Operators) for SideFX Houdini. GSOPs was used in combination with OTOY OctaneRender to produce this music video.
If you're interested in the technology and its capabilities, learn more at https://www.cgnomads.com/ or AMA.
Try GSOPs yourself: https://github.com/cgnomads/GSOPs (example content included).
>Evercoast deployed a 56 camera RGB-D array
Do you know which depth cameras they used?
I recommend asking https://www.linkedin.com/in/benschwartzxr/ for accuracy.
So likely RealSense D455.
EDIT: I realize a phone is not on the same level as a red camera, but i just saw iphones as a massively cheaper option to alternatives in the field i worked in.
And when I think back to another iconic hip hop (iconic that genre) video where they used practical effects and military helicopters chasing speedboats in the waters off of Santa Monica...I bet they had change to spear.
But yes, you can easily use iPhones for this now.
Check this project, for example: https://zju3dv.github.io/freetimegs/
Unfortunately, these formats are currently closed behind cloud processing so adoption is a rather low.
Before Gaussian splatting, textured mesh caches would be used for volumetric video (e.g. Alembic geometry).
https://developer.apple.com/documentation/spatial/
Edit: As I'm digging, this seems to be focused on stereoscopic video as opposed to actual point clouds. It appears applications like cinematic mode use a monocular depth map, and their lidar outputs raw point cloud data.
You just can't see the back of a thing by knowing the shape of the front side with current technologies.
The depth map stored for image processing is image metadata, meaning it calculates one depth per pixel from a single position in space. Note that it doesn't have the ability to measure that many depth values, so it measures what it can using LIDAR and focus information and estimates the rest.
On the other hand, a point cloud is not image data. It isn't necessarily taken from a single position, in theory the device could be moved around to capture addition angles, and the result is a sparse point cloud of depth measurements. Also, raw point cloud data doesn't necessarily come tagged with point metadata such as color.
I also note that these distinctions start to vanish when dealing with video or using more than one capture device.
So you have to have minimum two for front and back of a dancer. Actually, the seams are kind of dubious so let's say three 120 degrees apart. Well we need ones looking down as well as up for baggy clothing, so more like nine, 30 degrees apart vertically and 120 degrees horizontally, ...
and ^ this will go far down enough that installing few dozens of identical non-Apple cameras in a monstrous sci-fi cage starts making a lot more sense than an iPhone, for a video.
There are usually six sides on a cube, which means you need minimum six iPhone around an object to capture all sides of it to be able to then freely move around it. You might as well seek open-source alternatives than relying on Apple surprise boxes for that.
In cases where your subject would be static, such as it being a building, then you can wave around a single iPhone for the same effect for a result comparable to more expensive rigs, of course.
Can you add any interesting details on the benchmarking done against the RED camera rig?
I assume they stuck with realsense for proper depth maps. However, those are both limited to a 6 meters range, and their depth imaging isn't able to resolve features smaller than their native resolution allows (gets worse after 3m too, as there is less and less parallax among other issues). I wonder how they approached that as well.
How did you find out this was posted here?
Also, great work!
And thank you!
(I'm not the author.)
You can train your own splats using Brush or OpenSplat
You're right that you can intentionally under-construct your scenes. These can create a dream-like effect.
It's also possible to stylize your Gaussian splats to produce NPR effects. Check out David Lisser's amazing work: https://davidlisser.co.uk/Surface-Tension.
Additionally, you can intentionally introduce view-dependent ghosting artifacts. In other words, if you take images from a certain angle that contain an object, and remove that object for other views, it can produce a lenticular/holographic effect.
If you don't know already, you need to leverage this. HN is one of the biggest channels of engineers and venture capitalists on the internet. It's almost pure signal (minus some grumpy engineer grumblings - we're a grouchy lot sometimes).
Post your contract info here. You might get business inquiries. If you've got any special software or process in what you do, there might be "venture scale" business opportunities that come your way. Certainly clients, but potentially much more.
(I'd certainly like to get in touch!)
--
edit: Since I'm commenting here, I'll expand on my thoughts. I've been rate limited all day long, and I don't know if I can post another response.
I believe volumetric is going to be huge for creative work in the coming years.
Gaussian splats are a huge improvement over point clouds and NeRFs in terms of accessibility and rendering, but the field has so many potential ways to evolve.
I was always in love with Intel's "volume", but it was impractical [1, 2] and got shut down. Their demos are still impressive, especially from an equipment POV, but A$AP Rocky's music video is technically superior.
During the pandemic, to get over my lack of in-person filmmaking, I wrote Unreal Engine shaders to combine the output of several Kinect point clouds [3] to build my own lightweight version inspired by what Intel was doing. The VGA resolution of consumer volumetric hardware was a pain and I was faced with fpga solutions for higher real time resolution, or going 100% offline.
World Labs and Apple are doing exciting work with image-to-Gaussian models [4, 5], and World Labs created the fantastic Spark library [6] for viewing them.
I've been leveraging splats to do controllable image gen and video generation [7], where they're extremely useful for consistent sets and props between shots.
I think the next steps for Gaussian splats are good editing tools, segmenting, physics, etc. The generative models are showing a lot of promise too. The Hunyuan team is supposedly working on a generative Gaussian model.
[1] https://www.youtube.com/watch?v=24Y4zby6tmo (film)
[2] https://www.youtube.com/watch?v=4NJUiBZVx5c (hardware)
[3] https://www.twitch.tv/videos/969978954?collection=02RSMb5adR...
[4] https://www.worldlabs.ai/blog/marble-world-model
[5] https://machinelearning.apple.com/research/sharp-monocular-v...
[7] https://github.com/storytold/artcraft (in action: https://www.youtube.com/watch?v=iD999naQq9A or https://www.youtube.com/watch?v=f8L4_ot1bQA )
The most expensive part of Gaussian splatting is depth sorting.
Second, it's very motivating to read this! My background is in video game development (only recently transitioning to VFX). My dream is to make a Gaussian splatting content creation and game development platform with social elements. One of the most exciting aspects of Gaussian splatting is that it democratizes high quality content acquisition. Let's make casual and micro games based on the world around us and share those with our friends and communities.
Superman is what comes to mind for this
However, surface-based constraints can prevent thin surfaces (hair/fur) from reconstructing as well as vanilla 3DGS. It might also inhibit certain reflections and transparency from being reconstructed as accurately.
It's also possible to splat from textured meshes directly, see: https://github.com/electronicarts/mesh2splat. This approach yields high quality, PBR compatible splats, but is not quite as efficient as a traditional training workflow. This approach will likely become mainstream in third party render engines, moving forward.
Why do this? 1. Consistent, streamlined visuals across a massive ecosystem, including content creation tools, the web, and XR headsets. 2. High fidelity, compressed visuals. With SOGs compression, splats are going to become the dominant 3D representation on the web (see https://superspl.at). 3. E-commerce (product visualizations, tours, real-estate, etc.) 4. Virtual production (replace green screens with giant LED walls). 5. View-dependent effects without (traditional) shaders or lighting
It's not just about the aesthetic, it's also about interoperability, ease of use, and the entire ecosystem.
Do you have some benchmarks about what is the geometric precision of these reproductions?
Geometric analysis for Gaussian splatting is a bit like comparing apples and oranges. Gaussian splats are not really discrete geometry, and their power lies in overlapping semi-transparent blobs. In other words, their benefit is as a radiance field and not as a surface representation.
However, assuming good camera alignment and real world scale enforced at the capture and alignment steps, the splats should match real world units quite closely (mm to cm accuracy). See: https://www.xgrids.com/intl?page=geomatics.
I can see that relighting is still a work in progress, as the virtual spot lights tends to look flat and fake. I understand that you are just making brighter splats that fall inside the spotlight cone and darker the ones behind lots of splats.
Do you know if there are plans for gaussian splats to capture unlit albedo, roughness and metalness? So we can relight in a more realistic manner?
Also, environment radiosity doesnt seem to translate to the splats, am I right?
Thanks
There are many ways to relight Gaussian splats. However, the highest quality results are currently coming from raytracing/path tracing render engines (such as Octane and VRay), with 2D diffusion models in second place. Relighting with GSOPs nodes does not yield as high quality, but can be baked into the model and exported elsewhere. This is the only approach that stores the relit information in the original splat scene.
That said, you are correct that in order to relight more accurately, we need material properties encoded in the splats as well. I believe this will come sooner than later with inverse rendering and material decomposition, or technology like Beeble Switchlight (https://beeble.ai). This data can ultimately be predicted from multiple views and trained into the splats.
"Also, environment radiosity doesnt seem to translate to the splats, am I right?"
Splats do not have their own radiosity in that sense, but if you have a virtual environment, its radiosity can be translated to the splats.
The majority of wait time was the cinematographer lighting each scene. I imagined a workflow where secondary digital cameras captured 3D information, and all lighting took place in post production. Film productions hemorrhage money by the second; this would be a massive cost saving.
I described this idea to a venture capitalist friend, who concluded one already needed to be a player to pull this off. I mentioned this to an acquaintance at Pixar (a logical player) and they went silent.
Still, we don't shoot movies this way. Not there yet...
Seems like a really cool technology, though.
I wonder if anyone else got the same response, or it's just me.
A shame that kid was slept on. Allegedly (according to discord) he abandoned this because so many artists reached out to have him do this style of mv, instead of wanting to collaborate on music.
Put another way, is this a scientific comparison or an artistic comparison?
Well yes, the visuals are awesome, while the music… isn’t.
Llainwire was my top artist listens throughout 2023, so it’s always funny to bump into reactions that feel totally different from my world/my peers.
Great job, Chris and crew!
I would have refused to work on this.
It's only a matter of time until the #1 hit figures out how to make this work
If you're looking for bars sans pussy, read on: https://www.youtube.com/watch?v=yKifJ4Q5ph0
It certainly moves around a lot!
It certainly looks like the tech and art style here are indissociable. Not only the use of Gaussian Splats made such extreme camera movement possible, one can be argued that it made them necessary.
Pause the video and notice the blurriness and general lack of details. But the frantic motion doesn't let the viewer focus on this details, most of them hidden by a copious amount of motion blur anyways.
To me it is typical of demos, both as in the "demoscene" and "tech demo" sense, where the art style is driven by the technology, insisting on what it enables, while at the same time working around its shortcomings. I don't consider it a bad thing of course, it leads to a lot of creativity and interesting art styles.
Sadly, much of the demoscene is in a bit of a navel-grazing retro computing phase, many active ones today are "returners" from the C64 and Amiga eras whilst PC sceners of the 90s either dropped off for money, games or kids.
It's also the sheer work-effort, demoscene in the 90s and early 00s could focus on rendering while visual art pipelines didn't matter as much, as graphics cards got better it was obvious that the scene was falling behind the cutting edge games (both in asset due to workload and hacks required for graphics cards to render realistically).
The introduction and early popularization of SDF rendering turned the scene a bit more relevant again, but it's also been masking a certain lack of good artists since programmers could create nice renderings without needing assets.
However, to match something like this video in creativity would require a lot of asset workload (and non-trival rendering), and that combo is not really that common today sadly.
Funnily enough, I was actually discussing just Gaussian Splatting as a solution for more "asset heavy" demos about a year ago with another scener friend, but sadly there's a tad of a stigma culturally as NN/"AI" methods has been fairly controversial within the scene, aside from programmers there are both visual and music artists, and among those camps it's not really a popular thing.
It's still mostly a method though, and SDF rendering + GS could in the end be a saviour in disguise for the scene to go beyond just rendering and bring back a bit more story-telling to the scene.
The scene has always been a ridiculously conservative bunch. Back when 3dfx was new, using 3d acceleration was similarly controversial. The Pouet comments were scary similar to those today. All we need is a few demos that actually use these technologies with great results (instead of for laziness/slop), and the majority opinion will shift as it always has.
Problem was that fixed pipelines were seriously limiting. Most of the cool effects of software rendering couldn't be done, and the lack of direct access to the hardware meant that you couldn't do many of the hardware tricks the demoscene is known for. It doesn't mean people couldn't be creative, but it was mostly limited to doing interesting 3D geometry and textures. Things started changing with the advent of shaders.
About AI, I think thing the demoscene is rather welcoming of "AI" as long as you use it creatively instead of doing what everyone else does (slop). In the topic of Gaussian splatting, look at the Revision 2025 invitation [1], there is a pretty cool scene featuring it, people loved that part.
In the near term, it could be very useful for sports replays. The UFC has this thing where they stitch together sequences of images from cameras all around the ring, to capture a few seconds of '360 degree' video of important moments. It looks horrible, this would be a huge improvement.
So you decide to take lots and lots of photos at every single angle possible, but you need a way to link these all together, so you decide that each Centrepoint of the image is a "gaussian". These splat everywhere.
Now you have taken all of these photos and you can now explore the image in fortnite because you took thousands of images of every possible view!
But what if you didn't want to just look at the frozen image in a landscape in fortnite, instead you wanted to use a man dancing in your new upcoming YouTube video called helicopter.
If you isolate this person (let's say taking all these photos on a green screen) you now have a 3d like recording model, you can reshoot and "scene" on-top of something else (like a 3d diarama like in your video!)
https://www.youtube.com/watch?v=Tnij_xHEnXc
Whenever I see Gaussian, I think of the Gauss gun from Half Life 2
Watching this Helicopter music video made me recall a scene in Money for Nothing by Dire Straits, which was famously the first music video to air on MTV Europe (when the MTV phenomenon launched the 80s). It used 3D animation for the human characters and was considered groundbreaking animation in that time. The irony is we knew it was computer generated but now human generated is indistinguishable from AI.
The music video is a mix of creative and technical genius from several different teams, and it's ultimately a byproduct of building tooling to capture reality once and reuse it downstream.
There’s a lot more to explore here. Once you have grounded 4D human motion and appearance tied to a stable world coordinate system, it becomes a missing primitive for things like world models, simulation, and embodied AI, where synthetic or purely parametric humans tend to break down.
I've been developing a solution to make the cost of 4D capture an order of magnitude cheaper by using a small number of off-the-shelf cameras. Here's the proof-of-concept demo using 4x GoPros: https://youtube.com/shorts/Y56l0FlLlAg (yes, lots of room to improve quality). You can also see the interactive version (with XR support) at https://gaussplay.lovable.app