Also, can it run on Apple silicon?
> Furthermore, by leveraging tools like MapAnything to generate metric points, ShapeR can even produce metric 3D shapes from monocular images without retraining.
its designed to be run on top of a SLAM system that outputs a sparse point cloud.
on page 4 on the top right you can see how the point cloud is used to then feed into the object generator: https://cdn.jsdelivr.net/gh/facebookresearch/ShapeR@main/res...
That means it doesn’t need depth. Depth is helpful for getting good point locations, but SLAM on multiple frames should also work.
I’m guessing that they are researching this for AR or robot navigation. Otherwise, the focus on accurately dividing the scene into objects wouldn’t make sense for me.
Segmentation in 2d is mostly a solved problem (segment anything is pretty fucking great) Segmentation in 3d is also fairly well done. You can use dino V2 to do 3d object detection and segmentation.
The diffcult part _after_ that is interacting with the object. sparse and semi dense point clouds can be generated and refined in real time, but they are point clouds not meshes. this means that interacting with the object accurately is super hard, because its not a simple mesh that can be tested/interacted with. its a bunch of points around the edges.
Where this is useful is it allows you to generate a mostly plausible simple 3d model that can act as a standin for any further interactions. In VR you can use it as a collision object for physics. For robotics you can use it to plan interactions (ie place objects on the table)
Its also a step in the direction of answering "who's" object it is, rather than "what" the object is. Who's water bottle is much much harder to answer with machines (without markers) than "is this a water bottle" or "where is the water bottle in this scene"