Carnegie Mellon University
July 10, 2020

New System Combines Smartphone Videos To Create 4D Visualizations

Carnegie Mellon University approach requires neither studio nor specialized cameras

By Byron Spice

Byron Spice
  • School of Computer Science
  • 412-268-9068
Virginia Alvino Young
  • School of Computer Science
  • 412-268-8356

Researchers at Carnegie Mellon University have demonstrated that they can combine iPhone videos shot "in the wild" by separate cameras to create 4D visualizations that allow viewers to watch action from various angles, or even erase people or objects that temporarily block sight lines.

Imagine a visualization of a wedding reception, where dancers can be seen from as many angles as there were cameras, and the tipsy guest who walked in front of the bridal party is nowhere to be seen.

The videos can be shot independently from a variety of vantage points, as might occur at a wedding or birthday celebration, said Aayush Bansal, a Ph.D. student in CMU's Robotics Institute. It also is possible to record actors in one setting and then insert them into another, he added.

"We are only limited by the number of cameras," Bansal said, with no upper limit on how many video feeds can be used.

Bansal and his colleagues presented their 4D visualization method at the Computer Vision and Pattern Recognition virtual conference.


Researchers at Carnegie Mellon University have developed a method for combining videos from several cameras to create 4D visualizations. The combined videos create a "virtual camera" that enables the viewer to look at the same scene from different angles, remove people from the scene or add people to a new scene.

"Virtualized reality" is nothing new, but in the past it has been restricted to studio setups, such as CMU's Panoptic Studio, which boasts more than 500 video cameras embedded in its geodesic walls. Fusing visual information of real-world scenes shot from multiple, independent, handheld cameras into a single comprehensive model that can reconstruct a dynamic 3D scene simply hasn't been possible.

Bansal and his colleagues worked around that limitation by using convolutional neural nets (CNNs), a type of deep learning program that has proven adept at analyzing visual data. They found that scene-specific CNNs could be used to compose different parts of the scene.

The CMU researchers demonstrated their method using up to 15 iPhones to capture a variety of scenes — dances, martial arts demonstrations and even flamingos at the National Aviary in Pittsburgh.

"The point of using iPhones was to show that anyone can use this system," Bansal said. "The world is our studio."

The method also unlocks a host of potential applications in the movie industry and consumer devices, particularly as the popularity of virtual reality headsets continues to grow.

Though the method doesn't necessarily capture scenes in full 3D detail, the system can limit playback angles so incompletely reconstructed areas are not visible and the illusion of 3D imagery is preserved.

In addition to Bansal, the research team included Robotics Institute faculty members Yaser Sheikh, Deva Ramanan and Srinivasa Narasimhan. The team also included Minh Vo, a former Ph.D. student who now works at Facebook Reality Lab. The National Science Foundation, Office of Naval Research and Qualcomm supported this research.