03.11.15

System automatically converts 2D content to 3D

AUTHOR: Inavate

By harnessing software that powers sports video games, researchers at MIT and the Qatar Computing Research Institute (QCRI) have developed a system that automatically converts 2D video of football games into 3D. The converted video can be played back over any 3D device — a commercial 3D TV, or VR headsets such as Oculus Rift or Google Cardboard.

While their focus was on developing the technology to serve a specific sport, tackling this challenge could lead to a wider roll out of the technology to serve different purposes. “Any TV these days is capable of 3D, there’s just no content,” said Wojciech Matusik, an associate professor of electrical engineering and computer science at MIT.
“We see that the production of high-quality content is the main thing that should happen. With movies, you have artists who paint the depth map. Here, there is no luxury of hiring 100 artists to do the conversion. This has to happen in real-time. What we have noticed is that we can leverage video games.”

Modern video games generally store detailed 3D maps of the game’s virtual environment and adjust the map accordingly during gameplay. On the fly it generates a 2D projection of the 3D scene that corresponds to a particular viewing angle - the researchers ran this process in reverse. They set the Microsoft game “FIFA13” to play over and over again, and used Microsoft’s game analysis tool PIX to continuously store screen shots of the action. For each screen shot, they also extracted the corresponding 3D map. Using a standard algorithm for gauging the difference between two images, they winnowed out most of the screen shots, keeping just those that best captured the range of possible viewing angles and player configurations that the game presented; the total number of screen shots still ran to the tens of thousands.

Then they stored each screen shot and the associated 3D map in a database. For every frame of 2D video of an actual football game, the system looks for the 10 or so screen shots in the database that best correspond to it. Then it decomposes all those images, looking for the best matches between smaller regions of the video feed and smaller regions of the screen shots. Once found, it superimposes the depth information from the screen shots on the corresponding sections of the video feed and stitches them back together. The result is a convincing 3D effect, with no visual artefacts.

The system takes about a third of a second to process a frame of video, but successive frames could all be processed in parallel, so that the third-of-a-second delay needs to be incurred only once. A broadcast delay of a second or two would probably provide an adequate buffer to permit conversion on the fly, but the researchers are working to bring the conversion time down still further.