You are currently viewing Groundbreaking AI Extracts Audio from Muted Videos and Images
Groundbreaking AI Extracts Audio from Muted Videos and Images

Groundbreaking AI Extracts Audio from Muted Videos and Images

Groundbreaking AI Extracts Audio from Muted Videos and Images
Groundbreaking AI Extracts Audio from Muted Videos and Images

The seemingly futuristic notion of extracting audio from static images has become a reality, thanks to the power of artificial intelligence (AI).

A team led by Kevin Fu, a professor specializing in electrical and computer engineering and computer science at Northeastern University, has developed a machine learning tool called Side Eye that is pushing the boundaries of image analysis.

What it Does

Side Eye is a remarkable tool that, when applied to a still image, can discern the gender of a speaker in the room where the photo was taken, transcribe spoken words, and even pinpoint the location. Notably, this tool can also be used with muted videos.

Kevin Fu explains its potential:

“Imagine someone is doing a TikTok video and they mute it and dub music. Have you ever been curious about what they’re really saying? Was it ‘Watermelon watermelon’ or ‘Here’s my password?’ Was somebody speaking behind them? You can actually pick up what is being spoken off-camera.”

How it Works

Side Eye leverages image stabilization technology commonly found in smartphone cameras. These cameras use a lens suspension system with springs submerged in liquid to keep photos clear and focused, even when the photographer’s hand is unsteady. Sensors and an electromagnet counteract movement by adjusting the lens in the opposite direction, stabilizing the image.

Interestingly, when someone speaks near the camera lens during a photo, it creates minute vibrations in the springs, subtly altering the path of light. Extracting audio frequencies from these vibrations, though challenging, is made possible by the rolling shutter technique used in most cameras.

Fu explains further:

“The way cameras work today to reduce cost basically is they don’t scan all pixels of an image simultaneously – they do it one row at a time. [That happens] hundreds of thousands of times in a single photo. What this basically means is you’re able to amplify by over a thousand times how much frequency information you can get, basically the granularity of the audio.”

Implications

While Side Eye is in a preliminary stage and requires substantial training data for improvement, there are concerns about its potential misuse. In the wrong hands, an advanced version of this system could pose a significant cybersecurity threat.

However, there are also promising applications for this technology, especially if an advanced Side Eye were employed as a digital tool by law enforcement agencies for crime investigations, providing valuable digital evidence.

Leave a Reply