Academic researchers have developed a technique to extract sounds from still images captured by smartphone cameras with rolling shutter and movable lens structures.
Furthermore, this innovation exploits the inherent movement of camera hardware, like CMOS rolling shutters and movable lenses used in Optical Image Stabilization (OIS) and Auto Focus (AF), to convert sounds into imperceptible distortions within images. These smartphone cameras create a novel “optical-acoustic side channel” for acoustic eavesdropping, eliminating the need for line-of-sight or objects within the camera’s field of view. The researchers employed machine learning to accurately identify different speakers, genders, and spoken digits from the extracted acoustic information.
To execute this technique, the researchers assumed the attacker had a malicious app running on the victim’s smartphone but lacked access to the device’s microphone. The threat model also required the attacker to capture a video with the victim’s camera and have access to speech samples of the target individuals in advance for machine learning training. Using a dataset of 10,000 samples of signal-digit utterances, they trained their model for three classification tasks: gender, identity, and digit recognition, achieving high accuracies.
While the potential for this security risk exists, smartphone makers can take measures to mitigate it, such as using higher rolling shutter frequencies, random-code rolling shutters, robust lens suspension springs, and lens locking mechanisms. The researchers believe that this optical-acoustic side channel could support various malicious applications, emphasizing the need for ongoing vigilance and countermeasures in smartphone security.