Researchers use AI to turn sounds into images

Researchers use AI to turn sounds into images
Researchers at the University of Texas at Austin have pioneered an artificial intelligence model that converts audio recordings into images by training a soundscape-to-image AI model.

The AI model uses audio and visual data gathered from a variety of urban and rural streetscapes to generate images from audio recordings. The researchers used YouTube video and audio from cities in Europe, Asia, and North America, creating pairs of 10-second audio clips and image stills to train the AI model, which can now create high-resolution images from audio input.

The researchers compared the AI sound-to-image creations made from 100 audio clips to the respective real-world counterparts, using both human and computer evaluations to test accuracy.

Results showed a strong correlation in the proportion of sky and greenery between the AI-generated and real-world images, with a lower correlation in building proportions. Human participants averaged 80% accuracy in selecting the generated images that corresponded to source audio samples.

Yuhao Kang, assistant professor of geography and the environment, University of Texas at Austin, commented: “Traditionally, the ability to envision a scene from sounds is a uniquely human capability, reflecting our deep sensory connection with the environment. Our use of advanced AI techniques supported by large language models (LLMs) demonstrates that machines have the potential to approximate this human sensory experience.

“This suggests that AI can extend beyond mere recognition of physical surroundings to potentially enrich our understanding of human subjective experiences at different places.”