A team from the University of Washington (UW) has developed a prototype translation system that tackles a challenge facing many public space translation tools: understanding and distinguishing between multiple speakers in real-world, often noisy environments.
According to a report from the University of Washington, the new system, dubbed Spatial Speech Translation, uses off-the-shelf noise-cancelling headphones equipped with microphones to isolate, translate and relay speech from multiple people in a space. Unlike conventional translation technologies that focus on a single speaker, the UW system preserves the direction and characteristics of each speaker’s voice and follows them as they move.
The project was presented at the ACM CHI Conference on Human Factors in Computing Systems in Yokohama, Japan, on April 30 and the research team’s proof-of-concept code is now publicly available for others to build upon.
The system emerged from the frustrations of lead author Tuochao Chen, a UW doctoral student, who struggled to understand a museum tour in Mexico due to noisy conditions. “Other translation tech is built on the assumption that only one person is speaking,” said senior author Shyam Gollakota, a professor in the Paul G. Allen School of Computer Science & Engineering at UW, in an article on the university’s website. “But in the real world, you can’t have just one robotic voice talking for multiple people in a room. For the first time, we’ve preserved the sound of each person’s voice and the direction it’s coming from.”
The technology introduces three main innovations. First, it uses a radar-like algorithm to scan a space in 360 degrees, instantly detecting how many people are speaking. Then, it processes translations while maintaining each speaker’s voice quality and volume. Lastly, it dynamically adapts as users move, ensuring spatial audio cues remain accurate.
Rather than relying on cloud processing, which can raise privacy concerns around voice data, the system runs locally on devices powered by Apple’s M2 chip, such as laptops and the Apple Vision Pro.
The team tested the system in ten real-world settings, indoors and out. In user trials involving 29 participants, the prototype was preferred over baseline models that lacked spatial tracking. Another study showed that users found a 3 to 4 second translation delay more acceptable than shorter latencies, which tended to produce more errors.
While the system currently supports common speech in Spanish, German and French, the researchers note that existing translation models can be expanded to cover up to 100 languages. However, specialised terminology and jargon remain outside the tool’s current capabilities.
“This is a step toward breaking down the language barriers between cultures,” said Chen in the original University of Washington article. “So if I’m walking down the street in Mexico, even though I don’t speak Spanish, I can translate all the people’s voices and know who said what.”
The research was supported by a Moore Inventor Fellow award and the UW CoMotion Innovation Gap Fund. Co-authors include Qirui Wang, a research intern at HydroX AI and former UW undergraduate, and Runlin He, a UW doctoral student.
Top image credit: eamesBot/Shutterstock.com