Smart speakers and other speech-recognition systems are a booming market. Tim Kridel explores the promise and the peril.
My wife asked me why I spoke so softly in the house. I said I was afraid Mark Zuckerberg was listening. She laughed. I laughed. Alexa laughed. Siri, too.
So goes a joke making the rounds on the internet. Whether it makes you laugh or cry depends on your view of smart speakers: A new set of tools for interactive digital signage and controlling conference room AV systems? Or an attack vector for industrial espionage and other security breaches? Maybe a little of both?
By some estimates, at least 21% of UK homes now have at least one smart speaker. For Germans and the French, it’s about 12% and 7.5%, respectively.
Those numbers are noteworthy because like smartphone virtual assistants such as Siri, smart speakers are steadily conditioning more and more people to use their voice to interact with things. Those experiences at home influence their expectations about what’s possible and preferable in the workplace, stores and elsewhere.
One example of the latter is wayfinding digital signage, where people simply ask for the location of something rather than trying to figure out which icons to poke. This use case also highlights one benefit of speech user interfaces in the Covid-19 era.
“The requirement for touchless technology experiences has certainly increased during the pandemic, but the real driver is the increasing sophistication of the applications living behind the voice technology,” says Joel Chimoindes [pictured below, right], Maverick AV Solutions vice president for Europe. “Artificial intelligence (AI) is driving the collection and processing of the data behind the request and helping turn that into meaningful commands. Its accuracy and the speed at which it is obtained is the real gamechanger.”
Another public-place use case is retail.
“Imagine ordering a different size to be brought to you in your changing room via voice control,” Chimoindes says.
By your command
In the workplace, AI-powered speech recognition enables new user interfaces (UIs) for conference room AV systems. For example, meeting participants can say “Lower the shades and dim the lights” instead of using a touchpanel.
“Speech recognition is part of a larger category of techniques people are using to make technology a little more natural to interact with,” says Joe Andrulis, Biamp EVP of corporate development. “You don’t have to learn it as much as it learns you.”
AI also is automating applications that used to require humans.
“The events space has been driving innovation in the area, with live subtitling and translation already emerging at international conferences,” Chimoindes says. “At Microsoft Inspire in 2019, they showcased a language-translating HoloLens hologram. This was an inflexion point, showcasing the possibilities of the technology.”
The AI component often is referred to as natural language processing or understanding (NLP/NLU), depending on the vendor. NLP keeps getting more sophisticated, but there’s still a lot of room for improvement.
“For example, there are sometimes a huge number of errors in your average meeting transcript,” Chimoindes says. “In this application, this isn’t necessarily a problem as it’s been proven that we naturally fill in the gaps while reading.”
But it can be a problem with other applications, which is something AV pros need to consider when designing systems.
“While accuracy is still improving, integrators need to provide non-voice-activated options in every case to ensure accessibility,” Chimoindes says. “The application should be carefully considered depending on how the misinterpretation of a command may affect the user experience of your service or brand.”

AI-powered speech recognition is already used in a wide variety of non-AV applications, such as call centre virtual assistants developed to lighten the load for human agents. In some case, those applications could help refine AV systems by providing a database of words and phrases commonly used by employees, visitors and others.
Getting emotional
One lesson from the contact centre world is that when they’re getting irritated or frustrated, some people begin speaking low and slow, while others get faster and higher pitched. If the AI identifies these changes, it may hand off the call to a human agent. An AV application could make similar changes, especially if it already has experience with that employee or customer.
“The more the system knows about the speaker and the situation, the better it can decipher the emotions and behaviours exhibited,” says Rana Gujral [pictured left], CEO of Behavioral Signals, which specializes in AI that can analyse emotion. “So it is important to identify a speaker’s neutral (or emotion-balanced) state based on all these things and then process emotions as cases of divergence from that state.
“There is also a global neutral state-of-course [that] is determined based on data from multiple speakers, which allows the system, for example, to identify a speaker from a reference standpoint, [such as] relatively more angry than others.”
Accents, slang and other attributes affect NLP’s ability to analyse a person’s state of mind.
“It’s a factor of having properly modelled the context of the interaction,” Gujral says. “It would be very difficult for a generically trained emotion-recognition system to properly account for the idiosyncratic properties of speakers in a specific region, or of a specific culture, for example. Adaptation of the models by employing some sort of transfer learning is typically a requirement in such cases.”
Behavioral Signals has spent the past few years developing an application programming interface (API) that conferencing vendors could use to refine their technology, such as to provide better user experiences.
“There are many compelling applications, [such as] adjusting the video layout in teleconferences that involves a group of people based on the way they speak and their overall behaviour to ensure better engagement of all participants,” Gujral says. “[Another is] providing real-time feedback to the participants in business-related conversations to help make a participant aware of their emotions and potentially help them avoid situations in which, for example, they may overreact.”
The walls have ears
Gartner predicts that by 2025, 75% of workplace conversations will be recorded and analysed, “enabling the discovery of added organizational value or risk.” One example is applying speech recognition to meetings for more than just transcription.
“Analytics from conversations in the enterprise will greatly increase the ability to monitor compliance and risk, identify areas of improvements and streamline automation of processes,” Gartner Research principal Emily Potosky wrote in a recent report.
But there’s also risk in uncovering risk. For starters, speech recognition systems often aren’t local, meaning they send conversations up to the cloud, where far more processing power resides. This architecture also helps make the NLP and other AI systems increasingly smarter by providing them with a constant stream of conversations to learn from.
Allowing confidential conversations to go off premises is an obvious risk — one that’s grown along with telecommuting.
“When Covid first hit, security professionals said to take a survey of your house and turn off the smart speakers because we don’t know if it’s a threat vector or not,” says Frank Dickson, vice president of IDC’s Security & Trust program.
Potosky’s report predicts a few more scenarios: “In 2021, lawyers will subpoena recordings and analytics from meetings of a major corporation in a public sexual harassment case. In 2022, a major corporate acquisition will use analysis of recorded conversations as the primary data source for deciding which employees to retain. In 2024, there will be a major general strike in the EU over handling of employee information.”
So it’s no surprise that Gartner also is publishing research notes with titles such as “How CIOs Must Lead the Ethical Debate on Remote Employee Monitoring.” One concern is that employees’ perception that the walls have ears will stifle workstyles and creativity.
“Transparency with the people participating in a voice recorded conversation will be one area of consideration, data security another,” Chimoindes says. “A robust security policy and infrastructure will be vital to AV integration projects as voice becomes integral to our devices. Trust in the system by the workforce will be critical in ensuring it doesn’t impact on creativity, collaboration and productivity.
“However, the opportunities are really exciting. Organisations will be able to more accurately predict and evidence trends, personalise experiences and streamline administrative tasks. In the short term, it will reduce friction between people and technology, starting meetings more quickly, creating a more human interaction for digital signage than has ever been achieved.”
There are also opportunities for AV firms to differentiate themselves and their products. For example, integrators could offer cybersecurity and privacy services as part of their speech portfolio. Meanwhile, vendors could use cybersecurity and privacy features to compete with the consumer-grade smart speakers that sometimes are considered for workplace applications.
Some of these risks might seem obvious to AV pros, but that’s not necessarily the case with clients.
“I think there is some education that needs to go on,” Dickson says.
Another aspect is that AV/IT managers, chief information security officers (CISOs) and others already have a lot on their plate.
“The CISO is going to say: ‘That’s nice. I’m still trying to clean up SolarWinds, and I have four active cases of ransomware,’” Dickson says. “To get the CISO’s attention, you’re going to have to make the definition of the problem and the solution in one conversation. If you just dump something on the CISO’s desk, it might not elevate to the addressable [level] because there’s a lot of things going on.”
Despite all of the security, ethical and other challenges, Gartner believes recording analysis will become common, which is good news for AV firms providing the hardware and software.
"Trust in a digital world can only be accomplished through stricter guidelines. As a consumer, when we feel more secure, we are more likely to share information." - Rana Gujral, Behavioral Signals
“While initially regarded as a risk by legal departments and compliance officers, the ability to derive measurements of intangibles from recorded conversations will be seen as too beneficial,” Potosky writes. “Models that predict innovation ability, employee loyalty, competence gaps, undocumented knowledge and a multitude of other intangible factors will become possible. Ultimately, ethical transgressions and unintended consequences will turn employees against their employers. Major disruptions will take place, primarily in the EU, leading to a bill of rights for employees that will become a template for the rest of the world.”
In the meantime, there’s the GDPR bill of rights, which doesn’t necessarily undermine the speech recognition market opportunity.
“GDPR definitely has its detractors,” Gujral says. “A common belief is that stricter regulations place these companies at a disadvantage in comparison to companies which operate in countries with lesser restrictions.
“However, these concerns are misplaced. Trust in a digital world can only be accomplished through stricter guidelines. As a consumer, when we feel more secure, we are more likely to share information. This in turn allows companies access to more data to work with. Hence GDPR helps build trust and is really good for everyone. Applying GDPR delivers a higher standard which in turn leads to a competitive advantage.”
For AV firms, other aspects of speech recognition are familiar ground, starting with the right mics in the right places to ensure a good user experience.
“The dominant characteristic of speech recognition is acquiring a clear signal,” Andrulis says. “That’s something we worry about already.”
Picture credits for this page: BAIVECTOR/Shutterstock.com & whiteMocca/Shutterstock.com