Page 52

EETE OCT 2013

Digital Signal Processing Communicating with robots By Rob Hatfield Talking to machines is still an awkward experience. Until recently, advancements in machine interpretation of human speech had not gone far enough to bring meaningful benefits to mainstream users. Current developments in low power audio technology have the potential to improve this man-machine relationship permanently as bottlenecks that previously obstructed real progress in speech recognition are removed. A path to an era of rapid innovation in human-machine interaction is now opening up. This will lead to interesting developments in the way we interact with machines that can listen to us and, increasingly, understand us. Speech is perhaps the most natural way for humans to communicate but the introduction of a machine into the process creates a need for new behavioural protocols, particularly where there are no on-going visual cues from the other party during speech. The first telephone calls were a little awkward for early users and the broken conversational style of two-way radio still requires a little adjustment for new users today. In both cases, common practices quickly developed to achieve a fairly natural communication style, largely because the other party was also human. As mobile users are confronted with new speech recognition interfaces, they will face similar challenges to those of much older communication media. In a more recent example, the touchscreen revolution demonstrated how new, unfamiliar and awkward interfaces are thrust into mainstream use and popularity if they perform with high quality and include features that add value to the user experience. It is therefore worth defining “performance” of voice control in a much broader sense than has traditionally been the case. More future-proof solutions can then be designed which take account of next-generation bottlenecks. Building a high-performance speech recognition solution Very simple performance metrics have traditionally been used for speech recognition solutions. These metrics are usually quoted as single “accuracy” or “hit rate” figures, essentially indicating Fig. 1: Always-on voice trigger using an audio hub. the probability of correctly identifying words or phrases. A much broader and thought out approach, which reflects the longer-term potential of speech interfaces to provide the same level of comfort and engagement for users as touchscreen interfaces, is needed when defining “performance”. Quality of interpretation, essentially a form of artificial intelligence, going much further than basic word recognition, plays a key role. Access to all device functionality also makes speech recognition a viable alternative to touchscreens and interestingly this also makes the technology applicable to a much greater range of device types. This could be ported to smaller devices such as wearable technology. Low response latency and a natural, “protocol-free” interaction, which performs well amid ambient noise, also improve the experience. Careful system design is required to enable device-level signal processing to combine well with cloud-based intelligence to bring these performance enhancements to users. Removing the button The most significant ergonomic limitation of speech recognition today is the need for button-press or other mechanical activation, restricting usability in many environments. This mechanical trigger is ultimately the result of a power consumption limitation. In order to maintain competitive battery-life figures, standby power budgets are extremely low in mobile devices, typically single-digit milliamps of battery current. It is not feasible to run speech recognition (or at least arbitrary speech recognition) continuously when power budgets are this low. A button-press trigger has until now provided a crude solution to this problem, minimising average power consumption by disabling speech recognition until the button is pressed. However, modern voice triggering functions are now being featured on the latest high-end audio hubs as OEMs look to make voice recognition features slicker and easier to use. Reducing the average power footprint of speech recognition dramatically, even to levels that fall within standby-mode budgets, allows the host processor to ‘sleep’. This power reduction (typically an order of magnitude) is so significant that the button-press requirement can be removed altogether. Voice trigger architecture choices A voice trigger is a short key word or phrase (such as “Hello phone”), which causes the device to wake up and respond to subsequent speech input. The semi-autonomous low-power “always-on” processing domain illustrated in figure 1 provides a platform for this voice trigger. Audio hubs provide a natural home for the voice trigger function, featuring interfaces to all internal and headset microphones as well as typically running during standby-mode anyway, for other reasons such as accessory interface monitoring. This reduces duplication of utility functions in the system such as clock generators and voltage references, reducing quiescent power. Hardware optimisations in audio hubs targeted at voice wake-up allow signal processing cycles to be kept to Rob Hatfield is Principal Solutions Architect at Wolfson Microelectronics – www.wolfsonmicro.com He can be reached at Robert.Hatfield@wolfsonmicro.com 40 Electronic Engineering Times Europe October 2013 www.electronics-eetimes.com


EETE OCT 2013
To see the actual publication please follow the link above