019_EETE

EETE FEBRUARY 2013

Fig. 4: Word error rate test set up. Fig. 6: GMOS as a function of noise type for 3D voice processing and standard 2D voice processing. noise. The ambient noise cancellation algorithm uses feature extraction information and the output of the 3D-Vocal block. Equalisation is performed to equalise the spectral distribution of the received signal to match the requirements of the ASR process or the voice call requirements. Minimising word error rate for speech recognition To evaluate how the improved voice quality impacts the performance of a virtual assistant, a test was conducted measuring the Word Error Rate (WER) which calculates the number of error words between the spoken word sequence and the recognised one, using the formula: Where S is the number of substitutions D is the number of deletions I is the number of insertions N is the number of words in the reference (N=S+D+C) with C being the number of correct words. A voice script was dictated using a commercial virtual assistant system on a mobile phone with 3D voice processing and with 2D voice processing. The script was dictated in different background noise type, such as a cafe, a pub, a car and a train and the WER was calculated with 3D voice processing and with 2D voice processing. The test benchmark was performed using the set up described in figure 4, all the tests being performed in an acoustic chamber which included a Head and Torso Simulator (HATS ). The mobile handset under test was positioned attached to the head of the mannequin. AUDIO & VIDEO ELECTRONICS Background noises per ETSI EG 202 396-1 was injected to four loudspeakers and a subwoofer by a PC which triggered by the Master PC. The “dictated clean speech” is played by the mouth of HATS that is fed from the Master PC via the analogue front end (Head Acoustics’ Measurement Front End, MFE VI.I). The speech captured by the mobile phone is converted to text email by the virtual assistant, and the WER was then calculated by analysing the received text mail. The result of the test is presented in figure 5. With 3D voice processing the WER is in the range of 10 to 15% for all the noise types, while with 2D voice processing, the WER ranged from 18 to 60% and depended on the noise type, which means that ASR with 2D voice processing in a noisy environment is not consistent. In some noise types it will work fine and in other noise types the performance can be very poor. However, as shown on figure 5, with 3D voice processing, ASR degradation remains minimal and consistent for all noise types, making the virtual assistant significant more reliable. Improving the quality of voice communication By incorporating the advanced noise cancellation capabilities into smart phones for voice communication, the voice quality can be significantly improved from “poor” to “very good”. The audio quality of 3D voice processing was compared with standard 2D noise cancellation techniques using the ETSI EG 202 396-1 standard, which defines a method to test the quality of noise reduction algorithms objectively. The scale for general quality (GMOS) is shown in Table 1. Voice quality was compared according to the MOS scores using a standard phone with builtin 2D voice processing in different types of noisy environments. Figure 6 shows that the score of the 3D voice processing is significantly higher than the standard 2D voice processing. Value added 3D voice processing For safety and convenience reasons, hands free operation will often be the first choice for consumers. And yet voice control is just beginning to see its true potential. Test results indicate that 3D voice processing can significantly improve the reliability and usability of voice control enabling it to become a valuable differentiator. In addition to better voice control for smart phones, tablets, and a wide range of consumer electronic devices, 3D voice processing supports the injection of background music or sounds during on-going conversations. This could provide telecom service providers with new services that could be available at a premium, generating more revenue. Fig. 5: Increased virtual assist reliability with 3D voice processing. Table 1: General MOS scores (GMOS). 18 Electronic Engineering Times Europe February 2013 www.electronics-eetimes.com


EETE FEBRUARY 2013
To see the actual publication please follow the link above