TIPS
Here are some helpful tips for developing with AmiVoice API.
Client Program Related
Preventing Unintended Speech Recognition
There are cases where speech recognition is unintentionally performed due to forgetting to disconnect after speech recognition ends or due to operational errors. It is recommended to implement mechanisms in the client program to prevent such accidents that could lead to unintended transmission of voice data containing important information or unnecessary costs. For example, the following mechanisms could be considered:
- Clearly notify users that voice recording or speech recognition is in progress through screen display or other means.
- For real-time speech recognition, if the session time exceeds a certain duration, display a dialog to confirm with the user.
Audio Data Related
Recommendation for Checking Recording Quality
If the speech recognition results are significantly poor and can't even maintain the form of a sentence, there may be issues with the recording quality. To prevent situations where speech recognition is performed but the results are unusable, it is recommended to encourage end users to check the recording quality. For example, please check the following points:
- Is the volume of the speech to be recognized sufficient? As a rough guideline, for 16-bit audio, an amplitude of around 3000 is good. Conversely, care should be taken not to have the volume too high, causing sound distortion.
- Is the speech audio muffled and difficult to hear?
- Is the speech audio to be recognized not drowned out by environmental sounds or voices of other speakers?
When the speech recognition request parameters are appropriate and there are no issues with the recording quality, the speech recognition results should not be significantly poor. Even if the recording quality cannot be checked in advance, please ensure that recording is done with appropriate quality by checking the recognition results in real-time and reviewing the recording if abnormal recognition results occur. Please note that even in quiet indoor settings, such as face-to-face meetings in large conference rooms, recording quality may deteriorate as mentioned above depending on the placement and performance of the recorder. (For example: the recorder is far from the speaker whose speech is to be recognized, noise such as typing sounds or paper rustling occurs very close to the recorder, etc.)
Cautions Regarding Audio Data Processing
Audio data for speech recognition tends to yield better recognition accuracy when it is easily audible to human ears (in terms of volume, sound quality, speaking style, etc.). However, in the case of processed audio, even if it's easier for human ears to hear, speech recognition accuracy may decrease. Below are some cautions regarding audio data processing.
Noise Cancellation and Echo Cancellation
Depending on the method, noise cancellation and echo cancellation can distort the audio signal, changing its characteristics from what the speech recognition engine has been trained on, resulting in decreased speech recognition accuracy. While it may be effective when noise is severe, it is generally recommended not to use these.
Automatic Gain Control (AGC)
Automatic Gain Control (AGC), which maintains a constant level of the audio signal, can have negative effects on speech recognition but can be positive for the process of detecting speech segments. In cases where speech is not detected at all, resulting in poor accuracy, using AGC may improve accuracy. Note that the ease of detecting speech segments can also be adjusted using Request Parameters.
Compression
Audio data for speech recognition does not necessarily need to be lossless compressed like FLAC. Often, compression has little effect on accuracy, but be careful as applying strong compression that makes it difficult even for human ears to hear can affect recognition accuracy.
The AmiVoice Tech Blog introduces a verification of how sampling rate and compression rate affect speech recognition accuracy. Please see this for reference as well.
【We Tested It!】How Does Speech Recognition Accuracy Change with Sampling Rate and Compression Rate!? (Japanese blog)