Speech Detection
Speech segments refer to the portions of audio data where a person is speaking. Audio data contains human voices as well as other parts, such as silence or background noise. Before performing speech recognition, speech segments are detected and processed exclusively. This is done to reduce computational load and prevent erroneous recognition of non-speech audio as speech. AmiVoice API uses deep learning models to distinguish between human voices and other sounds, providing more accurate speech detection compared to simple volume-based detection methods.
The following figure illustrates the flow when audio data is sent from the client to the AmiVoice API. First, speech detection is performed, followed by speech recognition processing. In the figure, the purple bands represent speech segments. Three speech segments are detected, and speech recognition is performed for each of them.
The asynchronous HTTP interface and WebSocket interface provide time information, speech recognition results, and confidence scores for each speech segment. For more details, please see Speech Segment Results. Additionally, the WebSocket interface allows real-time reception of speech start and end timings. For more information, please see Obtaining Status Events.
Speech segment results cannot be obtained with the synchronous HTTP interface.
Adjusting Speech Detection Parameters
Speech detection parameters can be adjusted according to the usage scenario. The default values are set to suit many use cases of the AmiVoice API, so it's recommended to first try speech recognition with the default settings and observe the results. If necessary, then proceed with parameter adjustments. For applications such as dictation or meeting transcription, changes are often unnecessary. For applications like call center IVRs or robot interactions, changes to specific parameters such as sensitivity or speech end detection time may be required. For details on adjustable parameters and how to set them, please see segmenterProperties.