Speech Detection

Speech segments refer to the portions of audio data where a person is speaking. Audio data contains human voices as well as other parts, such as silence or background noise. Before performing speech recognition, speech segments are detected and processed exclusively. This is done to reduce computational load and prevent erroneous recognition of non-speech audio as speech. AmiVoice API uses deep learning models to distinguish between human voices and other sounds, providing more accurate speech detection compared to simple volume-based detection methods.

The following figure illustrates the flow when audio data is sent from the client to the AmiVoice API. First, speech detection is performed, followed by speech recognition processing. In the figure, the purple bands represent speech segments. Three speech segments are detected, and speech recognition is performed for each of them.

Figure. Speech Recognition Pipeline

The asynchronous HTTP interface and WebSocket interface provide time information, speech recognition results, and confidence scores for each speech segment. For more details, please see Speech Segment Results. Additionally, the WebSocket interface allows real-time reception of speech start and end timings. For more information, please see Obtaining Status Events.

note

Speech segment results cannot be obtained with the synchronous HTTP interface.

Adjusting Speech Detection Parameters

Currently, AmiVoice API does not allow adjustment of speech detection parameters. In most cases, changes are not necessary for applications such as dictation or meeting transcription. For applications like call center IVRs or robot interactions, adjustments to specific parameters like sensitivity or speech end detection time may be required. In such cases, please consider using AmiVoice API Private.

Adjusting Speech Detection Parameters​

Adjusting Speech Detection Parameters