Appropriate Use of Interface

AmiVoice API provides three interfaces depending on the use case.

Synchronous HTTP Interface
Asynchronous HTTP Interface
WebSocket Interface

This section explains the differences between each interface and the criteria for choosing the appropriate one.

Synchronous HTTP Interface

This is the simplest method to implement. You send an audio file via HTTP POST with request parameters, and the response contains the transcribed text of the speech. This can be used only when the audio file size is 16MB or less.

Figure. Overview of AmiVoice API

Use Cases

When you want to verify recognition accuracy
When you want to easily integrate into applications
When you want to recognize short utterances

Although it cannot obtain recognition results sequentially, it can be used in production if the utterance you want to recognize is sufficiently short (a few seconds) and real-time results are not strongly required.

Advantages

It can be easily implemented by simply sending audio and a few parameters via HTTP POST. You can also try it immediately using tools that can send HTTP requests, such as curl, Postman, or REST Client in Visual Studio Code.

Points to Note

An error will occur if the audio file exceeds 16MB.
It is possible to send in chunks using Chunked transfer encoding. However, the results are returned all at once.
Even if the sent audio data contains multiple speech segments, it does not return time information for each speech segment. If you need results for speech segments, please use the Asynchronous HTTP Interface or WebSocket Interface.
It does not return intermediate recognition results (hypotheses before the speech recognition result is finalized). If you need intermediate recognition results, please use the WebSocket Interface.

Resources

The tutorial Transcribing Short Audio Files provides a step-by-step example.
The usage guide Synchronous HTTP Interface explains how to send requests.
For detailed specifications, please see the API reference Synchronous HTTP Interface.

Asynchronous HTTP Interface

Speech recognition processing is executed asynchronously on the server side as a speech recognition job. You can poll until the job is completed, and once completed, you can obtain the recognition results.

Figure. Overview of AmiVoice API

Use Cases

When you want to convert large audio files (larger than 16MB) to text
When you don't need to receive responses sequentially and want to improve accuracy even slightly

For example, it is intended for batch processing such as transcribing call center recording files or meeting recording files.

note

The time required for speech recognition is approximately 0.5 to 1.5 times the length of the submitted audio. For example, if you submit a one-hour audio file, it will take about 30 to 90 minutes to get the results. In such cases, if you perform the request and result retrieval in a single HTTP session, you need to maintain the session for a long time, and there is a possibility that you may not get the results if the session is disconnected midway. Therefore, the synchronous HTTP interface limits the audio file size to 16MB or less, and for larger audio files, we use the asynchronous HTTP interface to separate the request and result retrieval.

Advantages

There is a possibility of better accuracy compared to other interfaces.
note
In various test sets conducted by our company, the asynchronous HTTP interface has shown an average of 5 percentage points higher error improvement rate compared to other interfaces. This is due to the following reasons:
- Since there is no need to return responses immediately, it can process by reading ahead more future information to estimate the speech content at a certain point in time.
- It is configured to use more computational resources than when recognizing using synchronous HTTP or WebSocket interfaces.
Not only time information for each token, but also from results for speech segments the start and end time of the utterance can be obtained. If the audio file contains multiple speech segments, you can use the start and end times of each utterance.

Points to Note

There is a delay of several tens of seconds to several minutes before the job starts, so the impact of the delay time becomes relatively large when recognizing short audio. Also, the effect of accuracy improvement mentioned in the advantages above becomes smaller for short audio.
It does not return intermediate recognition results (hypotheses before the speech recognition result is finalized). If you need intermediate recognition results, please use the WebSocket Interface.

Resources

The tutorial Transcribing Long Audio Files provides a step-by-step example.
The usage guide Asynchronous HTTP Interface explains how to send requests.
For detailed specifications, please see the API reference Asynchronous HTTP Interface.

WebSocket Interface

By establishing a WebSocket connection, bidirectional communication becomes possible. You can send audio from a streaming source to the server in small chunks and sequentially obtain speech recognition results. It is suitable for applications that require real-time performance.

Figure. Overview of AmiVoice API

Use Cases

When you want to use an audio stream source and obtain and use results sequentially
When you want to use intermediate results of speech recognition (hypotheses before the speech recognition result is finalized)
When you want to improve the accuracy of detecting the end of user speech

Advantages

You can obtain speech recognition results sequentially.
You can obtain intermediate recognition results (hypotheses before the speech recognition result is finalized). Since speech recognition results are finalized after detecting the end of an utterance, it takes time to obtain results in real-time applications. By displaying intermediate results on the screen, you can achieve quick responses to users.
End-of-speech detection can be performed on the API side. For example, if you try to implement a dialogue application using the Synchronous HTTP Interface, you need to stop recording and create an audio file when the user finishes speaking. If the recording system does not have an end-of-speech detection function or only has a simple implementation based on volume, it may not work well in cases where there is a lot of background noise or the user's voice is low. With the WebSocket interface, you can simply send audio data sequentially and use AmiVoice API's deep learning model for end-of-speech detection, which can detect the end of speech with better accuracy than simple volume-based detection and obtain results.
Not only time information for each token, but also from results for speech segments the start and end time of the utterance can be obtained. If the audio file contains multiple speech segments, you can use the start and end times of each utterance.

Points to Note

After establishing a WebSocket connection, you need to communicate with the speech recognition server using a proprietary text-based protocol, which can make implementation complex.

Resources

The usage guide WebSocket Interface explains how to send requests.
We provide client libraries that hide the details of the protocol and allow easy handling from various programming languages. This is explained in How to Use the Real-time Speech Recognition Library Wrp.
For detailed specifications, please see the API reference WebSocket Interface.

Synchronous HTTP Interface​

Use Cases​

Advantages​

Points to Note​

Resources​

Asynchronous HTTP Interface​

Use Cases​

Advantages​

Points to Note​

Resources​

WebSocket Interface​

Use Cases​

Advantages​

Points to Note​

Resources​

Synchronous HTTP Interface

Use Cases

Advantages

Points to Note

Resources

Asynchronous HTTP Interface

Use Cases

Advantages

Points to Note

Resources

WebSocket Interface

Use Cases

Advantages

Points to Note

Resources