Skip to main content

Appropriate Use of Interface

AmiVoice API provides three interfaces depending on the use case.

This section explains the differences between each interface and the criteria for choosing the appropriate one.

Synchronous HTTP Interface

This is the simplest method to implement. You send an audio file via HTTP POST with request parameters, and the response contains the transcribed text of the speech. This can be used only when the audio file size is 16MB or less.

Figure. Overview of AmiVoice API

Use Cases

  • When you want to verify recognition accuracy
  • When you want to easily integrate into applications
  • When you want to recognize short utterances

Although it cannot obtain recognition results sequentially, it can be used in production if the utterance you want to recognize is sufficiently short (a few seconds) and real-time results are not strongly required.

Advantages

It can be easily implemented by simply sending audio and a few parameters via HTTP POST. You can also try it immediately using tools that can send HTTP requests, such as curl, Postman, or REST Client in Visual Studio Code.

Points to Note

  • An error will occur if the audio file exceeds 16MB.
  • It is possible to send in chunks using Chunked transfer encoding. However, the results are returned all at once.
  • Even if the sent audio data contains multiple utterance segments, it does not return time information for each utterance segment. If you need results for utterance segments, please use the Asynchronous HTTP Interface or WebSocket Interface.
  • It does not return intermediate recognition results (hypotheses before the speech recognition result is finalized). If you need intermediate recognition results, please use the WebSocket Interface.

Resources

Asynchronous HTTP Interface

Speech recognition processing is executed asynchronously on the server side as a speech recognition job. You can poll until the job is completed, and once completed, you can obtain the recognition results.

Figure. Overview of AmiVoice API

Use Cases

  • When you want to convert large audio files (larger than 16MB) to text
  • When you don't need to receive responses sequentially and want to improve accuracy even slightly

For example, it is intended for batch processing such as transcribing call center recording files or meeting recording files.

note

The time required for speech recognition is approximately 0.5 to 1.5 times the length of the submitted audio. For example, if you submit a one-hour audio file, it will take about 30 to 90 minutes to get the results. In such cases, if you perform the request and result retrieval in a single HTTP session, you need to maintain the session for a long time, and there is a possibility that you may not get the results if the session is disconnected midway. Therefore, the synchronous HTTP interface limits the audio file size to 16MB or less, and for larger audio files, we use the asynchronous HTTP interface to separate the request and result retrieval.

Advantages

  • There is a possibility of better accuracy compared to other interfaces.
    note

    In various test sets conducted by our company, the asynchronous HTTP interface has shown an average of 5 percentage points higher error improvement rate compared to other interfaces. This is due to the following reasons:

    • Since there is no need to return responses immediately, it can process by reading ahead more future information to estimate the speech content at a certain point in time.
    • It is configured to use more computational resources than when recognizing using synchronous HTTP or WebSocket interfaces.
  • Not only time information for each token, but also from results for utterance segments the start and end time of the utterance can be obtained. If the audio file contains multiple utterance segments, you can use the start and end times of each utterance.

Points to Note

  • There is a delay of several tens of seconds to several minutes before the job starts, so the impact of the delay time becomes relatively large when recognizing short audio. Also, the effect of accuracy improvement mentioned in the advantages above becomes smaller for short audio.
  • It does not return intermediate recognition results (hypotheses before the speech recognition result is finalized). If you need intermediate recognition results, please use the WebSocket Interface.

Resources

WebSocket Interface

By establishing a WebSocket connection, bidirectional communication becomes possible. You can send audio from a streaming source to the server in small chunks and sequentially obtain speech recognition results. It is suitable for applications that require real-time performance.

Figure. Overview of AmiVoice API

Use Cases

  • When you want to use an audio stream source and obtain and use results sequentially
  • When you want to use intermediate results of speech recognition (hypotheses before the speech recognition result is finalized)
  • When you want to improve the accuracy of detecting the end of user speech

Advantages

  • You can obtain speech recognition results sequentially.
  • You can obtain intermediate recognition results (hypotheses before the speech recognition result is finalized). Since speech recognition results are finalized after detecting the end of an utterance, it takes time to obtain results in real-time applications. By displaying intermediate results on the screen, you can achieve quick responses to users.
  • End-of-speech detection can be performed on the API side. For example, if you try to implement a dialogue application using the Synchronous HTTP Interface, you need to stop recording and create an audio file when the user finishes speaking. If the recording system does not have an end-of-speech detection function or only has a simple implementation based on volume, it may not work well in cases where there is a lot of background noise or the user's voice is low. With the WebSocket interface, you can simply send audio data sequentially and use AmiVoice API's deep learning model for end-of-speech detection, which can detect the end of speech with better accuracy than simple volume-based detection and obtain results.
  • Not only time information for each token, but also from results for utterance segments the start and end time of the utterance can be obtained. If the audio file contains multiple utterance segments, you can use the start and end times of each utterance.

Points to Note

  • After establishing a WebSocket connection, you need to communicate with the speech recognition server using a proprietary text-based protocol, which can make implementation complex.

Resources