Commands and Responses

This section explains the commands and responses when sending voice data as a stream using the WebSocket interface.

The client uses the commands in the table below to request the start and end of speech recognition and to send data through the API, and to send data. The client waits for responses from the API for the s and e commands before sending the next command. An error response is returned when an error occurs.

Name	Description
s command	Command for speech recognition request from client
p command	Command for sending voice data from client
e command	Command for requesting end of speech recognition from client

The API returns the events in the table below according to the progress of processing.

Name	Description
S event	Event notifying the start of speech detected by the speech detection process
E event	Event notifying the end of speech detected by the speech detection process
C event	Event notifying the start of speech recognition by the speech recognition process
U event	Event notifying interim results of speech recognition by the speech recognition process
A event	Event notifying the completion and results of speech recognition by the speech recognition process
G event	Event notifying information generated by the server. This does not occur if related features are not used.

The general flow of commands and events in an application using voice streaming is as follows:

Figure. Commands and Events

S and E are obtained as results of speech detection, while C, U, and A are obtained as results of speech recognition. The order of events may differ depending on the length of the speech segment and the congestion of the underlying system. For example, in the figure above, C comes between S and E, but it may come after E.

Here, we will explain using an example where three speech segments are detected by speech detection for a given audio data.

Figure. Voice sample of commands and events

We will use logs to explain the flow of command transmission from the client and event responses from the API when sending data from a streaming data source to the API in 1-second intervals using the WebSocket interface.

Lines starting with command>>> are command transmissions from the client
Lines starting with message<<< are event responses from the API

The overall log is as follows. API events are highlighted for better visibility:

00:00.000  open> wss://acp-api.amivoice.com/v1/
00:00.161  open                              # WebSocket connection
00:00.161  command>>> s 16k -a-general segmenterProperties="useDiarizer=1" resultUpdatedInterval=1000 authorization=XXXXXXXXXXXXXXXX
00:00.226  message<<< s                      # Response to s command for speech recognition request
00:00.226  command>>> p [..(32000 bytes)..]  # Send 1st second of data
00:01.232  command>>> p [..(32000 bytes)..]  # Send 2nd second of data
00:02.236  command>>> p [..(32000 bytes)..]  # Send 3rd second of data
00:03.241  command>>> p [..(32000 bytes)..]  # Send 4th second of data
00:04.245  command>>> p [..(32000 bytes)..]  # Send 5th second of data
00:05.250  command>>> p [..(32000 bytes)..]  # Send 6th second of data
00:06.254  command>>> p [..(32000 bytes)..]  # Send 7th second of data
00:07.257  command>>> p [..(32000 bytes)..]  # Send 8th second of data
00:07.291  message<<< S 6200                 # [Utterance 1] Speech detected at 6.2 seconds
00:07.297  message<<< C                      # [Utterance 1] Speech recognition processing started
00:08.262  command>>> p [..(32000 bytes)..]  # Send 9th second of data
00:08.315  message<<< E 7450                 # [Utterance 1] Speech ended at 7.45 seconds
00:08.315  message<<< U {...}                # [Utterance 1] Interim result
00:08.446  message<<< A {...}                # [Utterance 1] Result for speech segment
00:09.267  command>>> p [..(32000 bytes)..]  # Send 10th second of data
00:10.270  command>>> p [..(32000 bytes)..]  # Send 11th second of data
00:10.309  message<<< S 8600                 # [Utterance 2] Speech detected at 8.6 seconds
00:10.327  message<<< C                      # [Utterance 2] Speech recognition processing started
00:11.272  command>>> p [..(32000 bytes)..]  # Send 12th second of data
00:11.337  message<<< U {...}                # [Utterance 2] Interim result
00:12.274  command>>> p [..(32000 bytes)..]  # Send 13th second of data
00:12.321  message<<< U {...}                # [Utterance 2] Interim result
00:13.277  command>>> p [..(32000 bytes)..]  # Send 14th second of data
00:13.301  message<<< E 11650                # [Utterance 2] Speech ended at 11.65 seconds
00:13.304  message<<< S 12000                # [Utterance 3] Speech detected at 12.00 seconds
00:13.311  message<<< U {...}                # [Utterance 2] Interim result
00:13.343  message<<< A {...}                # [Utterance 2] Result for speech segment
00:14.282  command>>> p [..(32000 bytes)..]  # Send 15th second of data
00:14.336  message<<< C                      # [Utterance 3] Speech recognition processing started
00:15.287  command>>> p [..(32000 bytes)..]  # Send 16th second of data
00:15.344  message<<< U {...}                # [Utterance 3] Interim result
00:16.289  command>>> p [..(32000 bytes)..]  # Send 17th second of data
00:16.337  message<<< U {...}                # [Utterance 3] Interim result
00:17.291  command>>> p [..(22968 bytes)..]  # Send 18th second of data
00:17.345  message<<< U {...}                # [Utterance 3] Interim result
00:18.297  command>>> e                      # Send e command to end session after all voice data has been sent
00:18.341  message<<< U {...}                # [Utterance 3] Interim result
00:18.347  message<<< E 17700                # [Utterance 3] Speech ended at 17.70 seconds
00:18.347  message<<< U {...}                # [Utterance 3] Interim result
00:18.512  message<<< A {...}                # [Utterance 3] Result for speech segment
00:18.574  message<<< e                      # Response to session end
00:18.574  close>                            # WebSocket close from client
00:18.595  close                             # Response to WebSocket close

Here's a step-by-step explanation:

First, connect to the AmiVoice API endpoint via WebSocket and make a speech recognition request with the s command. The API returns a response to the s command.

00:00.000  open> wss://acp-api.amivoice.com/v1/
00:00.161  open                              # WebSocket connection
00:00.161  command>>> s 16k -a-general segmenterProperties="useDiarizer=1" resultUpdatedInterval=1000 authorization=XXXXXXXXXXXXXXXX
00:00.226  message<<< s                      # Response to s command for speech recognition request

After the request succeeds, the client sends voice data every second with the p command. After sending the 8th second of data, the API sends an S (speech detection) event.

00:00.226  command>>> p [..(32000 bytes)..]  # Send 1st second of data
00:01.232  command>>> p [..(32000 bytes)..]  # Send 2nd second of data
00:02.236  command>>> p [..(32000 bytes)..]  # Send 3rd second of data
00:03.241  command>>> p [..(32000 bytes)..]  # Send 4th second of data
00:04.245  command>>> p [..(32000 bytes)..]  # Send 5th second of data
00:05.250  command>>> p [..(32000 bytes)..]  # Send 6th second of data
00:06.254  command>>> p [..(32000 bytes)..]  # Send 7th second of data
00:07.257  command>>> p [..(32000 bytes)..]  # Send 8th second of data
00:07.291  message<<< S 6200                 # [Utterance 1] Speech detected at 6.2 seconds

After that, events are received in the order of C (speech recognition start), then A (result for speech segment). Also, an E (end of speech detection) event is received indicating that the speech ended at 7.45 seconds. Since resultUpdatedInterval=1000 was specified in the s command when connecting, U (interim result) is received every second.

00:07.297  message<<< C                      # [Utterance 1] Speech recognition processing started
00:08.262  command>>> p [..(32000 bytes)..]  # Send 9th second of data
00:08.315  message<<< E 7450                 # [Utterance 1] Speech ended at 7.45 seconds
00:08.315  message<<< U {...}                # [Utterance 1] Interim result
00:08.446  message<<< A {...}                # [Utterance 1] Result for speech segment

Speech detection and speech recognition processing are repeated for the remaining two utterances.

After sending all voice data, send the e command to end the session. The API returns an e event as a response to end the session after completing all speech detection and speech recognition processing.

00:18.297  command>>> e                      # Send e command to end session after all voice data has been sent
00:18.341  message<<< U {...}                # [Utterance 3] Interim result
00:18.347  message<<< E 17700                # [Utterance 3] Speech detected at 17.70 seconds
00:18.347  message<<< U {...}                # [Utterance 3] Interim result
00:18.512  message<<< A {...}                # [Utterance 3] Result for speech segment
00:18.574  message<<< e                      # Response to session end
00:18.574  close>                            # WebSocket close from client
00:18.595  close                             # Response to WebSocket close

For the entire 18-second session, please see the comments in the log mentioned earlier.

The A and U event responses contain results. For details, please see Speech Recognition Result Format. For WebSocket commands and events, please also see WebSocket Interface.

The sequence of commands and events in the above log is as follows. The p commands are omitted in the figure below:

Figure. Example of Commands and Events

Error Responses

If the s command, p command, or e command sent by the client fails for some reason, for example, if there is an error in the command transmission procedure, if it violates limitations, or if a problem occurs on the server side, an error response containing an error message may be returned.

Successful response:

s
p
e

Error response:

s Error message
p Error message
e Error message

In case of an error response, it returns to the initial state before sending the s command. Please send the request again starting from the s command. For details, please see the audio supply state transition diagram in Packet and State Transitions.

Please see the error message and, in case of a client error, correct the cause and send the request again. In case of a server error on the AmiVoice API server side, please wait for a while and then send the request again. For errors due to limitations, please see Limitations. For details on error messages, please see the following:

To handle server errors of the s command or unexpected network transmission errors, it is effective to retry until the s command succeeds. In this case, you can create a more robust application by taking measures such as using a ring buffer to avoid losing audio from the streaming source.

Error Responses​

Error Responses