Commands and Responses
This section explains the commands and responses when sending voice data as a stream using the WebSocket interface.
The client uses the commands in the table below to request the start and end of speech recognition from the API, and to send data. The client waits for responses from the API to the s
and e
commands before sending the next command. An error response is returned when an error occurs.
Name | Description |
---|---|
s command | Command for speech recognition request from client |
p command | Command for sending voice data from client |
e command | Command for requesting end of speech recognition from client |
The API returns the events in the table below according to the progress of processing.
Name | Description |
---|---|
S event | Event notifying the start of speech detected by the voice activity detection process |
E event | Event notifying the end of speech detected by the voice activity detection process |
C event | Event notifying the start of speech recognition by the speech recognition process |
U event | Event notifying interim results of speech recognition by the speech recognition process |
A event | Event notifying the completion and results of speech recognition by the speech recognition process |
G event | Event notifying information generated by the server. This does not occur if related features are not used. |
The general flow of commands and events in an application using voice streaming is as follows:
S
and E
are obtained as results of voice activity detection, while C
, U
, and A
are obtained as results of speech recognition. The order of events may differ depending on the length of the utterance period and the congestion of the underlying system. For example, in the figure above, C
comes between S
and E
, but it may come after E
.
Here, we will explain using an example where three utterance periods are detected by voice activity detection for a given audio data.
We will use logs to explain the flow of command transmission from the client and event responses from the API when sending data from a streaming data source to the API in 1-second intervals using the WebSocket interface.
- Lines starting with
command>>>
are command transmissions from the client - Lines starting with
message<<<
are event responses from the API
The overall log is as follows. API events are highlighted for better visibility:
0:00:00.000 open> wss://acp-api.amivoice.com/v1/
0:00:00.161 open # WebSocket connection
0:00:00.161 command>>> s 16k -a-general segmenterProperties="useDiarizer=1" resultUpdatedInterval=1000 authorization=XXXXXXXXXXXXXXXX
0:00:00.226 message<<< s # Response to s command for speech recognition request
0:00:00.226 command>>> p [..(32000 bytes)..] # Send 1st second of data
0:00:01.232 command>>> p [..(32000 bytes)..] # Send 2nd second of data
0:00:02.236 command>>> p [..(32000 bytes)..] # Send 3rd second of data
0:00:03.241 command>>> p [..(32000 bytes)..] # Send 4th second of data
0:00:04.245 command>>> p [..(32000 bytes)..] # Send 5th second of data
0:00:05.250 command>>> p [..(32000 bytes)..] # Send 6th second of data
0:00:06.254 command>>> p [..(32000 bytes)..] # Send 7th second of data
0:00:07.257 command>>> p [..(32000 bytes)..] # Send 8th second of data
0:00:07.291 message<<< S 6200 # [Utterance 1] Speech detected at 6.2 seconds
0:00:07.297 message<<< C # [Utterance 1] Speech recognition processing started
0:00:08.262 command>>> p [..(32000 bytes)..] # Send 9th second of data
0:00:08.315 message<<< E 7450 # [Utterance 1] Speech ended at 7.45 seconds
0:00:08.315 message<<< U {...} # [Utterance 1] Interim result
0:00:08.446 message<<< A {...} # [Utterance 1] Result for utterance period
0:00:09.267 command>>> p [..(32000 bytes)..] # Send 10th second of data
0:00:10.270 command>>> p [..(32000 bytes)..] # Send 11th second of data
0:00:10.309 message<<< S 8600 # [Utterance 2] Speech detected at 8.6 seconds
0:00:10.327 message<<< C # [Utterance 2] Speech recognition processing started
0:00:11.272 command>>> p [..(32000 bytes)..] # Send 12th second of data
0:00:11.337 message<<< U {...} # [Utterance 2] Interim result
0:00:12.274 command>>> p [..(32000 bytes)..] # Send 13th second of data
0:00:12.321 message<<< U {...} # [Utterance 2] Interim result
0:00:13.277 command>>> p [..(32000 bytes)..] # Send 14th second of data
0:00:13.301 message<<< E 11650 # [Utterance 2] Speech detected at 11.65 seconds
0:00:13.304 message<<< S 12000 # [Utterance 3] Speech detected at 12.00 seconds
0:00:13.311 message<<< U {...} # [Utterance 3] Interim result
0:00:13.343 message<<< A {...} # [Utterance 3] Result for utterance period
0:00:14.282 command>>> p [..(32000 bytes)..] # Send 15th second of data
0:00:14.336 message<<< C # [Utterance 3] Speech recognition processing started
0:00:15.287 command>>> p [..(32000 bytes)..] # Send 16th second of data
0:00:15.344 message<<< U {...} # [Utterance 3] Interim result
0:00:16.289 command>>> p [..(32000 bytes)..] # Send 17th second of data
0:00:16.337 message<<< U {...} # [Utterance 3] Interim result
0:00:17.291 command>>> p [..(22968 bytes)..] # Send 18th second of data
0:00:17.345 message<<< U {...} # [Utterance 3] Interim result
0:00:18.297 command>>> e # Send e command to end session after all voice data has been sent
0:00:18.341 message<<< U {...} # [Utterance 3] Interim result
0:00:18.347 message<<< E 17700 # [Utterance 3] Speech detected at 17.70 seconds
0:00:18.347 message<<< U {...} # [Utterance 3] Interim result
0:00:18.512 message<<< A {...} # [Utterance 3] Result for utterance period
0:00:18.574 message<<< e # Response to session end
0:00:18.574 close> # WebSocket close from client
0:00:18.595 close # Response to WebSocket close
Here's a step-by-step explanation:
First, connect to the AmiVoice API endpoint via WebSocket and make a speech recognition request with the s
command. The API returns a response to the s
command.
0:00:00.000 open> wss://acp-api.amivoice.com/v1/
0:00:00.161 open # WebSocket connection
0:00:00.161 command>>> s 16k -a-general segmenterProperties="useDiarizer=1" resultUpdatedInterval=1000 authorization=XXXXXXXXXXXXXXXX
0:00:00.226 message<<< s # Response to s command for speech recognition request
After the request succeeds, the client sends voice data every second with the p
command. After sending the 8th second of data, the API sends an S
(speech detection) event.
0:00:00.226 command>>> p [..(32000 bytes)..] # Send 1st second of data
0:00:01.232 command>>> p [..(32000 bytes)..] # Send 2nd second of data
0:00:02.236 command>>> p [..(32000 bytes)..] # Send 3rd second of data
0:00:03.241 command>>> p [..(32000 bytes)..] # Send 4th second of data
0:00:04.245 command>>> p [..(32000 bytes)..] # Send 5th second of data
0:00:05.250 command>>> p [..(32000 bytes)..] # Send 6th second of data
0:00:06.254 command>>> p [..(32000 bytes)..] # Send 7th second of data
0:00:07.257 command>>> p [..(32000 bytes)..] # Send 8th second of data
0:00:07.291 message<<< S 6200 # [Utterance 1] Speech detected at 6.2 seconds
After that, events are received in the order of C
(speech recognition start), then A
(result for utterance period). Also, an E
(end of speech detection) event is received indicating that the speech ended at 7.45 seconds. Since resultUpdatedInterval=1000
was specified in the s
command when connecting, U
(interim result) is received every second.
0:00:07.297 message<<< C # [Utterance 1] Speech recognition processing started
0:00:08.262 command>>> p [..(32000 bytes)..] # Send 9th second of data
0:00:08.315 message<<< E 7450 # [Utterance 1] Speech ended at 7.45 seconds
0:00:08.315 message<<< U {...} # [Utterance 1] Interim result
0:00:08.446 message<<< A {...} # [Utterance 1] Result for utterance period
Speech detection and speech recognition processing are repeated for the remaining two utterances.
After sending all voice data, send the e
command to end the session. The API returns an e
event as a response to end the session after completing all speech detection and speech recognition processing.
0:00:18.297 command>>> e # Send e command to end session after all voice data has been sent
0:00:18.341 message<<< U {...} # [Utterance 3] Interim result
0:00:18.347 message<<< E 17700 # [Utterance 3] Speech detected at 17.70 seconds
0:00:18.347 message<<< U {...} # [Utterance 3] Interim result
0:00:18.512 message<<< A {...} # [Utterance 3] Result for utterance period
0:00:18.574 message<<< e # Response to session end
0:00:18.574 close> # WebSocket close from client
0:00:18.595 close # Response to WebSocket close
For the entire 18-second session, please see the comments in the log mentioned earlier.
The A
and U
event responses contain results. For details, please see Speech Recognition Result Format. For WebSocket commands and events, please also see WebSocket Interface.
The sequence of commands and events in the above log is as follows. The p
commands are omitted in the figure below:
Error Responses
If the s
command, p
command, or e
command sent by the client fails for some reason, for example, if there is an error in the command transmission procedure, if it violates limitations, or if a problem occurs on the server side, an error response containing an error message may be returned.
Successful response:
s
p
e
Error response:
s Error message
p Error message
e Error message
In case of an error response, it returns to the initial state before sending the s
command. Please send the request again starting from the s
command. For details, please see the voice supply state transition diagram in Packet and State Transitions.
Please see the error message and, in case of a client error, correct the cause and send the request again. In case of a server error on the AmiVoice API server side, please wait for a while and then send the request again. For errors due to limitations, please see Limitations. For details on error messages, please see the following:
To handle server errors of the s
command or unexpected network transmission errors, it is effective to retry until the s
command succeeds. In this case, you can create a more robust application by taking measures such as using a ring buffer to avoid losing voice from the streaming source.