Synchronous HTTP Interface

The synchronous HTTP interface is a Web API that allows easy conversion of short audio files to text.

Endpoint

This is the endpoint for requesting speech recognition. The endpoint differs depending on whether logging or no logging. For details, please see Logging.

POST https://acp-api.amivoice.com/v1/recognize　　　(logging)
POST https://acp-api.amivoice.com/v1/nolog/recognize (no logging)

Request

Request Parameter List

Parameter Name	Required	Description
u	●	Specify the APPKEY displayed on MyPage, or the One-time APPKEY.
d	●	Set various parameters related to the speech recognition request. Please see d parameter.
a	●	Set the binary data of the audio. This data must be in the final part of the HTTP multipart. For sendable audio data, please see Audio Format in the usage guide.
c		Format name when sending RAW data (PCM). For settable values, please see Audio Format.

note

Parameters other than audio data can be sent either as query parameters or multipart. Setting the d parameter as a query parameter may exceed the request line limit, so sending as multipart is recommended.
If the same parameter is specified in both query parameters and multipart, the value set in the query parameter takes precedence.

d parameter

In the d parameter, key-value format parameters are specified separated by half-width spaces. The format of the d parameter is as follows:

Example:

<key>=<value> <key>=<value> <key>=<value> ...

URL encode <value> if it contains spaces. In the following example, two parameters, grammarFileNames and profileWords, are specified. For profileWords, a word with notation "www" and pronunciation "とりぷるだぶる" is set.

grammarFileNames=-a-general profileWords=www%20%E3%81%A8%E3%82%8A%E3%81%B7%E3%82%8B%E3%81%A0%E3%81%B6%E3%82%8B

The following can be specified in the d parameter. The connection engine name (grammarFileNames) is required.

Parameter Name	Value	Description
grammarFileNames	{Connection Engine Name}	Specify the connection engine name. The list of available connection engine names is displayed on MyPage. Please also see List of Speech Recognition Engines.
profileId	String	ID to specify registered words. For details, please see Word Registration.
profileWords	String	List of word registrations to be valid only during the session. Specify as `{notation} {pronunciation}` or `{notation} {pronunciation} {class name}`. For multiple words, concatenate with `\|`. For details, please see Word Registration.
keepFillerToken	0\|1	Specify filler word output. Set to 1 to not remove fillers. The default is 0, where filler words are automatically removed from recognition results. Please See Specifying Filler Word Output.

caution

For profileId, strings consisting of alphanumeric characters, "-" (hyphen), and "_" (underscore) can be used. However, strings starting with "__" (two underscores) are reserved by the speech recognition engine, so please do not specify strings starting with "__" (two underscores).
When specifying both profileId and profileWords, profileId must be specified first.

Response

Response Structure

<result> contains the following JSON:

			Description
results			Array of "recognition results for speech segments" *The array always contains only 1 element.
	confidence		Confidence (value from 0 to 1. 0: low confidence, 1: high confidence)
	starttime		Utterance start time (0 is the beginning of the audio data)
	endtime		Utterance end time (0 is the beginning of the audio data)
	tags		Unused (empty array)
	rulename		Unused (empty string)
	text		Recognition result text
	tokens		Array of morphemes of the recognition result text
		written	Notation of the morpheme (word)
		confidence	Confidence of the morpheme (likelihood of recognition result)
		starttime	Start time of the morpheme (0 is the beginning of the audio data)
		endtime	End time of the morpheme (0 is the beginning of the audio data)
		spoken	Pronunciation of the morpheme *3
utteranceid			Recognition result information ID *1
text			Overall recognition result text combining all "recognition results for speech segments"
code			1-character code representing the result *2
message			String representing the error content *2

*1 For the WebSocket speech recognition protocol, the recognition result information ID is the ID assigned to the recognition result information for each speech segment. For the HTTP speech recognition protocol, it is the ID assigned to the recognition result information for the entire audio data uploaded in one session (which may include multiple speech segments).

*2 On successful recognition: body.code == ""　and　 body.message == ""　and　 body.text != "" On failed recognition: body.code != ""　and　 body.message != ""　and　 body.text == ""

*3 The spoken in Japanese engine recognition results is in hiragana. The spoken in English engine recognition results is not a pronunciation (please ignore it). The spoken in Chinese engine recognition results is pinyin.

List of codes and messages

When values are set in code and message included in <result>, it indicates that the request has failed. The causes of failure are as follows:

code	message	Description
+	received unsupported audio format	Received audio data in an unsupported audio data format
-	received illegal service authorization	Received an invalid APPKEY (service authentication key string)
!	failed to connect to recognizer server	Communication failure within the speech recognition server (failed to connect to the speech recognition server or load balancer server)
>	failed to send audio data to recognizer server	Communication failure within the speech recognition server (failed to send audio data to the speech recognition server)
<	failed to receive recognition result from recognizer server	Communication failure within the speech recognition server (failed to receive recognition results from the speech recognition server)
#	received invalid recognition result from recognizer server	Communication failure within the speech recognition server (invalid format of recognition results received from the speech recognition server)
$	timeout occurred while receiving audio data from client	No communication timeout occurred while receiving audio data from the client
%	received too large audio data from client	The number of bytes of audio data received from the client is too large (does not occur with WebSocket interface)
o	recognition result is rejected because confidence is below the threshold	Recognition failed because the confidence of the entire recognition result fell below the confidence threshold * This error is also returned when no utterance could be detected from the entire received audio data, and therefore no recognition result can be returned. Possible causes for failure to detect utterance include audio data loss or incorrect specification of audio data format.
b	recognition result is rejected because recognizer server is busy	Recognition failed because the speech recognition server is busy
x	recognition result is rejected because grammar files are not loaded	Recognition failed because the dictionary is not loaded
c	recognition result is rejected because the recognition process is cancelled	Recognition failed because a request to interrupt the recognition process was made
?	recognition result is rejected because fatal error occurred in recognizer server	Recognition failed because a fatal error occurred during recognition on the speech recognition server
^	invalid parameter (...)	An invalid parameter was specified. Only for Asynchronous HTTP Interface.

Endpoint​

Request​

Request Parameter List​

d parameter​

Response​

Response Structure​

List of codes and messages​