Skip to main content

Synchronous HTTP Interface

The synchronous HTTP interface is a Web API that allows easy conversion of short audio files to text.

Endpoint

This is the endpoint for requesting speech recognition. The endpoint differs depending on whether logging or no logging. For details, please see Logging.

POST https://acp-api.amivoice.com/v1/recognize   (logging)
POST https://acp-api.amivoice.com/v1/nolog/recognize (no logging)

Request

Request Parameter List

Parameter Name
Required
Description
uSpecify the APPKEY displayed on the My Page, or the One-time APPKEY.
dSet various parameters related to the speech recognition request. Please see d parameter.
aSet the binary data of the audio. This data must be in the final part of the HTTP multipart. For sendable audio data, please see Audio Format in the usage guide.
cFormat name when sending RAW data (PCM). For settable values, please see Audio Format.
note
  • Parameters other than audio data can be sent either as query parameters or multipart. Setting the d parameter as a query parameter may exceed the request line limit, so sending as multipart is recommended.
  • If the same parameter is specified in both query parameters and multipart, the value set in the query parameter takes precedence.

d parameter

In the d parameter, key-value format parameters are specified separated by half-width spaces. The format of the d parameter is as follows:

Example:

<key>=<value> <key>=<value> <key>=<value> ...

URL encode <value> if it contains spaces. In the following example, two parameters, grammarFileNames and profileWords, are specified. For profileWords, a word with notation "www" and reading "とりぷるだぶる" is set.

grammarFileNames=-a-general profileWords=www%20%E3%81%A8%E3%82%8A%E3%81%B7%E3%82%8B%E3%81%A0%E3%81%B6%E3%82%8B

The following can be specified in the d parameter. The connection engine name (grammarFileNames) is required.

Parameter Name
Value
Description
grammarFileNames{Connection Engine Name}Specify the connection engine name. The list of available connection engine names is displayed on the My Page. Please also see List of Speech Recognition Engines.
profileIdStringID to specify registered words. For details, please see Word Registration.
profileWordsStringList of word registrations to be valid only during the session. Specify as {notation} {reading} or {notation} {reading} {class name}. For multiple words, concatenate with |. For details, please see Word Registration.
keepFillerToken0|1Specify filler word output. Set to 1 to not remove fillers. The default is 0, where filler words are automatically removed from recognition results. Please See Specifying Filler Word Output.
caution
  • For profileId, strings consisting of alphanumeric characters, "-" (hyphen), and "_" (underscore) can be used. However, strings starting with "__" (two underscores) are reserved by the speech recognition engine, so please do not specify strings starting with "__" (two underscores).
  • When specifying both profileId and profileWords, profileId must be specified first.

Response

Response Structure

<result> contains the following JSON:

Description
resultsArray of "recognition results for utterance segments" *The array always contains only 1 element.
confidenceConfidence (value from 0 to 1. 0: low confidence, 1: high confidence)
starttimeUtterance start time (0 is the beginning of the audio data)
endtimeUtterance end time (0 is the beginning of the audio data)
tagsUnused (empty array)
rulenameUnused (empty string)
textRecognition result text
tokensArray of morphemes of the recognition result text
writtenNotation of the morpheme (word)
confidenceConfidence of the morpheme (likelihood of recognition result)
starttimeStart time of the morpheme (0 is the beginning of the audio data)
endtimeEnd time of the morpheme (0 is the beginning of the audio data)
spokenReading of the morpheme *3
utteranceidRecognition result information ID *1
textOverall recognition result text combining all "recognition results for utterance segments"
code1-character code representing the result *2
messageString representing the error content *2

*1 For the WebSocket speech recognition protocol, the recognition result information ID is the ID assigned to the recognition result information for each utterance segment. For the HTTP speech recognition protocol, it is the ID assigned to the recognition result information for the entire audio data uploaded in one session (which may include multiple utterance segments).

*2 On successful recognition: body.code == "" and  body.message == "" and  body.text != "" On failed recognition: body.code != "" and  body.message != "" and  body.text == ""

*3 The spoken in Japanese engine recognition results is in hiragana. The spoken in English engine recognition results is not a reading (please ignore it). The spoken in Chinese engine recognition results is pinyin.

List of codes and messages

When values are set in code and message included in <result>, it indicates that the request has failed. The causes of failure are as follows:

codemessageDescription
+received unsupported audio formatReceived audio data in an unsupported audio data format
-received illegal service authorizationReceived an invalid APPKEY (service authentication key string)
!failed to connect to recognizer serverCommunication failure within the speech recognition server (failed to connect to DSRM or DSRS)
>failed to send audio data to recognizer serverCommunication failure within the speech recognition server (failed to send audio data to DSRS)
<failed to receive recognition result from recognizer serverCommunication failure within the speech recognition server (failed to receive recognition results from DSRS)
#received invalid recognition result from recognizer serverCommunication failure within the speech recognition server (invalid format of recognition results received from DSRS)
$timeout occurred while receiving audio data from clientNo communication timeout occurred while receiving audio data from the client
%received too large audio data from clientThe number of bytes of audio data received from the client is too large (does not occur with WebSocket interface)
orecognition result is rejected because confidence is below the thresholdRecognition failed because the confidence of the entire recognition result fell below the confidence threshold * This error is also returned when no utterance could be detected from the entire received audio data, and therefore no recognition result can be returned. Possible causes for failure to detect utterance include audio data loss or incorrect specification of audio data format.
brecognition result is rejected because recognizer server is busyRecognition failed because the speech recognition server is busy
xrecognition result is rejected because grammar files are not loadedRecognition failed because the dictionary is not loaded
crecognition result is rejected because the recognition process is cancelledRecognition failed because a request to interrupt the recognition process was made
?recognition result is rejected because fatal error occurred in recognizer serverRecognition failed because a fatal error occurred during recognition on the speech recognition server
^invalid parameter (...)An invalid parameter was specified. Only for Asynchronous HTTP Interface.