Request Parameters
This section explains the parameters to be set when requesting speech recognition with AmiVoice API. Although the transmission methods differ for HTTP and WebSocket interfaces, the parameters that can be set are the same.
List of Parameters
authorization (authentication information) and grammarFileNames (connection engine name) are required. Other parameters are optional. Some parameters are not supported by all interfaces, so please see the table below.
| Parameter Name | Description | Required | Sync HTTP | WebSocket | Async HTTP |
|---|---|---|---|---|---|
| authorization | Authentication information | ● | ● | ● | ● |
| grammarFileNames | Connection engine name | ● | ● | ● | ● |
| profileId | Profile ID | ● | ● | ● | |
| profileWords | Word registration list | ● | ● | ● | |
| keepFillerToken | Suppression of automatic filler word removal | ● | ● | ● | |
| segmenterProperties | Parameters for speech segment detection and speaker diarization | ● | ● | ● (*1) | |
| extension | Usage aggregation tag | ● | ● | ● | |
| maxDecodingTime | Maximum recognition processing time | ● | ● | ● | |
| maxResponseTime | Maximum response time | ● | ● | ● | |
| maxDecodingRate | Maximum RT | ● | ● | ● | |
| targetResponseTime | Target response time | ● | ● | ● | |
| targetDecodingRate | Target RT | ● | ● | ● | |
| recognitionTimeout | Recognition completion timeout | ● | ● | ● | |
| resultUpdatedInterval | Interval for recognition in progress events | ● | |||
| noInputTimeout | Speech start timeout | ● | |||
| loggingOptOut | Change logging or no logging settings | ● | |||
| contentId | User-defined ID | ● | |||
| compatibleWithSync | Result format compatibility | ● | |||
| speakerDiarization | Speaker diarization enable option | ● | |||
| diarizationMinSpeaker | Minimum estimated number of speakers for diarization | ● | |||
| diarizationMaxSpeaker | Maximum estimated number of speakers for diarization | ● | |||
| sentimentAnalysis | Sentiment analysis enable option | ● |
(*1) In the Asynchronous HTTP interface, parameters related to speaker diarization cannot be used
For information on how to send these request parameters, please see the following sections:
Parameter Details
The following sections explain the details of each parameter.
Required Parameters
authorization
Authentication information
Authentication information must be set to use the API. The authentication information is the [APPKEY] listed on MyPage, or the one-time APPKEY obtained from the One-time APPKEY Issuance API.
When connecting to the speech recognition server from a browser application, please use a one-time APPKEY to avoid writing the APPKEY in the HTML file. For details, please see One-time APPKEY.
grammarFileNames
Connection engine name
Specify the speech recognition engine you want to use for the session. Specify one for each session. The values that can be set are listed in the Connection Engine Name table or on MyPage. For details, please see Speech Recognition Engines.
Optional Parameters
profileId
Profile ID
A profile is a user-specific data file that exists on the speech recognition server, where users can name and save registered words. The profile ID is an identifier to specify that data file. For details, please see Word Registration.
profileWords
Word registration list
You can register words that are valid for the session. Each word is registered in the format "notation (single-byte space) pronunciation". If specifying a class name, use "notation<single-byte space>pronunciation <single-byte space> class name". When registering multiple words, separate them with a "|" (single-byte vertical bar). The value format is as follows (example without specifying class names):
notation1 pronunciation1|notation2 pronunciation2|notation3 pronunciation3|notation4 pronunciation4
For details, please see Word Registration.
keepFillerToken
Suppression of automatic filler word removal
Specify 1 or 0. The default is 0. Specify 1 if you do not want to automatically remove filler words (such as "あー" or "えー") included in the speech recognition results. Also see Automatic Filler Word Removal.
Filler words are surrounded by single-byte "%" before and after the word. Here are examples of filler words:
%あー%
%えー%
%おー%
%えっと%
Please also see the AmiVoice Tech Blog article How to choose whether to display or remove unnecessary words (fillers) with AmiVoice API.
segmenterProperties
Parameters for speech segment detection
These are parameters to adjust the ease of detecting speech segments, etc. Please try with the default settings first, and then adjust as needed. Parameters available for configuration are as follows.
The default values are common for the Synchronous HTTP interface and WebSocket interface, while the Asynchronous HTTP interface has some different values set. The latter are listed in parentheses.
Default values may be changed without notice.
threshold- This is the threshold score for determining whether it's a speech segment or not. If the score is greater than or equal to this value, it's considered a speech segment. Lowering this value makes speech segments easier to detect and less likely to be interrupted or cut off at the end, but it also increases the likelihood of false detections. If false detections are noticeable in noisy environments, increase this value.
- The default is 5000 (8000).
preTime- When a period considered as a speech segment continues for a certain time, it transitions to a state of detecting the speech segment period. This value specifies the length of this "certain time". If short speech segments are not detected or the beginnings of speech segments are easily cut off, decrease this value. If there are many false detections of short noises, increase this value.
- The unit is milliseconds, and the default is 100 (100). Please specify in multiples of 50.
postTime- When a period considered as non-speech continues for a certain time at the end of a speech segment period, it ends the state of detecting the speech segment period. This value specifies the length of this "certain time". If speech segments are being divided in the middle, increase this value. If two speech segments are being connected, decrease this value.
- The unit is milliseconds, and the default is 550 (800).
preForcedTime- When transitioning to a state of detecting a speech segment period, this value specifies how far back from the first time considered as speech to set as the start point of the speech segment period. If the beginning of speech segments is easily cut off, increase this value.
- The unit is milliseconds, and the default is 350 (350).
postForcedTime- When ending the state of detecting a speech segment period, this value specifies how much time after the last time considered as non-speech to include in the speech segment period. If the end of speech segments is easily cut off, increase this value. If the real-time responsiveness of responses is poor, decrease this value.
- The unit is milliseconds, and the default is 350 (350).
powerThreshold- This is the threshold score when considering volume (power) in determining whether it's a speech segment or not. It needs to be set separately from
threshold, and becomes ineffective if set to 0 or less. If the score is below the threshold, it's considered non-speech. If small background sounds are being detected easily, increase this value. - The default is 100 (100).
- This is the threshold score when considering volume (power) in determining whether it's a speech segment or not. It needs to be set separately from
decayTime- You can monotonically decrease the value of
postTimeafter a certain time has passed to make speech segments easier to cut off. This value specifies this "certain time". If the detected speech segment periods are too long, decrease this value. - The unit is milliseconds, and the default is 5000 (15000).
- You can monotonically decrease the value of
Parameters related to speaker diarization
These are parameters related to speaker diarization. They can only be set in Synchronous HTTP and WebSocket interfaces. The following parameters can be set:
useDiarizer- Setting this to
1enables speaker diarization in Synchronous HTTP and WebSocket interfaces. It is disabled by default. For details, please see Speaker Diarization.
- Setting this to
diarizerAlpha- This parameter controls the ease of appearance of new speakers in speaker diarization for Synchronous HTTP and WebSocket interfaces. Larger values make new speakers more likely to appear, while smaller values make new speakers less likely to appear.
diarizerAlpha=0is special and is treated as if 1e-30 was specified. However, for engines that support 8kHz audio, for example, when using the general-purpose engine (-a-general) and sending 8kHz audio, it is treated as if 1e-10 was specified. If nothing is set, it is treated as ifdiarizerAlpha=0was specified.
- This parameter controls the ease of appearance of new speakers in speaker diarization for Synchronous HTTP and WebSocket interfaces. Larger values make new speakers more likely to appear, while smaller values make new speakers less likely to appear.
diarizerTransitionBias- This parameter controls the ease of speaker transitions in speaker diarization for Synchronous HTTP and WebSocket interfaces. Larger values make speaker transitions more likely, while smaller values make speaker transitions less likely.
diarizerTransitionBias=0is special and is treated as if 1e-20 was specified. If nothing is set, it is treated as ifdiarizerTransitionBias=0was specified.
- This parameter controls the ease of speaker transitions in speaker diarization for Synchronous HTTP and WebSocket interfaces. Larger values make speaker transitions more likely, while smaller values make speaker transitions less likely.
How to Set Parameters
Write the parameter settings following segmenterProperties=. When setting multiple parameters simultaneously, separate each parameter with a single-byte space.
Setting example for Synchronous HTTP interface using curl command
URL encode the single-byte space inserted when setting multiple parameters to %20.
curl https://acp-api.amivoice.com/v1/recognize \
-F u={APP_KEY} \
-F d="grammarFileNames=-a-general segmenterProperties=threshold=4000%20postTime=600" \
-F a=@test.wav
Setting example for WebSocket interface
Enclose the entire set of parameters for segmenterProperties in double quotes like "...".
s 16K -a-general authorization={APPKEY} segmenterProperties="preTime=200 useDiarizer=1 diarizerAlpha=1e-20"
extension
Usage aggregation tag
When sharing the same AmiVoice API account across multiple systems, environments, end-users, etc., you can set an aggregation tag to obtain usage for arbitrary attributes. For details, please see Usage Aggregation Tag.
maxDecodingTime
Maximum recognition processing time
A parameter that forcibly interrupts speech recognition processing when the following condition is met:
The unit is milliseconds. The default is 0, and when it's 0, this function is disabled.
maxResponseTime
Maximum response time
A parameter that forcibly interrupts speech recognition processing when the following condition is met:
The unit is milliseconds. Note that this parameter only activates after all audio data has been received.
The default is 0, and when it's 0, this function is disabled.
maxDecodingRate
Maximum RT
A parameter that forcibly interrupts speech recognition processing when the following condition is met:
Note that this parameter only activates after all audio data has been received.
The default is -1, and when it's a negative value, this function is disabled.
RT (Real Time factor) is a value that represents the processing speed of speech recognition, calculated as RT = Recognition processing time / Time of input audio.
targetResponseTime
Target Response Time
Calculate "Provisional RT" as , and
This function dynamically adjusts the balance between speech recognition processing speed and accuracy during recognition. The unit is milliseconds. This function is not activated when the "Duration of input audio so far" is less than 1 second, or when the "Recognition processing time so far" is less than the "Duration of input audio so far" (i.e., when "Provisional RT" is not 1 or greater).
The default value is 0, and when set to 0, this function is disabled.
targetDecodingRate
Target RT
Calculate "Provisional RT" as , and
This function dynamically adjusts the balance between speech recognition processing speed and accuracy during recognition. This function is not activated when the "Duration of input audio so far" is less than 1 second, or when the "Recognition processing time so far" is less than the "Duration of input audio so far" (i.e., when "Provisional RT" is not 1 or greater).
The default value is -1, and when set to a negative value, this function is disabled.
recognitionTimeout
Recognition Completion Timeout
This function forcibly interrupts the processing if speech recognition does not complete within a specified time (= recognitionTimeout). The unit is milliseconds. When this function is enabled, if speech recognition succeeds for one utterance segment, no further recognition processing will be performed for subsequent utterance segments, even within the time set by recognitionTimeout.
For Synchronous HTTP interface and Asynchronous HTTP interface, the recognitionTimeout count starts from the point when the server begins receiving audio from the client. For WebSocket interface, it starts from the point when the beginning of the first utterance segment is detected. Note that the duration of recognitionTimeout is the time taken for recognition processing, not the duration of the audio data.
The default value is 0, and when set to 0, this function is disabled.
This function is effective in cases where it is important to find an utterance segment containing valid recognition results as quickly as possible, such as in a question-and-answer voicebot, and where recognition processing is not necessary for subsequent utterance segments. If you enable this function and find that the utterance segments are interrupted mid-speech, resulting in incomplete recognition results, adjust the postTime in segmenterProperties.
WebSocket API Specific Parameters
resultUpdatedInterval
Interval for recognition in progress events
Specifies the interval in milliseconds at which recognition in progress events are issued.
- Setting it to 0 will not issue recognition in progress events.
- It issues recognition in-progress events every time the specified time elapses. If the interim recognition result is not updated during the specified time, the previous interim recognition result with a "." added to the end will be sent. If a value including a fraction less than 100 is specified, it will be treated as if the value rounded up to a multiple of 100 was specified.
noInputTimeout
Speech Start Timeout
This function forcibly ends the speech recognition session and cancels all subsequent recognition processing if no utterance segment is detected for a certain period (= noInputTimeout). The unit is milliseconds. The length of this time is the time taken for speech recognition processing, not the duration of the audio data.
When the session ends, the client receives "e timeout occurred" error.
The default value is 0, and when set to 0, this function is disabled.
Asynchronous HTTP Interface Specific Parameters
loggingOptOut
Change logging or no logging settings
loggingOptOut=<True|False>
Specifies logging or no logging. When set to True, the system will not retain logs during the session. The default is False.
contentId
User-defined ID
contentId=<arbitrary string>
You can specify an arbitrary string defined on the user side. It will be included in the status and result responses during that session. The default is None.
compatibleWithSync
Result format compatibility
compatibleWithSync=<True|False>
Formats the results in a way compatible with the Synchronous HTTP interface. The default is False.
speakerDiarization
Speaker diarization enable option
speakerDiarization=<True|False>
Enables speaker diarization. The default is False. For details, please see Speaker Diarization.
diarizationMinSpeaker
Minimum estimated number of speakers for diarization
diarizationMinSpeaker=<int>
Only valid when speaker diarization is enabled, you can specify the minimum number of speakers in the audio. It must be set to 1 or higher. The default is 1. For details, please see Speaker Diarization.
diarizationMaxSpeaker
Maximum estimated number of speakers for diarization
diarizationMaxSpeaker=<int>
Only valid when speaker diarization is enabled, you can specify the maximum number of speakers in the audio. It must be set to a value equal to or greater than diarizationMinSpeaker. The default is 10. For details, please see Speaker Diarization.
sentimentAnalysis
Sentiment analysis enable option
sentimentAnalysis=<True|False>
Enables sentiment analysis. The default is False.
For details, please see Sentiment Analysis.