Audio Formats

This section explains the audio data formats that can be handled by the AmiVoice API, and how to configure them in the request parameters.

Supported Audio Formats

This section describes the audio formats supported by the AmiVoice API.

Encoding

Signed 16-bit PCM (little-endian, big-endian)
A-law (8-bit)
mu-law (8-bit)

Sampling Rate

Sampling rates of 8kHz, 11.025kHz, 16kHz, 22.05kHz, 32kHz, 44.1kHz, and 48kHz are supported. A-law and mu-law formats are only supported at 8kHz.

note

In this document, 11.025kHz and 22.05kHz may also be referred to as 11kHz and 22kHz respectively.

The speech recognition engines used for speech recognition processing in the AmiVoice API are available in two types, supporting sampling rates of 8kHz and 16kHz. The 8kHz engine is mainly used for telephone audio, while the 16kHz engine is prepared for audio widely used in other applications. The sampling rates corresponding to each speech recognition engine are shown in the following table.

Speech Recognition Engine	Supported Sampling Rates
Speech recognition engine supporting 8kHz	8kHz, 11.025kHz
Speech recognition engine supporting 16kHz	16kHz, 22.05kHz, 32kHz, 44.1kHz, 48kHz

Only some speech recognition engines support 8kHz. For details, please see List of Speech Recognition Engines.

tip

Unlike songs or musical instrument performances, information in frequency bands higher than 16kHz is generally not necessary for speech recognition. Even if you send audio sampled at a higher frequency than 16kHz, it will be downsampled to 16kHz before processing. Therefore, there is no need to set the sampling rate higher than 16kHz. We recommend sending audio data at an appropriate sampling rate to conserve network bandwidth and reduce transmission time. Note that sampling rates other than 16kHz do not affect the accuracy of speech recognition at all.

Similarly, when using a speech recognition engine that supports 8k audio, 11kHz audio is downsampled to 8kHz before processing.

Please also see the AmiVoice TechBlog article "What sampling rate is necessary for speech recognition?" for more information.

Number of Channels

1 or 2 channels are supported.

2 channels (stereo) are only supported for audio files with headers containing audio format information, such as Wave, Ogg, FLAC, etc. However, for stereo audio, only the first channel is used for speech recognition.

tip

If the audio source is stereo, to perform speech recognition on both channels, please make separate speech recognition requests for each channel.

Please also see the AmiVoice TechBlog article "How to convert stereo audio files into two mono audio files" for more information.

Audio Compression

Speex, Opus, MP3, and FLAC are supported.

tip

Strong compression that makes it difficult even for human ears to hear can affect recognition accuracy. Here are guidelines for compression rates using the following compression methods:

Compression Method	Guideline
Speex	quality 7 or higher
Opus	Compression ratio of about 1/10

File Formats

Wave (WAV), Ogg, MP3, and FLAC are supported. In some cases, specifying the audio format in the request parameters may not be necessary because the audio format is described in the file header. For details, please see the next section on How to Set Audio Format.

How to Set Audio Format

When making a speech recognition request, you need to specify the audio format of the audio data being sent. Container files such as Wave (WAV), Ogg, and FLAC have the audio format described in the file header. We will explain how to send such "header-included" audio files and how to send "header-less" audio data that does not include audio format information in the file header.

danger

Please specify the correct audio format. Incorrect settings may result in no results at all, or decreased accuracy in speech recognition and speaker diarization.

tip

The audio format names mentioned below are case-insensitive. LSB16K and lsb16k are the same.

Audio Formats with Headers

When the audio format is described in the file header, set the audio format as follows:

Interface	How to Set Audio Format
Synchronous/Asynchronous HTTP	You can omit specifying the audio format
WebSocket	For the first argument of the `s` command, set `8k` if the sampling rate of the audio data is 8kHz/11kHz, and `16k` if it's 16kHz or higher.

The supported audio formats and how to set them are summarized in the table below.

File Format	Encoding/Audio Compression	Sampling Frequency	Number of Channels	Audio Format Name
Wave	PCM Signed 16-bit little-endian (* formatTag: 0x0001)	8kHz, 11kHz	1 or 2	`8K`
Wave	PCM Signed 16-bit little-endian (* formatTag: 0x0001)	16kHz or higher	1 or 2	`16K`
Wave	mu-Law 8-bit (* formatTag: 0x0007)	8kHz	1 or 2	`8K`
Wave	A-Law 8-bit (* formatTag: 0x0006)	8kHz	1 or 2	`8K`
Ogg	Speex	8kHz, 11kHz	1 or 2	`8K`
Ogg	Speex	16kHz or higher	1 or 2	`16K`
Ogg	Opus	8kHz, 11kHz	1 or 2	`8K`
Ogg	Opus	16kHz or higher	1 or 2	`16K`
MP3	MP3	8kHz, 11kHz	1 or 2	`8K`
MP3	MP3	16kHz or higher	1 or 2	`16K`
FLAC	FLAC	8kHz, 11kHz	1 or 2	`8K`
FLAC	FLAC	16kHz or higher	1 or 2	`16K`

note

For audio data with 2 channels, only the first channel is used for speech recognition.

Example for Synchronous HTTP Interface

WAV files contain audio format information, so the audio format is automatically detected and doesn't need to be specified. Here's an example using the curl command:

curl https://acp-api.amivoice.com/v1/recognize \
     -F u={APP_KEY} \
     -F d=grammarFileNames=-a-general \
     -F a=@test-16k.wav

Example for Asynchronous HTTP Interface

WAV files contain audio format information, so the audio format is automatically detected and doesn't need to be specified. Here's an example using the curl command:

curl https://acp-api-async.amivoice.com/v1/recognitions \
     -F u={APP_KEY} \
     -F d=grammarFileNames=-a-general \
     -F a=@test-16k.wav

note

Except for the endpoint, this is the same as the synchronous HTTP interface.

Example for WebSocket Interface

When using a 16kHz engine, set the first parameter of the s command as follows:

s 16K -a-general

For a sampling rate of 8k and using an engine that also supports 8kHz, such as "会話_汎用 (-a-general)", set it as follows:

s 8K -a-general

caution

When sending audio data with headers using the WebSocket interface, do not specify any audio format other than 8K or 16K. If you set a string like LSB16K from the headerless audio formats used to determine PCM format audio data (described later), processing will be based on that information. In most cases, this will result in either no speech being detected or results that are completely different from the actual speech.

Headerless Audio Formats

When sending "headerless" audio data such as raw PCM data, set the audio format as follows:

Interface	How to Set Audio Format
Synchronous/Asynchronous HTTP	Specify an audio format name like `LSB16K` in the `c` parameter.
WebSocket	Specify an audio format name like `LSB16K` as the first argument of the `s` command.

The audio format names corresponding to encoding and sampling rates are as follows:

Encoding	Sampling Frequency	Number of Channels	Audio Format Name
PCM Signed 16-bit little-endian	8kHz	1	`LSB8K`
PCM Signed 16-bit little-endian	11kHz	1	`LSB11K`
PCM Signed 16-bit little-endian	16kHz	1	`LSB16K`
PCM Signed 16-bit little-endian	22kHz	1	`LSB22K`
PCM Signed 16-bit little-endian	32kHz	1	`LSB32K`
PCM Signed 16-bit little-endian	44.1kHz	1	`LSB44K`
PCM Signed 16-bit little-endian	48kHz	1	`LSB48K`
PCM Signed 16-bit big-endian	8kHz	1	`MSB8K`
PCM Signed 16-bit big-endian	11kHz	1	`MSB11K`
PCM Signed 16-bit big-endian	16kHz	1	`MSB16K`
PCM Signed 16-bit big-endian	22kHz	1	`MSB22K`
PCM Signed 16-bit big-endian	32kHz	1	`MSB32K`
PCM Signed 16-bit big-endian	44.1kHz	1	`MSB44K`
PCM Signed 16-bit big-endian	48kHz	1	`MSB48K`
mu-Law 8-bit	8kHz	1	`MULAW`
A-Law 8-bit	8kHz	1	`ALAW`

Example for Synchronous HTTP Interface

Here's an example using curl to send PCM data with a sampling rate of 16kHz, 16-bit quantization, mono, and little-endian. In this case, specify LSB16K in the c parameter.

curl https://acp-api.amivoice.com/v1/recognize \
     -F u={APP_KEY} \
     -F d=grammarFileNames=-a-general \
     -F c=LSB16K \
     -F a=@test-16k.pcm

Example for Asynchronous HTTP Interface

Here's an example using curl to send PCM data with a sampling rate of 16kHz, 16-bit quantization, mono, and little-endian. In this case, specify LSB16K in the c parameter.

curl https://acp-api-async.amivoice.com/v1/recognitions \
     -F u={APP_KEY} \
     -F d=grammarFileNames=-a-general \
     -F c=LSB16K \
     -F a=@test-16k.pcm

note

Except for the endpoint, this is the same as the synchronous HTTP interface.

Example for WebSocket Interface

Here we're sending PCM data with a sampling rate of 16kHz, 16-bit quantization, mono, and little-endian. In this case, specify LSB16K in the s command.

s LSB16K -a-general

note

The s command in the WebSocket interface also sets other parameters. For details, please see Starting a Recognition Request and the reference for s Command Response Packet.

caution

If you specify an audio format like 8K or 16K (which are for audio files with headers) for headerless audio, it will attempt to read a header and fail. When header reading fails, it will be treated as LSB16K. When sending headerless audio data, always specify one of the strings listed in the Headerless Audio Formats section.

Supported Audio Formats​

Encoding​

Sampling Rate​

Number of Channels​

Audio Compression​

File Formats​

How to Set Audio Format​

Audio Formats with Headers​

Example for Synchronous HTTP Interface​

Example for Asynchronous HTTP Interface​

Example for WebSocket Interface​

Headerless Audio Formats​

Example for Synchronous HTTP Interface​

Example for Asynchronous HTTP Interface​

Example for WebSocket Interface​

Supported Audio Formats

Encoding

Sampling Rate

Number of Channels

Audio Compression

File Formats

How to Set Audio Format

Audio Formats with Headers

Example for Synchronous HTTP Interface

Example for Asynchronous HTTP Interface

Example for WebSocket Interface

Headerless Audio Formats

Example for Synchronous HTTP Interface

Example for Asynchronous HTTP Interface

Example for WebSocket Interface