Skip to main content

Audio Formats

This document explains the audio data formats that can be handled by the AmiVoice API, and how to configure them in the request parameters.

Supported Audio Formats

This section describes the audio formats supported by the AmiVoice API.

Encoding

  • Signed 16-bit PCM (little-endian, big-endian)
  • A-law (8-bit)
  • mu-law (8-bit)

Sampling Rate

Sampling rates of 8kHz, 11.025kHz, 16kHz, 22.05kHz, 32kHz, 44.1kHz, and 48kHz are supported. A-law and mu-law formats are only supported at 8kHz.

note

In this document, 11.025kHz and 22.05kHz may also be referred to as 11kHz and 22kHz respectively.

The speech recognition engines used for speech recognition processing in the AmiVoice API are available in two types, supporting sampling rates of 8kHz and 16kHz. The 8kHz engine is mainly used for telephone audio, while the 16kHz engine is prepared for audio widely used in other applications. The sampling rates corresponding to each speech recognition engine are shown in the following table.

Speech Recognition EngineSupported Sampling Rates
Speech recognition engine supporting 8kHz8kHz, 11.025kHz
Speech recognition engine supporting 16kHz16kHz, 22.05kHz, 32kHz, 44.1kHz, 48kHz

Only some speech recognition engines support 8kHz. For details, please see List of Speech Recognition Engines.

tip

Unlike songs or musical instrument performances, information in frequency bands higher than 16kHz is generally not necessary for speech recognition. Even if you send audio sampled at a higher frequency than 16kHz, it will be downsampled to 16kHz before processing. Therefore, there is no need to set the sampling rate higher than 16kHz. We recommend sending audio data at an appropriate sampling rate to conserve network bandwidth and reduce transmission time.

Similarly, when using a speech recognition engine that supports 8k audio, 11kHz audio is downsampled to 8kHz before processing.

Please also see the AmiVoice TechBlog article "What sampling rate is necessary for speech recognition?" for more information.

Number of Channels

1 or 2 channels are supported.

2 channels (stereo) are only supported for audio files with headers containing audio format information, such as Wave, Ogg, FLAC, etc. However, for stereo audio, only the first channel is used for speech recognition.

tip

If the audio source is stereo, to perform speech recognition on both channels, please make separate speech recognition requests for each channel.

Please also see the AmiVoice TechBlog article "How to convert stereo audio files into two mono audio files" for more information.

Audio Compression

Speex, Opus, MP3, and FLAC are supported.

tip

Strong compression that makes it difficult even for human ears to hear can affect recognition accuracy. Here are guidelines for compression rates using the following compression methods:

Compression MethodGuideline
Speexquality 7 or higher
OpusCompression ratio of about 1/10

File Formats

Wave (WAV), Ogg, MP3, and FLAC are supported. In some cases, specifying the audio format in the request parameters may not be necessary because the audio format is described in the file header. For details, please see the next section on How to Set Audio Format.

How to Set Audio Format

When making a speech recognition request, you need to specify the audio format of the audio data being sent. Container files such as Wave (WAV), Ogg, and FLAC have the audio format described in the file header. We will explain how to send such "header-included" audio files and how to send "header-less" audio data that does not include audio format information in the file header.

danger

Please specify the correct audio format. Incorrect settings may result in no results at all, or decreased accuracy in speech recognition and speaker diarization.

tip

The audio format names mentioned below are case-insensitive. LSB16K and lsb16k are the same.

Audio Formats with Headers

When the audio format is described in the file header, set the audio format as follows:

Interface
How to Set Audio Format
Synchronous/Asynchronous HTTPYou can omit specifying the audio format
WebSocketFor the first argument of the s command, set 8k if the sampling rate of the audio data is 8kHz/11kHz, and 16k if it's 16kHz or higher.

The supported audio formats and how to set them are summarized in the table below.

File Format
Encoding/Audio CompressionSampling FrequencyNumber of ChannelsAudio Format Name
WavePCM Signed 16-bit little-endian
(* formatTag: 0x0001)
8kHz, 11kHz1 or 28K
WavePCM Signed 16-bit little-endian
(* formatTag: 0x0001)
16kHz or higher1 or 216K
Wavemu-Law 8-bit
(* formatTag: 0x0007)
8kHz1 or 28K
WaveA-Law 8-bit
(* formatTag: 0x0006)
8kHz1 or 28K
OggSpeex8kHz, 11kHz1 or 28K
OggSpeex16kHz or higher1 or 216K
OggOpus8kHz, 11kHz1 or 28K
OggOpus16kHz or higher1 or 216K
MP3MP38kHz, 11kHz1 or 28K
MP3MP316kHz or higher1 or 216K
FLACFLAC8kHz, 11kHz1 or 28K
FLACFLAC16kHz or higher1 or 216K
note

For audio data with 2 channels, only the first channel is used for speech recognition.

Example for Synchronous HTTP Interface

WAV files contain audio format information, so the audio format is automatically detected and doesn't need to be specified. Here's an example using the curl command:

curl https://acp-api.amivoice.com/v1/recognize \
-F u={APP_KEY} \
-F d=grammarFileNames=-a-general \
-F a=@test-16k.wav

Example for Asynchronous HTTP Interface

WAV files contain audio format information, so the audio format is automatically detected and doesn't need to be specified. Here's an example using the curl command:

curl https://acp-api-async.amivoice.com/v1/recognitions \
-F u={APP_KEY} \
-F d=grammarFileNames=-a-general \
-F a=@test-16k.wav
note

Except for the endpoint, this is the same as the synchronous HTTP interface.

Example for WebSocket Interface

When using a 16kHz engine, set the first parameter of the s command as follows:

s 16K -a-general

For a sampling rate of 8k and using an engine that also supports 8kHz, such as "会話_汎用 (-a-general)", set it as follows:

s 8K -a-general
caution

When sending audio data with headers using the WebSocket interface, do not specify any audio format other than 8K or 16K. If you set a string like LSB16K from the headerless audio formats used to determine PCM format audio data (described later), processing will be based on that information. In most cases, this will result in either no speech being detected or results that are completely different from the actual speech.

Headerless Audio Formats

When sending "headerless" audio data such as raw PCM data, set the audio format as follows:

Interface
How to Set Audio Format
Synchronous/Asynchronous HTTPSpecify an audio format name like LSB16K in the c parameter.
WebSocketSpecify an audio format name like LSB16K as the first argument of the s command.

The audio format names corresponding to encoding and sampling rates are as follows:

EncodingSampling FrequencyNumber of ChannelsAudio Format Name
PCM Signed 16-bit little-endian8kHz1LSB8K
PCM Signed 16-bit little-endian11kHz1LSB11K
PCM Signed 16-bit little-endian16kHz1LSB16K
PCM Signed 16-bit little-endian22kHz1LSB22K
PCM Signed 16-bit little-endian32kHz1LSB32K
PCM Signed 16-bit little-endian44.1kHz1LSB44K
PCM Signed 16-bit little-endian48kHz1LSB48K
PCM Signed 16-bit big-endian8kHz1MSB8K
PCM Signed 16-bit big-endian11kHz1MSB11K
PCM Signed 16-bit big-endian16kHz1MSB16K
PCM Signed 16-bit big-endian22kHz1MSB22K
PCM Signed 16-bit big-endian32kHz1MSB32K
PCM Signed 16-bit big-endian44.1kHz1MSB44K
PCM Signed 16-bit big-endian48kHz1MSB48K
mu-Law 8-bit8kHz1MULAW
A-Law 8-bit8kHz1ALAW

Example for Synchronous HTTP Interface

Here's an example using curl to send PCM data with a sampling rate of 16kHz, 16-bit quantization, mono, and little-endian. In this case, specify LSB16K in the c parameter.

curl https://acp-api.amivoice.com/v1/recognize \
-F u={APP_KEY} \
-F d=grammarFileNames=-a-general \
-F c=LSB16K \
-F a=@test-16k.pcm

Example for Asynchronous HTTP Interface

Here's an example using curl to send PCM data with a sampling rate of 16kHz, 16-bit quantization, mono, and little-endian. In this case, specify LSB16K in the c parameter.

curl https://acp-api-async.amivoice.com/v1/recognitions \
-F u={APP_KEY} \
-F d=grammarFileNames=-a-general \
-F c=LSB16K \
-F a=@test-16k.pcm
note

Except for the endpoint, this is the same as the synchronous HTTP interface.

Example for WebSocket Interface

Here we're sending PCM data with a sampling rate of 16kHz, 16-bit quantization, mono, and little-endian. In this case, specify LSB16K in the s command.

s LSB16K -a-general
note

The s command in the WebSocket interface also sets other parameters. For details, please see Starting a Recognition Request and the reference for s Command Response Packet.

caution

If you specify an audio format like 8K or 16K (which are for audio files with headers) for headerless audio, it will attempt to read a header and fail. When header reading fails, it will be treated as LSB16K. When sending headerless audio data, always specify one of the strings listed in the Headerless Audio Formats section.