Audio Formats
This document explains the audio data formats that can be handled by the AmiVoice API, and how to configure them in the request parameters.
Supported Audio Formats
This section describes the audio formats supported by the AmiVoice API.
Encoding
- Signed 16-bit PCM (little-endian, big-endian)
- A-law (8-bit)
- mu-law (8-bit)
Sampling Rate
Sampling rates of 8kHz, 11.025kHz, 16kHz, 22.05kHz, 32kHz, 44.1kHz, and 48kHz are supported. A-law and mu-law formats are only supported at 8kHz.
In this document, 11.025kHz and 22.05kHz may also be referred to as 11kHz and 22kHz respectively.
The speech recognition engines used for speech recognition processing in the AmiVoice API are available in two types, supporting sampling rates of 8kHz and 16kHz. The 8kHz engine is mainly used for telephone audio, while the 16kHz engine is prepared for audio widely used in other applications. The sampling rates corresponding to each speech recognition engine are shown in the following table.
Speech Recognition Engine | Supported Sampling Rates |
---|---|
Speech recognition engine supporting 8kHz | 8kHz, 11.025kHz |
Speech recognition engine supporting 16kHz | 16kHz, 22.05kHz, 32kHz, 44.1kHz, 48kHz |
Only some speech recognition engines support 8kHz. For details, please see List of Speech Recognition Engines.
Unlike songs or musical instrument performances, information in frequency bands higher than 16kHz is generally not necessary for speech recognition. Even if you send audio sampled at a higher frequency than 16kHz, it will be downsampled to 16kHz before processing. Therefore, there is no need to set the sampling rate higher than 16kHz. We recommend sending audio data at an appropriate sampling rate to conserve network bandwidth and reduce transmission time.
Similarly, when using a speech recognition engine that supports 8k audio, 11kHz audio is downsampled to 8kHz before processing.
Please also see the AmiVoice TechBlog article "What sampling rate is necessary for speech recognition?" for more information.
Number of Channels
1 or 2 channels are supported.
2 channels (stereo) are only supported for audio files with headers containing audio format information, such as Wave, Ogg, FLAC, etc. However, for stereo audio, only the first channel is used for speech recognition.
If the audio source is stereo, to perform speech recognition on both channels, please make separate speech recognition requests for each channel.
Please also see the AmiVoice TechBlog article "How to convert stereo audio files into two mono audio files" for more information.
Audio Compression
Speex, Opus, MP3, and FLAC are supported.
Strong compression that makes it difficult even for human ears to hear can affect recognition accuracy. Here are guidelines for compression rates using the following compression methods:
Compression Method | Guideline |
---|---|
Speex | quality 7 or higher |
Opus | Compression ratio of about 1/10 |
File Formats
Wave (WAV), Ogg, MP3, and FLAC are supported. In some cases, specifying the audio format in the request parameters may not be necessary because the audio format is described in the file header. For details, please see the next section on How to Set Audio Format.
How to Set Audio Format
When making a speech recognition request, you need to specify the audio format of the audio data being sent. Container files such as Wave (WAV), Ogg, and FLAC have the audio format described in the file header. We will explain how to send such "header-included" audio files and how to send "header-less" audio data that does not include audio format information in the file header.
Please specify the correct audio format. Incorrect settings may result in no results at all, or decreased accuracy in speech recognition and speaker diarization.
The audio format names mentioned below are case-insensitive. LSB16K
and lsb16k
are the same.
Audio Formats with Headers
When the audio format is described in the file header, set the audio format as follows:
Interface | How to Set Audio Format |
---|---|
Synchronous/Asynchronous HTTP | You can omit specifying the audio format |
WebSocket | For the first argument of the s command, set 8k if the sampling rate of the audio data is 8kHz/11kHz, and 16k if it's 16kHz or higher. |
The supported audio formats and how to set them are summarized in the table below.
File Format | Encoding/Audio Compression | Sampling Frequency | Number of Channels | Audio Format Name |
---|---|---|---|---|
Wave | PCM Signed 16-bit little-endian (* formatTag: 0x0001) | 8kHz, 11kHz | 1 or 2 | 8K |
Wave | PCM Signed 16-bit little-endian (* formatTag: 0x0001) | 16kHz or higher | 1 or 2 | 16K |
Wave | mu-Law 8-bit (* formatTag: 0x0007) | 8kHz | 1 or 2 | 8K |
Wave | A-Law 8-bit (* formatTag: 0x0006) | 8kHz | 1 or 2 | 8K |
Ogg | Speex | 8kHz, 11kHz | 1 or 2 | 8K |
Ogg | Speex | 16kHz or higher | 1 or 2 | 16K |
Ogg | Opus | 8kHz, 11kHz | 1 or 2 | 8K |
Ogg | Opus | 16kHz or higher | 1 or 2 | 16K |
MP3 | MP3 | 8kHz, 11kHz | 1 or 2 | 8K |
MP3 | MP3 | 16kHz or higher | 1 or 2 | 16K |
FLAC | FLAC | 8kHz, 11kHz | 1 or 2 | 8K |
FLAC | FLAC | 16kHz or higher | 1 or 2 | 16K |
For audio data with 2 channels, only the first channel is used for speech recognition.
Example for Synchronous HTTP Interface
WAV files contain audio format information, so the audio format is automatically detected and doesn't need to be specified. Here's an example using the curl command:
curl https://acp-api.amivoice.com/v1/recognize \
-F u={APP_KEY} \
-F d=grammarFileNames=-a-general \
-F a=@test-16k.wav
Example for Asynchronous HTTP Interface
WAV files contain audio format information, so the audio format is automatically detected and doesn't need to be specified. Here's an example using the curl command:
curl https://acp-api-async.amivoice.com/v1/recognitions \
-F u={APP_KEY} \
-F d=grammarFileNames=-a-general \
-F a=@test-16k.wav
Except for the endpoint, this is the same as the synchronous HTTP interface.
Example for WebSocket Interface
When using a 16kHz engine, set the first parameter of the s
command as follows:
s 16K -a-general
For a sampling rate of 8k and using an engine that also supports 8kHz, such as "会話_汎用 (-a-general
)", set it as follows:
s 8K -a-general
When sending audio data with headers using the WebSocket interface, do not specify any audio format other than 8K
or 16K
. If you set a string like LSB16K
from the headerless audio formats used to determine PCM format audio data (described later), processing will be based on that information. In most cases, this will result in either no speech being detected or results that are completely different from the actual speech.
Headerless Audio Formats
When sending "headerless" audio data such as raw PCM data, set the audio format as follows:
Interface | How to Set Audio Format |
---|---|
Synchronous/Asynchronous HTTP | Specify an audio format name like LSB16K in the c parameter. |
WebSocket | Specify an audio format name like LSB16K as the first argument of the s command. |
The audio format names corresponding to encoding and sampling rates are as follows:
Encoding | Sampling Frequency | Number of Channels | Audio Format Name |
---|---|---|---|
PCM Signed 16-bit little-endian | 8kHz | 1 | LSB8K |
PCM Signed 16-bit little-endian | 11kHz | 1 | LSB11K |
PCM Signed 16-bit little-endian | 16kHz | 1 | LSB16K |
PCM Signed 16-bit little-endian | 22kHz | 1 | LSB22K |
PCM Signed 16-bit little-endian | 32kHz | 1 | LSB32K |
PCM Signed 16-bit little-endian | 44.1kHz | 1 | LSB44K |
PCM Signed 16-bit little-endian | 48kHz | 1 | LSB48K |
PCM Signed 16-bit big-endian | 8kHz | 1 | MSB8K |
PCM Signed 16-bit big-endian | 11kHz | 1 | MSB11K |
PCM Signed 16-bit big-endian | 16kHz | 1 | MSB16K |
PCM Signed 16-bit big-endian | 22kHz | 1 | MSB22K |
PCM Signed 16-bit big-endian | 32kHz | 1 | MSB32K |
PCM Signed 16-bit big-endian | 44.1kHz | 1 | MSB44K |
PCM Signed 16-bit big-endian | 48kHz | 1 | MSB48K |
mu-Law 8-bit | 8kHz | 1 | MULAW |
A-Law 8-bit | 8kHz | 1 | ALAW |
Example for Synchronous HTTP Interface
Here's an example using curl to send PCM data with a sampling rate of 16kHz, 16-bit quantization, mono, and little-endian. In this case, specify LSB16K
in the c
parameter.
curl https://acp-api.amivoice.com/v1/recognize \
-F u={APP_KEY} \
-F d=grammarFileNames=-a-general \
-F c=LSB16K \
-F a=@test-16k.pcm
Example for Asynchronous HTTP Interface
Here's an example using curl to send PCM data with a sampling rate of 16kHz, 16-bit quantization, mono, and little-endian. In this case, specify LSB16K
in the c
parameter.
curl https://acp-api-async.amivoice.com/v1/recognitions \
-F u={APP_KEY} \
-F d=grammarFileNames=-a-general \
-F c=LSB16K \
-F a=@test-16k.pcm
Except for the endpoint, this is the same as the synchronous HTTP interface.
Example for WebSocket Interface
Here we're sending PCM data with a sampling rate of 16kHz, 16-bit quantization, mono, and little-endian. In this case, specify LSB16K
in the s
command.
s LSB16K -a-general
The s
command in the WebSocket interface also sets other parameters. For details, please see Starting a Recognition Request and the reference for s Command Response Packet.
If you specify an audio format like 8K
or 16K
(which are for audio files with headers) for headerless audio, it will attempt to read a header and fail. When header reading fails, it will be treated as LSB16K
. When sending headerless audio data, always specify one of the strings listed in the Headerless Audio Formats section.