Speech Recognition Engines
AmiVoice API provides multiple speech recognition engines tailored for various languages and purposes. By selecting the most suitable speech recognition engine for the audio you want to recognize, you can improve accuracy. This section explains the languages supported by the speech recognition engines, types of engines, and key points for choosing the appropriate one.
List of Speech Recognition Engines
AmiVoice API offers various speech recognition engines. Please also see Difference between End to End and Hybrid.
End to End
This is a new generation of speech recognition engines.
Language | Engine Name | Supported Sampling Rates | Connection Engine Name |
---|---|---|---|
Japanese | 日本語E2E_汎用 | 8k / 16k | -a2-ja-general |
Chinese | 中国語E2E_汎用 | 8k / 16k | -a2-zh-general |
Multilingual | 多言語E2E_汎用 | 8k / 16k | -a2-multi-general |
Japanese | 日本語E2E_汎用バッチ | 8k / 16k | -a2b-ja-general |
Chinese | 中国語E2E_汎用バッチ | 8k / 16k | -a2b-zh-general |
Multilingual | 多言語E2E_汎用バッチ | 8k / 16k | -a2b-multi-general |
- The multilingual engine can transcribe audio containing multiple languages, each in its respective language. The supported languages are Japanese, English, and Chinese.
- Batch engines are optimized for batch processing where response speed is not critical. Use these engines when you prioritize accuracy. In particular, specify batch engines for asynchronous HTTP interfaces.
Hybrid
These are speech recognition engines optimized for various domains.
Language | Engine Name | Language Model | Supported Sampling Rates | Connection Engine Name |
---|---|---|---|---|
Japanese | 会話_汎用 | General | 8k / 16k | -a-general |
Japanese | 会話_医療 | Medical Conference | 16k | -a-medical |
Japanese | 会話_金融 | Finance | 16k | -a-bizfinance |
Japanese | 会話_保険 | Insurance | 16k | -a-bizinsurance |
Japanese | 音声入力_汎用 | Large-scale General | 16k | -a-general-input |
Japanese | 音声入力_医療 | Medical General | 16k | -a-medical-input |
Japanese | 音声入力_保険 | Insurance | 16k | -a-bizinsurance-input |
Japanese | 音声入力_金融 | Finance | 16k | -a-bizfinance-input |
English | 英語_汎用 | General | 8k / 16k | -a-general-en |
Chinese | 中国語_汎用 | General | 8k / 16k | -a-general-zh |
Korean | 韓国語_汎用 | General | 8k / 16k | -a-general-ko |
Japanese | 音声入力_氏名 | Name | 8k / 16k | -a-name-input-private (*1) |
Japanese | 音声入力_住所 | Address | 8k / 16k | -a-address-input-private (*1) |
Japanese | 音声入力_ルール | None | 16k | -a-rule-input-private (*1) (*2) |
- (*1) These engines are available in AmiVoice API Private.
- (*2) The "音声入力_ルール" engine is not available for asynchronous HTTP interfaces.
Elements of Speech Recognition Engines
Understanding the components and characteristics of speech recognition engines can help in selecting the appropriate engine and using the API effectively.
End to End/Hybrid
Here are the elements common to both End to End and Hybrid engines.
Supported Sampling Rates
All speech recognition engines support 16kHz. Some engines also support 8kHz sampling rate, which is commonly used for telephone audio. For more information about sampling rates, please see Sampling Rate in the audio format section.
- When recording audio yourself, record at 16kHz sampling rate and use a 16kHz engine.
- For telephone audio, use an 8kHz engine.
Connection Engine Name
For the connection engine name (grammarFileNames) in the request parameters, specify the string in the "Connection Engine Name" column of the table. For engine names published in AmiVoice API Private, please see MyPage.
Hybrid
Japanese hybrid speech recognition engines provide multiple engines combining purpose (acoustic model) and language model. The following explains the hybrid engines.
Purpose
There are "会話" engines optimized for transcribing natural conversations between people, and "音声入力" engines optimized for when people speak to machines. Each uses acoustic models trained on different datasets. However, the purpose is not just about the difference in acoustic models, but also includes optimizations for each specific use case.
Characteristics and Points to Note
"会話" engines are designed to easily remove unnecessary words like "えーっと" or "あのー". In the standard settings, these unnecessary words are recognized and automatically removed. You can also configure settings to deliberately display these unnecessary words. Please see Specifying Filler Word Output. When using "音声入力" engines, these words are often not judged as unnecessary and not removed, or may be misrecognized as other words.
Use Cases
- Use "会話" engines for transcribing audio from meetings, phone calls, etc.
- Use "音声入力" engines for dictation of electronic medical records, reports, emails, short messages, or for dialogues with robots or voice chatbots.
- If you can't narrow down the use case, use "会話" engines.
Language Model
Each "domain" such as medical, pharmaceutical, finance, insurance has its own frequently used vocabulary and expressions. We have prepared "domain-specific" language models optimized for each of these domains.
Here's a list of Japanese language models. As they are provided as engines for each purpose, use cases for each are also explained.
Language Model | Language Model Description and Engines for Each Purpose |
---|---|
General | Can be used for transcribing speech content without limiting the purpose. Exclusively for "会話" 会話_汎用( -a-general ): For transcribing meetings, videos, and cases where input is not limited |
Large-scale General | Can be used for dictation without limiting the purpose or transcribing voice dialogues. It has a significantly larger vocabulary than the general model. Rich in vocabulary including rarely spoken words and names of landmarks, places, facilities such as shrines, temples, castles, bridges, hot springs, zoos, aquariums, art museums, museums, dams, tunnels, etc. Exclusively for "音声入力" 音声入力_汎用( -a-general-input ): For dictation in various scenarios, voice dialogue applications, etc. |
Finance | In addition to the "General" language model, terms and expressions used in the finance industry are added. 会話_金融( -a-bizfinance ): For transcribing conversations during face-to-face sales, etc.音声入力_金融( -a-bizfinance-input ): For voice input of daily reports, email creation, etc. |
Insurance | In addition to the "General" language model, terms and expressions used in the insurance industry are added. 会話_保険( -a-bizinsurance ): For transcribing conversations during face-to-face sales, etc.音声入力_保険( -a-bizinsurance-input ): For voice input of daily reports, email creation, etc. |
Medical Conference | In addition to the "General" language model, various medical specialties, medical terms, and expressions used in medical industry meetings are added. It covers many disease names, drug names, hospital names, surgery names, place names, etc. Exclusively for "会話" 会話_医療( -a-medical ): For transcribing medical industry meetings, doctor-patient conversations during consultations, medical-related videos, face-to-face sales conversations, MR sales daily reports, etc. |
Medical General | Specialized for dictation of various medical documents such as electronic medical record findings, medical certificates, medical information provision documents, referral letters, care records, pharmacist medication guidance documents, etc. Exclusively for "音声入力" 音声入力_医療( -a-medical-input ): For dictation by various medical specialists including doctors and pharmacists |
Name | Specialized for recognizing person names (full name, surname only, given name only). Speech recognition results are all output in katakana. For "音声入力" only 音声入力_氏名( -a-name-input-private ): For voice automated response systems, etc. |
Address | Specialized for recognizing addresses. Recognizes city, ward, town, and village names and street numbers nationwide. For "音声入力" only 音声入力_住所( -a-address-input-private ): For voice automated response systems, etc. |
Rule Grammar | Allows recognition of only set phrases and words that you specify using rule grammar(*3). For "音声入力" only 音声入力_ルール( -a-rule-input-private ): Data entry in manufacturing and inspection/maintenance, robot operation, etc. |
- (*3) Rule grammar can use industry-standard formats such as JSGF (JSpeech Grammar Format) or SRGS (Speech Recognition Grammar Specification).
List of Class Names for Japanese Language Models
Here's a list of class names defined in Japanese speech recognition engines. Classes are used when registering words. For details, please see Word Registration. API users cannot add new classes.
Class Name | General | Large-scale General | Finance | Insurance | Medical Conference | Medical General | Name | Address | Notes |
---|---|---|---|---|---|---|---|---|---|
固有名詞 | ● | ● | ● | ● | ● | ||||
名前 | ● | ● | ● | ● | ● | Represents surname | |||
名前(名) | ● | ● | ● | ● | ● | Represents given name | |||
名前 | ● | Represents full name (*4) | |||||||
駅名 | ● | ● | ● | ● | ● | ||||
地名 | ● | ● | ● | ● | |||||
会社名 | ● | ● | ● | ● | ● | ||||
部署名 | ● | ● | ● | ● | ● | ||||
役職名 | ● | ● | ● | ● | ● | ||||
記号 | ● | ● | ● | ● | ● | ||||
括弧開き | ● | ● | ● | ● | ● | ||||
括弧閉じ | ● | ● | ● | ● | ● | ||||
元号 | ● | ● | ● | ● | ● | ● | |||
病名 | ● | ● | |||||||
薬品名 | ● | ● | |||||||
病院名 | ● | ● | |||||||
手術名 | ● | ● | |||||||
地名_区町村 | ● | ● | |||||||
地名_支庁市郡 | ● | ● | |||||||
フィラー(文頭) | ● | ● | |||||||
フィラー(文末) | ● | ● |
- (*4) The Name class represents full name in the 医療汎用, but represents surname in other language models.
- For the Rule Grammar engine, as words to be recognized are set within the rule grammar, the word registration feature is not available.
List of Class Names for Chinese Language Model
This is a list of class names defined in the Chinese speech recognition engine.
Class Name | General |
---|---|
固有名词一般 | ● |
姓 | ● |
名 | ● |
List of Class Names for Korean Language Model
This is a list of class names defined in the Korean speech recognition engine.
Class Name | General |
---|---|
固有名詞 | ● |
地名 | ● |
駅名 | ● |
会社名 | ● |
名前(姓) | ● |
名前(名) | ● |
Difference between End to End and Hybrid
Hybrid engines are traditional speech recognition engines using statistical models. Continue to use them in cases where you need to support many industry-specific vocabularies, such as in medical fields, or when you need hybrid engine-specific features. End to End engines are a new generation of AmiVoice speech recognition engines. For general purposes, they often achieve higher accuracy, so if you're new to AmiVoice API, try the End to End engines first. If you're already using AmiVoice API, check if changing to End to End won't affect your application, and consider migrating.
Characteristics of Hybrid Engines
- They perform speech recognition by combining acoustic and language models. You can use speech recognition engines with language models optimized for various domains.
- You can use classes in word registration.
- You can get accurate time information for each word in the
starttime
andendtime
of word-level results.
Characteristics of End to End Engines
- Only engines with general vocabulary are provided.
- Classes cannot be used in word registration.
- Pronunciation information for each word cannot be obtained in word-level results.
- The time information obtained from
starttime
andendtime
in word-level results is less accurate compared to hybrid engines. - As an error pattern that doesn't occur in hybrid engines, some results may be repeated. This tends to happen more often with audio of significantly poor quality or excessively long audio (over 20 seconds).
- Suppression of automatic filler word deletion is not supported. All fillers are automatically deleted.
- As of 2025-03-25, word registration is not possible with End to End engines. If you need word registration, please use a hybrid engine.
- As of 2025-03-25, automatic filler deletion is not supported for Chinese End to End engines.
Costs
Costs vary depending on the engine. For details, please see AmiVoice API Pricing.
About Recognition Accuracy
Words that are not in the vocabulary of the speech recognition engine will not be output. If a word not in the vocabulary is spoken, it will be recognized as a word with a similar pronunciation, a combination of shorter words with similar pronunciations, or simply as an incorrect word. Due to computational resource and time constraints, each speech recognition engine has a fixed vocabulary. General-purpose engines such as "会話_汎用" and "音声入力_汎用" have many vocabulary words registered to be usable in various scenarios, but they do not include words specific to particular industries or uses.
For specialized terms commonly used in industries such as medical, finance, and insurance, using an engine specialized for that industry can achieve higher recognition rates for words commonly used in that field. Furthermore, for words commonly used in specific organizations, word registration can be used to address this.
We have compared and reported on the difference in recognition rates between general-purpose engines and domain-specific engines on the AmiVoice Tech Blog. Please see Comparing the Speech Recognition Accuracy of AmiVoice's Domain-Specific Engines (General vs. Electronic Medical Records) and Comparison of Recognition Results between Voice Input Engine and Conversation Engine with the Same Utterance.