Speech Recognition Engines
AmiVoice API provides multiple speech recognition engines tailored for various languages and purposes. By selecting the most suitable speech recognition engine for the audio you want to recognize, you can improve accuracy. This section explains the languages supported by the speech recognition engines, types of engines, and key points for choosing the appropriate one.
We plan to consolidate and reorganize medical engines on February 1, 2025. The explanation after the consolidation is described in the "Note" section.
List of Speech Recognition Engines
Here is a list of speech recognition engines provided by AmiVoice API.
Language | Engine Name | Language Model | Supported Sampling Rates | Connection Engine Name |
---|---|---|---|---|
Japanese | 会話_汎用 | General | 8k / 16k | -a-general |
Japanese | 会話_医療 | Medical | 16k | -a-medgeneral Planned to change to -a-medical (*2) |
Japanese | 会話_製薬 | Pharmaceutical | 16k | -a-bizmrreport |
Japanese | 会話_金融 | Finance | 16k | -a-bizfinance |
Japanese | 会話_保険 | Insurance | 16k | -a-bizinsurance |
Japanese | 音声入力_汎用 | Large-scale General | 16k | -a-general-input |
Japanese | 音声入力_医療 | Medical | 16k | -a-medgeneral-input Planned to change to -a-medical-input (*2) |
Japanese | 音声入力_製薬 | Pharmaceutical | 16k | -a-bizmrreport-input Planned to be integrated into -a-medical-input (*2) |
Japanese | 音声入力_保険 | Insurance | 16k | -a-bizinsurance-input |
Japanese | 音声入力_金融 | Finance | 16k | -a-bizfinance-input |
Japanese | 音声入力_電子カルテ | Electronic Medical Records | 16k | -a-medkarte-input Planned to be integrated into -a-medical-input (*2) |
English | 英語_汎用 | General | 8k(*3) / 16k | -a-general-en |
Chinese | 中国語_汎用 | General | 8k(*3) / 16k | -a-general-zh |
Korean(*1) | 韓国語_汎用 | General | 8k(*3) / 16k | -a-general-ko |
- (*1) Korean is not supported for asynchronous API. Support is planned for the future.
- (*2) Planned to change on February 1, 2025. Users who have been using the old engine names until October 30, 2024, will be able to continue using the old connection engine names even after the change. While this won't affect the operation of applications, we recommend using the new engine names. The new engines will be available from November 1.
The correspondence between current speech recognition engines and new engines is as shown in the table below.
Current | After Change | ||
Speech Recognition Engine Name | Connection Engine Name | Speech Recognition Engine Name | Connection Engine Name |
会話_医療 | -a-medgeneral | 会話_医療 | -a-medical |
会話_製薬 | -a-bizmrreport | ||
音声入力_医療 | -a-medgeneral-input | 音声入力_医療 | -a-medical-input |
音声入力_製薬 | -a-bizmrreport-input | ||
音声入力_電子カルテ | -a-medkarte-input |
- (*3) 8k engines for English, Chinese, and Korean are not supported for asynchronous API. Support is planned for the future.
Engine Name
Japanese speech recognition engines provide multiple engines combining purpose (acoustic model) and language model.
Purpose
There are "会話" engines optimized for transcribing natural conversations between people, and "音声入力" engines optimized for when people speak to machines. Each uses acoustic models trained on different datasets. However, the purpose is not just about the difference in acoustic models, but also includes optimizations for each specific use case.
Characteristics and Points to Note
"会話" engines are designed to easily remove unnecessary words like "えーっと" or "あのー". In the standard settings, these unnecessary words are recognized and automatically removed. You can also configure settings to deliberately display these unnecessary words. Please see Specifying Filler Word Output. When using "音声入力" engines, these words are often not judged as unnecessary and not removed, or may be misrecognized as other words.
Use Cases
- Use "会話" engines for transcribing audio from meetings, phone calls, etc.
- Use "音声入力" engines for dictation of electronic medical records, reports, emails, short messages, or for dialogues with robots or voice chatbots.
- If you can't narrow down the use case, use "会話" engines.
Language Model
Each "domain" such as medical, pharmaceutical, finance, insurance has its own frequently used vocabulary and expressions. We have prepared "domain-specific" language models optimized for each of these domains.
Here's a list of Japanese language models. As they are provided as engines for each purpose, use cases for each are also explained.
Language Model | Language Model Description and Engines for Each Purpose |
---|---|
General | Can be used for transcribing speech content without limiting the purpose. Exclusively for "会話" 会話_汎用( -a-general ): For transcribing meetings, videos, and cases where input is not limited |
Large-scale General | Can be used for dictation without limiting the purpose or transcribing voice dialogues. It has a significantly larger vocabulary than the general model. Rich in vocabulary including rarely spoken words and names of landmarks, places, facilities such as shrines, temples, castles, bridges, hot springs, zoos, aquariums, art museums, museums, dams, tunnels, etc. Exclusively for "音声入力" 音声入力_汎用( -a-general-input ): For dictation in various scenarios, voice dialogue applications, etc. |
Finance | In addition to the "General" language model, terms and expressions used in the finance industry are added. 会話_金融( -a-bizfinance ): For transcribing conversations during face-to-face sales, etc.音声入力_金融( -a-bizfinance-input ): For voice input of daily reports, email creation, etc. |
Insurance | In addition to the "General" language model, terms and expressions used in the insurance industry are added. 会話_保険( -a-bizinsurance ): For transcribing conversations during face-to-face sales, etc.音声入力_保険( -a-bizinsurance-input ): For voice input of daily reports, email creation, etc. |
Medical | In addition to the "General" language model, various medical specialties, medical terms, and expressions used in medical industry meetings are added. It covers many disease names, drug names, hospital names, surgery names, place names, etc. 会話_医療( -a-medgeneral ): For transcribing medical industry meetings, doctor-patient conversations during consultations, medical-related videos, etc.音声入力_医療( -a-medgeneral-input ): For voice input of care records, medical-related information, etc. |
Pharmaceutical | In addition to the "Medical" language model, many pharmaceutical industry terms and expressions are added. It covers many disease names, drug names, hospital names, etc. 会話_製薬( -a-bizmrreport ): For transcribing conversations during face-to-face sales, etc.音声入力_製薬( -a-bizmrreport-input ): For creating pharmacist medication guidance documents, voice input of MR sales daily reports, etc. |
Electronic Medical Records | Specialized for dictation of various medical documents such as electronic medical record findings, medical certificates, medical information provision documents, referral letters, etc. Exclusively for "音声入力" 音声入力_電子カルテ( -a-medkarte-input ): For dictation of electronic medical records in various medical specialties |
We plan to consolidate and reorganize medical engines on February 1, 2025. After the consolidation, the list of language models will be as follows:
- New "Medical Conference" and "Medical General" language models have been added
- The "Pharmaceutical" language model will be integrated into the "Medical Conference" language model
Language Model | Language Model Description and Engines for Each Purpose |
---|---|
General | Can be used for transcribing speech content without limiting the purpose. Exclusively for "会話" 会話_汎用( -a-general ): For transcribing meetings, videos, and cases where input is not limited |
Large-scale General | Can be used for dictation without limiting the purpose or transcribing voice dialogues. It has a significantly larger vocabulary than the general model. Rich in vocabulary including rarely spoken words and names of landmarks, places, facilities such as shrines, temples, castles, bridges, hot springs, zoos, aquariums, art museums, museums, dams, tunnels, etc. Exclusively for "音声入力" 音声入力_汎用( -a-general-input ): For dictation in various scenarios, voice dialogue applications, etc. |
Finance | In addition to the "General" language model, terms and expressions used in the finance industry are added. 会話_金融( -a-bizfinance ): For transcribing conversations during face-to-face sales, etc.音声入力_金融( -a-bizfinance-input ): For voice input of daily reports, email creation, etc. |
Insurance | In addition to the "General" language model, terms and expressions used in the insurance industry are added. 会話_保険( -a-bizinsurance ): For transcribing conversations during face-to-face sales, etc.音声入力_保険( -a-bizinsurance-input ): For voice input of daily reports, email creation, etc. |
Medical Conference | In addition to the "General" language model, various medical specialties, medical terms, and expressions used in medical industry meetings are added. It covers many disease names, drug names, hospital names, surgery names, place names, etc. Exclusively for "会話" 会話_医療( -a-medical ): For transcribing medical industry meetings, doctor-patient conversations during consultations, medical-related videos, face-to-face sales conversations, MR sales daily reports, etc. |
Medical General | Specialized for dictation of various medical documents such as electronic medical record findings, medical certificates, medical information provision documents, referral letters, care records, pharmacist medication guidance documents, etc. Exclusively for "音声入力" 音声入力_医療( -a-medical-input ): For dictation by various medical specialists including doctors and pharmacists |
List of Class Names for Japanese Language Models
Here's a list of class names defined in Japanese speech recognition engines. Classes are used when registering words. For details, please see Word Registration. API users cannot add new classes.
Class Name | General | Large-scale General | Finance | Insurance | Pharmaceutical | Medical | Electronic Medical Records | Supplementary |
---|---|---|---|---|---|---|---|---|
固有名詞 | ● | ● | ● | ● | ● | ● | ||
名前 | ● | ● | ● | ● | ● | ● | Represents surname | |
名前(名) | ● | ● | ● | ● | ● | ● | Represents given name | |
名前 | ● | Represents full name *1 | ||||||
駅名 | ● | ● | ● | ● | ● | ● | ||
地名 | ● | ● | ● | ● | ● | |||
会社名 | ● | ● | ● | ● | ● | ● | ||
部署名 | ● | ● | ● | ● | ● | ● | ||
役職名 | ● | ● | ● | ● | ● | ● | ||
記号 | ● | ● | ● | ● | ● | ● | ||
括弧開き | ● | ● | ● | ● | ● | ● | ||
括弧閉じ | ● | ● | ● | ● | ● | ● | ||
元号 | ● | ● | ● | ● | ● | ● | ● | |
病名 | ● | ● | ● | |||||
薬品名 | ● | ● | ● | |||||
病院名 | ● | ● | ● | |||||
手術名 | ● | ● | ||||||
地名_区町村 | ● | ● | ||||||
地名_支庁市郡 | ● | ● |
- (*1) The Name class represents full name in Electronic Medical Records, but represents surname in other language models.
We plan to consolidate and reorganize medical engines on February 1, 2025. After the consolidation, the list of class names will be as follows:
- New "Medical Conference" and "Medical General" language models have been added
- The "Pharmaceutical" language model will be integrated into the "Medical Conference" language model
Class Name | General | Large-scale General | Finance | Insurance | Medical Conference | Medical General | Notes |
---|---|---|---|---|---|---|---|
固有名詞 | ● | ● | ● | ● | ● | ||
名前 | ● | ● | ● | ● | ● | Represents surname | |
名前(名) | ● | ● | ● | ● | ● | Represents given name | |
名前 | ● | Represents full name *1 | |||||
駅名 | ● | ● | ● | ● | ● | ||
地名 | ● | ● | ● | ● | |||
会社名 | ● | ● | ● | ● | ● | ||
部署名 | ● | ● | ● | ● | ● | ||
役職名 | ● | ● | ● | ● | ● | ||
記号 | ● | ● | ● | ● | ● | ||
括弧開き | ● | ● | ● | ● | ● | ||
括弧閉じ | ● | ● | ● | ● | ● | ||
元号 | ● | ● | ● | ● | ● | ● | |
病名 | ● | ● | |||||
薬品名 | ● | ● | |||||
病院名 | ● | ● | |||||
手術名 | ● | ● | |||||
地名_区町村 | ● | ● | |||||
地名_支庁市郡 | ● | ● |
List of Class Names for Chinese Language Model
This is a list of class names defined in the Chinese speech recognition engine.
Class Name | General |
---|---|
固有名词一般 | ● |
姓 | ● |
名 | ● |
List of Class Names for Korean Language Model
This is a list of class names defined in the Korean speech recognition engine.
Class Name | General |
---|---|
固有名詞 | ● |
地名 | ● |
駅名 | ● |
会社名 | ● |
名前(姓) | ● |
名前(名) | ● |
Supported Sampling Rates
All speech recognition engines support 16kHz. Some engines support 8kHz sampling rate, which is commonly used in telephone communications. For information on sampling rates, please see Sampling Rate in the audio format section.
- When recording audio yourself, record at 16kHz sampling rate and use a 16kHz engine.
- For telephone audio, use an 8kHz engine.
Connection Engine Names
For the Connection Engine Name (grammarFileNames) in the Request Parameters, specify the string in the "Connection Engine Name" column of the table. For engine names published in AmiVoice API Private, please see your My Page.
Costs
Costs vary depending on the engine. For details, please see AmiVoice API Pricing.
About Recognition Accuracy
Words that are not in the vocabulary of the speech recognition engine will not be output. If a word not in the vocabulary is spoken, it will be recognized as a word with a similar pronunciation, a combination of shorter words with similar pronunciations, or simply as an incorrect word. Due to computational resource and time constraints, each speech recognition engine has a fixed vocabulary. General-purpose engines such as "会話_汎用" and "音声入力_汎用" have many vocabulary words registered to be usable in various scenarios, but they do not include words specific to particular industries or uses.
For specialized terms commonly used in industries such as medical, finance, and insurance, using an engine specialized for that industry can achieve higher recognition rates for words commonly used in that field. Furthermore, for words commonly used in specific organizations, word registration can be used to address this.
We have compared and reported on the difference in recognition rates between general-purpose engines and domain-specific engines on the AmiVoice Tech Blog. Please see Comparing the Speech Recognition Accuracy of AmiVoice's Domain-Specific Engines (General vs. Electronic Medical Records) and Comparison of Recognition Results between Voice Input Engine and Conversation Engine with the Same Utterance.