Features
AmiVoice API has the following features:
- Receives audio data and returns the spoken content converted to text.
- Supports both file-based and streaming interfaces for real-time results. Please see Appropriate Use of Interface.
- Uses text-based protocols via HTTP or WebSocket, requiring only TCP/IP availability in the client environment without the need for special libraries.
- Ensures secure communication through HTTPS and WSS encryption.
- Estimates and recognizes human speech from the transmitted audio data, charging only for the recognized speech duration. Please see AmiVoice API Pricing
- Returns speech recognition results in JSON format, including not only the estimated spoken text but also speech start and end times, token-level timing information, confidence scores, and more.
- Supports various languages. Please see Supported Languages.
- Automatically inserts punctuation.
- Automatically removes filler words such as "えーっと" and "あのー". These can be retained if needed, for example, when analyzing employee speech patterns in call centers.
- Offers multiple speech recognition engines (combinations of language models and acoustic models) to select the optimal engine for various languages, domains, and use cases.
- Allows users to add unrecognized words through custom vocabulary registration.
- With the speaker diarization feature enabled, it can estimate who is speaking and when in audio containing multiple speakers.
- When the sentiment analysis feature is enabled, it can perform sentiment analysis simultaneously.