Features

AmiVoice API has the following features:

Receives audio data and returns the spoken content converted to text.
Supports both file-based and streaming interfaces for real-time results. Please see Appropriate Use of Interface.
Uses text-based protocols via HTTP or WebSocket, requiring only TCP/IP availability in the client environment without the need for special libraries.
Ensures secure communication through HTTPS and WSS encryption.
Estimates and recognizes human speech from the transmitted audio data, charging only for the recognized speech duration. Please see AmiVoice API Pricing
Returns speech recognition results in JSON format, including not only the estimated spoken text but also speech start and end times, token-level timing information, confidence scores, and more.
Supports various languages. Please see Supported Languages.
Automatically inserts punctuation.
Automatically removes filler words such as "えーっと" and "あのー". These can be retained if needed, for example, when analyzing employee speech patterns in call centers.
Offers multiple speech recognition engines (combinations of language models and acoustic models) to select the optimal engine for various languages, domains, and use cases.
Allows users to add unrecognized words using word registration.
With the speaker diarization feature enabled, it can estimate who is speaking and when in audio containing multiple speakers.
When the sentiment analysis feature is enabled, it can perform sentiment analysis simultaneously.