Skip to main content

Features

AmiVoice API has the following features:

  • Receives audio data and returns the spoken content converted to text.
  • Supports both file-based and streaming interfaces for real-time results. Please see Appropriate Use of Interface.
  • Uses text-based protocols via HTTP or WebSocket, requiring only TCP/IP availability in the client environment without the need for special libraries.
  • Ensures secure communication through HTTPS and WSS encryption.
  • Estimates and recognizes human speech from the transmitted audio data, charging only for the recognized speech duration. Please see AmiVoice API Pricing
  • Returns speech recognition results in JSON format, including not only the estimated spoken text but also speech start and end times, token-level timing information, confidence scores, and more.
  • Supports various languages. Please see Supported Languages.
  • Automatically inserts punctuation.
  • Automatically removes filler words such as "えーっと" and "あのー". These can be retained if needed, for example, when analyzing employee speech patterns in call centers.
  • Offers multiple speech recognition engines (combinations of language models and acoustic models) to select the optimal engine for various languages, domains, and use cases.
  • Allows users to add unrecognized words through custom vocabulary registration.
  • With the speaker diarization feature enabled, it can estimate who is speaking and when in audio containing multiple speakers.
  • When the sentiment analysis feature is enabled, it can perform sentiment analysis simultaneously.