Introduction
AmiVoice API is a speech recognition API that converts speech to text. When you send audio, it returns the spoken content as text. You can create speech-enabled applications such as transcription of meetings or voice dialogue systems.
Documentation Structure
Please see the "Introduction and Operation Guide" for security and operational information before implementation, the "Development Guide" for implementation details, the "Reference" for API specifications, and the "Help" section if you encounter any issues.
📄️ Introduction and Operation Guide
Summarizes information necessary for security, compliance, and operation.
📄️ Development Guide
Explains detailed information necessary for development, such as how to use the API according to your purpose, requests, and responses.
📄️ Reference
API Reference
📄️ Help
Troubleshooting and how to make inquiries
Quick Start
Obtain an APPKEY
Register from the User Registration Page and note down the APPKEY displayed in the [Connection Information] on your MyPage. Set it as an environment variable using the following command.
- macOS / Linux
- Windows (PowerShell)
- Windows (Command Prompt)
export APPKEY=your_appkey_here
$env:APPKEY = "your_appkey_here"
set APPKEY=your_appkey_here
The AmiVoice Tech Blog provides step-by-step instructions on how to register as a user and convert an audio file to text using AmiVoice API. Please see the following for reference:
Prepare an Audio File
Prepare the audio file you want to transcribe. You can use the following sample audio (test.wav) as is.
For information on supported audio file formats, please see Audio Formats.
Execute Speech Recognition
Execute the following. Replace test.wav with the path to your audio file.
- curl (macOS / Linux)
- curl (Windows PowerShell)
- curl (Windows Command Prompt)
- Python
curl https://acp-api.amivoice.com/v1/recognize \
-F d=-a-general \
-F u=$APPKEY \
-F a=@test.wav | jq
- If the
curlcommand is not installed, download the package for your OS from https://curl.se/ or use a package manager to install curl. - The result text is Unicode-escaped. In the above command,
jqis used to format the response for better readability. Ifjqis not installed, try executing without the| jqpart. Thejqcommand can be downloaded from https://stedolan.github.io/jq/ for your OS or installed using a package manager.
curl.exe https://acp-api.amivoice.com/v1/recognize `
-F d=-a-general `
-F u=$env:APPKEY `
-F a=@test.wav | jq
- In PowerShell,
curlis an alias forInvoke-WebRequest, so please specifycurl.exeexplicitly. Windows 10 version 1803 and later includescurl.exeby default. If it's not included, please install it from https://curl.se/. - The result text is Unicode-escaped. In the above command,
jqis used to format the response for better readability. Ifjqis not installed, try executing without the| jqpart. Thejqcommand can be downloaded from https://stedolan.github.io/jq/ for your OS or installed using a package manager.
curl https://acp-api.amivoice.com/v1/recognize ^
-F d=-a-general ^
-F u=%APPKEY% ^
-F a=@test.wav
- Windows 10 version 1803 and later includes
curlby default. If it's not included, please install it from https://curl.se/. - The result text is Unicode-escaped. In the above command,
jqis used to format the response for better readability. Ifjqis not installed, try executing without the| jqpart. Thejqcommand can be downloaded from https://stedolan.github.io/jq/ for your OS or installed using a package manager.
import os
import requests
with open("test.wav", "rb") as f:
response = requests.post(
"https://acp-api.amivoice.com/v1/recognize",
data={"d": "-a-general", "u": os.environ["APPKEY"]},
files={"a": f}
)
data = response.json() # The JSON parser automatically converts Unicode escapes to Japanese
print(data)
Check the Results
If successful, a JSON like the following will be returned. The transcription result is included in the text field.
{
"results": [
{
"tokens": [ ... ],
"confidence": 0.998,
"starttime": 250,
"endtime": 8794,
"text": "アドバンスト・メディアは、人と機械との自然なコミュニケーションを実現し、豊かな未来を創造していくことを目指します。"
}
],
"utteranceid": "20220602/14/018122d637320a301bc194c9_20220602_141433",
"text": "アドバンスト・メディアは、人と機械との自然なコミュニケーションを実現し、豊かな未来を創造していくことを目指します。",
"code": "",
"message": ""
}
For details on the response content, please see Speech Recognition Result Format.
Next Steps
The Quick Start used the Synchronous HTTP interface. If you want to handle real-time audio sources, you can use the WebSocket interface, and if you want to process large audio files exceeding 15MB, you can use the Asynchronous HTTP interface. For each use case and points to consider when choosing an interface, please see Interface Type and Usage.
📄️ Synchronous HTTP Interface
Simple implementation, optimal for short audio files
📄️ WebSocket Interface
Streaming
📄️ Asynchronous HTTP Interface
Large files and batch processing
We also provide client libraries and sample programs to support development.
📄️ Client Libraries
📄️ Sample Programs
For improving speech recognition accuracy, you can utilize the following features:
📄️ Speech Recognition Engines
You can change the speech recognition engine based on the domain
📄️ User Dictionary
You can pre-register technical terms and proper nouns
📄️ Rule Grammar
Improves accuracy by limiting patterns
We also provide additional features such as speaker diarization and sentiment analysis. Please use them according to your purpose.
📄️ Speaker Diarization
Separates audio containing multiple speakers by speaker and identifies who spoke when.
📄️ Sentiment Analysis
Analyzes emotions from speech and can identify the speaker's emotional state.
We also provide features to support the operation of your built services.
📄️ Usage Aggregation Tags
You can aggregate usage time for each tag set during requests.
Please also see the comprehensive Development Guide.