Transcribing Short Audio Files
You can easily convert short audio files (under 16MB) that you have on hand to text by sending them to the endpoint of the AmiVoice API's HTTP interface. In this tutorial, instead of writing a program, we'll explain how to use the API using the curl
command and the jq
command. For long audio files, please see the next tutorial 「Transcribing Long Audio Files」.
Preparation
To execute this tutorial, you'll need the following:
- curl
- jq
- Register for AmiVoice API and obtain an
APPKEY
- Prepare an audio file you want to transcribe
We're using the jq
command to format the results for better readability. You can proceed without installing jq
if you don't have it, as you'll still be able to transcribe audio in the following tutorial.
curl
Please check if the curl
command is installed on your system.
curl -V
If the version is not displayed, please download the package for your OS from https://curl.se/ or use a package manager to install curl
.
jq
Please check if the jq
command is installed on your system.
jq -V
If the version is not displayed, please download the package for your OS from https://stedolan.github.io/jq/ or use a package manager to install jq
.
Obtaining an APPKEY
- Register for AmiVoice API.
- Log in to your account page, and record the APPKEY listed under [Common Connection Information] in the [Connection Information] tab.
The AmiVoice Tech Blog details the steps to obtain an APPKEY
. For obtaining an APPKEY
, please also see Let's Try Using AmiVoice API.
Audio File
Prepare an audio file you want to transcribe. Here, we'll use the audio file (test.wav) included with the sample program in the client library.
- When preparing an audio file, please pay attention to the supported audio file formats. For supported formats, please see About Audio Formats.
- If you want to send an audio file larger than 16MB, please see the next Transcribing Long Audio Files.
Execution
Launch a terminal and copy and run the following command. Replace test.wav
with the path to your prepared audio file. Also, replace {APPKEY}
with your own key.
curl https://acp-api.amivoice.com/v1/recognize \
-F d=-a-general \
-F u={APP_KEY} \
-F a=@test.wav
Results
If the execution is successful, you'll get results in JSON format like this:
{"results":[{"tokens":[{"written":"\u30a2\u30c9\u30d0\u30f3\u30b9\u30c8\u30fb\u30e1\u30c7\u30a3\u30a2","confidence":1.00,"starttime":522,"endtime":1578,"spoken":"\u3042\u3069\u3070\u3093\u3059\u3068\u3081\u3067\u3043\u3042"},{"written":"\u306f","confidence":1.00,"starttime":1578,"endtime":1866,"spoken":"\u306f"},{"written":"\u3001","confidence":0.72,"starttime":1866,"endtime":2026,"spoken":"_"},{"written":"\u4eba","confidence":1.00,"starttime":2026,"endtime":2314,"spoken":"\u3072\u3068"},{"written":"\u3068","confidence":1.00,"starttime":2314,"endtime":2426,"spoken":"\u3068"},{"written":"\u6a5f\u68b0","confidence":1.00,"starttime":2426,"endtime":2826,"spoken":"\u304d\u304b\u3044"},{"written":"\u3068","confidence":1.00,"starttime":2826,"endtime":2938,"spoken":"\u3068"},{"written":"\u306e","confidence":1.00,"starttime":2938,"endtime":3082,"spoken":"\u306e"},{"written":"\u81ea\u7136","confidence":1.00,"starttime":3082,"endtime":3434,"spoken":"\u3057\u305c\u3093"},{"written":"\u306a","confidence":1.00,"starttime":3434,"endtime":3530,"spoken":"\u306a"},{"written":"\u30b3\u30df\u30e5\u30cb\u30b1\u30fc\u30b7\u30e7\u30f3","confidence":1.00,"starttime":3530,"endtime":4378,"spoken":"\u3053\u307f\u3085\u306b\u3051\u30fc\u3057\u3087\u3093"},{"written":"\u3092","confidence":1.00,"starttime":4378,"endtime":4442,"spoken":"\u3092"},{"written":"\u5b9f\u73fe","confidence":1.00,"starttime":4442,"endtime":4922,"spoken":"\u3058\u3064\u3052\u3093"},{"written":"\u3057","confidence":1.00,"starttime":4922,"endtime":5434,"spoken":"\u3057"},{"written":"\u3001","confidence":0.45,"starttime":5434,"endtime":5562,"spoken":"_"},{"written":"\u8c4a\u304b","confidence":1.00,"starttime":5562,"endtime":5994,"spoken":"\u3086\u305f\u304b"},{"written":"\u306a","confidence":1.00,"starttime":5994,"endtime":6090,"spoken":"\u306a"},{"written":"\u672a\u6765","confidence":1.00,"starttime":6090,"endtime":6490,"spoken":"\u307f\u3089\u3044"},{"written":"\u3092","confidence":1.00,"starttime":6490,"endtime":6554,"spoken":"\u3092"},{"written":"\u5275\u9020","confidence":0.93,"starttime":6554,"endtime":7050,"spoken":"\u305d\u3046\u305e\u3046"},{"written":"\u3057\u3066","confidence":0.99,"starttime":7050,"endtime":7210,"spoken":"\u3057\u3066"},{"written":"\u3044\u304f","confidence":1.00,"starttime":7210,"endtime":7418,"spoken":"\u3044\u304f"},{"written":"\u3053\u3068","confidence":1.00,"starttime":7418,"endtime":7690,"spoken":"\u3053\u3068"},{"written":"\u3092","confidence":1.00,"starttime":7690,"endtime":7722,"spoken":"\u3092"},{"written":"\u76ee\u6307\u3057","confidence":0.76,"starttime":7722,"endtime":8090,"spoken":"\u3081\u3056\u3057"},{"written":"\u307e\u3059","confidence":0.76,"starttime":8090,"endtime":8506,"spoken":"\u307e\u3059"},{"written":"\u3002","confidence":0.82,"starttime":8506,"endtime":8794,"spoken":"_"}],"confidence":0.998,"starttime":250,"endtime":8794,"tags":[],"rulename":"","text":"\u30a2\u30c9\u30d0\u30f3\u30b9\u30c8\u30fb\u30e1\u30c7\u30a3\u30a2\u306f\u3001\u4eba\u3068\u6a5f\u68b0\u3068\u306e\u81ea\u7136\u306a\u30b3\u30df\u30e5\u30cb\u30b1\u30fc\u30b7\u30e7\u30f3\u3092\u5b9f\u73fe\u3057\u3001\u8c4a\u304b\u306a\u672a\u6765\u3092\u5275\u9020\u3057\u3066\u3044\u304f\u3053\u3068\u3092\u76ee\u6307\u3057\u307e\u3059\u3002"}],"utteranceid":"20220602/14/018122d637320a301bc194c9_20220602_141433","text":"\u30a2\u30c9\u30d0\u30f3\u30b9\u30c8\u30fb\u30e1\u30c7\u30a3\u30a2\u306f\u3001\u4eba\u3068\u6a5f\u68b0\u3068\u306e\u81ea\u7136\u306a\u30b3\u30df\u30e5\u30cb\u30b1\u30fc\u30b7\u30e7\u30f3\u3092\u5b9f\u73fe\u3057\u3001\u8c4a\u304b\u306a\u672a\u6765\u3092\u5275\u9020\u3057\u3066\u3044\u304f\u3053\u3068\u3092\u76ee\u6307\u3057\u307e\u3059\u3002","code":"","message":""}
The Japanese in the recognition results is in Unicode-escaped UTF-8 format. You can easily revert it using the JSON parser provided in your development language. Here, we'll use the jq
command to convert it.
curl -F a=@test.wav "https://acp-api.amivoice.com/v1/recognize?d=-a-general&u=<APPKEY>" | jq
This time, the recognition results should be displayed in a readable Japanese format with indentation. Look for text
in the results. This is where you'll find the transcribed result of the audio.
"text": "アドバンスト・メディアは、人と機械との自然なコミュニケーションを実現し、豊かな未来を創造していくことを目指します。"
Below is a complete example of the response. You can obtain not only the transcribed result but also information such as word-by-word results, audio timing, and confidence levels. For details, please see Speech Recognition Results.
Response
{
"results": [
{
"tokens": [
{
"written": "アドバンスト・メディア",
"confidence": 1,
"starttime": 522,
"endtime": 1578,
"spoken": "あどばんすとめでぃあ"
},
{
"written": "は",
"confidence": 1,
"starttime": 1578,
"endtime": 1866,
"spoken": "は"
},
{
"written": "、",
"confidence": 0.72,
"starttime": 1866,
"endtime": 2026,
"spoken": "_"
},
{
"written": "人",
"confidence": 1,
"starttime": 2026,
"endtime": 2314,
"spoken": "ひと"
},
{
"written": "と",
"confidence": 1,
"starttime": 2314,
"endtime": 2426,
"spoken": "と"
},
{
"written": "機械",
"confidence": 1,
"starttime": 2426,
"endtime": 2826,
"spoken": "きかい"
},
{
"written": "と",
"confidence": 1,
"starttime": 2826,
"endtime": 2938,
"spoken": "と"
},
{
"written": "の",
"confidence": 1,
"starttime": 2938,
"endtime": 3082,
"spoken": "の"
},
{
"written": "自然",
"confidence": 1,
"starttime": 3082,
"endtime": 3434,
"spoken": "しぜん"
},
{
"written": "な",
"confidence": 1,
"starttime": 3434,
"endtime": 3530,
"spoken": "な"
},
{
"written": "コミュニケーション",
"confidence": 1,
"starttime": 3530,
"endtime": 4378,
"spoken": "こみゅにけーしょん"
},
{
"written": "を",
"confidence": 1,
"starttime": 4378,
"endtime": 4442,
"spoken": "を"
},
{
"written": "実現",
"confidence": 1,
"starttime": 4442,
"endtime": 4922,
"spoken": "じつげん"
},
{
"written": "し",
"confidence": 1,
"starttime": 4922,
"endtime": 5434,
"spoken": "し"
},
{
"written": "、",
"confidence": 0.45,
"starttime": 5434,
"endtime": 5562,
"spoken": "_"
},
{
"written": "豊か",
"confidence": 1,
"starttime": 5562,
"endtime": 5994,
"spoken": "ゆたか"
},
{
"written": "な",
"confidence": 1,
"starttime": 5994,
"endtime": 6090,
"spoken": "な"
},
{
"written": "未来",
"confidence": 1,
"starttime": 6090,
"endtime": 6490,
"spoken": "みらい"
},
{
"written": "を",
"confidence": 1,
"starttime": 6490,
"endtime": 6554,
"spoken": "を"
},
{
"written": "創造",
"confidence": 0.93,
"starttime": 6554,
"endtime": 7050,
"spoken": "そうぞう"
},
{
"written": "して",
"confidence": 0.99,
"starttime": 7050,
"endtime": 7210,
"spoken": "して"
},
{
"written": "いく",
"confidence": 1,
"starttime": 7210,
"endtime": 7418,
"spoken": "いく"
},
{
"written": "こと",
"confidence": 1,
"starttime": 7418,
"endtime": 7690,
"spoken": "こと"
},
{
"written": "を",
"confidence": 1,
"starttime": 7690,
"endtime": 7722,
"spoken": "を"
},
{
"written": "目指し",
"confidence": 0.76,
"starttime": 7722,
"endtime": 8090,
"spoken": "めざし"
},
{
"written": "ます",
"confidence": 0.76,
"starttime": 8090,
"endtime": 8506,
"spoken": "ます"
},
{
"written": "。",
"confidence": 0.82,
"starttime": 8506,
"endtime": 8794,
"spoken": "_"
}
],
"confidence": 0.998,
"starttime": 250,
"endtime": 8794,
"tags": [],
"rulename": "",
"text": "アドバンスト・メディアは、人と機械との自然なコミュニケーションを実現し、豊かな未来を創造していくことを目指します。"
}
],
"utteranceid": "20220602/14/018122d65d370a30116494c8_20220602_141442",
"text": "アドバンスト・メディアは、人と機械との自然なコミュニケーションを実現し、豊かな未来を創造していくことを目指します。",
"code": "",
"message": ""
}
Changing the Speech Recognition Engine
The d=-a-general
part of the query parameter specifies the speech recognition engine. -a-general
indicates that the 会話_汎用 engine is being specified.
For example, if you change it to -a-medgeneral
, you can transcribe using the 会話_医療汎用 engine, which is strong in medical terminology. For a list of available engines, please see Speech Recognition Engines.
curl https://acp-api.amivoice.com/v1/recognize \
-F d=-a-medgeneral \
-F u={APP_KEY} \
-F a=@test.wav
Next Steps
- The synchronous HTTP interface of AmiVoice API used here, as well as methods for transcribing speech using AmiVoice API, are explained in the User Guide.
- Within the User Guide, please see Request Parameters for parameters that can be set during requests, Speech Recognition Results for response details, and Synchronous HTTP Interface for the synchronous HTTP interface of AmiVoice API.
- Also, please see Synchronous HTTP Interface for the API reference.
- We provide a client library (
Hrp
) that encapsulates the communication processing and procedures for using the HTTP interface, allowing you to easily create speech recognition applications by simply implementing the necessary interfaces. First, try running the Sample Program HrpTester. For the interface specifications of the Hrp client library, please see Hrp (HTTP Interface Client) in the client library documentation.