Transcribing Long Audio Files
This tutorial explains step-by-step how to use the API to transcribe long audio files such as speeches from meetings and lectures, or recorded calls from call centers. Instead of writing a program, this tutorial uses the curl
command and jq
command for explanation.
Preparation
To execute this tutorial, you will need the following:
- curl
- jq
- Register for AmiVoice API and obtain an
APPKEY
- Prepare an audio file you want to transcribe
We use the jq
command to format the results for better readability. You can proceed without installing jq
if you don't have it, as you can still transcribe audio in the following tutorial.
curl
Please check if the curl
command is installed on your system.
curl -V
If the version is not displayed, please download the package for your OS from https://curl.se/ or use a package manager to install curl
.
jq
Please check if the jq
command is installed on your system.
jq -V
If the version is not displayed, please download the package for your OS from https://stedolan.github.io/jq/ or use a package manager to install jq
.
Obtaining an APPKEY
- Register for AmiVoice API.
- Log in to your account page and record the APPKEY listed under [Common Connection Information] in the [Connection Information] tab.
The AmiVoice Tech Blog details the steps to obtain an APPKEY
. For obtaining an APPKEY
, please see Let's Try Using AmiVoice API.
Audio File
Prepare the audio file you want to transcribe.
Here, we will use the audio file test.wav
included with the sample programs in the client library.
- When preparing an audio file, please be aware of the supported audio file formats. For supported formats, please see About Audio Formats.
- There are limitations on the length of audio files that can be accepted. Please see Limitations.
Execution
1. Speech Recognition Request
The parameters specified when making a speech recognition request are exactly the same as those for the synchronous HTTP interface.
curl https://acp-api-async.amivoice.com/v1/recognitions \
-F d=-a-general \
-F u={APP_KEY} \
-F a=@test.wav
If the request is successful, you will receive a response like the following. The request will be queued as a job and processed in order.
{"sessionid":"017c25ec12c00a304474a999","text":"..."}
2. Retrieving Job Status and Results
You can use the sessionid
obtained from the request to get the job status (status
) and results. You will need to execute this multiple times until the speech recognition results are obtained. Please specify the {APPKEY}
in the Authorization header
.
curl -H "Authorization: Bearer {APPKEY}" \
https://acp-api-async.amivoice.com/v1/recognitions/017c25ec12c00a304474a999
Immediately after sending the request, the status
will be in the queued
state.
{"service_id":"{YOUR_SERVICE_ID}","session_id":"017c25ec12c00a304474a999","status":"queued"}
When the job is taken from the queue, the status
will change to the started
state.
{"service_id":"{YOUR_SERVICE_ID}","session_id":"017c25ec12c00a304474a999","status":"started"}
When the actual speech recognition processing begins, the status
will change to the processing
state. You can verify whether the sent audio is being processed correctly using the size of the audio received by the API and the MD5 checksum. The time spent in the processing
state depends on the length of the audio.
{'audio_md5': '40f59fe5fc7745c33b33af44be43f6ad', 'audio_size': 306980, 'service_id': '{YOUR_SERVICE_ID}', 'session_id': '017c25ec12c00a304474a999', 'status': 'processing'}
Results
When speech recognition is complete, the status
will change to the completed
state. At this point, you can obtain the speech recognition results in results
. To format the results for better readability, we use the jq
command.
curl -H "Authorization: Bearer {APPKEY}" \
https://acp-api-async.amivoice.com/v1/recognitions/017c25ec12c00a304474a999 | jq
Below is a complete example of the response. You can obtain not only the transcribed text but also word-level results, audio timing, and confidence scores. For details, please see Speech Recognition Results.
Response
{
"audio_md5": "40f59fe5fc7745c33b33af44be43f6ad",
"audio_size": 306980,
"results": {
"code": "",
"message": "",
"segments": [
{
"code": "",
"message": "",
"results": [
{
"confidence": 1.0,
"endtime": 8778,
"rulename": "",
"starttime": 250,
"tags": [],
"text": "アドバンスト・メディアは、人と機械等の自然なコミュニケーションを実現し、豊かな未来を創造していくことをめざします。",
"tokens": [
{
"confidence": 1.0,
"endtime": 1578,
"spoken": "あどばんすとめでぃあ",
"starttime": 570,
"written": "アドバンスト・メディア"
},
{
"confidence": 1.0,
"endtime": 1850,
"spoken": "は",
"starttime": 1578,
"written": "は"
},
{
"confidence": 0.77,
"endtime": 2010,
"spoken": "_",
"starttime": 1850,
"written": "、"
},
{
"confidence": 1.0,
"endtime": 2314,
"spoken": "ひと",
"starttime": 2010,
"written": "人"
},
{
"confidence": 1.0,
"endtime": 2426,
"spoken": "と",
"starttime": 2314,
"written": "と"
},
{
"confidence": 1.0,
"endtime": 2826,
"spoken": "きかい",
"starttime": 2426,
"written": "機械"
},
{
"confidence": 0.76,
"endtime": 2922,
"spoken": "とう",
"starttime": 2826,
"written": "等"
},
{
"confidence": 1.0,
"endtime": 3082,
"spoken": "の",
"starttime": 2922,
"written": "の"
},
{
"confidence": 1.0,
"endtime": 3434,
"spoken": "しぜん",
"starttime": 3082,
"written": "自然"
},
{
"confidence": 1.0,
"endtime": 3530,
"spoken": "な",
"starttime": 3434,
"written": "な"
},
{
"confidence": 1.0,
"endtime": 4362,
"spoken": "こみゅにけーしょん",
"starttime": 3530,
"written": "コミュニケーション"
},
{
"confidence": 1.0,
"endtime": 4442,
"spoken": "を",
"starttime": 4362,
"written": "を"
},
{
"confidence": 1.0,
"endtime": 4906,
"spoken": "じつげん",
"starttime": 4442,
"written": "実現"
},
{
"confidence": 1.0,
"endtime": 5242,
"spoken": "し",
"starttime": 4906,
"written": "し"
},
{
"confidence": 0.83,
"endtime": 5642,
"spoken": "_",
"starttime": 5242,
"written": "、"
},
{
"confidence": 1.0,
"endtime": 5978,
"spoken": "ゆたか",
"starttime": 5642,
"written": "豊か"
},
{
"confidence": 1.0,
"endtime": 6090,
"spoken": "な",
"starttime": 5978,
"written": "な"
},
{
"confidence": 1.0,
"endtime": 6490,
"spoken": "みらい",
"starttime": 6090,
"written": "未来"
},
{
"confidence": 1.0,
"endtime": 6554,
"spoken": "を",
"starttime": 6490,
"written": "を"
},
{
"confidence": 0.92,
"endtime": 7034,
"spoken": "そうぞう",
"starttime": 6554,
"written": "創造"
},
{
"confidence": 1.0,
"endtime": 7210,
"spoken": "して",
"starttime": 7034,
"written": "して"
},
{
"confidence": 1.0,
"endtime": 7402,
"spoken": "いく",
"starttime": 7210,
"written": "いく"
},
{
"confidence": 0.8,
"endtime": 7674,
"spoken": "こと",
"starttime": 7402,
"written": "こと"
},
{
"confidence": 1.0,
"endtime": 7706,
"spoken": "を",
"starttime": 7674,
"written": "を"
},
{
"confidence": 0.78,
"endtime": 7962,
"spoken": "めざ",
"starttime": 7706,
"written": "めざ"
},
{
"confidence": 0.78,
"endtime": 8490,
"spoken": "します",
"starttime": 7962,
"written": "します"
},
{
"confidence": 0.83,
"endtime": 8778,
"spoken": "_",
"starttime": 8490,
"written": "。"
}
]
}
],
"text": "アドバンスト・メディアは、人と機械等の自然なコミュニケーションを実現し、豊かな未来を創造していくことをめざします。"
}
],
"text": "アドバンスト・メディアは、人と機械等の自然なコミュニケーションを実現し、豊かな未来を創造していくことをめざします。",
"utteranceid": "20210927/06/017c25ed38cc0a30425239d0_20210927_062436[nolog]"
},
"service_id": "{YOUR_SERVICE_ID}",
"session_id": "017c25ec12c00a304474a999",
"status": "completed"
}
Next Steps
- The methods for transcribing audio using AmiVoice API, including the asynchronous HTTP interface used here, are explained in the User Guide.
- Within the User Guide, please see Request Parameters for details on parameters that can be set during requests, Speech Recognition Results for response details, and Asynchronous HTTP Interface for information about AmiVoice API's asynchronous HTTP interface.
- For the API reference, please see Asynchronous HTTP Interface.