Skip to main content

Transcribing Long Audio Files

This tutorial explains step-by-step how to use the API to transcribe long audio files such as speeches from meetings and lectures, or recorded calls from call centers. Instead of writing a program, this tutorial uses the curl command and jq command for explanation.

Preparation

To execute this tutorial, you will need the following:

  • curl
  • jq
  • Register for AmiVoice API and obtain an APPKEY
  • Prepare an audio file you want to transcribe
note

We use the jq command to format the results for better readability. You can proceed without installing jq if you don't have it, as you can still transcribe audio in the following tutorial.

curl

Please check if the curl command is installed on your system.

curl -V

If the version is not displayed, please download the package for your OS from https://curl.se/ or use a package manager to install curl.

jq

Please check if the jq command is installed on your system.

jq -V

If the version is not displayed, please download the package for your OS from https://stedolan.github.io/jq/ or use a package manager to install jq.

Obtaining an APPKEY

  1. Register for AmiVoice API.
  2. Log in to your account page and record the APPKEY listed under [Common Connection Information] in the [Connection Information] tab.
tip

The AmiVoice Tech Blog details the steps to obtain an APPKEY. For obtaining an APPKEY, please see Let's Try Using AmiVoice API.

Audio File

Prepare the audio file you want to transcribe. Here, we will use the audio file test.wav included with the sample programs in the client library.

note
  • When preparing an audio file, please be aware of the supported audio file formats. For supported formats, please see About Audio Formats.
  • There are limitations on the length of audio files that can be accepted. Please see Limitations.

Execution

1. Speech Recognition Request

The parameters specified when making a speech recognition request are exactly the same as those for the synchronous HTTP interface.

curl https://acp-api-async.amivoice.com/v1/recognitions \
-F d=-a-general \
-F u={APP_KEY} \
-F a=@test.wav

If the request is successful, you will receive a response like the following. The request will be queued as a job and processed in order.

{"sessionid":"017c25ec12c00a304474a999","text":"..."}

2. Retrieving Job Status and Results

You can use the sessionid obtained from the request to get the job status (status) and results. You will need to execute this multiple times until the speech recognition results are obtained. Please specify the {APPKEY} in the Authorization header.

curl -H "Authorization: Bearer {APPKEY}" \
https://acp-api-async.amivoice.com/v1/recognitions/017c25ec12c00a304474a999

Immediately after sending the request, the status will be in the queued state.

{"service_id":"{YOUR_SERVICE_ID}","session_id":"017c25ec12c00a304474a999","status":"queued"}

When the job is taken from the queue, the status will change to the started state.

{"service_id":"{YOUR_SERVICE_ID}","session_id":"017c25ec12c00a304474a999","status":"started"}

When the actual speech recognition processing begins, the status will change to the processing state. You can verify whether the sent audio is being processed correctly using the size of the audio received by the API and the MD5 checksum. The time spent in the processing state depends on the length of the audio.

{'audio_md5': '40f59fe5fc7745c33b33af44be43f6ad', 'audio_size': 306980, 'service_id': '{YOUR_SERVICE_ID}', 'session_id': '017c25ec12c00a304474a999', 'status': 'processing'}

Results

When speech recognition is complete, the status will change to the completed state. At this point, you can obtain the speech recognition results in results. To format the results for better readability, we use the jq command.

curl -H "Authorization: Bearer {APPKEY}" \
https://acp-api-async.amivoice.com/v1/recognitions/017c25ec12c00a304474a999 | jq

Below is a complete example of the response. You can obtain not only the transcribed text but also word-level results, audio timing, and confidence scores. For details, please see Speech Recognition Results.

Response
{
"audio_md5": "40f59fe5fc7745c33b33af44be43f6ad",
"audio_size": 306980,
"results": {
"code": "",
"message": "",
"segments": [
{
"code": "",
"message": "",
"results": [
{
"confidence": 1.0,
"endtime": 8778,
"rulename": "",
"starttime": 250,
"tags": [],
"text": "アドバンスト・メディアは、人と機械等の自然なコミュニケーションを実現し、豊かな未来を創造していくことをめざします。",
"tokens": [
{
"confidence": 1.0,
"endtime": 1578,
"spoken": "あどばんすとめでぃあ",
"starttime": 570,
"written": "アドバンスト・メディア"
},
{
"confidence": 1.0,
"endtime": 1850,
"spoken": "は",
"starttime": 1578,
"written": "は"
},
{
"confidence": 0.77,
"endtime": 2010,
"spoken": "_",
"starttime": 1850,
"written": "、"
},
{
"confidence": 1.0,
"endtime": 2314,
"spoken": "ひと",
"starttime": 2010,
"written": "人"
},
{
"confidence": 1.0,
"endtime": 2426,
"spoken": "と",
"starttime": 2314,
"written": "と"
},
{
"confidence": 1.0,
"endtime": 2826,
"spoken": "きかい",
"starttime": 2426,
"written": "機械"
},
{
"confidence": 0.76,
"endtime": 2922,
"spoken": "とう",
"starttime": 2826,
"written": "等"
},
{
"confidence": 1.0,
"endtime": 3082,
"spoken": "の",
"starttime": 2922,
"written": "の"
},
{
"confidence": 1.0,
"endtime": 3434,
"spoken": "しぜん",
"starttime": 3082,
"written": "自然"
},
{
"confidence": 1.0,
"endtime": 3530,
"spoken": "な",
"starttime": 3434,
"written": "な"
},
{
"confidence": 1.0,
"endtime": 4362,
"spoken": "こみゅにけーしょん",
"starttime": 3530,
"written": "コミュニケーション"
},
{
"confidence": 1.0,
"endtime": 4442,
"spoken": "を",
"starttime": 4362,
"written": "を"
},
{
"confidence": 1.0,
"endtime": 4906,
"spoken": "じつげん",
"starttime": 4442,
"written": "実現"
},
{
"confidence": 1.0,
"endtime": 5242,
"spoken": "し",
"starttime": 4906,
"written": "し"
},
{
"confidence": 0.83,
"endtime": 5642,
"spoken": "_",
"starttime": 5242,
"written": "、"
},
{
"confidence": 1.0,
"endtime": 5978,
"spoken": "ゆたか",
"starttime": 5642,
"written": "豊か"
},
{
"confidence": 1.0,
"endtime": 6090,
"spoken": "な",
"starttime": 5978,
"written": "な"
},
{
"confidence": 1.0,
"endtime": 6490,
"spoken": "みらい",
"starttime": 6090,
"written": "未来"
},
{
"confidence": 1.0,
"endtime": 6554,
"spoken": "を",
"starttime": 6490,
"written": "を"
},
{
"confidence": 0.92,
"endtime": 7034,
"spoken": "そうぞう",
"starttime": 6554,
"written": "創造"
},
{
"confidence": 1.0,
"endtime": 7210,
"spoken": "して",
"starttime": 7034,
"written": "して"
},
{
"confidence": 1.0,
"endtime": 7402,
"spoken": "いく",
"starttime": 7210,
"written": "いく"
},
{
"confidence": 0.8,
"endtime": 7674,
"spoken": "こと",
"starttime": 7402,
"written": "こと"
},
{
"confidence": 1.0,
"endtime": 7706,
"spoken": "を",
"starttime": 7674,
"written": "を"
},
{
"confidence": 0.78,
"endtime": 7962,
"spoken": "めざ",
"starttime": 7706,
"written": "めざ"
},
{
"confidence": 0.78,
"endtime": 8490,
"spoken": "します",
"starttime": 7962,
"written": "します"
},
{
"confidence": 0.83,
"endtime": 8778,
"spoken": "_",
"starttime": 8490,
"written": "。"
}
]
}
],
"text": "アドバンスト・メディアは、人と機械等の自然なコミュニケーションを実現し、豊かな未来を創造していくことをめざします。"
}
],
"text": "アドバンスト・メディアは、人と機械等の自然なコミュニケーションを実現し、豊かな未来を創造していくことをめざします。",
"utteranceid": "20210927/06/017c25ed38cc0a30425239d0_20210927_062436[nolog]"
},
"service_id": "{YOUR_SERVICE_ID}",
"session_id": "017c25ec12c00a304474a999",
"status": "completed"
}

Next Steps