Transcribing Short Audio Files

You can easily convert short audio files (under 16MB) that you have on hand to text by sending them to the endpoint of the AmiVoice API's HTTP interface. In this tutorial, instead of writing a program, we'll explain how to use the API using the curl command and the jq command. For long audio files, please see the next tutorial 「Transcribing Long Audio Files」.

Preparation

To execute this tutorial, you'll need the following:

curl
jq
Register for AmiVoice API and obtain an APPKEY
Prepare an audio file you want to transcribe

note

We're using the jq command to format the results for better readability. You can proceed without installing jq if you don't have it, as you'll still be able to transcribe audio in the following tutorial.

curl

Please check if the curl command is installed on your system.

curl -V

If the version is not displayed, please download the package for your OS from https://curl.se/ or use a package manager to install curl.

jq

Please check if the jq command is installed on your system.

jq -V

If the version is not displayed, please download the package for your OS from https://stedolan.github.io/jq/ or use a package manager to install jq.

Obtaining an APPKEY

tip

The AmiVoice Tech Blog details the steps to obtain an APPKEY. For obtaining an APPKEY, please also see Let's Try Using AmiVoice API.

Audio File

Prepare an audio file you want to transcribe. Here, we'll use the audio file (test.wav) included with the sample program in the client library.

note

When preparing an audio file, please pay attention to the supported audio file formats. For supported formats, please see About Audio Formats.
If you want to send an audio file larger than 16MB, please see the next Transcribing Long Audio Files.

Execution

Launch a terminal and copy and run the following command. Replace test.wav with the path to your prepared audio file. Also, replace {APPKEY} with your own key.

curl https://acp-api.amivoice.com/v1/recognize \
     -F d=-a-general \
     -F u={APP_KEY} \
     -F a=@test.wav

Results

If the execution is successful, you'll get results in JSON format like this:

{"results":[{"tokens":[{"written":"\u30a2\u30c9\u30d0\u30f3\u30b9\u30c8\u30fb\u30e1\u30c7\u30a3\u30a2","confidence":1.00,"starttime":522,"endtime":1578,"spoken":"\u3042\u3069\u3070\u3093\u3059\u3068\u3081\u3067\u3043\u3042"},{"written":"\u306f","confidence":1.00,"starttime":1578,"endtime":1866,"spoken":"\u306f"},{"written":"\u3001","confidence":0.72,"starttime":1866,"endtime":2026,"spoken":"_"},{"written":"\u4eba","confidence":1.00,"starttime":2026,"endtime":2314,"spoken":"\u3072\u3068"},{"written":"\u3068","confidence":1.00,"starttime":2314,"endtime":2426,"spoken":"\u3068"},{"written":"\u6a5f\u68b0","confidence":1.00,"starttime":2426,"endtime":2826,"spoken":"\u304d\u304b\u3044"},{"written":"\u3068","confidence":1.00,"starttime":2826,"endtime":2938,"spoken":"\u3068"},{"written":"\u306e","confidence":1.00,"starttime":2938,"endtime":3082,"spoken":"\u306e"},{"written":"\u81ea\u7136","confidence":1.00,"starttime":3082,"endtime":3434,"spoken":"\u3057\u305c\u3093"},{"written":"\u306a","confidence":1.00,"starttime":3434,"endtime":3530,"spoken":"\u306a"},{"written":"\u30b3\u30df\u30e5\u30cb\u30b1\u30fc\u30b7\u30e7\u30f3","confidence":1.00,"starttime":3530,"endtime":4378,"spoken":"\u3053\u307f\u3085\u306b\u3051\u30fc\u3057\u3087\u3093"},{"written":"\u3092","confidence":1.00,"starttime":4378,"endtime":4442,"spoken":"\u3092"},{"written":"\u5b9f\u73fe","confidence":1.00,"starttime":4442,"endtime":4922,"spoken":"\u3058\u3064\u3052\u3093"},{"written":"\u3057","confidence":1.00,"starttime":4922,"endtime":5434,"spoken":"\u3057"},{"written":"\u3001","confidence":0.45,"starttime":5434,"endtime":5562,"spoken":"_"},{"written":"\u8c4a\u304b","confidence":1.00,"starttime":5562,"endtime":5994,"spoken":"\u3086\u305f\u304b"},{"written":"\u306a","confidence":1.00,"starttime":5994,"endtime":6090,"spoken":"\u306a"},{"written":"\u672a\u6765","confidence":1.00,"starttime":6090,"endtime":6490,"spoken":"\u307f\u3089\u3044"},{"written":"\u3092","confidence":1.00,"starttime":6490,"endtime":6554,"spoken":"\u3092"},{"written":"\u5275\u9020","confidence":0.93,"starttime":6554,"endtime":7050,"spoken":"\u305d\u3046\u305e\u3046"},{"written":"\u3057\u3066","confidence":0.99,"starttime":7050,"endtime":7210,"spoken":"\u3057\u3066"},{"written":"\u3044\u304f","confidence":1.00,"starttime":7210,"endtime":7418,"spoken":"\u3044\u304f"},{"written":"\u3053\u3068","confidence":1.00,"starttime":7418,"endtime":7690,"spoken":"\u3053\u3068"},{"written":"\u3092","confidence":1.00,"starttime":7690,"endtime":7722,"spoken":"\u3092"},{"written":"\u76ee\u6307\u3057","confidence":0.76,"starttime":7722,"endtime":8090,"spoken":"\u3081\u3056\u3057"},{"written":"\u307e\u3059","confidence":0.76,"starttime":8090,"endtime":8506,"spoken":"\u307e\u3059"},{"written":"\u3002","confidence":0.82,"starttime":8506,"endtime":8794,"spoken":"_"}],"confidence":0.998,"starttime":250,"endtime":8794,"tags":[],"rulename":"","text":"\u30a2\u30c9\u30d0\u30f3\u30b9\u30c8\u30fb\u30e1\u30c7\u30a3\u30a2\u306f\u3001\u4eba\u3068\u6a5f\u68b0\u3068\u306e\u81ea\u7136\u306a\u30b3\u30df\u30e5\u30cb\u30b1\u30fc\u30b7\u30e7\u30f3\u3092\u5b9f\u73fe\u3057\u3001\u8c4a\u304b\u306a\u672a\u6765\u3092\u5275\u9020\u3057\u3066\u3044\u304f\u3053\u3068\u3092\u76ee\u6307\u3057\u307e\u3059\u3002"}],"utteranceid":"20220602/14/018122d637320a301bc194c9_20220602_141433","text":"\u30a2\u30c9\u30d0\u30f3\u30b9\u30c8\u30fb\u30e1\u30c7\u30a3\u30a2\u306f\u3001\u4eba\u3068\u6a5f\u68b0\u3068\u306e\u81ea\u7136\u306a\u30b3\u30df\u30e5\u30cb\u30b1\u30fc\u30b7\u30e7\u30f3\u3092\u5b9f\u73fe\u3057\u3001\u8c4a\u304b\u306a\u672a\u6765\u3092\u5275\u9020\u3057\u3066\u3044\u304f\u3053\u3068\u3092\u76ee\u6307\u3057\u307e\u3059\u3002","code":"","message":""}

The Japanese in the recognition results is in Unicode-escaped UTF-8 format. You can easily revert it using the JSON parser provided in your development language. Here, we'll use the jq command to convert it.

curl -F a=@test.wav "https://acp-api.amivoice.com/v1/recognize?d=-a-general&u=<APPKEY>" | jq

This time, the recognition results should be displayed in a readable Japanese format with indentation. Look for text in the results. This is where you'll find the transcribed result of the audio.

"text": "アドバンスト・メディアは、人と機械との自然なコミュニケーションを実現し、豊かな未来を創造していくことを目指します。"

Below is a complete example of the response. You can obtain not only the transcribed result but also information such as word-level results, audio timing, and confidence scores. For details, please see Speech Recognition Results.

Response

{
  "results": [
    {
      "tokens": [
        {
          "written": "アドバンスト・メディア",
          "confidence": 1,
          "starttime": 522,
          "endtime": 1578,
          "spoken": "あどばんすとめでぃあ"
        },
        {
          "written": "は",
          "confidence": 1,
          "starttime": 1578,
          "endtime": 1866,
          "spoken": "は"
        },
        {
          "written": "、",
          "confidence": 0.72,
          "starttime": 1866,
          "endtime": 2026,
          "spoken": "_"
        },
        {
          "written": "人",
          "confidence": 1,
          "starttime": 2026,
          "endtime": 2314,
          "spoken": "ひと"
        },
        {
          "written": "と",
          "confidence": 1,
          "starttime": 2314,
          "endtime": 2426,
          "spoken": "と"
        },
        {
          "written": "機械",
          "confidence": 1,
          "starttime": 2426,
          "endtime": 2826,
          "spoken": "きかい"
        },
        {
          "written": "と",
          "confidence": 1,
          "starttime": 2826,
          "endtime": 2938,
          "spoken": "と"
        },
        {
          "written": "の",
          "confidence": 1,
          "starttime": 2938,
          "endtime": 3082,
          "spoken": "の"
        },
        {
          "written": "自然",
          "confidence": 1,
          "starttime": 3082,
          "endtime": 3434,
          "spoken": "しぜん"
        },
        {
          "written": "な",
          "confidence": 1,
          "starttime": 3434,
          "endtime": 3530,
          "spoken": "な"
        },
        {
          "written": "コミュニケーション",
          "confidence": 1,
          "starttime": 3530,
          "endtime": 4378,
          "spoken": "こみゅにけーしょん"
        },
        {
          "written": "を",
          "confidence": 1,
          "starttime": 4378,
          "endtime": 4442,
          "spoken": "を"
        },
        {
          "written": "実現",
          "confidence": 1,
          "starttime": 4442,
          "endtime": 4922,
          "spoken": "じつげん"
        },
        {
          "written": "し",
          "confidence": 1,
          "starttime": 4922,
          "endtime": 5434,
          "spoken": "し"
        },
        {
          "written": "、",
          "confidence": 0.45,
          "starttime": 5434,
          "endtime": 5562,
          "spoken": "_"
        },
        {
          "written": "豊か",
          "confidence": 1,
          "starttime": 5562,
          "endtime": 5994,
          "spoken": "ゆたか"
        },
        {
          "written": "な",
          "confidence": 1,
          "starttime": 5994,
          "endtime": 6090,
          "spoken": "な"
        },
        {
          "written": "未来",
          "confidence": 1,
          "starttime": 6090,
          "endtime": 6490,
          "spoken": "みらい"
        },
        {
          "written": "を",
          "confidence": 1,
          "starttime": 6490,
          "endtime": 6554,
          "spoken": "を"
        },
        {
          "written": "創造",
          "confidence": 0.93,
          "starttime": 6554,
          "endtime": 7050,
          "spoken": "そうぞう"
        },
        {
          "written": "して",
          "confidence": 0.99,
          "starttime": 7050,
          "endtime": 7210,
          "spoken": "して"
        },
        {
          "written": "いく",
          "confidence": 1,
          "starttime": 7210,
          "endtime": 7418,
          "spoken": "いく"
        },
        {
          "written": "こと",
          "confidence": 1,
          "starttime": 7418,
          "endtime": 7690,
          "spoken": "こと"
        },
        {
          "written": "を",
          "confidence": 1,
          "starttime": 7690,
          "endtime": 7722,
          "spoken": "を"
        },
        {
          "written": "目指し",
          "confidence": 0.76,
          "starttime": 7722,
          "endtime": 8090,
          "spoken": "めざし"
        },
        {
          "written": "ます",
          "confidence": 0.76,
          "starttime": 8090,
          "endtime": 8506,
          "spoken": "ます"
        },
        {
          "written": "。",
          "confidence": 0.82,
          "starttime": 8506,
          "endtime": 8794,
          "spoken": "_"
        }
      ],
      "confidence": 0.998,
      "starttime": 250,
      "endtime": 8794,
      "tags": [],
      "rulename": "",
      "text": "アドバンスト・メディアは、人と機械との自然なコミュニケーションを実現し、豊かな未来を創造していくことを目指します。"
    }
  ],
  "utteranceid": "20220602/14/018122d65d370a30116494c8_20220602_141442",
  "text": "アドバンスト・メディアは、人と機械との自然なコミュニケーションを実現し、豊かな未来を創造していくことを目指します。",
  "code": "",
  "message": ""
}

Changing the Speech Recognition Engine

The d=-a-general part of the query parameter specifies the speech recognition engine. -a-general indicates that the 会話_汎用 engine is being specified. For example, if you change it to -a-medical, you can transcribe using the 会話_医療 engine, which is strong in medical terminology. For a list of available engines, please see Speech Recognition Engines.

curl https://acp-api.amivoice.com/v1/recognize \
     -F d=-a-medical \
     -F u={APP_KEY} \
     -F a=@test.wav

Next Steps

The methods for transcribing audio using AmiVoice API, including the asynchronous HTTP interface used here, are explained in the User Guide.
Within the User Guide, please see Request Parameters for details on parameters that can be set during requests, Speech Recognition Results for response details, and Synchronous HTTP Interface for information about AmiVoice API's Synchronous HTTP interface.
Also, please see Synchronous HTTP Interface for the API reference.
We provide a client library (Hrp) that encapsulates the communication processing and procedures for using the HTTP interface, allowing you to easily create speech recognition applications by simply implementing the necessary interfaces. First, try running the Sample Program HrpTester. For the interface specifications of the Hrp client library, please see Hrp (HTTP Interface Client) in the client library documentation.

Preparation​

curl​

jq​

Obtaining an APPKEY​

Audio File​

Execution​

Results​

Changing the Speech Recognition Engine​

Next Steps​