Skip to main content

How to Use the Real-time Speech Recognition Library Wrp

The Wrp library allows you to develop real-time applications using the WebSocket interface of AmiVoice API with an interface similar to AmiVoice SDK. You can send streaming audio and receive results sequentially. This library is available in languages such as Java, C#, C++, Python, and PHP.

Overview of Client Program

The flow of a program using Wrp is as follows:

Figure. Commands and Events

Methods

The client program performs the following processes in order. The corresponding Wrp methods are listed in parentheses.

  1. Connect (connect)
  2. Request speech recognition (feedDataResume)
  3. Send audio data (feedData)
  4. End audio data transmission (feedDataPause)
  5. Disconnect (disconnect)

Events

Notification events for speech detection and speech recognition from the server are both obtained through methods of the listener class. There are five events as follows. The method names to be implemented in Wrp are listed in parentheses.

  • Event notified when the start of speech is detected utteranceStarted(startTime)
  • Event notified when the end of speech is detected utteranceEnded(endTime)
  • Event notified when speech recognition processing starts resultCreated()
  • Intermediate recognition result notification event resultUpdated(result)
  • Recognition result notification event resultFinalized(result)

Be sure to implement the recognition result notification event resultFinalized(result). Implement processing according to notifications from the server as needed for others.

Implementation Guide

We will explain how to use Wrp step by step while showing samples for each language.

The code examples shown below are all excerpts from WrpSimpleTester published in the GitHub repository advanced-media-inc/amivoice-api-client-library. For the entire code, please see the following source files:

For explanations on execution methods and file structures, please see the client library sample program WrpSimpleTester.

1. Initialization

Create an instance of the Wrp class.

		// Initialize WebSocket speech recognition server
com.amivoice.wrp.Wrp wrp = com.amivoice.wrp.Wrp.construct();

2. Implementing the Listener Class

Implement event handlers by inheriting the com.amivoice.wrp.WrpListener class.

The speech recognition results are obtained in the result argument of resultFinalized. For details, please see the WebSocket Interface of the speech recognition result format. Also, the recognition result text is encoded in UTF-8 and Unicode escaped. Please also see About Result Text.

In the code below, we implement logging to standard output in each method of utteranceStarted, utteranceEnded, resultCreated, resultUpdated, and resultFinalized. Set the instance of the listener that implements these methods to the wrp instance with wrp.setListener(listener). The Unicode escape of the result text is decoded with the text_ method. The complete code for the text_ method is published on GitHub.

public class WrpTester implements com.amivoice.wrp.WrpListener {
public static void main(String[] args) {
// Create WebSocket speech recognition server event listener
com.amivoice.wrp.WrpListener listener = new WrpTester(verbose);
wrp.setListener(listener);
}

@Override
public void utteranceStarted(int startTime) {
System.out.println("S " + startTime);
}

@Override
public void utteranceEnded(int endTime) {
System.out.println("E " + endTime);
}

@Override
public void resultCreated() {
System.out.println("C");
}

@Override
public void resultUpdated(String result) {
System.out.println("U " + result);
String text = text_(result);
if (text != null) {
System.out.println(" -> " + text);
}
}

@Override
public void resultFinalized(String result) {
System.out.println("F " + result);
String text = text_(result);
if (text != null) {
System.out.println(" -> " + text);
}
}

3. Connection (connect)

Connect to the speech recognition server.

The parameter that must be set before calling this method is as follows:

  • serverURL ... WebSocket interface endpoint

Please specify the following URL:

wss://acp-api.amivoice.com/v1/     (logging)
wss://acp-api.amivoice.com/v1/nolog/ (no logging)

You can adjust the behavior by setting the following parameters:

  • proxyServerName ... Specify when the client program connects through a proxy server
  • connectTimeout ... Connection timeout with the server. Unit is milliseconds.
  • receiveTimeout ... Timeout for receiving data from the server. Unit is milliseconds.

For server-side timeouts, please see Limitations.

The following code connects to the speech recognition server. If verbose is false, error messages are not displayed.

		wrp.setServerURL(serverURL);
wrp.setProxyServerName(proxyServerName);
wrp.setConnectTimeout(connectTimeout);
wrp.setReceiveTimeout(receiveTimeout);

// Connect to WebSocket speech recognition server
if (!wrp.connect()) {
if (!verbose) {
System.out.println(wrp.getLastMessage());
}
System.out.println("WebSocket 音声認識サーバ " + serverURL + " への接続に失敗しました。");
return;
}

4. Speech Recognition Request (feedDataResume)

Send a speech recognition request. This method blocks until the connection to the appropriate speech recognition server specified in the request parameters is established and word registration preparation is complete. If it fails, it returns an error. For details, please see the error messages of the s command response packet.

The following parameters must be set before calling this method:

The behavior can be adjusted by setting the following parameters:

  • profileId
  • profileWords
  • keepFillerToken
  • segmenterProperties
  • resultUpdatedInterval

For details, please see Parameter Details.

warning

To avoid deadlock, do not call the feedDataResume method from listener class methods such as resultFinalized that receive event notifications.

The following code sends a speech recognition request to the server based on the above parameters and displays a message to standard output if an error occurs. If verbose is false, error messages are not displayed.

        wrp.setCodec(codec);
wrp.setGrammarFileNames(grammarFileNames);
wrp.setAuthorization(authorization);

// Start sending audio data to WebSocket speech recognition server
if (!wrp.feedDataResume()) {
if (!verbose) {
System.out.println(wrp.getLastMessage());
}
System.out.println("WebSocket 音声認識サーバへの音声データの送信の開始に失敗しました。");
break;
}

5. Sending Audio Data (feedData)

Next, send the audio data. This method does not block. If an error occurs on the server side, it will return an error on the next method call. For details on the error content, please see the error messages of the p command response packet. Once you start sending audio data, the listener class methods will be called according to the server-side processing.

info
  • Please send audio data that matches the codec specified in feedDataResume. Even if the format is different, it will not result in an error, but the response may take a very long time or the recognition results may not be obtained correctly.
  • If speech cannot be detected from the sent audio data, the listener class methods will not be called. Please check the following possible reasons:
    • No audio is included at all, or the volume is very low. Check if the recording system is not muted and if the volume settings are appropriate.
    • The audio format and audio data do not match. Check the Audio Formats.
note
  • The maximum size of audio data that can be sent in one feedData method call is 16MB. If the data size is larger than that, please split it.
  • Audio data can be split at any point. There is no need to be aware of wav chunk boundaries or mp3, flac, opus frame boundaries.
  • You cannot change the format of the data being sent midway. If you want to change the audio format, end with the e command and start a new speech recognition request from s. The same applies to audio files with headers; end with the e command for each file and start a new speech recognition request from s.

The following code reads audio data from the audio data file specified by audioFileName and sends it to the WebSocket speech recognition server. If sleepTime is -1, it sleeps until the number of waiting recognition results becomes 1 or less. If verbose is false, error messages are not displayed.

					try (FileInputStream audioStream = new FileInputStream(audioFileName)) {
// Read audio data from audio data file
byte[] audioData = new byte[4096];
int audioDataReadBytes = audioStream.read(audioData, 0, audioData.length);
while (audioDataReadBytes > 0) {
// Check if sleep time has been calculated
if (sleepTime >= 0) {
// If sleep time has been calculated...
// Sleep for a short time
wrp.sleep(sleepTime);
} else {
// If sleep time has not been calculated...
// Sleep for a short time
wrp.sleep(1);

// Sleep until the number of waiting recognition results becomes 1 or less
int maxSleepTime = 50000;
while (wrp.getWaitingResults() > 1 && maxSleepTime > 0) {
wrp.sleep(100);
maxSleepTime -= 100;
}
}

// Send audio data to WebSocket speech recognition server
if (!wrp.feedData(audioData, 0, audioDataReadBytes)) {
if (!verbose) {
System.out.println(wrp.getLastMessage());
}
System.out.println("WebSocket 音声認識サーバへの音声データの送信に失敗しました。");
break;
}

// Read audio data from audio data file
audioDataReadBytes = audioStream.read(audioData, 0, audioData.length);
}
} catch (IOException e) {
System.out.println("音声データファイル " + audioFileName + " の読み込みに失敗しました。");
}

6. End of Audio Data Transmission (feedDataPause)

Call this when audio data transmission is complete. This method blocks until speech recognition processing is finished. The request may fail for some reason. For details, please see the error messages of the e command response packet.

warning

To avoid deadlock, do not call the feedDataPause method from listener class methods such as resultFinalized that receive event notifications.

info

If you send all the audio data at once instead of streaming, the speech recognition processing will take time, so it will take a while for the results to return. Please expect it to take about 0.5 to 1.5 times the length of the audio you sent.

The following code informs the server that all audio data has been sent and blocks until the results are returned. If verbose is false, error messages will not be displayed.

					// Completion of sending audio data to the WebSocket speech recognition server
if (!wrp.feedDataPause()) {
if (!verbose) {
System.out.println(wrp.getLastMessage());
}
System.out.println("WebSocket 音声認識サーバへの音声データの送信の完了に失敗しました。");
break;
}

7. Disconnection (disconnect)

Finally, disconnect from the speech recognition server.

			// Disconnection from the WebSocket speech recognition server
wrp.disconnect();

Client Program State Transition

The client program undergoes the following state transitions.

Other Documentation