Speaker Diarization

Overview

What is Speaker Diarization

Speaker Diarization is a function that estimates "who is speaking which part" in audio containing multiple speakers. For example, it is useful for distinguishing speakers for each utterance when recording a meeting with multiple participants using a single microphone.

The following figure shows an example of recording a meeting where Mr. Tanaka and Ms. Yamada are talking using a single microphone. In this audio, the utterances of both people are recorded on a single track.

Figure. Illustration of Speaker Diarization

By using the speaker diarization function of AmiVoice API, you can distinguish each utterance by speaker as shown in the figure below, such as "this segment is an utterance by 'speaker0'", "this segment is an utterance by another speaker 'speaker1'".

Figure. Illustration of Speaker Diarization

This function does not identify individuals. Therefore, the application needs to handle the correspondence between "speaker0" and Ms. Yamada, "speaker1" and Mr. Tanaka.

About the API

To use speaker diarization, specify option parameters when making a speech recognition request. The results of speaker diarization are obtained per word. In the speech recognition response, label is added to the word-level results, and labels to distinguish speakers such as "speaker0" and "speaker1" are set.

Example of speaker diarization results:

    "tokens": [
        {
            "written": "アドバンスト・メディア",
            "confidence": 1,
            "starttime": 522,
            "endtime": 1578,
            "spoken": "あどばんすとめでぃあ",
            "label": "speaker0"
        },
        {
            "written": "は",
            "confidence": 1,
            "starttime": 1578,
            "endtime": 1834,
            "spoken": "は",
            "label": "speaker0"
        },
        {
            "written": "、",
            "confidence": 0.95,
            "starttime": 1834,
            "endtime": 2010,
            "spoken": "_",
            "label": "speaker0"
        },
        /* Omitted below */

Usage

To use speaker diarization, set the request parameters shown in the table below when making a speech recognition request.

Table. Request parameters for speaker diarization

Interface	Parameter to enable	Parameters for adjustment
Sync HTTP / WebSocket	Set `useDiarizer=1` in `segmenterProperties`	`diarizerTransitionBias`, `diarizerAlpha` in `segmenterProperties`
Async HTTP	Set `speakerDiarization=True`	`diarizationMinSpeaker`, `diarizationMaxSpeaker`

note

Please note that the option parameters specified at the time of request differ depending on the interface. The results obtained are in the same format regardless of the interface.

The reason for the difference in request parameters is that the speaker diarization method differs between the synchronous HTTP and WebSocket interfaces and the asynchronous HTTP interface. In the synchronous HTTP and WebSocket interfaces, speaker diarization is performed when detecting utterance segments for the audio stream. Therefore, settings are made for segmenterProperties, which is a parameter for utterance segment detection.

On the other hand, for the asynchronous HTTP interface, speaker diarization is performed with the entire audio file available. Settings are made using parameters specific to asynchronous HTTP.

In this section, we first explain how to enable speaker diarization for each interface, and then describe the parameters for improving accuracy.

Request

We explain how to make a request with speaker diarization enabled for each interface.

Synchronous HTTP Interface

To enable speaker diarization, set useDiarizer=1 in segmenterProperties. In synchronous HTTP, segmenterProperties is set in the d parameter of the request parameters.

Let's explain with an example. When not using speaker diarization, to perform speech recognition with the general-purpose engine using the audio included with the AmiVoice API sample program using the curl command, you would execute the command as follows:

curl https://acp-api.amivoice.com/v1/recognize \
     -F u={APP_KEY} \
     -F d="grammarFileNames=-a-general" \
     -F a=@test.wav

To enable speaker diarization, do the following:

curl https://acp-api.amivoice.com/v1/recognize \
     -F u={APP_KEY} \
     -F d="grammarFileNames=-a-general segmenterProperties=useDiarizer=1" \
     -F a=@test.wav

The order of parameters is not important here. You can also set other parameters together. For details on how to make requests using the synchronous HTTP interface, please see Sending Speech Recognition Requests.

WebSocket Interface

To enable speaker diarization, set useDiarizer=1 in segmenterProperties. In the WebSocket interface, segmenterProperties is set in the s command sent first after connecting via WebSocket.

Let's explain with an example. When not using speaker diarization, to perform speech recognition with the general-purpose engine, you would make a request as follows:

s 16K -a-general authorization={APPKEY}

To enable speaker diarization, add parameters as follows:

s 16K -a-general authorization={APPKEY} segmenterProperties=useDiarizer=1

While the s command must set the audio format and engine name, the order of subsequent parameters can be changed. You can also set other parameters together. For information on how to make requests using the WebSocket interface, please see Starting Recognition Request.

Asynchronous HTTP Interface

To enable speaker diarization, add speakerDiarization=True to the d parameter of the request parameters.

curl https://acp-api-async.amivoice.com/v1/recognitions \
     -F u={APP_KEY} \
     -F d="grammarFileNames=-a-general" \
     -F a=@test.wav

To enable speaker diarization, do the following:

curl https://acp-api-async.amivoice.com/v1/recognitions \
     -F u={APP_KEY} \
     -F d="grammarFileNames=-a-general speakerDiarization=True" \
     -F a=@test.wav

The order of parameters is not important here. You can also set other parameters together. For information on how to make requests using the asynchronous HTTP interface, please see 1. Create a Speech Recognition Job.

Response

We explain the response when speaker diarization is enabled. The results of speaker diarization are obtained as label in token, which is the word-level result. The label is a string that distinguishes speakers with numbers following "speaker", such as speaker0, speaker1, speaker2 ... speakerN.

note

The speaker label numbers may have gaps. For example, the labels for 3 speakers might be output as speaker0, speaker1, speaker3. Do not assume that the numbers will appear in order starting from 0.

Example of speaker diarization results:

    "tokens": [
        {
            "written": "アドバンスト・メディア",
            "confidence": 1,
            "starttime": 522,
            "endtime": 1578,
            "spoken": "あどばんすとめでぃあ",
            "label": "speaker0"
        },
        {
            "written": "は",
            "confidence": 1,
            "starttime": 1578,
            "endtime": 1834,
            "spoken": "は",
            "label": "speaker0"
        },
        {
            "written": "、",
            "confidence": 0.95,
            "starttime": 1834,
            "endtime": 2010,
            "spoken": "_",
            "label": "speaker0"
        },
        /* Omitted below */

For information on the format of speech recognition results, please see Speech Recognition Results.

tip

Enabling speaker diarization may cause the response to take longer. If you don't need to use speaker labels, keep it disabled.

Adjustment Parameters for Improving Accuracy

We explain the parameters for adjusting the results of speaker diarization. As summarized in the "Table. Request Parameters for Speaker Diarization" mentioned earlier, the parameters that can be adjusted differ depending on the interface.

Adjusting the Ease of Speaker Detection

When using synchronous HTTP and WebSocket interfaces, you can adjust the ease of speaker detection using two parameters of the segmenterProperties property.

Parameter	Possible Values	Approximate Range	Default Value	Description
`diarizerAlpha`	0 or more	1e-100 to 1e50	1	Ease of new speaker appearance
`diarizerTransitionBias`	0 to less than 1	1e-150 to 1e-10	1e-40 For 8k, 1e-20	Ease of speaker switching

note

1e represents powers of 10. For example, 1e-100 represents $10^{-100}$ . Set it like diarizerAlpha=1e-100. Please also see the request samples below.

diarizerAlpha

This is a parameter that controls the ease of new speaker appearance. The larger the value, the easier it is for new speakers to appear, and the smaller the value, the harder it is for new speakers to appear.

diarizerAlpha=0 is special and is treated as if 1e0, i.e., 1, was specified. If nothing is set, it is treated as if diarizerAlpha=0 was specified.

tip

If the number of speakers in the result is too many compared to the actual number, try reducing diarizerAlpha from the default (1e0) to 1e-10, 1e-20, etc., and check if it improves.
If the number of speakers in the result is too few compared to the actual number, try increasing diarizerAlpha from the default (1e0) to 1e10, 1e20, etc., and check if it improves.

diarizerTransitionBias

This is a parameter that controls the ease of speaker switching. The larger the value, the easier it is for speakers to switch, and the smaller the value, the harder it is for speakers to switch.

diarizerTransitionBias=0 is special and is treated as if 1e-40 was specified. However, for engines that support 8kHz audio, for example, when using the general-purpose engine (-a-general) and sending 8kHz audio, it is treated as if 1e-20 was specified. If nothing is set, it is treated as if diarizerTransitionBias=0 was specified.

tip

If multiple speakers are detected frequently even though the same person is actually speaking continuously, try reducing diarizerTransitionBias from the default to 1e-50, 1e-60, etc., and check if it improves.
If one speaker continues even though multiple people are speaking, try increasing diarizerTransitionBias from the default to 1e-10, etc., and check if it improves.

Setting Example

Here's an example of enabling speaker diarization and setting diarizerAlpha to 1e-20 and diarizerTransitionBias to 1e-10. Since multiple parameters are being set for segmenterProperties, each parameter is separated by a space.

Example of setting for synchronous HTTP interface using curl command

curl https://acp-api.amivoice.com/v1/recognize \
     -F u={APP_KEY} \
     -F d="grammarFileNames=-a-general segmenterProperties=useDiarizer=1%20diarizerAlpha=1e-20%20diarizerTransitionBias=1e-10" \
     -F a=@test.wav

URL encode the spaces in the parameters set for segmenterProperties as %20.

Example of setting for WebSocket interface

s 16K -a-general authorization={APPKEY} segmenterProperties="useDiarizer=1 diarizerAlpha=1e-20 diarizerTransitionBias=1e-10"

Enclose the entire set of parameters for segmenterProperties in double quotes like "...".

Specifying the Number of Speakers

When using the asynchronous HTTP interface, you can improve the accuracy of speaker diarization by narrowing down the range of the number of speakers in the audio.

Parameter	Possible Values	Default Value	Description
`diarizationMinSpeaker`	1 to 20	1	Minimum expected number of speakers
`diarizationMaxSpeaker`	1 to 20	10	Maximum expected number of speakers

diarizationMinSpeaker

This is the minimum number of people expected to be in the audio.

diarizationMaxSpeaker

This is the maximum number of people expected to be in the audio.

tip

If you know in advance the number of speakers in the audio, setting it accurately can significantly improve the estimation accuracy. For example, if you know that 5 people will participate in the meeting you're about to start, add diarizationMinSpeaker=5 and diarizationMaxSpeaker=5 to the d parameter when making the request.

Setting Example

Here's an example of enabling speaker diarization and setting both diarizationMinSpeaker and diarizationMaxSpeaker to 5.

Example of setting for asynchronous HTTP interface using curl command

curl https://acp-api-async.amivoice.com/v1/recognitions \
     -F u={APP_KEY} \
     -F d="grammarFileNames=-a-general speakerDiarization=True diarizationMinSpeaker=5 diarizationMaxSpeaker=5" \
     -F a=@test.wav

Tips for Improving Accuracy

Here are some tips for improving the accuracy of speaker diarization.

Setting the Number of Speakers

For the asynchronous HTTP interface, you can specify the number of speakers. If possible, setting the number of speakers in the audio in advance can improve accuracy.

Improving Audio Quality

Speaker diarization tends to decrease in accuracy as the audio quality worsens. Improving the recording environment to avoid noise and echo can potentially improve accuracy.

Avoiding Simultaneous Speech by Multiple Speakers

It becomes difficult to estimate speakers when multiple speakers are talking simultaneously. While it may be challenging depending on the application's use case, if you can devise ways to prevent users from talking at the same time, accuracy will improve.

Points to Note

When Multiple Speakers Speak Simultaneously

For segments where multiple speakers are speaking simultaneously, it would be correct not to estimate a single speaker, but the system will return one of the speaker labels "speakerN".

Noise

For audio files, first, speech detection is performed, and then speech recognition and speaker diarization are carried out on the detected speech segments. If noise is mistakenly detected as a speech segment, one of the speaker labels "speakerN" will be returned.

Speaker Labels Are Independent for Each Request

Speaker labels are independent for each request. For example, if you divide the audio recorded in a meeting into two halves and make two requests, you will receive two responses. However, the speaker labels included in each response may not refer to the same speakers. Applications should associate speaker labels with speakers for each response obtained from different requests. Alternatively, send the audio as a single request.

Limitations

Maximum Number of Distinguishable Speakers

The system can distinguish up to 20 speakers.

For Asynchronous HTTP

When speaker diarization is enabled, the maximum length of audio that can be sent is 3 hours. If you send audio data longer than this, you will receive an error at the time of the request.

When not using speaker diarization, the audio that can be sent to the asynchronous HTTP interface is limited by size, and you can send up to approximately 2.14GB of audio data.

Sample Program

We have published a Windows application that uses speaker diarization with the AmiVoice API's asynchronous HTTP interface.

https://github.com/advanced-media-inc/acp-csharp-sample-applications/tree/main/SpeakerDiarizationSampleApp

tip

For an explanation of how to create the above sample application, please see How to Use AmiVoice's Speaker Diarization in C# [HttpClient] on the AmiVoice Tech Blog.

Overview​

What is Speaker Diarization​

About the API​

Usage​

Request​

Synchronous HTTP Interface​

WebSocket Interface​

Asynchronous HTTP Interface​

Response​

Adjustment Parameters for Improving Accuracy​

Adjusting the Ease of Speaker Detection​

diarizerAlpha​

diarizerTransitionBias​

Setting Example​

Example of setting for synchronous HTTP interface using curl command​

Example of setting for WebSocket interface​

Specifying the Number of Speakers​

diarizationMinSpeaker​

diarizationMaxSpeaker​

Setting Example​

Example of setting for asynchronous HTTP interface using curl command​

Tips for Improving Accuracy​

Setting the Number of Speakers​

Improving Audio Quality​

Avoiding Simultaneous Speech by Multiple Speakers​

Points to Note​

When Multiple Speakers Speak Simultaneously​

Noise​

Speaker Labels Are Independent for Each Request​

Limitations​

Maximum Number of Distinguishable Speakers​

For Asynchronous HTTP​

Sample Program​

Overview

What is Speaker Diarization

About the API

Usage

Request

Synchronous HTTP Interface

WebSocket Interface

Asynchronous HTTP Interface

Response

Adjustment Parameters for Improving Accuracy

Adjusting the Ease of Speaker Detection

diarizerAlpha

diarizerTransitionBias

Setting Example

Example of setting for synchronous HTTP interface using curl command

Example of setting for WebSocket interface

Specifying the Number of Speakers

diarizationMinSpeaker

diarizationMaxSpeaker

Setting Example

Example of setting for asynchronous HTTP interface using curl command

Tips for Improving Accuracy

Setting the Number of Speakers

Improving Audio Quality

Avoiding Simultaneous Speech by Multiple Speakers

Points to Note

When Multiple Speakers Speak Simultaneously

Noise

Speaker Labels Are Independent for Each Request

Limitations

Maximum Number of Distinguishable Speakers

For Asynchronous HTTP

Sample Program