Speaker Diarization
Overview
What is Speaker Diarization
Speaker Diarization is a function that estimates "who is speaking which part" in audio containing multiple speakers. For example, it is useful for distinguishing speakers for each utterance when recording a meeting with multiple participants using a single microphone.
The following figure shows an example of recording a meeting where Mr. Tanaka and Ms. Yamada are talking using a single microphone. In this audio, the utterances of both people are recorded on a single track.
By using the speaker diarization function of AmiVoice API, you can distinguish each utterance by speaker as shown in the figure below, such as "this segment is an utterance by 'speaker0'", "this segment is an utterance by another speaker 'speaker1'".
This function does not identify individuals. Therefore, the application needs to handle the correspondence between "speaker0" and Ms. Yamada, "speaker1" and Mr. Tanaka.
About the API
To use speaker diarization, specify option parameters when making a speech recognition request. The results of speaker diarization are obtained per word. In the speech recognition response, label
is added to the word-level results, and labels to distinguish speakers such as "speaker0" and "speaker1" are set.
Example of speaker diarization results:
"tokens": [
{
"written": "アドバンスト・メディア",
"confidence": 1,
"starttime": 522,
"endtime": 1578,
"spoken": "あどばんすとめでぃあ",
"label": "speaker0"
},
{
"written": "は",
"confidence": 1,
"starttime": 1578,
"endtime": 1834,
"spoken": "は",
"label": "speaker0"
},
{
"written": "、",
"confidence": 0.95,
"starttime": 1834,
"endtime": 2010,
"spoken": "_",
"label": "speaker0"
},
/* Omitted below */
Usage
To use speaker diarization, set the request parameters shown in the table below when making a speech recognition request.
Table. Request parameters for speaker diarization
Interface | Parameter to enable | Parameters for adjustment |
---|---|---|
Sync HTTP / WebSocket | Set useDiarizer=1 in segmenterProperties | diarizerTransitionBias , diarizerAlpha in segmenterProperties |
Async HTTP | Set speakerDiarization=True | diarizationMinSpeaker , diarizationMaxSpeaker |
Please note that the option parameters specified at the time of request differ depending on the interface. The results obtained are in the same format regardless of the interface.
The reason for the difference in request parameters is that the speaker diarization method differs between the synchronous HTTP and WebSocket interfaces and the asynchronous HTTP interface. In the synchronous HTTP and WebSocket interfaces, speaker diarization is performed when detecting utterance segments for the audio stream. Therefore, settings are made for segmenterProperties
, which is a parameter for utterance segment detection.
On the other hand, for the asynchronous HTTP interface, speaker diarization is performed with the entire audio file available. Settings are made using parameters specific to asynchronous HTTP.
In this section, we first explain how to enable speaker diarization for each interface, and then describe the parameters for improving accuracy.
Request
We explain how to make a request with speaker diarization enabled for each interface.
Synchronous HTTP Interface
To enable speaker diarization, set useDiarizer=1
in segmenterProperties
. In synchronous HTTP, segmenterProperties
is set in the d
parameter of the request parameters.
Let's explain with an example. When not using speaker diarization, to perform speech recognition with the general-purpose engine using the audio included with the AmiVoice API sample program using the curl command, you would execute the command as follows:
curl https://acp-api.amivoice.com/v1/recognize \
-F u={APP_KEY} \
-F d="grammarFileNames=-a-general" \
-F a=@test.wav
To enable speaker diarization, do the following:
curl https://acp-api.amivoice.com/v1/recognize \
-F u={APP_KEY} \
-F d="grammarFileNames=-a-general segmenterProperties=useDiarizer=1" \
-F a=@test.wav
The order of parameters is not important here. You can also set other parameters together. For details on how to make requests using the synchronous HTTP interface, please see Sending Speech Recognition Requests.
WebSocket Interface
To enable speaker diarization, set useDiarizer=1
in segmenterProperties
. In the WebSocket interface, segmenterProperties
is set in the s
command sent first after connecting via WebSocket.
Let's explain with an example. When not using speaker diarization, to perform speech recognition with the general-purpose engine, you would make a request as follows:
s 16K -a-general authorization={APPKEY}
To enable speaker diarization, add parameters as follows:
s 16K -a-general authorization={APPKEY} segmenterProperties=useDiarizer=1
While the s
command must set the audio format and engine name, the order of subsequent parameters can be changed. You can also set other parameters together. For information on how to make requests using the WebSocket interface, please see Starting Recognition Request.
Asynchronous HTTP Interface
To enable speaker diarization, add speakerDiarization=True
to the d
parameter of the request parameters.
Let's explain with an example. When not using speaker diarization, to perform speech recognition with the general-purpose engine using the audio included with the AmiVoice API sample program using the curl command, you would execute the command as follows:
curl https://acp-api-async.amivoice.com/v1/recognitions \
-F u={APP_KEY} \
-F d="grammarFileNames=-a-general" \
-F a=@test.wav
To enable speaker diarization, do the following:
curl https://acp-api-async.amivoice.com/v1/recognitions \
-F u={APP_KEY} \
-F d="grammarFileNames=-a-general speakerDiarization=True" \
-F a=@test.wav
The order of parameters is not important here. You can also set other parameters together. For information on how to make requests using the asynchronous HTTP interface, please see 1. Create a Speech Recognition Job.
Response
We explain the response when speaker diarization is enabled.
The results of speaker diarization are obtained as label
in token
, which is the word-level result. The label
is a string that distinguishes speakers with numbers following "speaker", such as speaker0, speaker1, speaker2 ... speakerN.
The speaker label numbers may have gaps. For example, the labels for 3 speakers might be output as speaker0, speaker1, speaker3. Do not assume that the numbers will appear in order starting from 0.
Example of speaker diarization results:
"tokens": [
{
"written": "アドバンスト・メディア",
"confidence": 1,
"starttime": 522,
"endtime": 1578,
"spoken": "あどばんすとめでぃあ",
"label": "speaker0"
},
{
"written": "は",
"confidence": 1,
"starttime": 1578,
"endtime": 1834,
"spoken": "は",
"label": "speaker0"
},
{
"written": "、",
"confidence": 0.95,
"starttime": 1834,
"endtime": 2010,
"spoken": "_",
"label": "speaker0"
},
/* Omitted below */
For information on the format of speech recognition results, please see Speech Recognition Results.
Enabling speaker diarization may cause the response to take longer. If you don't need to use speaker labels, keep it disabled.
Adjustment Parameters for Improving Accuracy
We explain the parameters for adjusting the results of speaker diarization. As summarized in the "Table. Request Parameters for Speaker Diarization" mentioned earlier, the parameters that can be adjusted differ depending on the interface.
Adjusting the Ease of Speaker Detection
When using synchronous HTTP and WebSocket interfaces, you can adjust the ease of speaker detection using two parameters of the segmenterProperties
property.
Parameter | Possible Values | Approximate Range | Default Value | Description |
---|---|---|---|---|
diarizerAlpha | 0 or more | 1e-100 to 1e50 | 1 | Ease of new speaker appearance |
diarizerTransitionBias | 0 to less than 1 | 1e-150 to 1e-10 | 1e-40 For 8k, 1e-20 | Ease of speaker switching |
1e represents powers of 10. For example, 1e-100 represents . Set it like diarizerAlpha=1e-100
. Please also see the request samples below.
diarizerAlpha
This is a parameter that controls the ease of new speaker appearance. The larger the value, the easier it is for new speakers to appear, and the smaller the value, the harder it is for new speakers to appear.
diarizerAlpha=0
is special and is treated as if 1e0, i.e., 1, was specified. If nothing is set, it is treated as if diarizerAlpha=0
was specified.
- If the number of speakers in the result is too many compared to the actual number, try reducing
diarizerAlpha
from the default (1e0) to 1e-10, 1e-20, etc., and check if it improves. - If the number of speakers in the result is too few compared to the actual number, try increasing
diarizerAlpha
from the default (1e0) to 1e10, 1e20, etc., and check if it improves.
diarizerTransitionBias
This is a parameter that controls the ease of speaker switching. The larger the value, the easier it is for speakers to switch, and the smaller the value, the harder it is for speakers to switch.
diarizerTransitionBias=0
is special and is treated as if 1e-40 was specified. However, for engines that support 8kHz audio, for example, when using the general-purpose engine (-a-general
) and sending 8kHz audio, it is treated as if 1e-20 was specified. If nothing is set, it is treated as if diarizerTransitionBias=0
was specified.
- If multiple speakers are detected frequently even though the same person is actually speaking continuously, try reducing
diarizerTransitionBias
from the default to 1e-50, 1e-60, etc., and check if it improves. - If one speaker continues even though multiple people are speaking, try increasing
diarizerTransitionBias
from the default to 1e-10, etc., and check if it improves.
Setting Example
Here's an example of enabling speaker diarization and setting diarizerAlpha
to 1e-20 and diarizerTransitionBias
to 1e-10. Since multiple parameters are being set for segmenterProperties
, each parameter is separated by a space.
Example of setting for synchronous HTTP interface using curl command
curl https://acp-api.amivoice.com/v1/recognize \
-F u={APP_KEY} \
-F d="grammarFileNames=-a-general segmenterProperties=useDiarizer=1%20diarizerAlpha=1e-20%20diarizerTransitionBias=1e-10" \
-F a=@test.wav
URL encode the spaces in the parameters set for segmenterProperties
as %20.
Example of setting for WebSocket interface
s 16K -a-general authorization={APPKEY} segmenterProperties="useDiarizer=1 diarizerAlpha=1e-20 diarizerTransitionBias=1e-10"
Enclose the entire set of parameters for segmenterProperties
in double quotes like "...".
Specifying the Number of Speakers
When using the asynchronous HTTP interface, you can improve the accuracy of speaker diarization by narrowing down the range of the number of speakers in the audio.
Parameter | Possible Values | Default Value | Description |
---|---|---|---|
diarizationMinSpeaker | 1 to 20 | 1 | Minimum expected number of speakers |
diarizationMaxSpeaker | 1 to 20 | 10 | Maximum expected number of speakers |
diarizationMinSpeaker
This is the minimum number of people expected to be in the audio.
diarizationMaxSpeaker
This is the maximum number of people expected to be in the audio.
If you know in advance the number of speakers in the audio, setting it accurately can significantly improve the estimation accuracy. For example, if you know that 5 people will participate in the meeting you're about to start, add diarizationMinSpeaker=5
and diarizationMaxSpeaker=5
to the d
parameter when making the request.
Setting Example
Here's an example of enabling speaker diarization and setting both diarizationMinSpeaker
and diarizationMaxSpeaker
to 5.
Example of setting for asynchronous HTTP interface using curl command
curl https://acp-api-async.amivoice.com/v1/recognitions \
-F u={APP_KEY} \
-F d="grammarFileNames=-a-general speakerDiarization=True diarizationMinSpeaker=5 diarizationMaxSpeaker=5" \
-F a=@test.wav
Tips for Improving Accuracy
Here are some tips for improving the accuracy of speaker diarization.
Setting the Number of Speakers
For the asynchronous HTTP interface, you can specify the number of speakers. If possible, setting the number of speakers in the audio in advance can improve accuracy.
Improving Audio Quality
Speaker diarization tends to decrease in accuracy as the audio quality worsens. Improving the recording environment to avoid noise and echo can potentially improve accuracy.
Avoiding Simultaneous Speech by Multiple Speakers
It becomes difficult to estimate speakers when multiple speakers are talking simultaneously. While it may be challenging depending on the application's use case, if you can devise ways to prevent users from talking at the same time, accuracy will improve.
Points to Note
When Multiple Speakers Speak Simultaneously
For segments where multiple speakers are speaking simultaneously, it would be correct not to estimate a single speaker, but the system will return one of the speaker labels "speakerN".
Noise
For audio files, first, speech detection is performed, and then speech recognition and speaker diarization are carried out on the detected speech segments. If noise is mistakenly detected as a speech segment, one of the speaker labels "speakerN" will be returned.
Speaker Labels Are Independent for Each Request
Speaker labels are independent for each request. For example, if you divide the audio recorded in a meeting into two halves and make two requests, you will receive two responses. However, the speaker labels included in each response may not refer to the same speakers. Applications should associate speaker labels with speakers for each response obtained from different requests. Alternatively, send the audio as a single request.
Limitations
Maximum Number of Distinguishable Speakers
The system can distinguish up to 20 speakers.
For Asynchronous HTTP
When speaker diarization is enabled, the maximum length of audio that can be sent is 3 hours. If you send audio data longer than this, you will receive an error at the time of the request.
When not using speaker diarization, the audio that can be sent to the asynchronous HTTP interface is limited by size, and you can send up to approximately 2.14GB of audio data.
Sample Program
We have published a Windows application that uses speaker diarization with the AmiVoice API's asynchronous HTTP interface.
For an explanation of how to create the above sample application, please see How to Use AmiVoice's Speaker Diarization in C# [HttpClient] on the AmiVoice Tech Blog.