Rough voice evaluation dataset

This repository contains hand annotations for 11 selected files from the VCTK corpus.

Data structure

As VCTK is freely available, this repository provides only the annotation files. Please download the original VCTK corpus from the dataset website.

The annotations can be found in the sub folder VCTK-Corpus-0.92/rough_voice where they follow the same file structure as the VCTK database, that is, <speaker_id>/<speaker_id>_<file_id:3d>.txt.

The annotation files are space separated text files with two columns, time and value.

TIME VALUE	Rough voice annotation
Time (s)	Either 0 (no rough voice) or 1 (rough voice)

There are convenience functions provided under src/files.py allowing you to read and parse the annotation files and get the corresponding audio files.

Examples

To varify the annotations can be read and work properly, you can run the script src/plot.py which plots the spectrograms of the audio files and highlights the rough voice sections in colour. It should produce something like this:

Annotations

The annotations were created by a mix of acoustic and visual inspection of the audio files and their spectrograms.

Rough voice type

Our goal was to annotate 3 types of rough voice:

Subharmonics
Jitter
Shimmer

While under good conditions, i.e. low noise, long segments, subharmonic can be clearly distinguished from structural noise (jitter and shimmer), many cases were not that clear. Therefore, the annotation does not distinguish between these types of rough voice.

Annotation criteria

We used the spectrogram to detect possible candidate for rough voice. The candidates were analysed through listening and by inspecting the time domain signal. We only annotated subharmonics, where the rough voice was clearly audible or irregular (missing) pulses could be found in the time domain.

What doesn't count

Several phenomena can look very similar to rough voice in the spectrogram due to the noisiness, but do not qualify as rough voice. In particular, we exclude:

Onsets: Onsets in general can have a wide frequency spectrum that can be mistaken as subharmonics.
Trills: Trills can introduce additional periodicity, but the origin is not in the glottis and therefore doesn't count as rough voice.
Voiced fricatives / transitions between vowels and consonants: Additive noise stemming from fricatives, or africates, can look like jitter or shimmer when superimposed over phonation. As the noise type is additive and not structural, it does not indicate irregular phonation.

Temporal precision

The on- and offsets of the rough voice segments were not easy to determine. Some of the rough voice segments have only one pulse missing. Other cases where difficult to spot in the time domain and were only visible with large window sizes. Both cases make temporal precision difficult. We added a small margin of a few periods allowing a larger region where rough voice may be detected.

supertone-inc/rough-voice-eval-set-vctk

README