This repository contains hand annotations for 11 selected files from the VCTK corpus.
As VCTK is freely available, this repository provides only the annotation files. Please download the original VCTK corpus from the dataset website.
The annotations can be found in the sub folder
VCTK-Corpus-0.92/rough_voice
where they follow the same file structure as the VCTK database,
that is, <speaker_id>/<speaker_id>_<file_id:3d>.txt.
The annotation files are space separated text files with two columns, time and value.
| TIME VALUE | Rough voice annotation |
|---|---|
| Time (s) | Either 0 (no rough voice) or 1 (rough voice) |
There are convenience functions provided under src/files.py
allowing you to read and parse the annotation files
and get the corresponding audio files.
To varify the annotations can be read and work properly,
you can run the script src/plot.py
which plots the spectrograms of the audio files
and highlights the rough voice sections in colour.
It should produce something like this:
The annotations were created by a mix of acoustic and visual inspection of the audio files and their spectrograms.
Our goal was to annotate 3 types of rough voice:
- Subharmonics
- Jitter
- Shimmer
While under good conditions, i.e. low noise, long segments, subharmonic can be clearly distinguished from structural noise (jitter and shimmer), many cases were not that clear. Therefore, the annotation does not distinguish between these types of rough voice.
We used the spectrogram to detect possible candidate for rough voice. The candidates were analysed through listening and by inspecting the time domain signal. We only annotated subharmonics, where the rough voice was clearly audible or irregular (missing) pulses could be found in the time domain.
Several phenomena can look very similar to rough voice in the spectrogram due to the noisiness, but do not qualify as rough voice. In particular, we exclude:
- Onsets: Onsets in general can have a wide frequency spectrum that can be mistaken as subharmonics.
- Trills: Trills can introduce additional periodicity, but the origin is not in the glottis and therefore doesn't count as rough voice.
- Voiced fricatives / transitions between vowels and consonants: Additive noise stemming from fricatives, or africates, can look like jitter or shimmer when superimposed over phonation. As the noise type is additive and not structural, it does not indicate irregular phonation.
The on- and offsets of the rough voice segments were not easy to determine. Some of the rough voice segments have only one pulse missing. Other cases where difficult to spot in the time domain and were only visible with large window sizes. Both cases make temporal precision difficult. We added a small margin of a few periods allowing a larger region where rough voice may be detected.
