|
 |
|
|
|
The Automated Method of Collecting and Labeling Data for Speech Emotion Recognition based on Face Emotion Recognition |
|
PP: 1067-1077 |
|
doi:10.18576/amis/190508
|
|
Author(s) |
|
Aisultan Shoiynbek,
Darkhan Kuanyshbay,
Paulo Menezes,
Gustavo Assunc ̧a ̃o,
Bakhtiyor Meraliyev,
Assylbek Mukhametzhanov,
Temirlan Shoiynbek,
Sergey Sklyar,
|
|
Abstract |
|
Speech Emotion Recognition (SER) is vital for enabling natural and effective human–machine interactions, yet its advancement is constrained by the scarcity of richly annotated emotional speech corpora, the laborious nature of manual labeling, and the difficulty of eliciting genuine expressions. We propose an automated data-collection and labeling pipeline that synchronizes video-based facial emotion recognition (FER) with audio capture to annotate speech recordings according to speakers’ natural facial expressions. Applying this method, we processed 1 243 YouTube videos (1 058 hours of raw footage) and extracted 218 359 candidate utterances, which—after FER-guided filtering—yielded a high-quality corpus of 45 459 recordings (33 h 15 min of audio) across seven basic emotions in Kazakh (15 076 utterances) and Russian (30 383 utterances). We trained a deep neural network on the combined dataset and achieved 86.84% overall test accuracy, with per-language accuracies of 89.00% (Kazakh) and 85.20% (Russian) for seven- way emotion classification; a support vector machine reached 82.47% under the same conditions. By reducing manual annotation effort by over 80% while maintaining consistent labels, our approach delivers a scalable, language-agnostic solution for generating authentic emotional speech datasets, substantially cutting down on human labor and paving the way for more robust, real-world SER systems.
|
|
|
 |
|
|