|
 |
|
|
|
Bridging Information Science and Deep Learning: Transformer Models for Isolated Saudi Sign Language Recognition |
|
PP: 1345-1357 |
|
doi:10.18576/amis/190609
|
|
Author(s) |
|
Soukeina Elhassen,
Lama Al Khuzayem,
|
|
Abstract |
|
Sign Language (SL) is the primary communication method for deaf and hard-of-hearing individuals. This underscores the need for advanced technologies that bridge the communication gap between SL users and the hearing community. Saudi Sign Language (SSL), the main SL used in Saudi Arabia, lacks large-scale isolated datasets, posing challenges in developing recognition models that perform well on small to medium-sized data. Most state-of-the-art approaches for Arabic Sign Language (ArSL) in general, and SSL in particular, rely on Convolutional Neural Networks (CNNs) architectures, which often struggle to capture long-range temporal dependencies. In contrast, this paper establishes a benchmark for isolated SSL recognition using transformer-based video models. Specifically, we evaluate three state-of-the-art architectures—Swin Transformer, VideoMAE, and TimeSformer—on the King Saud University Arabic Sign Language (KSU-ArSL) dataset. All models are pre-trained on the Kinetics-400 dataset and fine-tuned using 16- frame RGB clips. The Swin Transformer achieved the highest accuracy at 97.50%, followed by VideoMAE at 95.25% and TimeSformer at 93.44%. Despite challenges posed by visually similar signs, these results demonstrate the superior effectiveness of transformer networks over CNNs in sign language recognition. Future work will focus on signer-independent evaluation and continuous SSL recognition to build more generalizable systems and improve accessibility for the Saudi deaf community. |
|
|
 |
|
|