Detecting Offensive Language in Pashto: A Breakthrough in NLP for Low-Resource Languages

In the digital age, social media platforms are flooded with offensive content, posing challenges for maintaining a healthy online environment. While significant progress has been made in detecting toxic language in major languages like English, low-resource languages such as Pashto have remained largely unexplored. Our recent study titled “Pashto Offensive Language Detection: A Benchmark Dataset and Monolingual Pashto BERT” addresses this gap, introducing innovative solutions for identifying offensive content in Pashto.

The Challenge and the Solution

Pashto, spoken by millions, lacks the resources for advanced Natural Language Processing (NLP) tasks. To tackle this, researchers developed the Pashto Offensive Language Dataset (POLD), a manually annotated collection of tweets labeled as “offensive” or “not offensive.” This dataset serves as a benchmark for training and evaluating models.

The study explored two approaches:

Deep Learning Models: Classic neural networks like CNNs, LSTMs, and GRUs were tested with static word embeddings (Word2Vec, fastText, GloVe).
Transfer Learning: A pre-trained multilingual model (XLM-R) and a custom monolingual Pashto BERT (Ps-BERT) were fine-tuned for the task.

Key Findings

Ps-BERT outperformed all models, achieving an impressive F1-score of 94.34% and accuracy of 94.77%.
FastText embeddings paired with LSTM delivered strong results (F1-score: 93.08%), highlighting the value of sub-word information for detecting altered or misspelled offensive words.
The study revealed that monolingual models, even with limited data, can surpass multilingual counterparts for language-specific tasks.

Implications and Future Work

This research is a milestone for Pashto NLP, providing essential tools like POLD and Ps-BERT, publicly available for further development. The success of Ps-BERT demonstrates the potential of tailored models for low-resource languages. Future work could expand the dataset and address challenges like false positives in poetic or ambiguous texts.

By advancing offensive language detection in Pashto, this study paves the way for safer online spaces and inspires similar efforts for other underrepresented languages.

For more details, check out the full study

The Challenge and the Solution

The study explored two approaches:

Deep Learning Models: Classic neural networks like CNNs, LSTMs, and GRUs were tested with static word embeddings (Word2Vec, fastText, GloVe).
Transfer Learning: A pre-trained multilingual model (XLM-R) and a custom monolingual Pashto BERT (Ps-BERT) were fine-tuned for the task.

Key Findings

Ps-BERT outperformed all models, achieving an impressive F1-score of 94.34% and accuracy of 94.77%.
FastText embeddings paired with LSTM delivered strong results (F1-score: 93.08%), highlighting the value of sub-word information for detecting altered or misspelled offensive words.
The study revealed that monolingual models, even with limited data, can surpass multilingual counterparts for language-specific tasks.

Implications and Future Work

By advancing offensive language detection in Pashto, this study paves the way for safer online spaces and inspires similar efforts for other underrepresented languages.

Paper link: doi.org/10.7717/peerj-cs.1617.

Detecting Offensive Language in Pashto: A Breakthrough in NLP for Low-Resource Languages

The Challenge and the Solution

Key Findings

Implications and Future Work

The Challenge and the Solution

Key Findings

Implications and Future Work

ijazul.haq@outlook.com

Leave a ReplyCancel Reply

NLPashto: NLP Toolkit for Low-resource Pashto Language

Correction of Whitespace and Word Segmentation in Noisy Pashto Text using CRF