Research on Several Key Technologies of NLP for Low Resource Pashto Language

Online Social Networks (OSNs) have revolutionized communication but also brought challenges like hate speech, cyberbullying, and offensive content. While Natural Language Processing (NLP) helps detect such abuse, most research focuses on high-resource languages like English, leaving low-resource languages like Pashto, spoken by over 50 million people, largely unexplored.

Bridging the Gap in Pashto NLP

My PhD research pioneers NLP for Pashto by developing four key models from scratch, addressing critical gaps:

Space Correction & Tokenization
- Pashto lacks standardized whitespace rules, leading to poor tokenization.
- We built a BERT-based model to predict correct whitespace positions, improving text processing.
Word Segmentation
- Many Pashto words are compounds (e.g., multi-part phrases).
- We developed a WordPiece-based BERT model to accurately segment full words, crucial for tasks like POS tagging.
Part-of-Speech (POS) Tagging
- Created the first Pashto POS tagset (36 categories) and an annotated corpus.
- Designed a hybrid POS tagger combining BERT embeddings with lexical features for better accuracy.
Offensive Language Detection
- Built a BERT-based classifier enhanced with POS tags to detect toxic content.
- This model serves as a benchmark for future Pashto NLP tasks.

Beyond Models: Building Resources for Pashto NLP

Since Pashto lacks existing tools, we:

Pretrained PsBERT, the first monolingual Pashto BERT model.
Generated static word embeddings (Word2Vec, fastText, GloVe).
Compiled a 30-million-word corpus and annotated datasets.
Packaged all resources into NLPashto, an open-source Python toolkit.

Why This Matters

This research is a foundational step for Pashto NLP, enabling future work in machine translation, sentiment analysis, and more. By addressing online abuse detection, we also contribute to safer digital spaces for Pashto speakers.

Research on Several Key Technologies of NLP for Low Resource Pashto Language

Bridging the Gap in Pashto NLP

Beyond Models: Building Resources for Pashto NLP

Why This Matters

ijazul.haq@outlook.com

Leave a ReplyCancel Reply

NLPashto: NLP Toolkit for Low-resource Pashto Language

Correction of Whitespace and Word Segmentation in Noisy Pashto Text using CRF

Detecting Offensive Language in Pashto: A Breakthrough in NLP for Low-Resource Languages

Advancing Pashto OCR: Introducing PsOCR and Benchmarking Large Multimodal Models