Online Social Networks (OSNs) have revolutionized communication but also brought challenges like hate speech, cyberbullying, and offensive content. While Natural Language Processing (NLP) helps detect such abuse, most research focuses on high-resource languages like English, leaving low-resource languages like Pashto, spoken by over 50 million people, largely unexplored.
Bridging the Gap in Pashto NLP
My PhD research pioneers NLP for Pashto by developing four key models from scratch, addressing critical gaps:
- Space Correction & Tokenization
- Pashto lacks standardized whitespace rules, leading to poor tokenization.
- We built a BERT-based model to predict correct whitespace positions, improving text processing.
- Word Segmentation
- Many Pashto words are compounds (e.g., multi-part phrases).
- We developed a WordPiece-based BERT model to accurately segment full words, crucial for tasks like POS tagging.
- Part-of-Speech (POS) Tagging
- Created the first Pashto POS tagset (36 categories) and an annotated corpus.
- Designed a hybrid POS tagger combining BERT embeddings with lexical features for better accuracy.
- Offensive Language Detection
- Built a BERT-based classifier enhanced with POS tags to detect toxic content.
- This model serves as a benchmark for future Pashto NLP tasks.
Beyond Models: Building Resources for Pashto NLP
Since Pashto lacks existing tools, we:
- Pretrained PsBERT, the first monolingual Pashto BERT model.
- Generated static word embeddings (Word2Vec, fastText, GloVe).
- Compiled a 30-million-word corpus and annotated datasets.
- Packaged all resources into NLPashto, an open-source Python toolkit.
Why This Matters
This research is a foundational step for Pashto NLP, enabling future work in machine translation, sentiment analysis, and more. By addressing online abuse detection, we also contribute to safer digital spaces for Pashto speakers.

