Research on Several Key Technologies of NLP for Low Resource Pashto Language

Online Social Networks (OSNs) have revolutionized communication but also brought challenges like hate speech, cyberbullying, and offensive content. While Natural Language Processing (NLP) helps detect such abuse, most research focuses on high-resource languages like English, leaving low-resource languages like Pashto, spoken by over 50 million people, largely unexplored.

Bridging the Gap in Pashto NLP

My PhD research pioneers NLP for Pashto by developing four key models from scratch, addressing critical gaps:

  1. Space Correction & Tokenization
    • Pashto lacks standardized whitespace rules, leading to poor tokenization.
    • We built a BERT-based model to predict correct whitespace positions, improving text processing.
  2. Word Segmentation
    • Many Pashto words are compounds (e.g., multi-part phrases).
    • We developed a WordPiece-based BERT model to accurately segment full words, crucial for tasks like POS tagging.
  3. Part-of-Speech (POS) Tagging
    • Created the first Pashto POS tagset (36 categories) and an annotated corpus.
    • Designed a hybrid POS tagger combining BERT embeddings with lexical features for better accuracy.
  4. Offensive Language Detection
    • Built a BERT-based classifier enhanced with POS tags to detect toxic content.
    • This model serves as a benchmark for future Pashto NLP tasks.

Beyond Models: Building Resources for Pashto NLP

Since Pashto lacks existing tools, we:

  • Pretrained PsBERT, the first monolingual Pashto BERT model.
  • Generated static word embeddings (Word2Vec, fastText, GloVe).
  • Compiled a 30-million-word corpus and annotated datasets.
  • Packaged all resources into NLPashto, an open-source Python toolkit.

Why This Matters

This research is a foundational step for Pashto NLP, enabling future work in machine translation, sentiment analysis, and more. By addressing online abuse detection, we also contribute to safer digital spaces for Pashto speakers.

Share your love
ijazul.haq@outlook.com
ijazul.haq@outlook.com
Articles: 5

Newsletter Updates

Enter your email address below and subscribe to our newsletter

Leave a Reply

Your email address will not be published. Required fields are marked *