Correction of Whitespace and Word Segmentation in Noisy Pashto Text using CRF

Pashto, a low-resource language spoken by millions, presents unique challenges in NLP, particularly in word segmentation. Unlike English, where whitespace reliably marks word boundaries, Pashto uses whitespace inconsistently, leading to frequent spelling errors like space-omission and space-insertion. These issues complicate tasks such as machine translation, named entity recognition, and information extraction. Our recent paper, “Correction of Whitespace and Word Segmentation in Noisy Pashto Text Using CRF,” addresses these challenges by introducing a state-of-the-art proofing tool and word segmenter for Pashto.

The Challenge of Pashto Word Segmentation

Pashto’s script and morphology add layers of complexity. Written in Perso-Arabic script, Pashto includes non-joiner and joiner letters, which affect how words connect. For example, omitting a space after a non-joiner letter may not disrupt readability for humans but confuses NLP systems. Conversely, inserting unnecessary spaces splits words into meaningless fragments. These inconsistencies, combined with the lack of standardized spelling rules, make automated word segmentation exceptionally difficult.

Our Solution: CRF-Based Models

To tackle these issues, we developed two machine learning models using Conditional Random Fields (CRF), a powerful algorithm for sequence labeling tasks. The first model acts as a proofing tool, correcting space-omission and space-insertion errors by analyzing character-level context. The second model, a specialized word segmenter, identifies compound words and proper nouns by examining token-level features, such as prefixes, suffixes, and surrounding context.

Building a Benchmark Dataset

A key contribution of our work is the creation of a large, annotated Pashto corpus. This dataset, comprising nearly 3.5 million words, was meticulously labeled for correct whitespace usage and word boundaries using a lexicon-based approach followed by manual verification. The corpus serves as a valuable resource for future research in Pashto NLP, addressing the scarcity of high-quality linguistic data for low-resource languages.

Implications and Future Work

Our models significantly improve the accuracy of Pashto text processing, enabling more reliable downstream NLP applications. The proofing tool ensures clean input text, while the segmenter accurately identifies word boundaries, even in noisy or informal contexts. Looking ahead, we plan to explore advanced algorithms, such as transformer-based models, and expand our toolkit to support other NLP tasks like machine translation and sentiment analysis.

This research marks a critical step forward in Pashto NLP, bridging the gap for a language long underserved by computational tools. By making our models and dataset publicly available, we hope to inspire further innovation and collaboration in this emerging field.

For more details, check out our paper and the accompanying Pashto NLP toolkit on GitHub and PyPI. Let’s work together to unlock the potential of Pashto in the digital age!

Full Paper: doi.org/10.1016/j.specom.2023.102970

Correction of Whitespace and Word Segmentation in Noisy Pashto Text using CRF

The Challenge of Pashto Word Segmentation

Our Solution: CRF-Based Models

Building a Benchmark Dataset

Implications and Future Work

ijazul.haq@outlook.com

Leave a ReplyCancel Reply

NLPashto: NLP Toolkit for Low-resource Pashto Language