NLPashto: NLP Toolkit for Low-resource Pashto Language

Natural Language Processing (NLP) has revolutionized communication and technology, but many low-resource languages, like Pashto, remain underserved. Pashto, spoken by over 50 million people, lacks essential NLP tools and resources. Addressing this gap, researchers from Shanghai Jiao Tong University have introduced NLPashto, an open-source toolkit designed specifically for Pashto text processing.

What is NLPashto?

NLPashto is a comprehensive toolkit that provides state-of-the-art models for fundamental NLP tasks, including:

Spelling Correction: Fixes space-insertion and space-omission errors common in Pashto text.
Word Segmentation: Splits text into meaningful units, handling Pashto’s complex morphology.
Part-of-Speech (POS) Tagging: Assigns grammatical categories to words, improving language understanding.
Offensive Language Detection: Identifies harmful content on social media, a first for Pashto.

The toolkit also includes pre-trained word embeddings (Word2Vec, fastText, GloVe) and the first monolingual Pashto BERT model, trained on a custom corpus of 15 million words.

Why is NLPashto Important?

Pashto’s unique challenges—non-standardized spelling, rich morphology, and limited digital resources—have hindered NLP progress. NLPashto tackles these issues with:

Benchmark Datasets: Curated for training and evaluating models.
High Accuracy: The spelling correction model achieves 99.35% accuracy, while the offensive language detector reaches 94.77%.
Open Access: Available on GitHub and PyPI, promoting collaboration and reuse.

Future Directions

The team plans to expand NLPashto with modules for Named Entity Recognition (NER), dependency parsing, and more. This toolkit marks a significant milestone for Pashto NLP, empowering researchers and developers to build applications like chatbots, translators, and content moderators.

By democratizing access to NLP tools, NLPashto paves the way for innovation in Pashto language technology, ensuring this rich linguistic heritage thrives in the digital age.

Explore NLPashto today and join the effort to advance Pashto NLP!

Links:

PyPI: pypi.org/project/nlpashto

GitHub: github.com/zirak-ai/nlpashto

This paper presents a technique for detecting spammy names associated with fake profiles before any additional user information or history is available. The proposed approach involves analyzing the name field for patterns commonly found in fake accounts. To achieve this, we have developed a supervised machine learning model to discriminate between valid and spammy names. The model is trained on a labeled dataset of 100K instances (words and phrases), manually labeled for two categories, “valid names” and “spammy strings”. For classification, we examined several Machine Learning algorithms, including Naïve Bayes (NB), k-Nearest Neighbor (KNN), SVM, Linear Regression, and Decision Trees. Experimental results show that the Naïve Bayes algorithm performs the best and yields an F1-score of 94.1% with an accuracy of 95.3%.

Full Paper: https://doi.org/10.1109/ICSIP57908.2023.10270845

NLPashto: NLP Toolkit for Low-resource Pashto Language

What is NLPashto?

Why is NLPashto Important?

Future Directions

ijazul.haq@outlook.com

Leave a ReplyCancel Reply