Advancing Pashto OCR: Introducing PsOCR and Benchmarking Large Multimodal Models

Advancing Pashto OCR: Introducing PsOCR and Benchmarking Large Multimodal Models

Optical Character Recognition (OCR) is a cornerstone of digitization, enabling machines to convert scanned documents and images into editable, searchable text. While OCR technology has matured for widely spoken languages,...
Read More
Research on Several Key Technologies of NLP for Low Resource Pashto Language

Research on Several Key Technologies of NLP for Low Resource Pashto Language

Online Social Networks (OSNs) have revolutionized communication but also brought challenges like hate speech, cyberbullying, and offensive content. While Natural Language Processing (NLP) helps detect such abuse, most research focuses...
Read More
Detecting Offensive Language in Pashto: A Breakthrough in NLP for Low-Resource Languages

Detecting Offensive Language in Pashto: A Breakthrough in NLP for Low-Resource Languages

In the digital age, social media platforms are flooded with offensive content, posing challenges for maintaining a healthy online environment. While significant progress has been made in detecting toxic language...
Read More
Correction of Whitespace and Word Segmentation in Noisy Pashto Text using CRF

Correction of Whitespace and Word Segmentation in Noisy Pashto Text using CRF

Pashto, a low-resource language spoken by millions, presents unique challenges in NLP, particularly in word segmentation. Unlike English, where whitespace reliably marks word boundaries, Pashto uses whitespace inconsistently, leading to...
Read More
NLPashto: NLP Toolkit for Low-resource Pashto Language

NLPashto: NLP Toolkit for Low-resource Pashto Language

Natural Language Processing (NLP) has revolutionized communication and technology, but many low-resource languages, like Pashto, remain underserved. Pashto, spoken by over 50 million people, lacks essential NLP tools and resources....
Read More