Advancing Pashto OCR: Introducing PsOCR and Benchmarking Large Multimodal Models

Optical Character Recognition (OCR) is a cornerstone of digitization, enabling machines to convert scanned documents and images into editable, searchable text. While OCR technology has matured for widely spoken languages, low-resource languages like Pashto—written in a cursive Perso-Arabic script with unique challenges—lag behind. Our recent paper, “PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language,” addresses this gap by introducing the first comprehensive Pashto OCR dataset and evaluating state-of-the-art models to pave the way for future research.

The Challenge of Pashto OCR

Pashto, spoken by over 50 million people, presents distinct hurdles for OCR:

Cursive Script: Letters change shape based on position (initial, medial, final, or isolated), and ligatures further complicate recognition.
Inconsistent Diacritics: Optional diacritical marks lead to varied spellings of the same word.
Data Scarcity: Limited annotated datasets hinder model development.

These factors make Pashto OCR uniquely challenging compared to Latin or even other Perso-Arabic scripts.

Introducing PsOCR: A Synthetic Dataset for Pashto

To tackle data scarcity, we developed PsOCR, a synthetic dataset of one million annotated images with:

Diverse Fonts: 1,000 unique Pashto-compatible font families.
Varied Layouts: Multiple text alignments, line heights, and color schemes (light/dark themes).
Granular Annotations: Bounding boxes at page, line, and token levels for robust model training.

PsOCR simulates real-world variability, ensuring models generalize well to practical scenarios. A curated 10K-image benchmark subset enables standardized evaluation of OCR systems.

Benchmarking Large Multimodal Models

We evaluated 11 LMMs (7 open-source, 4 proprietary) in a zero-shot setting to assess their innate Pashto OCR capabilities. Key findings:

Gemini outperformed all models, achieving 89.92% character accuracy and 69.5% word accuracy.
Among open-source models, Qwen-7B excelled, demonstrating strong potential for fine-tuning.
Font diversity and line spacing significantly impacted performance, highlighting the need for robust training data.

Notably, while models like Gemini excelled at character recognition, coherent word-level accuracy remained a challenge, underscoring the complexity of Pashto’s script.

Future Directions

PsOCR is a foundational step toward bridging the OCR gap for Pashto. Next, we plan to:

Expand the dataset to include handwritten samples and natural scene text.
Develop benchmarks for visual question answering (VQA) and document understanding.
Explore fine-tuning open-source models like Qwen for domain-specific applications.

Access the Dataset

The PsOCR benchmark (10K images) is publicly available on Hugging Face and Kaggle, while the full training set (1M images) can be requested via email.

Join the Effort

By open-sourcing PsOCR, we invite researchers and developers to build upon this work. Together, we can unlock Pashto’s potential in the digital world—from preserving historical texts to enabling real-time translation.

Explore the paper and dataset: GitHub Repository. Let’s advance OCR for low-resource languages!

Advancing Pashto OCR: Introducing PsOCR and Benchmarking Large Multimodal Models

The Challenge of Pashto OCR

Introducing PsOCR: A Synthetic Dataset for Pashto

Benchmarking Large Multimodal Models

Future Directions

Access the Dataset

Join the Effort

ijazul.haq@outlook.com

Leave a ReplyCancel Reply

NLPashto: NLP Toolkit for Low-resource Pashto Language

Correction of Whitespace and Word Segmentation in Noisy Pashto Text using CRF

Detecting Offensive Language in Pashto: A Breakthrough in NLP for Low-Resource Languages