AI-Driven Lexicography: Building Intelligent Urdu Dictionaries Using NLP

Authors

  • Ijaz Hussain Ijaz The University of Lahore Sargodha Campus
  • Ms Sarwat Suhail

Abstract

The purpose of this study is to develop fully featured lexical databases for relatively low-resource languages such as Urdu presents researchers with persistent difficulties, particularly due to the language’s intricate morphology, limited high-quality training data, and the rapid evolution of how speakers use the language online. This paper introduces an AI-driven framework aimed at creating adaptive, smart Urdu dictionaries by harnessing state-of-the-art NLP processes including lemmatization, part-of-speech tagging, word sense disambiguation, and synthetic data generation. To feed the system, the authors assembled a multi-domain corpus that pulls together formal literary material, user-generated social media posts, and code-mixed Roman Urdu examples; all of this text was then run through a transformer pipeline specifically fine-tuned to capture Urdu’s linguistic characteristics. As a result, the dictionary now covers 92.3 per cent of academic and newspaper vocabulary as well as 87.6 per cent of more casual or spoken expressions, with the associated NLP tools reporting impressive F1 scores of 93.5 for lemmatization, 94.8 for POS tagging, and 91.3 for WSD. Further testing on practical applications such as machine translation where  scores reached 32.4—and sentiment classification, which clocked an F1 of 88.6, revealed clear performance gains over previously established benchmarks. Feedback collected from Urdu language experts and everyday users praised both the system’s accuracy and overall usability, although they also pointed to a need for broader representation of regional dialects. When compared to earlier research, the present framework marks a significant step forward in Urdu lexicography by blending contextual word embeddings with live, continually updating data streams, thereby offering a model that is scalable and relevant to other under-resourced tongues. These results highlight the power of AI-enhanced dictionary making to bolster linguistic variety while expanding the horizons of NLP tools designed for Urdu in an increasingly digital world.

Published

2025-09-30