POS (Parts of Speech) Tagging System for Sindhi Language

  • Ghazala Gul Junejo, Mir Sajjad Hussain Talpur, Taha Nuzhat, and Shakir Hussain Talpur
Keywords: Natural language Processing (NlP), Machine learning, core (NlP) library, HMM, Stanford POS taggers.

Abstract

Part of Speech (POS) tagging is a fundamenta1 need for any natural language text processing system. However, bui1ding such a classifier is quite challenging due to the inherent ambiguity present in the natural languages where the same word may be used as different part of speech in different contexts. Severa1 efforts have been made to bui1d such taggers for many internationa11anguages inc1uding Eng1ish, French, German and Arabic. Now, in order to bui1d Sindhi text processing system, a POS tagger for Sindhi 1anguage is much needed. 1ike Arabic, Sindhi POS tagging is more cha11enging due to its word morpho1ogy. In this thesis, we will describe various techniques that are avai1ab1e for POS tagging and discuss why we may or may not opt for them. We will then present a brief survey of the efforts that have been done so far for POS tagging of Sindhi language. In this research we aim to create our own POS tagger for Sindhi by training the famous Stanford POS tagger over a corpus containing more than 5000 Sindhi words. The performance of the trained POS tagger will be s by using another test corpus containing 2000 Sindhi words. Manual tagging of words (even with the help of semi-automatic too1s) for training purpose in such huge corpuses is a significant effort in itself and will be retained for later studies.

Downloads

Download data is not yet available.
Published
2021-03-29
How to Cite
Ghazala Gul Junejo, Mir Sajjad Hussain Talpur, Taha Nuzhat, and Shakir Hussain Talpur. (2021). POS (Parts of Speech) Tagging System for Sindhi Language. International Journal of Computer Science and Emerging Technologies , 4(2), 14-22. Retrieved from http://ijcet.salu.edu.pk/index.php/IJCET/article/view/59