POS (Parts of Speech) Tagging System for Sindhi Language
Part of Speech (POS) tagging is a fundamenta1 need for any natural language text processing system. However, bui1ding such a classifier is quite challenging due to the inherent ambiguity present in the natural languages where the same word may be used as different part of speech in different contexts. Severa1 efforts have been made to bui1d such taggers for many internationa11anguages inc1uding Eng1ish, French, German and Arabic. Now, in order to bui1d Sindhi text processing system, a POS tagger for Sindhi 1anguage is much needed. 1ike Arabic, Sindhi POS tagging is more cha11enging due to its word morpho1ogy. In this thesis, we will describe various techniques that are avai1ab1e for POS tagging and discuss why we may or may not opt for them. We will then present a brief survey of the efforts that have been done so far for POS tagging of Sindhi language. In this research we aim to create our own POS tagger for Sindhi by training the famous Stanford POS tagger over a corpus containing more than 5000 Sindhi words. The performance of the trained POS tagger will be s by using another test corpus containing 2000 Sindhi words. Manual tagging of words (even with the help of semi-automatic too1s) for training purpose in such huge corpuses is a significant effort in itself and will be retained for later studies.
Copyright (c) 2021 International Journal of Computer Science and Emerging Technologies
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.