Tokenization and its challenges in Sindhi language
Natural language processing, is a branch of Artificial Intelligence (AI). This is computational techniques which are used to analysis and synthesis of NLP and its applications. Natural Language is the ability and capability to understand the spoken language. Sindhi language has polymorphic characteristics. Sindhi is an old as well as complex language in the world because of its semantic features, so the tokenization is difficult task for Sindhi language. Tokenization is also called word segmentation into words or script (numbers, alphabets). In this research issues of tokenization are discussing. In many language just like Urdu, Sindhi Arabic and so on. Most of the language have space insertion and space omission errors. So, it’s very important to measure the different corpus with different algorithms in this research we utilize and develop J.Mahar model on corpus. When this tokenizer is tested on this data with one lac and seventy five thousand words of Sindhi text. On this corpus JM tokenizer provides 96% accuracy.