Unsupervised Part-of-Speech Acquisition Dataset
This page is a distribution site for Part-of-speech dataset for Bengali.
This data set was introduced in the following paper:
Unsupervised Part-of-Speech Acquisition for Resource Scarce Languages.
Sajib Dasgupta and Vincent Ng. In the proceedings of the Empirical Methods in Natural Language Processing (EMNLP), Prague, 2007.
Here are the files:
Bengali Dataset (Transliterated) : Total instances: 5000.
Bengali Dataset (Original) : Total instances: 5000.
Tagset : The Bengali tagset used in EMNLP paper.
Mapping : The transliteration we used to map Bengali to English.