Unsupervised Part-of-Speech Acquisition Dataset

This page is a distribution site for Part-of-speech dataset for Bengali. This data set was introduced in the following paper:

Unsupervised Part-of-Speech Acquisition for Resource Scarce Languages. Sajib Dasgupta and Vincent Ng. In the proceedings of the Empirical Methods in Natural Language Processing (EMNLP), Prague, 2007.

Here are the files:

Bengali Dataset (Transliterated) : Total instances: 5000.

Bengali Dataset (Original) : Total instances: 5000.

Tagset : The Bengali tagset used in EMNLP paper.

Mapping : The transliteration we used to map Bengali to English.