Unsupervised Part Of Speech Lexicon Induction Output

This page is a distribution site for unsupervised POS lexicon induction output for Bengali. This output was generated from the system introduced in the following paper:

Unsupervised Part-of-Speech Acquisition for Resource-Scarce Languages.
Sajib Dasgupta and Vincent Ng.
In the proceedings of the conference on Empirical Methods in Natural Language Processing (EMNLP), Prague, 2007.

Please see my thesis for more details:
Toward Language Independent Morphological Segmentation and Part-of-speech Induction
Advisor: Vincent Ng, University of Texas at Dallas.

Here are the files:

Frequency>=5 : Size 41.5K word types. Here, POS output of the words with frequency >=5 is shown.

All : Size 95.5K word types. Here's the output of all of the words in Prothom Alo except those whose feature vector is too sparse (no more than one feature is present). See Section 3.8 of my thesis for more details.

Cluster Information : List of open class Bengali POS tags induced by our clustering system.

Mapping : The transliteration we used to map Bengali to English.