High-Performance, Language-Independent Morphological Segmentation
Sajib Dasgupta and Vincent Ng.
NAACL HLT 2007: Proceedings of the Main Conference, pp. 155-163, 2007.
Click here for the
PostScript or PDF
version.
The talk slides are available here.
Abstract
This paper introduces an unsupervised morphological segmentation algorithm
that shows robust performance for four languages with different levels of
morphological complexity. In particular, our algorithm outperforms
Goldsmith's Linguistica and Creutz and Lagus's Morfessor for English and
Bengali, and achieves performance that is comparable to the best results for
all three PASCAL evaluation datasets. Improvements arise from (1) the use
for relative corpus frequency and suffix level similarity for detecting
incorrect morpheme attachments and (2) the induction of orthographic rules
and allomorphs for segmenting words where roots exhibit spelling changes
during morpheme attachments.
Dataset
The Bengali dataset used in this paper is available from
this page.
Software
Our unsupervised morphological segmenter, UnDivide++, is freely available. Try it out and give us your feedback!
BibTeX entry
@InProceedings{Dasgupta+Ng:07a,
author = {Sajib Dasgupta and Vincent Ng},
title = {High-Performance, Language-Independent Morphological Segmentation},
booktitle = {NAACL HLT 2007: Proceedings of the Main Conference},
pages = {155--163},
year = 2007
}