Programming Languages White Papers

Improved Source-Channel Models for Chinese Word Segmentation

Overview This paper presents a Chinese word segmentation system that uses improved source- channel models of Chinese sentence generation. Chinese words are defined as one of the following four types: lexicon words, morphologically derived words, factoids, and named entities. The system provides a unified approach to the four fundamental features of word-level Chinese language processing: word segmentation, morphological analysis, factoid detection, and named entity recognition. The performance of the system is evaluated on a manually annotated test set, and is also compared with several state-of-the-art systems, taking into account the fact that the definition of Chinese words often varies from system to system.

Further White Paper Details
PublisherMicrosoft File FormatPDF
Date PublishedMay 2003
FormatWhite Papers   
Topics

Quick Sitemap Links: