[ skip to content ]

Segmenting and POS tagging Classical Tibetan using a memory-based tagger

Hill, Nathan W. and Meelen, Marieke (2017) 'Segmenting and POS tagging Classical Tibetan using a memory-based tagger.' Himalayan Linguistics, 16 (2). pp. 64-86.

[img]
Preview
Text - Published Version
Available under License Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 (CC BY-NC-ND 4.0).

Download (951kB) | Preview

Abstract

This paper presents a new approach to two challenging NLP tasks in Classical Tibetan: word segmentation and Part-of-Speech (POS) tagging. We demonstrate how both these problems can be approached in the same way, by generating a memory-based tagger that assigns 1) segmentation tags and 2) POS tags to a test corpus consisting of unsegmented lines of Tibetan characters. We propose a three-stage workflow and evaluate the results of both the segmenting and the POS tagging tasks. We argue that the Memory-Based Tagger (MBT) and the proposed workflow not only provide an adequate solution to these NLP challenges, they are also highly efficient tools for building larger annotated corpora of Tibetan.

Item Type: Journal Article
SOAS Departments & Centres: Departments and Subunits > Department of East Asian Languages & Cultures
Departments and Subunits > School of Languages, Cultures & Linguistics > Department of Linguistics
ISSN: 15447502
Depositing User: Nathan Hill
Date Deposited: 03 Feb 2018 14:11
URI: http://eprints.soas.ac.uk/id/eprint/25373

Statistics

Download activity - last 12 months
Downloads since deposit
12Downloads
88Hits
Accesses by country - last 12 months
Accesses by referrer - last 12 months
Additional statistics for this record are available via IRStats2

Repository staff only

Edit Item Edit Item