Meelen, Marieke, Roux, Élie and Hill, Nathan W. (2021) 'Optimisation of the Largest Annotated Tibetan Corpus Combining Rule-based, Memory-based, and Deep-learning Methods.' ACM Transactions on Asian and Low-Resource Language Information Processing, 20 (1). pp. 1-11.

Preview

Text - Published Version
Available under License Creative Commons Attribution 4.0 (CC-BY 4.0).
Download (2MB) | Preview

Alternative Location: https://doi.org/10.1145/3409488

Abstract

This article presents a pipeline that converts collections of Tibetan documents in plain text or XML into a fully segmented and POS-tagged corpus. We apply the pipeline to the large extent collection of the Buddhist Digital Resource Center. The semi-supervised methods presented here not only result in a new and improved version of the largest annotated Tibetan corpus to date, the integration of rule-based, memory-based, and neural-network methods also serves as a good example of how to overcome challenges of under-researched languages. The end-to-end accuracy of our entire automatic pipeline of 91.99% is high enough to make the resulting corpus a useful resource for both linguists and scholars of Tibetan studies.

Item Type:	Journal Article
SOAS Departments & Centres:	Departments and Subunits > Department of East Asian Languages & Cultures
ISSN:	23754699
Copyright Statement:	© 2021 Copyright held by the owner/author(s)
DOI (Digital Object Identifier):	https://doi.org/10.1145/3409488
Date Deposited:	15 Mar 2021 13:51
URI:	https://eprints.soas.ac.uk/id/eprint/34903

Altmetric Data

Statistics

Download activity - last 12 monthsShow export options

Downloads since deposit

6 month trend

63Downloads

6 month trend

106Hits

Accesses by country - last 12 monthsShow export options Export as	Accesses by referrer - last 12 monthsShow export options Export as

Accesses by country - last 12 monthsShow export options

Accesses by referrer - last 12 monthsShow export options

Repository staff only

Edit Item

SOAS Research Online

A Free Database of the Latest Research by SOAS Academics and PhD Students

Abstract

Altmetric Data

Statistics

Repository staff only