Meelen, Marieke, Roux, Élie and Hill, Nathan W. (2021) 'Optimisation of the Largest Annotated Tibetan Corpus Combining Rule-based, Memory-based, and Deep-learning Methods.' ACM Transactions on Asian and Low-Resource Language Information Processing, 20 (1). pp. 1-11.
|
Text
- Published Version
Available under License Creative Commons Attribution 4.0 (CC-BY 4.0). Download (2MB) | Preview |
Abstract
This article presents a pipeline that converts collections of Tibetan documents in plain text or XML into a fully segmented and POS-tagged corpus. We apply the pipeline to the large extent collection of the Buddhist Digital Resource Center. The semi-supervised methods presented here not only result in a new and improved version of the largest annotated Tibetan corpus to date, the integration of rule-based, memory-based, and neural-network methods also serves as a good example of how to overcome challenges of under-researched languages. The end-to-end accuracy of our entire automatic pipeline of 91.99% is high enough to make the resulting corpus a useful resource for both linguists and scholars of Tibetan studies.
Item Type: | Journal Article |
---|---|
SOAS Departments & Centres: | Departments and Subunits > Department of East Asian Languages & Cultures |
ISSN: | 23754699 |
Copyright Statement: | © 2021 Copyright held by the owner/author(s) |
DOI (Digital Object Identifier): | https://doi.org/10.1145/3409488 |
Date Deposited: | 15 Mar 2021 13:51 |
URI: | https://eprints.soas.ac.uk/id/eprint/34903 |
Altmetric Data
Statistics
Accesses by country - last 12 months | Accesses by referrer - last 12 months |