Semi-supervised Thai Sentence Segmentation Using Local and Distant Word Representations

Authors

  • Chanatip Saetia Chulalongkorn University https://orcid.org/0000-0002-9403-4388
  • Ekapol Chuangsuwanich Chulalongkorn University
  • Tawunrat Chalothorn Kasikorn Labs, KBTG
  • Peerapon Vateekul Chulalongkorn University

DOI:

https://doi.org/10.4186/ej.2021.25.6.15

Keywords:

Natural language processing, machine learning, artificial neural networks, sequence tagging model, Thai sentence segmentation, Thai language

Abstract

A sentence is typically treated as the minimal syntactic unit used to extract valuable information from long text. However, in written Thai, there are no explicit sentence markers. Some prior works use machine learning; however, a deep learning approach has never been employed. We propose a deep learning model for sentence segmentation that includes three main contributions. First, we integrate n-gram embedding as a local representation to capture word groups near sentence boundaries. Second, to focus on the keywords of dependent clauses, we combine the model with a distant representation obtained from self-attention modules. Finally, due to the scarcity of labeled data, for which annotation is difficult and time-consuming, we also investigate two techniques that allow us to utilize unlabeled data: Cross-View Training (CVT) as a semi-supervised learning technique, and a pre-trained language model (ELMo) to improve word representation. In the experiments, our model reduced the relative error by 7.4% and 18.5% compared with the baseline models on the Orchid and UGWC datasets, respectively. Ablation studies revealed that the main contributing factor was adopting n-gram features, which were further analyzed using the interpretation technique and indicated that the model utilizes the features in the same way that humans do.

Downloads

Download data is not yet available.

Author Biographies

Chanatip Saetia

Chulalongkorn University Big Data Analytics and IoT Center (CUBIC), Department of Computer
Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok 10330, Thailand

Ekapol Chuangsuwanich

Chula Intelligent and Complex Systems, Department of Computer Engineering, Faculty of Engineering,
Chulalongkorn University, Bangkok 10330, Thailand

Tawunrat Chalothorn

Kasikorn Labs Co., Ltd., Kasikorn Business Technology Group, Nonthaburi 11120, Thailand

Peerapon Vateekul

Chulalongkorn University Big Data Analytics and IoT Center (CUBIC), Department of Computer
Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok 10330, Thailand

Downloads

Published In
Vol 25 No 6, Jun 30, 2021
How to Cite
[1]
C. Saetia, E. Chuangsuwanich, T. Chalothorn, and P. Vateekul, “Semi-supervised Thai Sentence Segmentation Using Local and Distant Word Representations”, Eng. J., vol. 25, no. 6, pp. 15-33, Jun. 2021.