Thai Multi-Document Summarization: Unit Segmentation, Unit-Graph Formulation, and Unit Selection

Authors

  • Nongnuch Ketui School of Information, Computer, and Communication Technology, Sirindhorn International Institute of Technology, Thammasat University
  • Thanaruk Theeramunkong School of Information, Computer, and Communication Technology, Sirindhorn International Institute of Technology, Thammasat University

Keywords:

Thai text summarization, multi-document summarization, iterative weighting

Abstract

There have been several challenges in summarization of Thai multiple documents since Thai language itself lacks of explicit word/phrase/sentence boundaries. This paper gives definition of Thai Elementary Discourse Unit (TEDU) and then presents our three-stage summarization process. Towards implementation of this process, we propose unit segmentation using TEDUs and their derivatives, unit-graph formation using iterative unit weighting and cosine similarity, and unit selection using highest-weight priority, redundancy removal, and post-selection weight recalculation. To examine performance of the proposed methods, a number of experiments are conducted using fifty sets of Thai news articles with their manually constructed reference summary. By three common evaluation measures of ROUGE-1, ROUGE-2, and ROUGE-SU4, the results evidence that (1) our TEDU-based summarization outperforms paragraph-based summarization, (2) our iterative weighting is superior to traditional TF-IDF, (3) the highest-weight priority without centroid preference and unit redundancy consideration helps improving summary quality, and (4) post-selection weight recalculation tends to raise summarization performance under some certain circumstances.

Downloads

Download data is not yet available.

Author Biographies

Nongnuch Ketui, School of Information, Computer, and Communication Technology, Sirindhorn International Institute of Technology, Thammasat University

Nongnuch Ketui received a bachelor’s (2000) and master’s (2003) degrees in Computer Science from ChiangMai University, Thailand. She is currently a student in Ph.D. program, School of Information, Computer and Communication Technology, Sirindhorn International Institute of Technology, Thammasat University, Thailand. Her research interests are natural language processing and data mining.

Thanaruk Theeramunkong, School of Information, Computer, and Communication Technology, Sirindhorn International Institute of Technology, Thammasat University

Thanaruk Theeramunkong received a bachelor’s degree in Electric and Electronics Engineering, and master and doctoral degrees in Computer Science from Tokyo Institute of Technology in 1990, 1992 and 1995, respectively. Now he is reserved as an associate professor at Sirindhorn International Institute of Technology, Thammasat University, Thailand. He is currently a member of ACM and ECTI association. His current research interests include data mining, machine learning, and natural language processing.

Downloads

Published

2016-05-31

How to Cite

Ketui, N., & Theeramunkong, T. (2016). Thai Multi-Document Summarization: Unit Segmentation, Unit-Graph Formulation, and Unit Selection. COMPUTING AND INFORMATICS, 35(1), 1–29. Retrieved from https://www.cai.sk/ojs/index.php/cai/article/view/2209

Most read articles by the same author(s)