SemPCA-Summarizer: Exploiting Semantic Principal Component Analysis for Automatic Summary Generation

Authors

  • Óscar Alcón Departament of Software and Computing Systems, University of Alicante
  • Elena Lloret Departament of Software and Computing Systems, University of Alicante

Keywords:

Natural language processing, human language technologies, intelligent information processing, automatic text summarization, principal component analysis

Abstract

Text summarization is the task of condensing a document keeping the relevant information. This task integrated in wider information systems can help users to access key information without having to read everything, allowing for a higher efficiency. In this research work, we have developed and evaluated a single-document extractive summarization approach, named SemPCA-Summarizer, which reduces the dimension of a document using Principal Component Analysis technique enriched with semantic information. A concept-sentence matrix is built from the textual input document, and then, PCA is used to identify and rank the relevant concepts, which are used for selecting the most important sentences through different heuristics, thus leading to various types of summaries. The results obtained show that the generated summaries are very competitive, both from a quantitative and a qualitative viewpoint, thus indicating that our proposed approach is appropriate for briefly providing key information, and thus helping to cope with a huge amount of information available in a quicker and efficient manner.

Downloads

Download data is not yet available.

Author Biographies

Óscar Alcón, Departament of Software and Computing Systems, University of Alicante

Óscar Alcón currently works at CYPE Ingenieros S.A. as a product development responsible. His tasks include the specification and supervision of software development for the design of common telecommunications infrastructures in buildings for Spain and Portugal. It includes activities such as market research, determination of deadlines, contacts with customers and collaborators, definition of lines of work, product presentations, commercial activities and relations with marketing and sales departments. In addition, he also conducts research into Natural Language Processing and Text Summarization as a collaborator with the GPLSI research group at the University of Alicante, and more specifically for the project "DIIM2.0: Desarrollo de técnicas inteligente e interactivas de minería y generación de información sobre la web 2.0".

Elena Lloret, Departament of Software and Computing Systems, University of Alicante

Elena Lloret - is a full-time PhD assistant lecturer at the University of Alicante in Spain. There, she obtained her PhD ocused on Text Summarisation in 2011. Her main field of interest is concerned with Natural Language Processing and more specifically Text Summarisation, and Natural Language Generation. She is the author of over 60 scientific publications in international peer-reviewed conferences and refereed journals. She has served on the program committee on international conferences, such as ACL, EACL, RANLP, or COLING. She is member of the Spanish Society for Natural Language Processing (SEPLN) and she has participated in a number of national and EU-funded projects, the current and latest ones are: Canonical Representation and transformations of texts applied to the Human Language Technologies (TIN2015-65100-R) and SAM - Dynamic Social & Media Content Syndication for 2nd Screen (grant no. 611312). She has also been collaborating with international groups at the University of Wolverhampton (UK), the University of Sheffield (UK), the University of Edinburgh (UK), and the Lorraine Research Laboratory in Computer Science and its Applications in France. Since 2009, she has been involved in teaching activities at the University of Alicante. Specifically, for the degrees in Computer Science Engineering and Multimedia Engineering and for the master's programme in Information Technologies and English and Spanish for Specific Purposes, involving 200 teaching hours per year.

Downloads

Published

2018-11-21

How to Cite

Alcón, Óscar, & Lloret, E. (2018). SemPCA-Summarizer: Exploiting Semantic Principal Component Analysis for Automatic Summary Generation. COMPUTING AND INFORMATICS, 37(5), 1126–1148. Retrieved from https://www.cai.sk/ojs/index.php/cai/article/view/2018_5_1126