Evaluation and Implementation of n-Gram-Based Algorithm for Fast Text Comparison

Maciej Wielgosz; Paweł Szczepka; Pawel Russek; Ernest Jamro; Kazimierz Wiatr; Marcin Pietroń; Dominik Żurek

Authors

Maciej Wielgosz AGH University of Science and Technology, Krakow
Paweł Szczepka AGH University of Science and Technology, Krakow
Pawel Russek AGH University of Science and Technology, Krakow
Ernest Jamro AGH University of Science and Technology, Krakow
Kazimierz Wiatr AGH University of Science and Technology, Krakow
Marcin Pietroń ACC Cyfronet AGH, Krakow
Dominik Żurek ACC Cyfronet AGH, Krakow

Keywords:

Text similarity analysis, n-gram-based model, GPGPU implementation, multi-CPU implementation

Abstract

This paper presents a study of an n-gram-based document comparison method. The method is intended to build a large-scale plagiarism detection system. The work focuses not only on an efficiency of the text similarity extraction but also on the execution performance of the implemented algorithms. We took notice of detection performance, storage requirements and execution time of the proposed approach. The obtained results show the trade-offs between detection quality and computational requirements. The GPGPU and multi-CPU platforms were considered to implement the algorithms and to achieve good execution speed. The method consists of two main algorithms: a document's feature extraction and fast text comparison. The winnowing algorithm is used to generate a compressed representation of the analyzed documents. The authors designed and implemented a dedicated test framework for the algorithm. That allowed for the tuning, evaluation, and optimization of the parameters. Well-known metrics (e.g. precision, recall) were used to evaluate detection performance. The authors conducted the tests to determine the performance of the winnowing algorithm for obfuscated and unobfuscated texts for a different window and n-gram size. Also, a simplified version of the text comparison algorithm was proposed and evaluated to reduce the computational complexity of the text comparison process. The paper also presents GPGPU and multi-CPU implementations of the algorithms for different data structures. The implementation speed was tested for different algorithms' parameters and the size of data. The scalability of the algorithm on multi-CPU platforms was verified. The authors of the paper provide the repository of software tools and programs used to perform the conducted experiments.he appropriate fast document comparison system. Its performance is given in the paper.

Downloads

Download data is not yet available.

Evaluation and Implementation of n-Gram-Based Algorithm for Fast Text Comparison

Authors

Keywords:

Abstract

Downloads

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)

Information

Make a Submission

Keywords