FingerPrint Based Duplicate Detection in Streamed Data

Authors

  • Amritpal Singh Department of Computer Science and Engineering, Thapar University, Patiala, Punjab
  • Shalini Batra Department of Computer Science and Engineering, Thapar University, Patiala, Punjab

Keywords:

Duplicate detection, stable Bloom filter, d-left hashing, FingerPrint bits, streaming dataing, Streaming Data

Abstract

In computing, duplicate data detection refers to identifying duplicate copies of repeating data. Identifying duplicate data items in streamed data and eliminating them before storing, is a complex job. This paper proposes a novel data structure for duplicate detection using a variant of stable Bloom filter named as FingerPrint Stable Bloom Filter (FP-SBF). The proposed approach uses counting Bloom filter with fingerprint bits along with an optimization mechanism for duplicate detection. FP-SBF uses d-left hashing which reduces the computational time and decreases the false positives as well as false negatives. FP-SBF can process unbounded data in single pass, using k hash functions, and successfully differentiate between duplicate and distinct elements in O(k+1) time, independent of the size of incoming data. The performance of FP-SBF has been compared with various Bloom Filters used for stream data duplication detection and it has been theoretically and experimentally proved that the proposed approach efficiently detects the duplicates in streaming data with less memory requirements.

Downloads

Download data is not yet available.

Downloads

Published

2019-02-04

How to Cite

Singh, A., & Batra, S. (2019). FingerPrint Based Duplicate Detection in Streamed Data. COMPUTING AND INFORMATICS, 37(6), 1313–1338. Retrieved from https://www.cai.sk/ojs/index.php/cai/article/view/2018_6_1313