Societal Transformation: AI and Big Data Journal

VOLUMES

A Bloom Filter-Based Approach for Textual Data Deduplication in Data Warehousing

Research Article 13
- Volume 3, Issue 1 2025
By Muhammad Ali
202410.20547/aibd.253102
Keywords: Keywords: bloom filter, duplicate detection, data warehouse, textual data

Abstract: Data duplication is one of the core issues in data warehouses pertaining to the quality of data. By finding and removing duplicate data, you can decrease the amount of space needed to store your data. For smooth, efficient, and fast analysis the duplicated data needs to be filtered out before using it for further process. However, very few research has been done on data deduplication for textual data. Realizing this gap, this paper presents analysis of bloom filters to detect duplicates in textual data. The paper discusses the challenges and the need for textual data deduplication. Existing literature on the topic is classified. Then, bloom filter based textual deduplication pipeline is presented. The proposed approach is evaluated with different parameters to investigate the impact on false positives and storage efficiency. It has been found that an optimal number of hash functions can reduce duplication along with maintaining a low memory overhead. These findings also highlight tradeoff between accuracy, scalability and computational cost. Hence, the proposed approach presents a promising solution for large scale textual data deduplication in data warehouse.

Download Full Paper

VOLUMES

Volume 3

Volume 2

Volume 1

A Bloom Filter-Based Approach for Textual Data Deduplication in Data Warehousing

VOLUMES

Volume 3

Volume 2

Volume 1

A Bloom Filter-Based Approach for Textual Data Deduplication in Data Warehousing

Share this paper