- Volume 3, Issue 1 2025
By Muhammad Ali
202410.20547/aibd.253102
Keywords: Keywords: bloom filter, duplicate detection, data warehouse, textual data
Abstract: Data duplication is one of the core issues in data warehouses pertaining to the quality of data. By finding and removing duplicate data, you can decrease the amount of space needed to store your data. For smooth, efficient, and fast analysis the duplicated data needs to be filtered out before using it for further process. However, very few research has been done on data deduplication for textual data. Realizing this gap, this paper presents analysis of bloom filters to detect duplicates in textual data. The paper discusses the challenges and the need for textual data deduplication. Existing literature on the topic is classified. Then, bloom filter based textual deduplication pipeline is presented. The proposed approach is evaluated with different parameters to investigate the impact on false positives and storage efficiency. It has been found that an optimal number of hash functions can reduce duplication along with maintaining a low memory overhead. These findings also highlight tradeoff between accuracy, scalability and computational cost. Hence, the proposed approach presents a promising solution for large scale textual data deduplication in data warehouse.
