Especially when working with raw source files, one important problem we find with data is duplicates and also we look for quick and efficient way to eliminate duplicate files. Show
Here is one such approach how to quickly and efficiently eliminate duplicate documents using hashing and PARALLEL processing. This article gives details about: Removing duplicate documents using Hashing in both Sequential and parallel implementation The Approach is:1. Calculate the hash value for all files2. Identify the Unique files using the hash values. 3. Delete the Duplicate Files. Please find the details of important functions below 1. Calculating hash valueThis function takes file path as input. It returns hash value for each file as output. I am currently using md5 hashing algorithm here. you can use any other hashing algorithm of your choice. def calculate_hash_val(path, block_size=''): 2. Adding Unique files to Dictionary This function takes one empty dictionary and one dictionary with all input files as key and their respective hash value(which we calculated using the above calculate_hash_val function)as their values . The function returns dic_unique without any duplicates def find_unique_files(dic_unique, dict1): 3. Deleting Duplicate files from Source Once the unique files are identified, the final step is to delete the remaining duplicate files. The below function is used to delete the duplicates from input folder. it takes two inputs all_inps and unique_inps which contatins file paths and hash values respectively. def remove_duplicate_files(all_inps ,unique_inps): Required imports and declarations Please assign the complete input folder path to ‘input_files_path’ variable import datetime, os, sys, logging, hashlib We will be using the above functions same for both Sequential and Parallel implementation. Please find the code below for both the methods respectively. METHOD1(Sequential Implementation)def rmv_dup_process(input_files): METHOD2(Parallel Implementation)This can also be implemented using parallel processing to make the process run faster, If you want to use Parallel processing, just replace the above Sequential Implementation Logic with the below code: #If you are using multiprocessing, you can also update the number of #processes in the file. The default value I used is 4 That’s it. Please find the github link below for the source code. I will also add couple more implementations of parallel processing in the github repo in the coming days. https://github.com/KiranKumarChilla/Removing-Duplicate-Docs-Using-Hashing-in-Python Can 2 different files have the same hash?Hashes explained
A good primer on hashes can be found on Wikipedia here. I say 'nearly' unique because it is possible for two different files to have identical hashes. This is known as a 'collision'. The probability of a collision occurring varies depending on the strength of the hash you generate.
Will a copy of a file have the same hash?So, a Word file and the PDF file published from the Word file may contain the same content, but the HASH value will be different. Even copying the content from one file to another in the same software program can result in different HASH values, or even different file sizes.
What is hash value of a file?A hash value is a unique value that corresponds to the content of the file. Rather than identifying the contents of a file by its file name, extension, or other designation, a hash assigns a unique value to the contents of a file.
What does it mean if the hash you generated and the one on the website are the same?It is deterministic, meaning that a specific input (or file) wil always deliver the same hash value (number string). This means that it is easy to verify the authenticity of a file. If two people independently (and correctly) check the hash value of a file, they will always get the same answer.
|