Hash value of two duplicate file is exactly same different cannot predict partially same

Especially when working with raw source files, one important problem we find with data is duplicates and also we look for quick and efficient way to eliminate duplicate files.

Here is one such approach how to quickly and efficiently eliminate duplicate documents using hashing and PARALLEL processing. This article gives details about: Removing duplicate documents using Hashing in both Sequential and parallel implementation

The Approach is:

1. Calculate the hash value for all files2. Identify the Unique files using the hash values. 3. Delete the Duplicate Files.

Please find the details of important functions below

1. Calculating hash value

This function takes file path as input. It returns hash value for each file as output. I am currently using md5 hashing algorithm here. you can use any other hashing algorithm of your choice.
You can also send block size as parameter. For example, For Big files, you just want to calculate hash value for only first few bytes of data rather than entire file. In that case, you can use the block size. When setting a value to blocksize, make sure to replace file.read() with file.read(block_size) in the below function.

def calculate_hash_val(path, block_size=''):
file = open(path, 'rb')
hasher = hashlib.md5()
data = file.read()
while len(data) > 0:
hasher.update(data)
data = file.read()
file.close()
return hasher.hexdigest()

2. Adding Unique files to Dictionary

This function takes one empty dictionary and one dictionary with all input files as key and their respective hash value(which we calculated using the above calculate_hash_val function)as their values . The function returns dic_unique without any duplicates

def find_unique_files(dic_unique, dict1):
for key in dict1.keys():
if key not in dic_unique:
dic_unique[key] = dict1[key]

3. Deleting Duplicate files from Source

Once the unique files are identified, the final step is to delete the remaining duplicate files. The below function is used to delete the duplicates from input folder. it takes two inputs all_inps and unique_inps which contatins file paths and hash values respectively.

def remove_duplicate_files(all_inps ,unique_inps):
for file_name in all_inps.keys():
if all_inps[file_name] in unique_inps and file_name!=unique_inps[all_inps[file_name]]:
os.remove(file_name)
elif all_inps[file_name] not in unique_inps:
os.remove(file_name)

Required imports and declarations

Please assign the complete input folder path to ‘input_files_path’ variable

import datetime, os, sys, logging, hashlib
from pathlib import Path
from os import listdir
from os.path import isfile, join

input_files_path = r'H:\files\input'
input_files = [f for f in listdir(input_files_path) if isfile(join(input_files_path, f))]
input_files = [os.path.join(input_files_path, x) for x in input_files]
inp_dups = {}
unique_inps = {}

We will be using the above functions same for both Sequential and Parallel implementation. Please find the code below for both the methods respectively.

METHOD1(Sequential Implementation)

def rmv_dup_process(input_files):
all_inps={}
for file_path in input_files:
if Path(file_path).exists():
files_hash = calculate_hash_val(file_path)
inp_dups[files_hash]=file_path
all_inps[file_path] = files_hash
else:
print('%s is not a valid path, please verify' % file_path)
sys.exit()

find_unique_files(unique_inps, inp_dups)
print(inp_dups)
remove_duplicate_files(all_inps, unique_inps)
if __name__ == '__main__':
datetime1 = datetime.datetime.now()
rmv_dup_process(input_files)
datetime2 = datetime.datetime.now()

print( "processed in",str(datetime2 - datetime1))

METHOD2(Parallel Implementation)

This can also be implemented using parallel processing to make the process run faster, If you want to use Parallel processing, just replace the above Sequential Implementation Logic with the below code:

#If you are using multiprocessing, you can also update the number of #processes in the file. The default value I used is 4
if __name__ == '__main__':
pool = multiprocessing.Pool()
pool = multiprocessing.Pool(processes=4)
keys_dict = pool.map(calculate_hash_val, input_files)
pool.close()

inp_dups = dict(zip(keys_dict, input_files))
all_inps = dict(zip(input_files, keys_dict))

find_unique_files(unique_inps, inp_dups)
remove_duplicate_files(all_inps, unique_inps)

That’s it. Please find the github link below for the source code. I will also add couple more implementations of parallel processing in the github repo in the coming days.

https://github.com/KiranKumarChilla/Removing-Duplicate-Docs-Using-Hashing-in-Python

Can 2 different files have the same hash?

Hashes explained A good primer on hashes can be found on Wikipedia here. I say 'nearly' unique because it is possible for two different files to have identical hashes. This is known as a 'collision'. The probability of a collision occurring varies depending on the strength of the hash you generate.

Will a copy of a file have the same hash?

So, a Word file and the PDF file published from the Word file may contain the same content, but the HASH value will be different. Even copying the content from one file to another in the same software program can result in different HASH values, or even different file sizes.

What is hash value of a file?

A hash value is a unique value that corresponds to the content of the file. Rather than identifying the contents of a file by its file name, extension, or other designation, a hash assigns a unique value to the contents of a file.

What does it mean if the hash you generated and the one on the website are the same?

It is deterministic, meaning that a specific input (or file) wil always deliver the same hash value (number string). This means that it is easy to verify the authenticity of a file. If two people independently (and correctly) check the hash value of a file, they will always get the same answer.