Thursday, January 08, 2009

Finding Duplicate Files

I take quite a few photos, admittedly not as many as I'd like, but that's another story. All these photos need to be stored somewhere which leads to an escalating amount of disk/backup space being used. Over the years there has been the inevitable copy made of a photo, along with re-importing the same photos again from an SD card that wasn't cleaned down. All these leads to quite a bit of disk space usage that I'm pretty sure is just duplicate stuff.

Since I'm developer I decided I can fix this and so with my new found love of Python I cracked open PyDev and set to work; the result of which is attached. It's a simple command that will search a given directory tree and report on any duplicate files it finds. Duplicates are identified, not by name or date, but rather by creating an MD5 hash of each file and comparing the hash. Usage is as follows :

     python FindDuplicates.py [directory]

If no directory is specified, the current directory is used. The script will dump out the duplicates it finds after either searching the directory tree, or when CTRL+C is pressed. A sample of the output is here:

Searching: c:\temp\demo
Found Duplicates :
        Hash 37b170a11bc7a2f83ffe5fb5d39dd676 :
                c:\temp\demo\Autumn Leaves.jpg
                c:\temp\demo\DuplicateOfAutumnLeaves.jpg
                c:\temp\demo\ThisIsADuplicate.jpg
        Hash 6a83579dd334d128ffd15582e6006795 :
                c:\temp\demo\AnotherDuplicate.jpg
                c:\temp\demo\Toco Toucan.jpg

I'm pretty pleased with the outcome and learnt some nice Python as a result, feel free to use it in anyway you see fit.

FindDuplicates.zip

No comments: