2

We need a Windows 7 program to remove/check the duplicates but our situation is somewhat different than the standard one for which there are enough programs.

We have a fairly large static archive (collection) of photos spread on several disks. Let's call them Disk A..M. We have also some disks (let's call them Disk 1..9) which contain some duplicates which are to be found on disks A..M.

We want to add to our collection new disks (N, O, P... aso.) which will contain the photos from disks 1..9 but, of course, we don't want to have any photos two (or more) times.

Of course, theoretically, the task can be solved with a regular file duplicate remover but the time needed will be very big.

Ideally, AFAIS now, the real solution would be a program which will scan the disks A..M, store the file sizes/hashes of the photos in an indexed database/file(s) and will check the new disks (1..9) against this database.

However I have hard time to find such a program (if exists).

Other things to note:

  • we consider that the Disks A..M (the collection) doesn't have any duplicates on them
  • the file names might be changed
  • we aren't interested in approximated (fuzzy) comparison which can be found in some photo comparing programs. We hunt for exact duplicate files.
  • we aren't afraid of command line. :-)
  • we need to work on Win7/XP
  • we prefer (of course) to be freeware

2 Answers2

4

Based on Dennis solution, we decided to use the hashdeep suite which is also available on Windows.

Basic usage:

Step 1. Generate the hashes (this should be done only once)

hashdeep64 -c tiger -r "D:\*" > Disk_D.hash

We use tiger as a hash function - faster and better than SHA-1 (no collisions).

Step 2: Hunt for duplicates (this must be executed for each drive / directory to check)

hashdeep64 -k Disk_D.hash -m -r "E:\My-Dir-To-Check\*" > Dupes.txt

Now all the duplicates are stored in Dupes.txt

You can use MsWord, LibreOffice or Notepad++ (or any other way you know) to insert del (and/or any other options) in this text file in order to delete the files. You have here enough variants, including a simple .bat file which scans the file list in order to delete all the entries.

Also, you have the choice to review the file list and do the processing manually.

2

Aproach

  1. Choose a collision-free hash function.

    My example uses SHA1, since the bottleneck is going to be the hard drive anyway.

    If that takes too long, it would be possible to compare only the first megabyte of the files. That should be enough for images.

  2. Read the files of interest on the disks A..M, compute their hashes and store them in a file specific to that disk (so you can add/remove disks later).

  3. Read the files of interest on the disks 1..9 and compute their hashes.

    If a file's hash is already known, perform action (list or delete).

Setup

  1. Download and install Cygwin, a collection of tools which provide a Linux look and feel environment for Windows.

  2. In Windows Explorer, open the folder %ProgramFiles(x86)%\Cygwin\home\%USERNAME%.

  3. Edit the file .bashrc and append the following line:

    export PATH=~:$PATH
    
  4. Create a file called hashdrive and save the following code into it:

    #!/bin/bash
    
    DRIVELETTER=$(echo $1 | tr '[:upper:]' '[:lower:]')
    EXTENSIONS=$(echo $2 | sed 's/,/\\|/g')
    DRIVENAME=$(echo $3 | tr '[:upper:]' '[:lower:]')
    
    set -e
    [ -d /cygdrive/$DRIVELETTER ] || (echo "Drive $DRIVELETTER: does not exist." ; exit 1)
    [ -f ~/drives/$DRIVENAME ] && (echo "Hashfile for drive $DRIVENAME already exists." ; exit 1)
    set +e
    
    mkdir ~/drives 2>/dev/null
    find /cygdrive/$DRIVELETTER -type f -iregex ".*\.\($EXTENSIONS\)" -exec sha1sum {} \; | cut -b -40 > ~/drives/$DRIVENAME
    
  5. Create a file called checkdrive and save the following code into it:

    #!/bin/bash
    
    DRIVELETTER=$(echo $1 | tr '[:upper:]' '[:lower:]')
    EXTENSIONS=$(echo $2 | sed 's/,/\\|/g')
    ACTION=$(echo $3 | tr '[:upper:]' '[:lower:]')
    
    set -e
    [ -d /cygdrive/$DRIVELETTER ] || (echo "Drive $DRIVELETTER: does not exist." ; exit 1)
    set +e
    
    IFS=":" ; for FILE in `find /cygdrive/$DRIVELETTER -type f -iregex ".*\.\($EXTENSIONS\)" -printf %p:`; do
        [ "$(grep -m 1 $(sha1sum "$FILE" | cut -b -40) ~/drives/*)" ] && $ACTION "$FILE"
    done
    

Usage

  • To save the hashes of all images of a certain disk to a file, start Cygwin and execute the following command:

    hashdrive DRIVELETTER EXTENSIONS DRIVENAME
    

    For example, if DiskA is mounted as drive D: and you want to hash all images with extensions jpg and png, use the following command:

    hashdrive d jpg,png diska
    

    There must be no space in jpg,png.

  • To check a disk for duplicate images, start Cygwin and execute the following command:

    hashdrive DRIVELETTER EXTENSIONS ACTION
    

    For example, if Disk1 is mounted as drive E: and you want to list all duplicate images with extensions jpg and png, use the following command:

    checkdrive e jpg,png echo
    

    If you want to remove the files directly, use rm instead of echo.

  • To remove a disk from the database, just delete the file DRIVENAME in the folder %ProgramFiles(x86)%\Cygwin\home\%USERNAME%\drive.

Caution

The rm command does not move files to the Recycle Bin; it deletes them directly.

While it should be possible to recover the files anyway, be careful when using the rm action and try echo before you use rm.

Dennis
  • 50,701