(archive 'newLISPer)

December 16, 2005

Finding duplicate files on MacOS X

Filed under: newLISP — newlisper @ 09:56

One of the problems that apparently only I encounter is that of duplicate files. It’s probably due to the disorganized way I manage my projects. But there’s no obvious way to find duplicate files. There are some applications that can find them, but I haven’t found a way using the command line. There are some interesting problems:

– similar files don’t always have the same names!

Surprisingly, identical files don’t always have similar names. When you duplicate a file using the Finder, the name of the copy is changed. When you download files using Safari, numbers are appended. And so on. And then you make a safe backup copy of a folder while working on a new project.

– resource-fork files look empty

There are some files that, when you list them at the command line, appear to be 0 bytes. For example:

$ ls -l /System/Library/Fonts/H*
-rw-r--r--   1 root  wheel  7501763 Mar 20  2005 /System/Library/Fonts/Hei.dfont
-rw-r--r--   1 root  wheel        0 Mar 28  2005 /System/Library/Fonts/HelveLTMM
-rw-r--r--   1 root  wheel        0 Mar 28  2005 /System/Library/Fonts/Helvetica LT MM
-rw-r--r--   1 root  wheel   991360 Mar 20  2005 /System/Library/Fonts/Helvetica.dfont

These are resource-fork files, a clever technology inherited from the classic (ie old) MacOS which occasionally causes a few headaches in the brave new world of classic (ie old) Unix. (The technique for finding them involves adding /rsrc to the path.)

A newLISP tool for finding duplicates

I don’t recommend running this script on an entire disk. You specify one or two folders to examine. The script collects the filenames and their sizes in a simple Lisp list, then sorts the list. (This alone warns you not to tackle more than 20000 files!) Then, if two files appear to have the same size, the MD5 checksums are compared. I’m making the assumption that if two files have the same size and the same MD5 checksum, they’re probably duplicates. Finally, I’m not doing anything with the results except listing them – I’m never entirely happy with scripts that delete things for me.

I’ve found newLISP to be an excellent tool for this sort of job. I’m not interested in performance – it’s just the sort of task to run while you do something more interesting!


;; finds duplicate files
;; ignores filenames but tests sizes and MacOS resource forks
;; usage: find-duplicate-files.lsp [folder1 folder2 ...]
;; needs version containing (real-path)

(define (set-spotlight-comment file comment)
    "set Spotlight comment of file to comment"
        (exec (format [text]osascript -e 'set pf to POSIX file "%s" ' -e 'tell application "Finder" to set comment of pf to "%s" ' [/text] file comment)))

(define (walk-tree folder , item-name )
" build a list of all files in the folder, with sizes:
    ((size1 file1) (size2 file2) ... )"
    (dolist (item (directory folder))
        (set 'item-name (string folder "/" item))
        (if (and
                    (directory? item-name)
                    (!= item ".")
                    (!= item ".."))
            (walk-tree item-name) ; recurse
            ; else process the item
                (not (starts-with item ".")) ; skip hidden files
                (set 'path-name (real-path item-name))
                (file-info path-name) ; skip symlinks...
                (set 'dataforksize (first (file-info path-name)))
                (if (file? (format {"%s"/..namedfork/rsrc} path-name ))
                    ; add resource fork size if one exists at /..namedfork/rsrc
                    (set 'resourceforksize
                        (first (file-info (format {"%s"/..namedfork/rsrc} path-name ))))
                    (set 'resourceforksize 0))
                ; put composite file size and file name into dupe-list
                (push (cons (+ dataforksize resourceforksize ) path-name ) dupe-list -1 )))))

; start
(if (> (length (main-args)) 2)
    (dolist (folder (2 (main-args)))
        (println "... gathering files in folder " (real-path folder) "\n")
        (walk-tree folder))
        (println "... gathering files in folder " (real-path) "\n")
        (walk-tree (real-path))))

(println "... sorting " (length dupe-list) " items\n")
(set 'dupe-list (sort dupe-list )) ; sort by size - very important!
(println "... duplicates are: \n")

; see if two adjacent items have the same size

; this is a kludge to avoid an error
; If we start with item 1, we have no 'previous' pair for comparison
; I'd really like to start at item 2...
(set 'previous (last dupe-list))

    (current dupe-list)
    (if (= (first current) (first previous)) ; current same size as previous?
            ; same size, compare md5 checksums

            (set 'current-dataforkmd5
                (exec (format {md5 -q "%s"} (last current)))) ; fails if file contains double quote?

            (set 'current-resourceforkmd5
                (exec (format {md5 -q "%s/..namedfork/rsrc"} (string (last current)))))

            (set 'previous-dataforkmd5
                (exec (format {md5 -q "%s"} (last previous))))

            (set 'previous-resourceforkmd5
                (exec (format {md5 -q "%s/..namedfork/rsrc"} (last previous))))

                    (> (+ (first current ) (first previous) 0)) ; not 0
                    (= current-dataforkmd5 previous-dataforkmd5 )
                    (= current-resourceforkmd5 previous-resourceforkmd5)
                    (println (format "     %12d %s" (first previous) (last previous)))
                    (println (format "   = %12d %s" (first current) (last current)))
                    (set-spotlight-comment (last current) (string "duplicate " (last previous)))
    ; remember this one for the next comparison
    (set 'previous current)))

(println "... finished")

Update 2006-01-06

Every time I run this file, I find things that need improving slightly. I noticed recently that the /rsrc technique for looking at resource forks is now deprecated. I should be looking for ‘namedresource’ or something. I will change it one day.

The script also seems to not like a few files. I will investigate this one day. Luckily the script doesn’t do anything to the file system…

Update 2006-01-13

I’ve changed the resource-fork handling, and changed the way the filenames are quoted. The script was having problems with filenames that have single quotes in them. I found a whole bunch of font files that used the possessive apostrophe in their names. By using format with curly brackets, I hope that the quoted filenames can pass through the shell without damage.

Update 2006-04-20

More minor corrections, including calls to (real-path), and a call to SpotLight via osascript. This significantly slows the script down, so comment out the call to (set-spotlight-comment) if speed is important.


1 Comment »

  1. >You’re not the only one who ran into this duplicate file problem. I ended up writing a Java application to solve the same thing. I doubt that it handles Macintosh resource forks properly, but the download web page is http://clubweb.interbaun.com/fenske/cporder.htm and see the entry for “Find Duplicate Files”. The documentation does note that this is basically a LISP problem! Signed: Keith.

    Comment by Anonymous — March 9, 2007 @ 21:29 | Reply

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: