DragonFly BSD
DragonFly users List (threaded) for 2011-05
Easy way to find identify files which share some content/blocks

From: Thomas Keusch <fwd+usenet-spam2011q2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
Date: 01 May 2011 18:39:48 GMT


now that Dragonfly's HAMMER has got deduplication I ask myself if there
is a simple way to identify "pairs" or groups of files which share a lot
of data, i.e. are mostly identical.

I have a rather large repository of downloaded pictures, which contain
a lot of dupes in multiple locations. I have no problems finding those
given some time and a shell prompt.

I'm interested in identifying broken files. Broken in the sense that
A is an incomplete version of B (some bytes missing), or B a damaged
version of A (some additional bytes at the end).

Is there a way to get to something like this:

"File A shares 1234 (98.3%) data blocks with file B"
"File A shares xxxx (xx.x%) data blocks with file C"

Getting a step closer helps too.

Thanks for any insights.


