DragonFly BSD
DragonFly users List (threaded) for 2011-05
[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]

Re: Easy way to find identify files which share some content/blocks

From: Justin Sherrill <justin@xxxxxxxxxxxxxxxxxx>
Date: Mon, 2 May 2011 12:51:30 -0400

You could dump out the B-tree information.  I don't know how clear a
picture would come from that, and it may require some massaging of
data anyway since nonduplicated files may have some degree of
matching, duplicated data anyway, especially when dealing with larger
image file.

If you are sure that the corruption lies at the end of the files, you
could loop over the files, read the first x bytes of each, then MD5
that data.  Matching MD5 = matching file.

On Sun, May 1, 2011 at 2:39 PM, Thomas Keusch
<fwd+usenet-spam2011q2@bsd-solutions-duesseldorf.de> wrote:
> Hello,
> now that Dragonfly's HAMMER has got deduplication I ask myself if there
> is a simple way to identify "pairs" or groups of files which share a lot
> of data, i.e. are mostly identical.
> I have a rather large repository of downloaded pictures, which contain
> a lot of dupes in multiple locations. I have no problems finding those
> given some time and a shell prompt.
> I'm interested in identifying broken files. Broken in the sense that
> A is an incomplete version of B (some bytes missing), or B a damaged
> version of A (some additional bytes at the end).
> Is there a way to get to something like this:
> "File A shares 1234 (98.3%) data blocks with file B"
> "File A shares xxxx (xx.x%) data blocks with file C"
> Getting a step closer helps too.
> Thanks for any insights.
> Regards
> Thomas

[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]