DragonFly On-Line Manual Pages
SAMEFILE(1) JS SAMEFILE(1)
NAME
samefile - find identical files
SYNOPSIS
samefile [-g size] [-l | -r] [-s sep] [-0aiqVvx]
DESCRIPTION
samefile reads a list of filenames (one filename per line) from stdin.
For each filename pair with identical contents, a line consisting of
six fields is output: The size in bytes, two filenames, the character
``='' if the two files are on the same device, ``X'' otherwise, and the
link counts of the two files. The output is sorted in reverse order by
size as the primary key and the filenames as the secondary key.
OPTIONS
-0 Indicates that the input list of file names is NUL terminated,
for example as generated by implementations of find(1) that
support the -print0 option. Without this option, the file names
are assumed to be newline terminated.
-a Do not sort files with same size alphabetically.
-g size
Compare only files with size greater than size bytes. Default is
0.
-i Allow files with the same device/i-node pair to be added to the
binary tree. This might be useful if output will be fed into
some other program. If this option is used, the statistics
displayed when using -v will not contain the ``You have a total
of x bytes in identical files'' line because -i prohibits proper
calculation of this value.
-l Do not check if files with identical contents are hard links
created by ln(1). By default, samefile checks if files with
identical contents are hard linked and, if they are, does not
write a name pair to stdout. A slight speedup is gained when
using this option. This option is incompatible with the -r
option.
-q Do not issue warning messages when open(2) fails. When you
encounter such a warning, open probably failed due to a
'permission denied' error on files or directories for which you
have no read permission. Useful if you are not root and want to
compare your files against files in a system directory like /etc
-r Report whether identical files are hard linked. The separator
string followed by the [bracketed] link count is appended to
each name pair if they are hard links created with ln. This
option is incompatible with the -l option. Note that this kind
of output has only four fields and will appear unsorted before
the actual output of samefile.
-s sep Use string sep as the output field separator, defaults to a tab
character. Useful if filenames contain tab characters and output
must be processed by another program, say awk(1).
-V Print the version information and exit.
-v verbose mode. Write some statistical messages about memory usage
and work reduction as well as the sum of the sizes of all
identical files to stderr.
-x Switch off intelligence. This option prevents samefile from
being smart. If files file1, file2 and file3 are identical, it
will do 3 comparisons instead of just the two needed and write
more output. See the discussion under INTERNALS why this could
be useful. If this option is used, the statistics displayed
when using -v will not contain the ``You have a total of x bytes
in identical files'' line because -x prohibits proper
calculation of this value.
INTERNALS
samefile uses two stages to give optimum performance.
In the first stage, all non-plain files are skipped (directories,
devices, FIFOs, sockets, symbolic links) as well as files for which
stat(2) fails and files that have a size less than or equal to size.
Output of the first stage (the filenames) is written into a binary tree
with one node for every file size. It is also at this early stage
where checks for hard links are done. If hard links are found, and -r
is requested, the name pairs are output immediately. The whole list of
hard linked name pairs will therefore appear before any output of the
second stage.
For any i-node only one filename will be added to the binary tree
(unless -i was requested.)
In the second stage all files having the same size are compared against
each other. The rules of mathematical logic are applied to reduce work
and output noise (unless -x is requested): if files a, b, and c have
the same size and samefile finds that a = b and a = c then it will not
compare b against c (and will not output a line for b and c) but only
for a = b and a = c. Note however, that because only the first filename
per i-node gets into the second stage, the output for a group of
identical files with different i-node numbers is also minimized.
Suppose you have six identical files of size 100 in an i-node group
consisting of the three i-nodes with numbers 10, 20 and 30 (the term
'i-node group' has nothing to do with the i-node group notion of some
file systems - it merely refers to a set of i-nodes addressing files
with identical contents):
$ ls -i
10 file1 20 file4 30 file6
10 file2 20 file5
10 file3
$ ls | samefile
100 file1 file4 = 3 2
100 file1 file6 = 3 1
The sum of the sizes in the first column is the amount of disk space
you could gain by making all 6 files links to only one file or remove
all but one of the files. To be precise, disk space is allocated in
blocks - you will probably gain two blocks here, rather than 200 bytes.
Note that it is not enough to just remove file4 and file6 (you would
gain only 100 bytes because file5 still exists.) The proper way is to
use the -i option. The output will look like
100 file1 file2 = 3 3
100 file1 file3 = 3 3
100 file1 file4 = 3 2
100 file1 file5 = 3 2
100 file1 file6 = 3 1
Removing all files listed in the third field will leave only file1.
Making all files hard links to file1 is easy. If the fourth field is a
``='' do a forced hard link. If you need to know about all
combinations of identical files, then you use both the -i and -x
option. This produces
$ ls | samefile -ix
100 file1 file2 = 3 3
100 file1 file3 = 3 3
100 file1 file4 = 3 2
100 file1 file5 = 3 2
100 file1 file6 = 3 1
100 file2 file3 = 3 3
100 file2 file4 = 3 2
100 file2 file5 = 3 2
100 file2 file6 = 3 1
100 file3 file4 = 3 2
100 file3 file5 = 3 2
100 file3 file6 = 3 1
100 file4 file5 = 2 2
100 file4 file6 = 2 1
100 file5 file6 = 2 1
EXAMPLES
Find all identical files in the current working directory:
$ ls | samefile
Find all identical files in my HOME directory and subdirectories and
also tell me if there are hard links:
$ find $HOME -type f | samefile -r
Find all identical files in the /usr directory tree that are bigger
than 10000 bytes and write the result to usr.dups (that one is for the
sysadmin folks, you may want to 'amp' - put it in the background with
the ampersand & - this command because it takes a few minutes.)
$ find /usr -type f | samefile -g 10000 >usr.dups
DIAGNOSTICS
You will see a short usage message if you use an invalid option.
malloc - free = xxxx
I didn't free the memory I've malloc(3)ed. You found a bug.
Please report it to the author.
Allocation failed for 'expr' ...
Oops! You ran out of virtual memory. You must have a real big
filename list. Try to use a smaller one or increase resources
available to your processes. For more information see ulimit(1)
or your similar shell builtin.
SEE ALSO
ln(1), find(1), rm(1), df(1)
BUGS
There are no known bugs. The source has been lint(1)ed and all possible
care has been taken while coding. If you find a bug (or miss a feature)
please contact the author.
HOME
The official samefile home page www.schweikhardt.net/samefile/ is
maintained by the author Jens Schweikhardt - schweikh at schweikhardt
dot net
7 AUGUST 2005 SAMEFILE(1)