DragonFly On-Line Manual Pages
TEXT_SIMILARITY(1) User Contributed Perl Documentation TEXT_SIMILARITY(1)
NAME
text_simlarity.pl - Measure the pair-wise similarity between files or
strings
SYNOPSIS
text_similarity.pl --type Text::Similarity::Overlaps --normalize
--string '.......this is one' '????this is two'
text_similarity.pl --type Text::Similarity::Overlaps --no-normalize
--string '.......this is one' '????this is two'
text_similarity.pl --type Text::Similarity::Overlaps
--string 'sir winston churchill' 'Churchill, Winston Sir'
text_similarity.pl --type Text::Similarity::Overlaps ../GPL.txt ../FDL.txt
text_similarity.pl --verbose --type Text::Similarity::Overlaps ../GPL.txt ../FDL.txt
text_similarity.pl --verbose --stoplist stoplist.txt --type Text::Similarity::Overlaps
../GPL.txt ../FDL.txt
text_similarity.pl [[--verbose] [--stoplist=FILE] [--no-normalize] [--string]]
--type=TYPE | --help | --version] FILE1 FILE2
DESCRIPTION
This script is a simple command-line interface to the Text::Similarity
Perl modules. A method for computing similarity must be specified via
the --type option, and then that method is used to measure the
similarity of two strings or two files.
Text::Similarity::Overlaps measures similarity by counting the number
of words that overlap (match) between the two inputs, without regard to
order. So, all of the following strings would have the same pairwise
similarity (they would each have a raw score of 4 relative to each
other, meaning that 4 words are overlapping or matching).
winston churchill was here
here was winston churchill
winston was here churchill
By default Text::Similarity::Overlaps returns a normalized F-measure
between 0 and 1. Normalization can be turned off by specifying
--no-normalize. It returns various other overlap based scores if you
specify --verbose.
OPTIONS
--type=TYPE
The type of text similarity measure. Valid values include:
Text::Similarity::Overlaps
--stoplist=FILE
The name of a file containing stop words. Under the ./sample
directory, we give two formats of the stop words format, one word
per line(stoplist.txt) and one word in the regular expression
format per line(stoplist-nsp.regex). If you want to mix these two
formats to make your own stop words file, it is also all right.
--no-normalize
Do not normalize scores. Normally, scores are normalized so that
they range from 0 to 1. Using this option will give you a raw
score instead.
--string
Input will be provided on the command line as strings, not files.
--verbose
Show all the matches that are found between the files, their length
and frequency, as well as precision, recall, F-measure, E-measure,
Cosine, and the Dice Coefficient.
--help
Show a detailed help message.
--version
Show version information.
AUTHORS
Ted Pedersen, University of Minnesota, Duluth
tpederse at d.umn.edu
Jason Michelizzi
Ying Liu, University of Minnesota, Twin Cities
liux0395 at umn.edu
Last modified by: $Id: text_similarity.pl,v 1.4 2010/06/10 21:31:24
liux0395 Exp $
BUGS
--compfile is not working, seems to cause hang (tdp 3/21/08)
COPYRIGHT AND LICENSE
Copyright (C) 2004-2010, Jason Michelizzi, Ted Pedersen and Ying Liu
This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your
option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
perl v5.20.2 2011-09-29 TEXT_SIMILARITY(1)