DragonFly On-Line Manual Pages

TEXT_SIMILARITY(1)    User Contributed Perl Documentation   TEXT_SIMILARITY(1)

NAME
       text_simlarity.pl - Measure the pair-wise similarity between files or
       strings

SYNOPSIS
        text_similarity.pl --type Text::Similarity::Overlaps --normalize
                                --string '.......this is one' '????this is two'

        text_similarity.pl --type Text::Similarity::Overlaps --no-normalize
                                --string '.......this is one' '????this is two'

        text_similarity.pl --type Text::Similarity::Overlaps
                                --string 'sir winston churchill' 'Churchill, Winston Sir'

        text_similarity.pl --type Text::Similarity::Overlaps ../GPL.txt ../FDL.txt

        text_similarity.pl --verbose --type Text::Similarity::Overlaps ../GPL.txt ../FDL.txt

        text_similarity.pl --verbose --stoplist stoplist.txt --type Text::Similarity::Overlaps
                               ../GPL.txt ../FDL.txt

        text_similarity.pl [[--verbose] [--stoplist=FILE] [--no-normalize] [--string]]
                               --type=TYPE | --help | --version] FILE1 FILE2

DESCRIPTION
       This script is a simple command-line interface to the Text::Similarity
       Perl modules. A method for computing similarity must be specified via
       the --type option, and then that method is used to measure the
       similarity of two strings or two files.

       Text::Similarity::Overlaps measures similarity by counting the number
       of words that overlap (match) between the two inputs, without regard to
       order. So, all of the following strings would have the same pairwise
       similarity (they would each have a raw score of 4 relative to each
       other, meaning that 4 words are overlapping or matching).

        winston churchill was here
        here was winston churchill
        winston was here churchill

       By default Text::Similarity::Overlaps returns a normalized F-measure
       between 0 and 1. Normalization can be turned off by specifying
       --no-normalize. It returns various other overlap based scores if you
       specify --verbose.

OPTIONS
       --type=TYPE
           The type of text similarity measure.  Valid values include:

               Text::Similarity::Overlaps

       --stoplist=FILE
           The name of a file containing stop words. Under the ./sample
           directory, we give two formats of the stop words format, one word
           per line(stoplist.txt) and one word in the regular expression
           format per line(stoplist-nsp.regex). If you want to mix these two
           formats to make your own stop words file, it is also all right.

       --no-normalize
           Do not normalize scores.  Normally, scores are normalized so that
           they range from 0 to 1.  Using this option will give you a raw
           score instead.

       --string
           Input will be provided on the command line as strings, not files.

       --verbose
           Show all the matches that are found between the files, their length
           and frequency, as well as precision, recall, F-measure, E-measure,
           Cosine, and the Dice Coefficient.

       --help
           Show a detailed help message.

       --version
           Show version information.

AUTHORS
        Ted Pedersen, University of Minnesota, Duluth
        tpederse at d.umn.edu

        Jason Michelizzi

        Ying Liu, University of Minnesota, Twin Cities
        liux0395 at umn.edu

       Last modified by: $Id: text_similarity.pl,v 1.4 2010/06/10 21:31:24
       liux0395 Exp $

BUGS
       --compfile is not working, seems to cause hang (tdp 3/21/08)

COPYRIGHT AND LICENSE
       Copyright (C) 2004-2010, Jason Michelizzi, Ted Pedersen and Ying Liu

       This program is free software; you can redistribute it and/or modify it
       under the terms of the GNU General Public License as published by the
       Free Software Foundation; either version 2 of the License, or (at your
       option) any later version.

       This program is distributed in the hope that it will be useful, but
       WITHOUT ANY WARRANTY; without even the implied warranty of
       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
       General Public License for more details.

       You should have received a copy of the GNU General Public License along
       with this program; if not, write to the Free Software Foundation, Inc.,
       59 Temple Place, Suite 330, Boston, MA  02111-1307  USA

perl v5.20.2                      2011-09-29                TEXT_SIMILARITY(1)