DragonFly On-Line Manual Pages

ANNOYANCE-FILTER(1)    DragonFly General Commands Manual   ANNOYANCE-FILTER(1)

NAME
       annoyance-filter - automatically detect junk mail

SYNOPSIS
       annoyance-filter [ options ]

DESCRIPTION
       annoyance-filter uses Bayesian statistics to determine the probability
       an E-mail message is junk based on an analysis of its contents compared
       to collections of known junk and legitimate E-mail.

       The current version of this program is always posted at:
                      http://www.fourmilab.ch/annoyance-filter/
       Please visit this page for news about the program and to download the
       latest version.

       The project is hosted on SourceForge, where you will find the CVS
       source code repository and release archives:
                  http://sourceforge.net/projects/annoyancefilter/

USAGE
       annoyance-filter has a multitude of options which permit it to be used
       in many different ways, but the most common application involves
       training the program with collections of legitimate and junk mail in
       order to create a dictionary which indicates the probability that words
       identify a message as junk or non-junk (legitimate).  Training must be
       done before the program is used to classify incoming mail, but need be
       done subsequently only when adding messages to the training
       collections.  As long as the overall content of the mail, junk and
       legitimate, which you receive remains pretty much the same, there's no
       need to retrain, but the ability to do so allows the program to
       automatically adapt to evolving message content, which is particularly
       characteristic of junk mail.

       Suppose you have a collection of legitimate mail (in other words, mail
       you wish to read) in a file named m-good and a collection of junk mail
       (that which you don't wish to read) in file m-junk.  These collections
       may be in ``Unix mail folder'' format, which is simply the text of one
       or more E-mail messages concatenated together in a single text file, or
       may be the names of directories containing files, each of which may be
       a single E-mail message or a Unix mail folder.  In either case, if a
       message file is compressed with gzip, it will be automatically
       uncompressed on the fly.  Directories of messages may not, however,
       contain other directories of messages.

       To train annoyance-filter with these collections and create a
       dictionary, use a command like:

       annoyance-filter --mail m-good --junk m-junk --prune --write dict.bin

       where dict.bin is the name of the dictionary file you wish to create.

       Now that the dictionary has been created, you can use it on subsequent
       runs to compute the probability a message is junk and classify it
       accordingly.  Suppose you have an E-mail message in the file mail.txt.
       To compute its junk priority and display it on standard output, use the
       command:

       annoyance-filter --read dict.bin --test mail.txt

       To integrate annoyance-filter into a mail processing system such as
       procmail, you'll usually want to run it as a filter which reads
       incoming messages from standard input (piped there by the mail
       processing system), classifies them and adds annotations to the message
       header indicating the classification, then writes the message with
       header annotations to standard output.  The mail processing system may
       then examine the header annotations and route the message accordingly.
       To filter a message, again assuming the dictionary created by the
       training run is in the file dict.bin, use the command:

       annoyance-filter --read dict.bin --transcript - --test -

       Here the --transcript option is used to request the input message be
       copied to an output file, in this case standard output, specified by
       ``-'', with the message read from standard input, the ``-'' argument to
       the --test option.

OPTIONS
       Options are specified on the command line.  Options are treated as
       commands--most instruct the program to perform some specific action;
       consequently, the order in which they are specified is significant;
       they are processed left to right. Long options beginning with ``--''
       may be abbreviated to any unambiguous prefix; single-letter options
       introduced by a single ``-'' without arguments may be aggregated.

       --annotate options
                 Add the annotations requested by the characters in options to
                 the transcript generated by the --transcript option.  Upper
                 and lower case options are treated identically.  Available
                 annotations are:
                             d        Decoder diagnostics
                             p        Parser warnings and error messages
                             w        Most significant words and their
                 probabilities

       --autoprune n
                 As the dictionary is bring built by appending mail to it with
                 the --mail and --junk options, unique words will
                 automatically be pruned from it whenever the dictionary
                 exceeds approximately n bytes.  This is particularly handy
                 when loading large collections of messages with --phrasemax
                 set greater than one, as a very large number of unique
                 phrases may clutter the dictionary being built and exceed the
                 memory capacity of your computer.  You could split the mail
                 collection into multiple parts and explicitly --prune after
                 each part, but --autoprune is much more convenient.

       --biasmail n
                 The frequency of words appearing in legitimate mail is
                 inflated by the floating point factor n, which defaults to 2.
                 This biases the classification of messages in favour of
                 ``false negatives''--junk mail deemed legitimate, while
                 reducing the probability of ``false positives'' (legitimate
                 mail erroneously classified as junk, which is bad).  The
                 higher the setting of --biasmail, the greater the bias in
                 favour of false negatives will be.

       --binword n
                 Binary character streams (for example, attachments of
                 application-specific files, including the executable code of
                 worm and virus attachments) are scanned and contiguous
                 sequences of alphanumeric ASCII characters n characters or
                 longer are added to the list of words in the message.  The
                 dollar sign (``$'') is considered an alphanumeric character
                 for these purposes, and words may have embedded hyphens and
                 apostrophes, but may not begin or end with those characters.
                 If --binword is set to zero, scanning of binary attachments
                 is disabled entirely.  The default setting is 5 characters.

       --bsdfolder
                 The next --mail or --junk folder will be parsed using
                 ``classic BSD'' rules for identifying the start of individual
                 messages in the folder.  In BSD-style folders, the text
                 ``From '' as the leftmost characters of a line always denotes
                 the start of a new message: any appearance of this text in
                 any other context is always quoted, often by prefixing a
                 ``>'' character.  In the default Unix folder syntax,
                 ``From '' only marks the start of a new message if it appears
                 following one or more blank lines.  Note that you must
                 specify --bsdfolder before each folder to be read with BSD
                 rules; it is not a modal setting.

       --classify fname
                 Classify mail in fname.  If it equals or exceeds the junk
                 threshold (see --threshjunk), ``JUNK'' is written to standard
                 output and the program exits with status code 3. If the
                 message scores less than or equal to the mail threshold (see
                 --threshmail), ``MAIL'' is written to standard output and the
                 program exits with status 0.  If the message's score falls
                 between the two thresholds, its content is deemed
                 indeterminate; ``INDT'' is written to standard output and the
                 program exits with a status of 4.  The output can be used to
                 set an environment variable in Procmail to control the
                 disposition of the message.  If fname is ``-'' the message is
                 read from standard input.

       --clearjunk
                 Clear appearances of words in junk mail from database.  Used
                 when preparing a database of legitimate mail.

       --clearmail
                 Clear appearances of words in legitimate mail from database.
                 Used when preparing a database of junk mail.

       --copyright
                 Print copyright information.

       --csvread fname
                 Import a dictionary from a comma-separated value (CSV) file
                 fname.  Records are assumed to be in the format written by
                 --csvwrite but need not be sorted in any particular order.
                 Words are added to those already in memory.

       --csvwrite fname
                 Export a dictionary as a comma-separated value (CSV) fname
                 with this option.  Such files can be loaded into spreadsheet
                 or database programs for further processing.  Words are
                 sorted first in ascending order of probability they denote
                 junk mail, then lexically.

       --fread, -r fname
                 Load a fast dictionary (previously created with the --fwrite
                 option) from file fname.

       --fwrite fname
                 Write a dictionary to the file fname in fast dictionary
                 format.  Fast dictionaries are written in a binary format
                 which is not portable across machines with different byte
                 order conventions and cannot be added incrementally to
                 assemble a larger dictionary, but can be loaded in a small
                 fraction of the time required by the format created by the
                 --write command.  Using a fast dictionary for routine
                 classification of incoming mail drastically reduces the time
                 consumed in loading the dictionary for each message.

       --help, -u
                 Print how-to-call information including a list of options.

       --junk, -j fname
                 Add the mail in folder fname to the dictionary as junk mail.
                 These folders may be compressed by a utility the host system
                 can uncompress; specify the complete file name including the
                 extension denoting its form of compression.  If fname is
                 ``-'' the mail folder is read from standard input.

       --list    List the dictionary on standard output.

       --mail, -m fname
                 Add the mail in folder fname to the dictionary as legitimate
                 mail.  These folders may be compressed by a utility the host
                 system can uncompress; specify the complete file name
                 including the extension denoting its form of compression.  If
                 fname is ``-'' the mail folder is read from standard input.

       --newword n
                 The probability that a word seen in mail which does not
                 appear in the dictionary (or appeared too few times to assign
                 it a probability with acceptable confidence) is indicative of
                 junk is set to n.  The default is 0.2--the odds are that
                 novel words are more likely to appear in legitimate mail than
                 in junk.

       --pdiag fname
                 Write a diagnostic file to the specified fname containing the
                 actual lines the parser processed (after decoding of MIME
                 parts and exclusion of data deemed unparseable).  Use this
                 option when you suspect problems in decoding or pre-parser
                 filtering.

       --phraselimit n
                 Limit the length of phrases assembled according to the
                 --phrasemin and --phrasemax options to n characters.  This
                 permits ignoring ``phrases'' consisting of gibberish from
                 mail headers and un-decoded content.  In most cases these
                 items will be discarded by a --prune in any case, but
                 skipping them as they are generated keeps the dictionary from
                 bloating in the first place.  The default value is 48
                 characters.

       --phrasemin n
                 Calculate probabilities of phrases consisting of a minumum of
                 n words.  The default of 1 calculates probabilities for
                 single words.

       --phrasemax n
                 Calculate probabilities of phrases consisting of a maximum of
                 n words.  The default of 1 calculates probabilities for
                 single words.  If you set this too large, the dictionary may
                 grow to an absurd size.

       --plot fname
                 After loading the dictionary, create a plot in fname .png of
                 the histogram of words, binned by their probability of
                 appearance in junk mail.  In order to generate the histogram
                 the GNUPLOT and NETPbm utilities must be installed on the
                 system; if they are absent, the --plot option will not be
                 available.

       --pop3port n
                 The POP3 proxy server activated by a subsequent --pop3server
                 option will listen for connections on port n.  If no
                 --pop3port is specified, the server will listen on the
                 default port of 9110.  On most systems, you'll have to run
                 the program as root if you wish the proxy server to listen on
                 a port numbered 1023 or less.

       --pop3server server[:port]
                 Activate a POP3 proxy server which relays requests made on
                 the previously specified --pop3port or the default of 9110 if
                 no port is specified, to the specified server, which may be
                 given either as an IP address in ``dotted quad'' notion such
                 as 10.89.11.131 or a fully-qualified domain name like
                 pop.someisp.tld.  The port on which the server listens for
                 POP3 connections may be specified after the server prefixed
                 by a colon (``:'') ; if no port is specified, the IANA
                 assigned POP3 port 110 will be used. The POP3 proxy server
                 will pass each message received on behalf of a requestor
                 through the classifier and return the annotated transcript to
                 the requestor, who may then filter it based on the
                 classification appended to the message header. You must load
                 a dictionary before activating the POP3 proxy server, and the
                 --pop3server option must be the last on the command line.
                 The server continues to run and service requests until
                 manually terminated.

       --pop3trace
                 Write a trace of POP3 proxy server operations to standard
                 error.  Each trace message (apart from the dump of the body
                 of multi-line replies to clients) is prefixed with the label
                 ``POP3: ''.

       --prune   After loading the dictionary from --mail and --junk folders,
                 this option discards words which appear sufficiently
                 infrequently that their probability cannot be reliably
                 estimated.  One usually --prune s the dictionary before using
                 --write to save it for subsequent runs.

       --ptrace  Include a token-by-token trace in the --pdiag output file.
                 This helps when adjusting the parser's criteria for
                 recognising tokens.  Setting this option without also
                 specifying a --pdiag file will have no effect other than
                 perhaps to exercise your fingers typing it on the command
                 line.

       --read, -r fname
                 Load a dictionary (previously created with the --write
                 option) from file fname.

       --sigwords n
                 The probability that a message is junk will be computed based
                 on the individual probabilities of the n words with extremal
                 probabilities; that is, probabilities most indicative of junk
                 or mail.  The default is 15, but there's no obvious optimal
                 setting for this parameter; it depends in part on the average
                 length of messages you receive.

       --sloppyheaders
                 To evade filtering programs, some junk mail is sent with MIME
                 part headers which violate the standard but which most mail
                 clients accept anyway.  This option causes such messages to
                 be parsed as a browser would, at the cost of standards
                 compliance.  If --sloppyheaders is used, it should be
                 specified both when building the dictionary and when testing
                 messages.

       --statistics
                 After loading the dictionary from --mail and --junk folders,
                 print statistics of the distribution of junk probabilities of
                 words in the dictionary.  The statistics are written to
                 standard output.

       --test, -t fname
                 Test mail in fname and write the estimated probability it is
                 junk to standard output unless the --transcript option is
                 also specified with standard output (``-'') as the
                 destination, in which case the inclusion of the probability
                 and classification in the transcript is adjudged sufficient.
                 If the --verbose option is specified, the individual
                 probabilities of the ``most interesting'' words in the
                 message will also be output.  If fname is ``-'' the message
                 is read from standard input.

       --threshjunk n
                 Set the threshold for classifying a message as junk to the
                 floating point probability value n.  The default threshold is
                 0.9; messages scored above --threshjunk are deemed junk.

       --threshmail n
                 Set the threshold for classifying a message as legitimate
                 mail to the floating point probability value n.  The default
                 threshold is 0.9, with messages scored below --threshmail
                 deemed legitimate.  Note that you may leave a gap between the
                 --threshmail and --threshjunk values (although it makes no
                 sense to set --threshmail higher).  Mail scored between the
                 two thresholds will then be judged of uncertain status.

       --transcript fname
                 Write an annotated transcript of the original message to the
                 specified fname.  If fname is ``-'', the transcript is
                 written to standard output.  At the end of the message
                 header, an X-Annoyance-Filter-Junk-Probability header item
                 giving the computed probability and an
                 X-Annoyance-Filter-Classification item which gives the
                 classification of the message according to the --threshmail
                 and --threshjunk settings; the classification is given as
                 ``Mail'', ``Junk'', or ``Indeterminate''.

       --verbose, -v
                 Print diagnostic information as the program performs various
                 operations.

       --version Print program version information.

       --write fname
                 Write a dictionary to the file fname.  The dictionary is
                 written in a binary format which may be loaded on subsequent
                 runs with the --read option.  Binary dictionary files are
                 portable among machines with different architectures and byte
                 order.

EXIT STATUS
       The program exits with a status of 0 when processing is successfully
       completed, 1 when an error (I/O or file access in most cases) occurs,
       and 2 to indicate a command line syntax error.  If the --classify
       option is specified, an exit status of 0 identifies the message tested
       as legitimate mail, 3 marks it as junk, and a status of 4 is returned
       for messages which cannot be confidently classified as either mail or
       junk.

FILES
       Files are read or written as requested by options on the command line;
       all options which read or write files take a fname argument which gives
       the file name.  The --classify, --junk, --mail, --test, and
       --transcript options interpret an argument of ``-'' as denoting
       standard input or output.

       On systems which provide the required services and utilities, arguments
       to the --junk and --mail options may be compressed files or the name of
       a directory containing one or more messages which will be read as if
       logically concatenated.  Messages in the directory may be compressed or
       uncompressed.

       Error messages and diagnostic output generated when the --verbose
       option is specified are written to standard error.

BUGS
       Millions, doubtless.  This is a program which must cope with whatever
       garbage is fed to it from mail folders, trying to make the best of it.
       When it messes up, your efforts in identifying the message which caused
       the problem and submitting a verbatim copy of it with your bug report
       are much appreciated.

       Please report bugs to bugs@fourmilab.ch and include annoyance-filter in
       the Subject line.  Thanks in advance.

AUTHOR
                                     John Walker
                              http://www.fourmilab.ch/

       This software is in the public domain. Permission to use, copy, modify,
       and distribute this software and its documentation for any purpose and
       without fee is hereby granted, without any conditions or restrictions.
       This software is provided ``as is'' without express or implied
       warranty.

SEE ALSO
       gnuplot(1), gs(1), gzip(1), netpbm(1), procmail(1), xpdf(1)

       annoyance-filter is written using the Literate Programming
       http://www.literateprogramming.com/ methodology; the user manual,
       program, and internal documentation are developed together, closely
       interlinked.  Whenever the program is modified, the documentation is
       automatically updated, reducing the risk of divergence between what the
       manual says and what the program does.

       This man page is intended as a reference for the command line options
       and most common applications of the program.  For comprehensive
       documentation, including details of how to integrate annoyance-filter
       with the procmail mail processing system, please refer to the complete
       documentation published in PDF format, available on the Web at:
            http://www.fourmilab.ch/annoyance-filter/annoyance-filter.pdf

       If you have downloaded the annoyance-filter source distribution, the
       corresponding version of annoyance-filter.pdf is included in the
       archive.  You can read PDF files with Acrobat reader (a free download
       from http://www.adobe.com/acrobat/readstep.html) or the xpdf or
       Ghostscript (gs) utilities.

4th Berkeley Distribution         4 AUG 2004               ANNOYANCE-FILTER(1)