DragonFly On-Line Manual Pages

CRAWL(1)               DragonFly General Commands Manual              CRAWL(1)

NAME
     crawl - a small and efficient HTTP crawler

SYNOPSIS
     crawl [-u urlincl] [-e urlexcl] [-i imgincl] [-I imgexcl] [-d imgdir]
           [-m depth] [-c state] [-t timeout] [-A agent] [-R] [-E external]
           [url ...]

DESCRIPTION
     The crawl utility starts a depth-first traversal of the web at the
     specified URLs.  It stores all JPEG images that match the configured
     constraints.

     The options are as follows:

     -v level     The verbosity level of crawl in regards to printing
                  information about URL processing.  The default is 1.

     -u urlincl   A regex(3) expression that all URLs that should be included
                  in the traversal have to match.

     -e urlexcl   A regex(3) expression that determines which URLs will be
                  excluded from the traversal.

     -i imgincl   A regex(3) expression that all image URLs have to match in
                  order to be stored on disk.

     -I imgexcl   A regex(3) expression that determines the images that will
                  not be stored.

     -d imagedir  Specifies the directory under which the images will be
                  stored.

     -m depth     Specifies the maximum depth of the traversal.  A 0 means
                  that only the URLs specified on the command line will be
                  retrieved. A -1 stands for unlimited traversal and should be
                  used with caution.

     -c state     Continues a traversal that was interrupted previosly.  The
                  remaining URLs with be read from the file state.

     -t timeout   Specifies the time in seconds that needs to pass between
                  successive access of a single host.  The parameter is a
                  float.  The default is five seconds.

     -A agent     Specifies the agent string that will be included in all HTTP
                  requests.

     -R           Specifies that the crawler should ignore the robots.txt
                  file.

     -E external  Specifies an external filter program that can refine which
                  URLs are to be included in the traversal.  The filter
                  program reads the URLs on stdin and outputs a single
                  character on stdout.  An output of `y' indicates that the
                  URL may be included, `n' means that the URL should be
                  excluded.

     The source code for existing web crawlers tend to be very complicated.
     crawl is a very simple design with pretty simple source code.

     A configuration file can be used instead of the command line arguments.
     The configuration file contains the MIME-type that is being used.  To
     download other objects besides images the MIME-type needs to be adjusted
     accordingly.  For more information, see crawl.conf.

EXAMPLES
     crawl -m 0 http://www.w3.org/

     Searches for images in  the index page of the web consortium without
     following any other links.

ACKNOWLEDGEMENTS
     This product includes software developed by Ericsson Radio Systems.

     This product includes software developed by the University of California,
     Berkeley and its contributors.

AUTHORS
     The crawl utility has been developed by Niels Provos.

DragonFly 6.5-DEVELOPMENT        May 29, 2001        DragonFly 6.5-DEVELOPMENT