DragonFly On-Line Manual Pages
rwsplit(1) SiLK Tool Suite rwsplit(1)
NAME
rwsplit - Divide a SiLK file into a (sampled) collection of subfiles
SYNOPSIS
rwsplit --basename=BASENAME
{ --ip-limit=LIMIT | --flow-limit=LIMIT
| --packet-limit=LIMIT | --byte-limit=LIMIT }
[--seed=NUMBER] [--sample-ratio=SAMPLE_RATIO]
[--file-ratio=FILE_RATIO] [--max-outputs=MAX_OUTPUTS]
[--note-add=TEXT] [--note-file-add=FILE]
[--compression-method=COMP_METHOD]
[--print-filenames] [--site-config-file=FILENAME]
[--xargs[=FILE] | FILE [FILES...]]
rwsplit --help
rwsplit --version
DESCRIPTION
rwsplit reads SiLK Flow records from the standard input or from files
named on the command line and writes the flows into a set of subfiles
based on the splitting criterion. In its simplest form, rwsplit
partitions the file, meaning that each input flow will appear in one
(and only one) of the subfiles.
In addition to splitting the file, rwsplit can generate files
containing sample flows. Sampling is specified by using the
--sample-ratio and --file-ratio switches.
rwsplit reads SiLK Flow records from the files named on the command
line or from the standard input when no file names are specified and
--xargs is not present. To read the standard input in addition to the
named files, use "-" or "stdin" as a file name. If an input file name
ends in ".gz", the file will be uncompressed as it is read. When the
--xargs switch is provided, rwsplit will read the names of the files to
process from the named text file, or from the standard input if no file
name argument is provided to the switch. The input to --xargs must
contain one file name per line.
If you wish to use the size of the output files as the splitting
criterion, use the --flow-limit switch. The paramater to this switch
should be the size of the desired output files divided by the record
size. The record size can be determined by rrwwffiilleeiinnffoo(1). When the
output files are compressed (see the description of
--compression-method below), you should assume about a 50% compression
ratio.
OPTIONS
Option names may be abbreviated if the abbreviation is unique or is an
exact match for an option. A parameter to an option may be specified
as --arg=param or --arg param, though the first form is required for
options that take optional parameters.
The splitting criterion is defined using one of the limit specifiers;
one and only one must be specified. They are:
--ip-limit=LIMIT
Close the current subfile and begin a new subfile when the count of
unique source and destination IPs in the current subfile meets or
exceeds LIMIT. The next-hop-IP does not count toward LIMIT.
--flow-limit=LIMIT
Close the current subfile and begin a new subfile when the number
of SiLK Flow records in the current subfile meets LIMIT.
--packet-limit=LIMIT
Close the current subfile and begin a new subfile when the sum of
the packet counts across all SiLK Flow records in the current
subfile meets or exceeds LIMIT.
--byte-limit=LIMIT
Close the current subfile and begin a new subfile when the sum of
the byte counts across all SiLK Flow records in the current subfile
meets or exceeds LIMIT. This switch does not specify the size of
the subfiles.
The other switches are:
--basename=BASENAME
Specifies the basename of the output files; this switch is
required. The flows are written sequentially to a set of subfiles
whose names follow the format BASENAME.ORDER.rwf, where ORDER is an
8-digit zero-formatted sequence number (i.e., 00000000, 00000001,
and so on). The sequence number will begin at zero and increase by
one for every file written, unless --file-ratio is specified,
--seed=NUMBER
Use NUMBER to seed the pseudo-random number generator for the
--sample-ratio or --file-ratio switch. This can be used to put the
random number generator into a known state, which is useful for
testing.
--sample-ratio=SAMPLE_RATIO
Writes one flow record, chosen at random, from every SAMPLE_RATIO
flows that are read.
--file-ratio=FILE_RATIO
Picks one subfile, chosen from random, out of every FILE_RATIO
names generated, for writing to disk.
--max-outputs=NUMBER
Limits the number of files that are written to disk to NUMBER.
--note-add=TEXT
Add the specified TEXT to the header of the output file as an
annotation. This switch may be repeated to add multiple
annotations to a file. To view the annotations, use the
rrwwffiilleeiinnffoo(1) tool.
--note-file-add=FILENAME
Open FILENAME and add the contents of that file to the header of
the output file as an annotation. This switch may be repeated to
add multiple annotations. Currently the application makes no
effort to ensure that FILENAME contains text; be careful that you
do not attempt to add a SiLK data file as an annotation.
--compression-method=COMP_METHOD
Specify how to compress the output. When this switch is not given,
the output files are compressed using the default chosen when SiLK
was compiled. The valid values for COMP_METHOD are determined by
which external libraries were found when SiLK was compiled. To see
the available compression methods and the default method, use the
--help or --version switch. SiLK can support the following
COMP_METHOD values when the required libraries are available.
none
Do not compress the output using an external library.
zlib
Use the zzlliibb(3) library for compressing the output. Using zlib
produces the smallest output files at the cost of speed.
lzo1x
Use the lzo1x algorithm from the LZO real time compression
library for compression. This compression provides good
compression with less memory and CPU overhead.
best
Use lzo1x if available, otherwise use zlib.
--print-filenames
Print to the standard error the names of input files as they are
opened.
--site-config-file=FILENAME
Read the SiLK site configuration from the named file FILENAME.
When this switch is not provided, rwsplit searches for the site
configuration file in the locations specified in the "FILES"
section.
--xargs
--xargs=FILENAME
Causes rwsplit to read file names from FILENAME or from the
standard input if FILENAME is not provided. The input should have
one file name per line. rwsplit will open each file in turn and
read records from it, as if the files had been listed on the
command line.
--help
Print the available options and exit.
--version
Print the version number and information about how SiLK was
configured, then exit the application.
EXAMPLES
In the following examples, the dollar sign ("$") represents the shell
prompt. The text after the dollar sign represents the command line.
Lines have been wrapped for improved readability, and the back slash
("\") is used to indicate a wrapped line.
Assume a source file source.rwf; to split that file into files that
each contain about 100 unique IP addresses:
$ rwsplit --basename=result --ip-limit=100 source.rwf
To split source.rwf into files that each contain 100 flows:
$ rwsplit --basename=result --flow-limit=100 source.rwf
The following causes rwsplit to sample 1 out of every 10 records from
source.rwf; i.e., rwsplit will read 1000 flow records to produce each
subfile:
$ rwsplit --basename=result --flow-limit=100 --sample-ratio=10 source.rwf
When --file-ratio is specified, the file names are generated as usual
(e.g., base-00000000, base-00000001, ...); however, one of these names
will be chosen randomly from each set of --file-ratio candidates, and
only that file will be written to disk.
$ rwsplit --basename=result --flow-limit=100 --file-ratio=5 source.rwf
$ ls
result-00000002.rwf
result-00000008.rwf
result-00000013.rwf
result-00000016.rwf
LIMITATIONS
rwsplit can take exactly 1 partitioning switch per invocation.
Partitioning is not exact, rwsplit keeps appending flow records a file
until it meets or exceeds the specified LIMIT. For example, if you
specify --ip-limit=100, then rwsplit will fill up the file until it has
100 IP addresses in it; if the file has 99 addresses and a new record
with 2 previously unseen addresses is received, rwsplit will put this
in the current file, resulting in a 101-address file. Similarly, if
you specify --byte-limit=2000, and rwsplit receives a 10kb flow record,
that flow record will be placed in the current subfile.
The switches --sample-ratio, --file-ratio, and --max-outputs are
processed in that order. So, when you specify
$ rwsplit --sample-ratio=10 --ip-limit=100 \
--file-ratio=10 --max-outputs=20
rwsplit will pick 1 out of every 10 flow records, write that to a file
until it has 100 IP's per file, pick 1 out of every 10 files to write,
and write up to 20 files. If there are 1000 records, each with 2
unique IPs in them, then rwsplit will write at most 1 file (it will
write 200 unique IP addresses, but it may not pick one of the files
from the set to write).
ENVIRONMENT
SILK_CLOBBER
The SiLK tools normally refuse to overwrite existing files.
Setting SILK_CLOBBER to a non-empty value removes this restriction.
SILK_CONFIG_FILE
This environment variable is used as the value for the
--site-config-file when that switch is not provided.
SILK_DATA_ROOTDIR
This environment variable specifies the root directory of data
repository. As described in the "FILES" section, rwsplit may use
this environment variable when searching for the SiLK site
configuration file.
SILK_PATH
This environment variable gives the root of the install tree. When
searching for configuration files, rwsplit may use this environment
variable. See the "FILES" section for details.
FILES
${SILK_CONFIG_FILE}
${SILK_DATA_ROOTDIR}/silk.conf
/data/silk.conf
${SILK_PATH}/share/silk/silk.conf
${SILK_PATH}/share/silk.conf
/usr/local/share/silk/silk.conf
/usr/local/share/silk.conf
Possible locations for the SiLK site configuration file which are
checked when the --site-config-file switch is not provided.
SEE ALSO
rrwwffiilleeiinnffoo(1), ssiillkk(7), zzlliibb(3)
SiLK 3.11.0.1 2016-02-19 rwsplit(1)