DragonFly On-Line Manual Pages
BZIP(1) DragonFly General Commands Manual BZIP(1)
NAME
bzip, bunzip - a block-sorting file compressor, v0.21
SYNOPSIS
bzip [ -cdfkvVL123456789 ] [ filenames ... ]
bunzip [ -kvVL ] [ filenames ... ]
DESCRIPTION
Bzip compresses files using the Burrows-Wheeler-Fenwick block-sorting
text compression algorithm. Compression is generally considerably
better than that achieved by more conventional LZ77/LZ78-based
compressors, and competitive with all but the best of the PPM family of
statistical compressors.
The command-line options are deliberately very similar to those of GNU
Gzip, but they are not identical.
Bzip expects a list of file names to follow the command-line flags.
Each file is replaced by a compressed version of itself, with the name
"original_name.bz". Each compressed file has the same modification
date and permissions as the corresponding original, so that these
properties can be correctly restored at decompression time. File name
handling is naive in the sense that there is no mechanism for
preserving original file names, permissions and dates in filesystems
which lack these concepts, or have serious file name length
restrictions, such as MS-DOS.
Bzip and bunzip will not overwrite existing files; if you want this to
happen, you should delete them first.
If no file names are specified, bzip compresses from standard input to
standard output. In this case, bzip will decline to write compressed
output to a terminal, as this would be entirely incomprehensible and
therefore pointless.
Bunzip (or bzip -d ) decompresses and restores all specified files
whose names end in ".bz". Files without this suffix are ignored.
Again, supplying no filenames causes decompression from standard input
to standard output.
You can also compress or decompress exactly one named file to the
standard output by giving the -c flag.
Compression is always performed, even if the compressed file is
slightly larger than the original. The worst case expansion is for
files of zero length, which expand to seventeen bytes. Random data
(including the output of most file compressors) is coded at about 8.1
bits per byte, giving an expansion of around 1%.
As a self-check for your protection, bzip uses 32-bit CRCs to make sure
that the decompressed version of a file is identical to the original.
This guards against corruption of the compressed data, and against
undetected bugs in bzip (hopefully very unlikely). The chances of data
corruption going undetected is microscopic, about one chance in four
billion for each file processed. Be aware, though, that the check
occurs upon decompression, so it can only tell you that that something
is wrong. It can't help you recover the original uncompressed data.
Return values: 1 for an abnormal exit, otherwise 0.
MEMORY MANAGEMENT
Bzip compresses large files in blocks. The block size affects both the
compression ratio achieved, and the amount of memory needed both for
compression and decompression. The flags -1 through -9 specify the
block size to be 100,000 bytes through 900,000 bytes (the default)
respectively. At decompression-time, the block size used for
compression is read from the header of the compressed file, and bunzip
then allocates itself just enough memory to decompress the file. Since
block sizes are stored in compressed files, it follows that the flags
-1 to -9 are irrelevant to and so ignored during decompression.
Compression and decompression requirements, in bytes, can be estimated
as:
Compression: 300k + ( 8 x block size )
Decompression: 6 x block size
The 300k constant is for a frequency-count table, used in the sorting
phase of compression.
Larger block sizes give rapidly diminishing marginal returns; most of
the compression comes from the first two or three hundred k of block
size, a fact worth bearing in mind when using bzip on small machines.
It is also important to appreciate that the decompression memory
requirement is set at compression-time by the choice of block size.
So, for example, if you are compressing files which you think might
possibly be decompressed on a 4-megabyte machine, you might want to
select a block size of 200k or 300k, so the decompressor will draw 1200
kbytes or 1800 kbytes respectively, which is probably the limit of
what's comfortable on a 4-meg machine. In general, though, you should
try and use the largest block size memory constraints allow.
Compression and decompression speed is virtually unaffected by block
size.
Another significant point applies to files which fit in a single block
-- that means most files you'd encounter using a large block size. The
amount of real memory touched is proportional to the size of the file,
since the file is smaller than a block. For example, compressing a
file 20,000 bytes long with the flag -9 will cause the compressor to
allocate [by the formula, in practice a little more] 7500k of memory,
but only touch 300k + 20000 * 8 = 460 kbytes of it. Similarly, the
decompressor will allocate 5400k but only touch 20000 * 6 = 120 kbytes.
Here is a table which summarises the maximum memory usage for different
block sizes. Also recorded is the total compressed size for 14 files
of the Calgary Text Compression Corpus totalling 3,141,622 bytes. This
column gives some feel for how compression varies with block size.
These figures tend to understate the advantage of larger block sizes
for larger files, since the Corpus is dominated by smaller files.
Compress Decompress Corpus
Flag usage usage Size
-1 1100k 500k 905958
-2 1900k 1000k 870646
-3 2700k 1500k 853650
-4 3500k 2000k 840140
-5 4300k 2500k 838355
-6 5100k 3000k 831695
-7 5900k 3500k 827104
-8 6700k 4000k 821652
-9 7500k 4500k 821652
OPTIONS
-c Compress or decompress to standard output. -c requires you to
supply exactly one file name, and this file is compressed or
decompressed to standard out.
-d Force decompression. Bzip and bunzip are really the same
program, and the decision about whether to compress or
decompress is done on the basis of which name is used. This
flag overrides that mechanism, and forces bzip to decompress.
-f The complement to -d: forces compression, regardless of the
invokation name.
-k Keep (don't delete) input files during compression or
decompression.
-v Verbose mode -- show the compression ratio for each file
processed.
-V Be very verbose. This spews out lots of information during
compression which is primarily of interest for debugging
purposes.
-L Display the software license terms and conditions.
-1 to -9
Set the block size to 100 k, 200 k .. 900 k when compressing.
Has no effect when decompressing. See MEMORY MANAGEMENT above.
PERFORMANCE NOTES
The sorting phase of compression gathers together similar strings in
the file. Because of this, files containing very long runs of repeated
symbols, like "aabaabaabaab ..." (repeated several hundred times) may
compress extraordinarily slowly. You can use the -V option to monitor
progress in great detail, if you want. Decompression speed is
unaffected. Such pathological cases seem rare in practice.
Incompressible or virtually-incompressible data may decompress rather
more slowly than one would hope. This is due to naive implementation
of the move-to-front coder, and of the frequency tables for the
arithmetic coder.
Decompression on Sun Sparc 1's (and other low-range Sparcs) can be
slow, because of the lack of hardware implementations of integer
multiply and divide in the SPARC v7 instruction set. The situation is
much exacerbated if bzip is compiled for a full SPARC v8 instruction
set, since this causes the machine to trap on each multiply and divide
instruction. These traps take control to the relevant software
emulation of the offending instruction, but it is much quicker for the
compiler simply to plant a call to the emulation routine. Moral: be
careful how you compile bzip for a Sparc. If you use GNU C,
investigate the effects of the -msupersparc and -mcypress flags.
Wildcard expansion for Windows 95 and NT loses leading directory
information. For example, the pathspec "sources\*.c" is searched
correctly for matching files, but the "sources\" bit is ignored when
the files come to be processed, which means bzip won't be able to find
any of them. This is easy to fix; perhaps some enterprising soul will
send me a patch?
CAVEATS
I/O error messages are not as helpful as they could be. Bzip tries
hard to detect I/O errors and exit cleanly, but the details of what the
problem is sometimes seem rather misleading.
There is no -t option to test the integrity of a compressed file.
However, Unix folks can do the following:
bzip -dcV file.bz > /dev/null
which causes bzip to do a trial decompression of file.bz, throwing away
the result. You'll be shown the computed and stored CRCs. If these
are identical, the file is almost certainly OK -- see the discussion
above on CRCs for a definition of "almost certainly". If they're not,
bzip will complain loudly. Note that file.bz is left unchanged
regardless of the outcome. Win95/NT folks can do the same, but
/dev/null will have to be replaced with something suitable, perhaps
NUL.
This manual page pertains to version 0.21 of bzip. It may well happen
that some future version will use a different compressed file format.
If you try to decompress, using 0.21, a .bz file created with some
future version which uses a different compressed file format, 0.21 will
complain that your file "is not a BZIP file". If that happens, you
should obtain a more recent version of bzip and use that to decompress
the file.
AUTHOR
Julian Seward, sewardj@cs.man.ac.uk.
The ideas embodied in bzip are due to (at least) the following people:
Michael Burrows and David Wheeler (for the block sorting
transformation), Peter Fenwick (for the structured coding model, and
many refinements), and Alistair Moffat, Radford Neal and Ian Witten
(for the arithmetic coder). I am much indebted for their help, support
and advice. See the file ALGORITHMS in the source distribution for
pointers to sources of documentation. Christian von Roques encouraged
me to look for faster sorting algorithms, so as to speed up
compression. Many people sent patches, helped with portability
problems, lent machines, gave advice and were generally helpful.
local BZIP(1)