DragonFly BSD
DragonFly kernel List (threaded) for 2013-06
[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]

[GSOC] HAMMER2 compression feature week1 report


From: Daniel Flores <daniel5555@xxxxxxxxx>
Date: Sat, 22 Jun 2013 21:14:06 +0200

--001a11336974bfd97304dfc2fb1a
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Hello everyone,

Here is my report on the progress made during the first week of GSOC's
coding period. I'd be happy to receive your feedback and criticism.

It turned out that I really overestimated the part of prototype application
(and possibly underestimated the amount of time related directly to
HAMMER2). In any case, the prototype part is done for now. The algorithm
that was integrated in prototype application is LZ4, the algorithm
suggested by Freddie.

This algorithm turned to be extremely comfortable to use and I'm also very
satisfied with its performance. Using the prototype application I ran some
tests and, summarizing, the results are those:

1. Most of files that we use, like images, documents, audios, etc., are
already compressed to some point. So this algorithm can't really compress
those in most cases since it works only on small amounts of data in our
case. At best, you would have like 1-2 blocks per 100 blocks that are
sufficiently compressed.

2. Plain text files that contain texts like books or mailing lists
generally are not compressible at all with this approach.

3. Some uncompressed files are very well compressible with this approach.
For example, some TIFF images that I tested had all of their blocks
sufficiently compressed, sometimes spectacularly like going from 64KB to
just 1404 bytes (or 2KB physical block in the actual file system). That
doesn't happen with all files though and it also seems to never happen with
uncompressed audio (.wav).

4. Other types of file that are really well compressed with this approach
are source code files. For example, from DragonFly's own sources:
/usr/src/sbin/hammer/cmd_cleanup.c was compressed from 31502 bytes to 13977
bytes. Another example: /usr/src/sbin/md5/md5.c was compressed from 12950
bytes to 7867 bytes.

5. Because there was expressed a certain interest about log files, I tested
those too. They are also very, very well compressed with this approach. I
tested some logs from my VPS, like access log and error log and they had
all of their blocks sufficiently compressed, most of them well below 10000
bytes. I assume that a dedicated compression of a whole file would be even
more efficient, but using just file system compression is also very
beneficial.

So, the conclusion is that for the many types of files that we use, like
images, .pdfs, documents, there wouldn't be a lot of difference with that
type of compression, even though some parts of those files can be
compressed. However, there is a huge difference for files like source code,
logs and some uncompressed formats. Generally any file that has obvious
pattern or many repeated elements is well compressed with this approach.
Sadly, for the same reason it's not possible to compress things like books
or mailing list archives with it.

So, right now I consider the part of the prototype done. I'll probably
return to it later to test other algorithms =96 DEFLATE and LZO, but for no=
w
I'll move to HAMMER2. The reason is that I'm very satisfied with the
performance of LZ4 and it's very unlikely that other algorithms would
outperform it significantly. I'd like to implement at least one algorithm
in HAMMER2 and once that is done, it would make more sense to consider
other algorithms and test them.

Right now I'm working on the hammer2 utility to implement a new command to
set the compression mode on a specified directory. When this will be done,
I'll implement LZ4 in HAMMER2 and do tests in real-life.

And then, if there is time, I would work on DEFLATE and LZO and ultimately
let the user choose which one to use.

If you wish to check out my prototype application, you can get it from my
repository, branch =91prototype=92 [1]. Alternatively, you can also downloa=
d it
from my VPS [2], if you prefer it that way. The code is a bit rough, but
hopefully it is possible to understand it.
To compile the application, just run 'make'. Then to perform a test on a
file, run './prototype filename'. All the code contained in =93lz4=94 direc=
tory
is created by Yann Collet, the author of LZ4 implementation. I didn't
modify anything in it.

Then there is also =93zero_blocks.c=94 which is an application that simply
creates a file with several blocks that contain only zeros. It is used to
check zero-blocks detection for algorithm #1 (zero-checking).

So, that's all I can report for now.

Thank you for attention!


Daniel

[1] git://leaf.dragonflybsd.org/~iostream/dragonfly.git
[2] http://project5555.com/dragonflybsd/prototype.tar.gz

--001a11336974bfd97304dfc2fb1a
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hello everyone,<br><br>Here is my report on the progress m=
ade during the first week of GSOC&#39;s coding period. I&#39;d be happy to =
receive your feedback and criticism.<br><br>It turned out that I really ove=
restimated the part of prototype application (and possibly underestimated t=
he amount of time related directly to HAMMER2). In any case, the prototype =
part is done for now. The algorithm that was integrated in prototype applic=
ation is LZ4, the algorithm suggested by Freddie.<br>
<br>This algorithm turned to be extremely comfortable to use and I&#39;m al=
so very satisfied with its performance. Using the prototype application I r=
an some tests and, summarizing, the results are those:<br><br>1. Most of fi=
les that we use, like images, documents, audios, etc., are already compress=
ed to some point. So this algorithm can&#39;t really compress those in most=
 cases since it works only on small amounts of data in our case. At best, y=
ou would have like 1-2 blocks per 100 blocks that are sufficiently compress=
ed.<br>
<br>2. Plain text files that contain texts like books or mailing lists gene=
rally are not compressible at all with this approach.<br><br>3. Some uncomp=
ressed files are very well compressible with this approach. For example, so=
me TIFF images that I tested had all of their blocks sufficiently compresse=
d, sometimes spectacularly like going from 64KB to just 1404 bytes (or 2KB =
physical block in the actual file system). That doesn&#39;t happen with all=
 files though and it also seems to never happen with uncompressed audio (.w=
av).<br>
<br>4. Other types of file that are really well compressed with this approa=
ch are source code files. For example, from DragonFly&#39;s own sources: /u=
sr/src/sbin/hammer/cmd_cleanup.c was compressed from 31502 bytes to 13977 b=
ytes. Another example: /usr/src/sbin/md5/md5.c was compressed from 12950 by=
tes to 7867 bytes.<br>
<br>5. Because there was expressed a certain interest about log files, I te=
sted those too. They are also very, very well compressed with this approach=
. I tested some logs from my VPS, like access log and error log and they ha=
d all of their blocks sufficiently compressed, most of them well below 1000=
0 bytes. I assume that a dedicated compression of a whole file would be eve=
n more efficient, but using just file system compression is also very benef=
icial.<br>
<br>So, the conclusion is that for the many types of files that we use, lik=
e images, .pdfs, documents, there wouldn&#39;t be a lot of difference with =
that type of compression, even though some parts of those files can be comp=
ressed. However, there is a huge difference for files like source code, log=
s and some uncompressed formats. Generally any file that has obvious patter=
n or many repeated elements is well compressed with this approach. Sadly, f=
or the same reason it&#39;s not possible to compress things like books or m=
ailing list archives with it.<br>
<br>So, right now I consider the part of the prototype done. I&#39;ll proba=
bly return to it later to test other algorithms =96 DEFLATE and LZO, but fo=
r now I&#39;ll move to HAMMER2. The reason is that I&#39;m very satisfied w=
ith the performance of LZ4 and it&#39;s very unlikely that other algorithms=
 would outperform it significantly. I&#39;d like to implement at least one =
algorithm in HAMMER2 and once that is done, it would make more sense to con=
sider other algorithms and test them.<br>
<br>Right now I&#39;m working on the hammer2 utility to implement a new com=
mand to set the compression mode on a specified directory. When this will b=
e done, I&#39;ll implement LZ4 in HAMMER2 and do tests in real-life.<br>
<br>And then, if there is time, I would work on DEFLATE and LZO and ultimat=
ely let the user choose which one to use.<br><br>If you wish to check out m=
y prototype application, you can get it from my repository, branch =91proto=
type=92 [1]. Alternatively, you can also download it from my VPS [2], if yo=
u prefer it that way. The code is a bit rough, but hopefully it is possible=
 to understand it.<br>
To compile the application, just run &#39;make&#39;. Then to perform a test=
 on a file, run &#39;./prototype filename&#39;. All the code contained in =
=93lz4=94 directory is created by Yann Collet, the author of LZ4 implementa=
tion. I didn&#39;t modify anything in it.<br>
<br>Then there is also =93zero_blocks.c=94 which is an application that sim=
ply creates a file with several blocks that contain only zeros. It is used =
to check zero-blocks detection for algorithm #1 (zero-checking).<br><br>So,=
 that&#39;s all I can report for now.<br>
<br>Thank you for attention!<br><br><br>Daniel<br><br>[1] git://<a href=3D"=
http://leaf.dragonflybsd.org/~iostream/dragonfly.git";>leaf.dragonflybsd.org=
/~iostream/dragonfly.git</a><br>[2] <a href=3D"http://project5555.com/drago=
nflybsd/prototype.tar.gz">http://project5555.com/dragonflybsd/prototype.tar=
.gz</a><br>
<br></div>

--001a11336974bfd97304dfc2fb1a--



[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]