Re: Compression and dictionaries

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: linux@horizon.com
To: jonsmirl@gmail.com
Cc: git@vger.kernel.org, linux@horizon.com
Subject: Re: Compression and dictionaries
Date: 15 Aug 2006 20:37:12 -0400	[thread overview]
Message-ID: <20060816003712.32000.qmail@science.horizon.com> (raw)
In-Reply-To: <9e4733910608150755q54757386n13c705b0043e8308@mail.gmail.com>

> I explained our situation to Mark Adler, zlib author, and he recommend
> checking out this book for techniques that can be used.
> 
> http://www.cs.mu.oz.au/mg/

I was about to suggest the same thing!  It's an excellent book.

It's about a piece of software and the choices made there, but it
explains in detail many alternatives and why they weren't chosen for
the mg software, but what they might be useful for.

The mg software itself is designed for human-language text, and does
indexing as well.  So it does a fixed breakdown into alternate word and
non-word tokens, builds a lexicon with frequency tables, then uses the
frequencies to build Huffman trees, and Huffman-compresses each token.
The "word" dictionary is also used as part of the search index, as each
word has a (very cleverly compressed to less than one byte on average)
pointer to each document it appears in with a count of the number of
appearances.

The word encoding is straight zero-order Huffman, so inter-word
correlations are not used.

For software, with lots of punctuation and multi-word idioms
("for (i = 0; i < n; i++) {"), the basic design is not very well suited.
(To say nothing of binary data, like some people who want to
check images or multi-megabyte video files into git.)

But there is a good description of many possible algorithms, with
lots of emphasis on practicalities.

And the software *does* support dynamic collections, via auxiliary indexes
and escape codes for any new words not found in the main dictionary.
In fact, generating the Huffman tree from as little as 1/8 of the material
to be compressed only loses you 5.7% of the compression.  (Table 9011,
p. 360.)

However, that is for English text, with a pretty Zipf-like distribution.
In code, we generate new words (new function names) frequently, and
proceed to use them heavily.

It's worth noting the similarity between generating a good base dictionary
with finding a good base version of a file for delta-encoding.
You may end up having to divide the documents into classes (different
source languages for example - C vs. asm vs. perl vs. python vs.
docs vs. GIFs), and use a different dictionary per class.  But that
drives up the cost of compression (you have to see which class the
file to be compressed falls into).

Anyway, highly recommended.

next prev parent reply	other threads:[~2006-08-16  0:37 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-08-15  8:33 Compression and dictionaries linux
2006-08-15 13:29 ` Jon Smirl
2006-08-15 14:55 ` Jon Smirl
2006-08-16  0:37   ` linux [this message]
     [not found]     ` <4b73d43f0608152243i15b37036x7aa50aa3afc2b02f@mail.gmail.com>
2006-08-16  5:50       ` Jon Smirl
2006-08-16  6:33         ` Johannes Schindelin
2006-08-16  6:55           ` Shawn Pearce
2006-08-16  7:09             ` Johannes Schindelin
2006-08-16 14:43               ` Jon Smirl
2006-08-17 22:33           ` linux
  -- strict thread matches above, loose matches on Subject: below --
2006-08-14  3:37 Jon Smirl
2006-08-14  3:56 ` Shawn Pearce
2006-08-14  4:07   ` Jon Smirl
2006-08-14  4:17     ` Shawn Pearce
2006-08-14  7:48       ` Alex Riesen
2006-08-14 10:06     ` Erik Mouw
2006-08-14 12:33 ` Johannes Schindelin
2006-08-14 14:08   ` Jon Smirl
2006-08-14 14:45     ` Johannes Schindelin
2006-08-14 16:15       ` Jon Smirl
2006-08-14 16:32         ` David Lang
2006-08-14 16:55           ` Jakub Narebski
2006-08-14 17:15             ` Jeff Garzik
2006-08-14 17:34               ` David Lang
2006-08-14 17:50                 ` Jeff Garzik
2006-08-14 18:48           ` Jon Smirl
2006-08-14 19:08             ` David Lang
2006-08-14 19:38               ` Johannes Schindelin
2006-08-14 15:14     ` Alex Riesen
2006-08-14 15:26       ` Johannes Schindelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060816003712.32000.qmail@science.horizon.com \
    --to=linux@horizon.com \
    --cc=git@vger.kernel.org \
    --cc=jonsmirl@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).