git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] read-cache.c: index format v5 -- 30% smaller/faster than v4
Date: Thu, 14 Feb 2019 11:02:52 +0100	[thread overview]
Message-ID: <87bm3ek7qr.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <20190213120807.25326-1-pclouds@gmail.com>


On Wed, Feb 13 2019, Nguyễn Thái Ngọc Duy wrote:

> Index file size more or less translates to write time because we hash
> the entire file every time we update the index. And we update the index
> quite often (automatically index refresh is done everywhere). This means
> smaller index files are faster, especially true for very large
> worktrees.
>
> Index v4 attempts to reduce file size by "prefix compressing"
> paths. This reduces file size from 17% (git.git) to 41% (webkit.git,
> deep hierarchy).
>
> Index v5 takes the same idea to the next level. Instead of compressing
> just paths, based on the previous entry, we "compress" a lot more
> fields.
>
> Take a look at stat data, st_dev, st_uid, st_gid and st_mode are the
> same most of the time. ctime should often be the same (or differs just
> slightly). And sometimes mtime is the same as well. st_ino is also
> always zero on Windows. We're storing a lot of duplicate values.
>
> Index v5 handles this

This looks really promising.

>  - by adding a "same mask" per entry. If st_dev is the same as previous
>    entry, for instance, we set "st_dev is the same" flag and will not
>    store it at all, saving 31 bits per entry.
>
>  - even when we store it, "varint" encoding is used. We should rarely
>    need to write out 4 bytes
>
>  - for ctime and mtime, even if we have to store it, we store the offset
>    instead of absolute numbers. This often leads to smaller numbers,
>    which also means fewer bytes to encode.

Sounds good. I wonder if you've thought about/considered a couple of
optimizations on top of this, or if they're possible. Both share the
same theme:

* Instead of adding a "same as last mask" adding "same as Nth
  mask". Something similar exists in the Sereal format (which also has
  other techniques you use, e.g. varint
  https://github.com/Sereal/Sereal/blob/master/sereal_spec.pod#the-copy-tag)

  So instead of:

      <mask1><same><mask2><same><mask1><same> etc.

   You'd have:

      <mask1 (mark1)><same><mask2 (mark2)><same><insert: mark1><same> etc.

   I.e. when you have data that flip-flops a lot you can save space by
   saying "it's the same as existing earlier value at offset N". Maybe
   it doesn't make sense for this data, I don't know...

* For ctime/mtime presumably for dir paths, are these paths tolerant to
  or already out of glob() order? Then perhaps they can be pre-sorted so
  the compression or ctime/mtime offset compression is more effective.

> As a result of this, v5 reduces file size from 30% (git.git) to
> 36% (webkit.git) compared to v4. Comparing to v2, webkit.git index file
> size is reduced by 63%! A 8.4MB index file is _almost_ acceptable.
>
> Of course we trade off storage with cpu. We now need to spend more
> cycles writing or even reading (but still plenty fast compared to
> zlib). For reading, I'm counting on multi thread to hide away all this
> even if it becomes significant.

This would be a bigger change, but have we/you ever done a POC
experiment to see how much of this time is eaten up by zlib that
wouldn't be eaten up with some of the newer "fast but good enough"
compression algorithms, e.g. Snappy and Zstandard?

  parent reply	other threads:[~2019-02-14 10:02 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-13 12:08 [PATCH] read-cache.c: index format v5 -- 30% smaller/faster than v4 Nguyễn Thái Ngọc Duy
2019-02-13 22:27 ` Junio C Hamano
2019-02-14 10:02 ` Ævar Arnfjörð Bjarmason [this message]
2019-02-14 10:14   ` Duy Nguyen
2019-02-15 20:22     ` Ben Peart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87bm3ek7qr.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).