From: Ævar Arnfjörð Bjarmason <firstname.lastname@example.org> To: Nguyễn Thái Ngọc Duy <email@example.com> Cc: firstname.lastname@example.org Subject: Re: [PATCH] read-cache.c: index format v5 -- 30% smaller/faster than v4 Date: Thu, 14 Feb 2019 11:02:52 +0100 Message-ID: <email@example.com> (raw) In-Reply-To: <firstname.lastname@example.org> On Wed, Feb 13 2019, Nguyễn Thái Ngọc Duy wrote: > Index file size more or less translates to write time because we hash > the entire file every time we update the index. And we update the index > quite often (automatically index refresh is done everywhere). This means > smaller index files are faster, especially true for very large > worktrees. > > Index v4 attempts to reduce file size by "prefix compressing" > paths. This reduces file size from 17% (git.git) to 41% (webkit.git, > deep hierarchy). > > Index v5 takes the same idea to the next level. Instead of compressing > just paths, based on the previous entry, we "compress" a lot more > fields. > > Take a look at stat data, st_dev, st_uid, st_gid and st_mode are the > same most of the time. ctime should often be the same (or differs just > slightly). And sometimes mtime is the same as well. st_ino is also > always zero on Windows. We're storing a lot of duplicate values. > > Index v5 handles this This looks really promising. > - by adding a "same mask" per entry. If st_dev is the same as previous > entry, for instance, we set "st_dev is the same" flag and will not > store it at all, saving 31 bits per entry. > > - even when we store it, "varint" encoding is used. We should rarely > need to write out 4 bytes > > - for ctime and mtime, even if we have to store it, we store the offset > instead of absolute numbers. This often leads to smaller numbers, > which also means fewer bytes to encode. Sounds good. I wonder if you've thought about/considered a couple of optimizations on top of this, or if they're possible. Both share the same theme: * Instead of adding a "same as last mask" adding "same as Nth mask". Something similar exists in the Sereal format (which also has other techniques you use, e.g. varint https://github.com/Sereal/Sereal/blob/master/sereal_spec.pod#the-copy-tag) So instead of: <mask1><same><mask2><same><mask1><same> etc. You'd have: <mask1 (mark1)><same><mask2 (mark2)><same><insert: mark1><same> etc. I.e. when you have data that flip-flops a lot you can save space by saying "it's the same as existing earlier value at offset N". Maybe it doesn't make sense for this data, I don't know... * For ctime/mtime presumably for dir paths, are these paths tolerant to or already out of glob() order? Then perhaps they can be pre-sorted so the compression or ctime/mtime offset compression is more effective. > As a result of this, v5 reduces file size from 30% (git.git) to > 36% (webkit.git) compared to v4. Comparing to v2, webkit.git index file > size is reduced by 63%! A 8.4MB index file is _almost_ acceptable. > > Of course we trade off storage with cpu. We now need to spend more > cycles writing or even reading (but still plenty fast compared to > zlib). For reading, I'm counting on multi thread to hide away all this > even if it becomes significant. This would be a bigger change, but have we/you ever done a POC experiment to see how much of this time is eaten up by zlib that wouldn't be eaten up with some of the newer "fast but good enough" compression algorithms, e.g. Snappy and Zstandard?
next prev parent reply index Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-02-13 12:08 Nguyễn Thái Ngọc Duy 2019-02-13 22:27 ` Junio C Hamano 2019-02-14 10:02 ` Ævar Arnfjörð Bjarmason [this message] 2019-02-14 10:14 ` Duy Nguyen 2019-02-15 20:22 ` Ben Peart
Reply instructions: You may reply publically to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style List information: http://vger.kernel.org/majordomo-info.html * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
email@example.com list mirror (unofficial, one of many) Archives are clonable: git clone --mirror https://public-inbox.org/git git clone --mirror http://ou63pmih66umazou.onion/git git clone --mirror http://czquwvybam4bgbro.onion/git git clone --mirror http://hjrcffqmbrq6wope.onion/git Newsgroups are available over NNTP: nntp://news.public-inbox.org/inbox.comp.version-control.git nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git nntp://news.gmane.org/gmane.comp.version-control.git note: .onion URLs require Tor: https://www.torproject.org/ or Tor2web: https://www.tor2web.org/ AGPL code for this site: git clone https://public-inbox.org/ public-inbox