git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / Atom feed
From: Ben Peart <peartben@gmail.com>
To: Duy Nguyen <pclouds@gmail.com>,
	Ævar Arnfjörð Bjarmason  <avarab@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>,
	Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH] read-cache.c: index format v5 -- 30% smaller/faster than v4
Date: Fri, 15 Feb 2019 15:22:12 -0500
Message-ID: <11875ebb-5a40-4c87-dce7-b337cc922100@gmail.com> (raw)
In-Reply-To: <CACsJy8DWXcBk3f3heZp5J7dhTM3JL4MeVco56j4WtJNeskz9pw@mail.gmail.com>



On 2/14/2019 5:14 AM, Duy Nguyen wrote:
> On Thu, Feb 14, 2019 at 5:02 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>> Take a look at stat data, st_dev, st_uid, st_gid and st_mode are the
>>> same most of the time. ctime should often be the same (or differs just
>>> slightly). And sometimes mtime is the same as well. st_ino is also
>>> always zero on Windows. We're storing a lot of duplicate values.
>>>
>>> Index v5 handles this
>>
>> This looks really promising.
> 
> I was going to reply to Junio. But it turns out I underestimated
> "varint" encoding overhead and it increases read time too much. I
> might get back and try some optimization when I'm bored, but until
> then this is yet another failed experiment.
> 
>>> As a result of this, v5 reduces file size from 30% (git.git) to
>>> 36% (webkit.git) compared to v4. Comparing to v2, webkit.git index file
>>> size is reduced by 63%! A 8.4MB index file is _almost_ acceptable.
>>>

Just for kicks, I tried this out on a couple of repos I have handy.

files	version	index size	%savings
200k	2	25,033,758	0.00%
	3	25,033,758	0.00%
	4	15,269,923	39.00%
	5	9,759,844	61.01%
			
3m	2	446,123,848	0.00%
	3	446,123,848	0.00%
	4	249,631,640	44.04%
	5	82,147,981	81.59%

The 81% savings is very impressive.  I didn't measure performance but 
not writing out an extra 167MB to disk has to help.

I'm definitely also interested in your 'sparse index' format ideas as in 
our 3M repos, there are typically only a few thousand that don't have 
the skip-worktree bit set.  I'm not sure if that is the same 'sparse' 
you had in mind but it would sure be nice!



I've also contemplated multi-threading the index write code path.  My 
thought was in the primary thread to allocate a buffer and when it is 
full have a background thread compute the SHA and write it to disk while 
the primary thread fills the next buffer.

I'm not sure how much it will buy us as I don't know the relative cost 
of computing the SHA/writing to disk vs filling the buffer.  I've 
suspected the filling the buffer thread would end up blocked on the 
background thread most of the time which is why I haven't tried it yet.

>>> Of course we trade off storage with cpu. We now need to spend more
>>> cycles writing or even reading (but still plenty fast compared to
>>> zlib). For reading, I'm counting on multi thread to hide away all this
>>> even if it becomes significant.
>>
>> This would be a bigger change, but have we/you ever done a POC
>> experiment to see how much of this time is eaten up by zlib that
>> wouldn't be eaten up with some of the newer "fast but good enough"
>> compression algorithms, e.g. Snappy and Zstandard?
> 
> I'm quite sure I tried zlib at some point, the only lasting impression
> I have is "not good enough". Other algorithms might improve a bit,
> perhaps on the uncompress/read side, but I find it unlikely we could
> reasonably compress like a hundred megabytes in a few dozen
> milliseconds (a quick google says Snappy compresses 250MB/s, so about
> 400ms per 100MB, too long). Splitting the files and compressing in
> parallel might help. But I will probably focus on "sparse index"
> approach before going that direction.
> 

      reply index

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-13 12:08 Nguyễn Thái Ngọc Duy
2019-02-13 22:27 ` Junio C Hamano
2019-02-14 10:02 ` Ævar Arnfjörð Bjarmason
2019-02-14 10:14   ` Duy Nguyen
2019-02-15 20:22     ` Ben Peart [this message]

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=11875ebb-5a40-4c87-dce7-b337cc922100@gmail.com \
    --to=peartben@gmail.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

git@vger.kernel.org list mirror (unofficial, one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git