git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Han-Wen Nienhuys <hanwen@google.com>
Cc: "Jakub Narębski" <jnareb@gmail.com>,
	"Abhradeep Chakraborty" <chakrabortyabhradeep79@gmail.com>,
	git@vger.kernel.org
Subject: Re: What is the status of GSoC 2022 work on making Git use roaring bitmaps?
Date: Mon, 31 Jul 2023 16:18:13 -0400	[thread overview]
Message-ID: <ZMgXBc5idN+sR3o1@nand.local> (raw)
In-Reply-To: <CAFQ2z_P2HkT8grAFk=6Mr05rJRfsh_sXypVFPyHr0v5xkcjYTA@mail.gmail.com>

On Mon, Jul 31, 2023 at 07:46:18PM +0200, Han-Wen Nienhuys wrote:
> On Sat, Mar 25, 2023 at 6:40 PM Jakub Narębski <jnareb@gmail.com> wrote:
> > >>> Abhradeep promised[1] that he'd include some performance work in his
> > >>> next version of that series. I think the main things we'd be interested
> > >>> in are:
> > >>>
> > >>>   - Does using Roaring provide a file-size advantage over
> > >>>     EWAH-compressed bitmaps?
>
> I modified JGit to write Roaring bitmaps instead of EWAH bitmaps. The
> resulting difference in file sizes are small, and actually favor EWAH:
>
> $ ls -l {ewah-repos,roaring-repos}/*.git/objects/pack/*.bitmap
> -r--r----- 1 hanwen primarygroup 26257386 Jul 31 15:04
> ewah-repos/android-pfb.git/objects/pack/pack-b14c35ec7fc3bb20884abe51a81c832be5983fdc.bitmap
> -r--r----- 1 hanwen primarygroup 27621579 Jul 31 15:20
> roaring-repos/android-pfb.git/objects/pack/pack-b14c35ec7fc3bb20884abe51a81c832be5983fdc.bitmap
>
> -r--r----- 1 hanwen primarygroup  1037356 Jul 31 14:46
> ewah-repos/gerrit.git/objects/pack/pack-fe46c7f96a2910f5775a2ff3bef7e4fa0e779f91.bitmap
> -r--r----- 1 hanwen primarygroup  1242608 Jul 31 14:45
> roaring-repos/gerrit.git/objects/pack/pack-fe46c7f96a2910f5775a2ff3bef7e4fa0e779f91.bitmap

This was one of my hopes with Roaring+Run, too, but I think that it's a
non-starter with our current object ordering for reachability bitmaps.

I did the same experiment a few months ago in my fork of git.git, and
consistently was able to produce smaller bitmaps when EWAH compressed as
compared to Roaring+Run. I think the reason is that our bitmaps are
pretty sparse, and so often have a lot of 0's, interspersed with a few
1's.

Depending on container alignment, a single 1's bit surrounded by zeros
will often get encoded as a length 0 "run" container. This means that
instead of storing that information in a single bit like we would with
EWAH, we use two 16-bit unsigned values to store (a) the position[^1],
and (b) length of the run.

The length is entirely wasted space, since the bytes are zeros, and
storing the position is significantly less efficient than storing a
sparse bit position in EWAH.

I haven't proved conclusively one way or the other where Roaring+Run is
significantly faster than EWAH or vice-versa. There are some cases where
the former is a clear winner, and other cases where it's the latter.

In any event, my extremely WIP patches to make this mostly work are
available here:

  https://github.com/ttaylorr/git/compare/tb/roaring-bitmaps

One thing that I was able to do to produce slightly smaller Roaring+Run
encoded .bitmap files is to store some of the bitmaps as XORs against
earlier bitmaps, similar to what we do with EWAH. Often XORing the raw
bits, and then compressing that with Roaring+Run can be significantly
smaller than storing another full Roaring+Run bitset.

That's the only part that I've had trouble getting to make it
consistently work, and I haven't had time to get back to it, so it's
been collecting dust since the end of May.

Thanks,
Taylor

[^1]: Not the bit position, exactly, but rather the 16 least-significant
  bits of the bit position, since all values in the same container share
  the 16 most-significant bits being a multiple of 2^16.

  reply	other threads:[~2023-07-31 20:18 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-23 19:26 What is the status of GSoC 2022 work on making Git use roaring bitmaps? Jakub Narębski
2023-03-23 20:26 ` Taylor Blau
2023-03-23 22:01   ` Jakub Narębski
2023-03-24  3:48     ` Abhradeep Chakraborty
2023-03-25 17:40       ` Jakub Narębski
2023-07-31 17:46         ` Han-Wen Nienhuys
2023-07-31 20:18           ` Taylor Blau [this message]
2023-08-01 11:26             ` Han-Wen Nienhuys
2023-08-01 11:34               ` Jakub Narębski
2023-08-01 11:54                 ` Han-Wen Nienhuys
2023-08-01 13:17                   ` Jakub Narębski
2023-08-01 17:33                 ` Taylor Blau
2023-08-01 17:43                   ` Jakub Narębski
2023-08-01 17:31               ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZMgXBc5idN+sR3o1@nand.local \
    --to=me@ttaylorr.com \
    --cc=chakrabortyabhradeep79@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=hanwen@google.com \
    --cc=jnareb@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).