Re: [RFC] Metadata vs Generation Data Chunk

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Derrick Stolee <stolee@gmail.com>
To: Abhishek Kumar <abhishekkumar8222@gmail.com>, git@vger.kernel.org
Cc: jnareb@gmail.com
Subject: Re: [RFC] Metadata vs Generation Data Chunk
Date: Mon, 22 Jun 2020 10:46:27 -0400	[thread overview]
Message-ID: <7bd59f07-39e7-82c4-51c4-0ceba9f45999@gmail.com> (raw)
In-Reply-To: <20200622093451.GA1719@Abhishek-Arch>

On 6/22/2020 5:34 AM, Abhishek Kumar wrote:
> One of the remaining pre-requisites for implementing generation number
> v2 was distinguishing between corrected commit dates with monotonically
> increasing offsets and topological level without incrementing generation
> number version.
> 
> Two approaches were proposed [1]:
> 1. New chunk for commit data (generation data chunk, "GDAT")
> 2. Metadata/versioning chunk
> 
> Since both approaches have their advantages and disadvantages, I wrote
> up a prototype [2] to investigate their performance.
> 
> [1]: https://lore.kernel.org/git/86mu87qj92.fsf@gmail.com/ 
> [2]: https://github.com/abhishekkumar2718/git/pull/1
> 
> TL;DR: I recommend we should use generation data chunk approach.
> 
> Generation Data Chunk
> =====================
> 
> We could move the generation number v2 into a separate chunk, storing
> topological levels in CDAT and the corrected commit date into a new,
> "GDAT" chunk.  Thus, old Git would use generation number v1, and
> new Git would use corrected commit dates from GDAT.
> 
> Using generation data chunk has the advantage that we would no longer
> be restricted to using 30 bits for generation number.

It will be important to detect these 30-bit overflow and replace the
value with GENERATION_NUMBER_MAX and treat those commits as having
GENERATION_NUMBER_INFINITY(_V2) when parsing them.

This 30-bit overflow _is_ more likely in the approach we've considered
because the commit-date offsets will accumulate through the history.
Please test a check for this against repos with particularly nasty
commit-date problems like the Linux kernel repo. Perhaps it would be
worthwhile to output the maximum stored offset in the trace2 API so
we can see how close we are to overflowing, even if we do not.

In order to fully commit to this plan, we need to know how problematic
this overflow is. Generation number v1 required hundreds of millions
of commits in a row to overflow, but here we could overflow with a few
appropriate commit date problems. How common are they in the wild?

> It also works
> well for commit-graph chains with a mix of v1 and v2 generation numbers.

We do not want to mix commit-graph chains with v1 and v2 numbers,
because we want the comparison between two commits to be independent
of which commit-graph layer they came from. This will require us to
squash the layers if we are asked to write one version and the existing
chain does not match that version.

> However, it increases the time required for I/O as commit data and
> generation numbers are no longer contiguous.

I'm glad you were able to measure this, which makes the locality of
the data chunk a valuable feature.

> Note: While it also increases disk space required for storing
> commit-graph files by 8 bytes per commit, I don't consider it relevant,
> especially on modern systems. A repo of the size of Linux repo would be
> larger by a mere 7.2 Mb.

...relative to what? What's the percentage increase? It still is not
going to be too bad. I'm _guessing_ ~10-15%.

> Metadata / Versioning Chunk
> ===========================
> 
> We could also introduce an optional metadata chunk to store generation
> number version and store corrected date offsets in CDAT. Since the
> offsets are backward compatible, Old Git would still yield correct
> results by assuming the offsets to be topological levels. New Git would
> correctly use the offsets to create corrected commit dates.
> 
> It works just as well as generation number v1 in parsing and writing
> commit-graph files.
> 
> However, the generation numbers are still restricted to 30 bits in CDAT
> chunk and it does not work well with commit-graph chains with a mix of
> v1 and v2 generation numbers.
> 
> Performance
> ===========
> 
> | Command                        | Master | Metadata | Generation Data |
> |--------------------------------|--------|----------|-----------------|
> | git commit-graph write         | 14.45s | 14.28s   | 14.63s          |
> | git log --topo-order -10000    | 0.211s | 0.206s   | 0.208s          |
> | git log --topo-order -100 A..B | 0.019s | 0.015s   | 0.015s          |
> | git merge-base A..B            | 0.137s | 0.137s   | 0.137s          |
> 
> - Metadata and generation data chunks perform better than master on
>   using commit-graph files since they use corrected commit dates.
> 
> - The increased I/O time for parsing GDAT does not affect performance as
>   much as expected.
> 
> - Generation data commit-graph takes longer to write since more
>   information is written into the file.
>> As using the commit-graph is much more frequent than writing, we can
> consider both approaches to perform equally well.
> 
> I prefer generation data chunk approach as it also removes 30-bit length
> restriction on generation numbers.

Thanks for testing these approaches.

The other thing to keep in mind with the new data chunk is that we will
probably still want a 32-bit value, but it no longer needs to be
backwards compatible! We would keep generation number v1 in the data
chunk, but could place _any_ definition in the v2 chunk.

This means that the "offset" from the commit-date does not need to be
strictly increasing! This could help with the overflow issue if someone
decided to do a nasty commit-date order issue (say, a commit with
Unix epoch "0" pointing to a recent commit with "real" date).

Recall that a commit C has a commit-date (date(C)) given by the
commit date, and we are storing an additional offset (off(C)) with
the following requirement for all parents P:

	date(C) + off(C) > date(P) + off(P)

The backwards-compatible requirement is this:

		off(C) > off(P)

but we can drop that if we are using the new chunk. That allows
off(C) to be zero if we already have date(C) > date(P) + off(P),
which will naturally happen as time passes (barring malicious
users).

Since you have an initial prototype of these two approaches,
could you extend your Generation Data prototype to use this
non-backwards-compatible model and report the performance data?

I found that using different pairs of tags in the Linux kernel
repo to be particularly fruitful for A..B performance tests.
In particular, "git merge-base v4.8 v4.9" demonstrated the problems
with generation number v1 [1]. Please also update the
paint_down_to_common() method to use generation numbers so
we can see the benefit or drawback there. Perhaps use config or
an environment variable to change this:

	if (!min_generation)
		queue.compare = compare_commits_by_commit_date;

to this:

	if (!min_genertion && !git_env(...))
		queue.compare = compare_commits_by_commit_date;

Without comparing commit-date heuristics against your generation
number values, we will not be sure that we are moving in the
right direction here.

[1] https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/

Thanks,
-Stolee

next prev parent reply	other threads:[~2020-06-22 14:46 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-22  9:34 [RFC] Metadata vs Generation Data Chunk Abhishek Kumar
2020-06-22 13:40 ` Jakub Narębski
2020-06-22 14:46 ` Derrick Stolee [this message]
  -- strict thread matches above, loose matches on Subject: below --
2020-06-26 13:44 Abhishek Kumar
2020-06-26 14:40 ` Derrick Stolee
2020-06-30 15:00 Abhishek Kumar
2020-06-30 20:46 ` Derrick Stolee
2020-07-03  8:29   ` Abhishek Kumar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7bd59f07-39e7-82c4-51c4-0ceba9f45999@gmail.com \
    --to=stolee@gmail.com \
    --cc=abhishekkumar8222@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jnareb@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).