[RFC] Metadata vs Generation Data Chunk

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [RFC] Metadata vs Generation Data Chunk
@ 2020-06-22  9:34 Abhishek Kumar
  2020-06-22 13:40 ` Jakub Narębski
  2020-06-22 14:46 ` Derrick Stolee
  0 siblings, 2 replies; 8+ messages in thread
From: Abhishek Kumar @ 2020-06-22  9:34 UTC (permalink / raw)
  To: git; +Cc: stolee, jnareb

One of the remaining pre-requisites for implementing generation number
v2 was distinguishing between corrected commit dates with monotonically
increasing offsets and topological level without incrementing generation
number version.

Two approaches were proposed [1]:
1. New chunk for commit data (generation data chunk, "GDAT")
2. Metadata/versioning chunk

Since both approaches have their advantages and disadvantages, I wrote
up a prototype [2] to investigate their performance.

[1]: https://lore.kernel.org/git/86mu87qj92.fsf@gmail.com/ 
[2]: https://github.com/abhishekkumar2718/git/pull/1

TL;DR: I recommend we should use generation data chunk approach.

Generation Data Chunk
=====================

We could move the generation number v2 into a separate chunk, storing
topological levels in CDAT and the corrected commit date into a new,
"GDAT" chunk.  Thus, old Git would use generation number v1, and
new Git would use corrected commit dates from GDAT.

Using generation data chunk has the advantage that we would no longer
be restricted to using 30 bits for generation number. It also works
well for commit-graph chains with a mix of v1 and v2 generation numbers.

However, it increases the time required for I/O as commit data and
generation numbers are no longer contiguous.

Note: While it also increases disk space required for storing
commit-graph files by 8 bytes per commit, I don't consider it relevant,
especially on modern systems. A repo of the size of Linux repo would be
larger by a mere 7.2 Mb.

Metadata / Versioning Chunk
===========================

We could also introduce an optional metadata chunk to store generation
number version and store corrected date offsets in CDAT. Since the
offsets are backward compatible, Old Git would still yield correct
results by assuming the offsets to be topological levels. New Git would
correctly use the offsets to create corrected commit dates.

It works just as well as generation number v1 in parsing and writing
commit-graph files.

However, the generation numbers are still restricted to 30 bits in CDAT
chunk and it does not work well with commit-graph chains with a mix of
v1 and v2 generation numbers.

Performance
===========

| Command                        | Master | Metadata | Generation Data |
|--------------------------------|--------|----------|-----------------|
| git commit-graph write         | 14.45s | 14.28s   | 14.63s          |
| git log --topo-order -10000    | 0.211s | 0.206s   | 0.208s          |
| git log --topo-order -100 A..B | 0.019s | 0.015s   | 0.015s          |
| git merge-base A..B            | 0.137s | 0.137s   | 0.137s          |

- Metadata and generation data chunks perform better than master on
  using commit-graph files since they use corrected commit dates.

- The increased I/O time for parsing GDAT does not affect performance as
  much as expected.

- Generation data commit-graph takes longer to write since more
  information is written into the file.

As using the commit-graph is much more frequent than writing, we can
consider both approaches to perform equally well.

I prefer generation data chunk approach as it also removes 30-bit length
restriction on generation numbers.

Thanks
Abhishek

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Metadata vs Generation Data Chunk
  2020-06-22  9:34 [RFC] Metadata vs Generation Data Chunk Abhishek Kumar
@ 2020-06-22 13:40 ` Jakub Narębski
  2020-06-22 14:46 ` Derrick Stolee
  1 sibling, 0 replies; 8+ messages in thread
From: Jakub Narębski @ 2020-06-22 13:40 UTC (permalink / raw)
  To: Abhishek Kumar, git; +Cc: Jakub Narębski, Derrick Stolee

On 22.06.2020 at 11:34, Abhishek Kumar wrote:

> One of the remaining pre-requisites for implementing generation number
> v2 was distinguishing between corrected commit dates with monotonically
> increasing offsets and topological level without incrementing generation
> number version.
> 
> Two approaches were proposed [1]:
> 1. New chunk for commit data (generation data chunk, "GDAT")
> 2. Metadata/versioning chunk

Actually in [1] there was also proposed another distinct approach,
namely to 'rename' the "CDAT" chunk to something else, like "CDA2"
(or proposed here "GDAT").

If I read the code correctly, with old Git if one of required chunks
is missing then Git would continue work as if commit-graph was not
present -- as opposed to current handling of unknown commit-graph
file format version number, where Git would stop working with an
error message.

> 
> Since both approaches have their advantages and disadvantages, I wrote
> up a prototype [2] to investigate their performance.
> 
> [1]: https://lore.kernel.org/git/86mu87qj92.fsf@gmail.com/
> [2]: https://github.com/abhishekkumar2718/git/pull/1

That's very nice.  Thanks for investigating that.

> 
> TL;DR: I recommend we should use generation data chunk approach.
> 
> Generation Data Chunk
> =====================
> 
> We could move the generation number v2 into a separate chunk, storing
> topological levels in CDAT and the corrected commit date into a new,
> "GDAT" chunk.  Thus, old Git would use generation number v1, and
> new Git would use corrected commit dates from GDAT.
> 
> Using generation data chunk has the advantage that we would no longer
> be restricted to using 30 bits for generation number. It also works
> well for commit-graph chains with a mix of v1 and v2 generation numbers.
> 
> However, it increases the time required for I/O as commit data and
> generation numbers are no longer contiguous.
> 
> Note: While it also increases disk space required for storing
> commit-graph files by 8 bytes per commit, I don't consider it relevant,
> especially on modern systems. A repo of the size of Linux repo would be
> larger by a mere 7.2 Mb.

All right.

Another advantage: we don't have to store the corrected committer date
_offset_, we can store the date (as epoch) directly.  This saves some
calculation, though a minuscule amount.

Yet another advantage: we don't need backward-compatibility for
generation number v2, i.e. corrected commit date.

Another disadvantage: we need to compute both topological levels and
corrected commit date when creating a commit-graph file or a chunk of
it.  This means that `git commit-graph write` could be slightly more
expensive.

> 
> Metadata / Versioning Chunk
> ===========================
> 
> We could also introduce an optional metadata chunk to store generation
> number version and store corrected date offsets in CDAT. Since the
> offsets are backward compatible, Old Git would still yield correct
> results by assuming the offsets to be topological levels. New Git would
> correctly use the offsets to create corrected commit dates.

This also means that we need to use backward-compatible generation
number v2, that is corrected commit date with strictly monotonic
offsets.  Which may lead to more cases where 30 bits for label is not
enough, and thus worse performance (no detailed reachability
information for newest commits).

> 
> It works just as well as generation number v1 in parsing and writing
> commit-graph files.
> 
> However, the generation numbers are still restricted to 30 bits in CDAT
> chunk and it does not work well with commit-graph chains with a mix of
> v1 and v2 generation numbers.

CDAT chunk replaced with another chunk
======================================

In this approach the "CDAT" chunk is missing from the commit-graph file.
We can craft the replacement ("CDA2") however we like, but it can also
look like the "CDAT" chunk, only with the offsets for corrected commit
date rather than topological level for generation number part (30 bits).

If we choose to follow the "CDAT" format (modified), then the file
size would not change, and neither would change the amount of I/O
needed -- there would be the same access structure as in current
version.

There should be no confusion with a mix of v1 and v2 generation numbers.

The disadvantage is of course that older version of Git would no longer
be able to make use of serialized commit-graph if the file was written
by newer version of Git.

> 
> Performance
> ===========

What is the repository where those benchmarks took place?

> | Command                        | Master | Metadata | Generation Data |
> |--------------------------------|--------|----------|-----------------|
> | git commit-graph write         | 14.45s | 14.28s   | 14.63s          |
> | git log --topo-order -10000    | 0.211s | 0.206s   | 0.208s          |
> | git log --topo-order -100 A..B | 0.019s | 0.015s   | 0.015s          |
> | git merge-base A..B            | 0.137s | 0.137s   | 0.137s          |

Nice.

How those two (or three) approaches work on gen-test [3] test cases,
both in terms of commits walked (extracted via trace2 mechanism) and
actual wall-time clock, like in the result above?

[3]: https://github.com/derrickstolee/gen-test

> - Metadata and generation data chunks perform better than master on
>    using commit-graph files since they use corrected commit dates.
> 
> - The increased I/O time for parsing GDAT does not affect performance as
>    much as expected.
> 
> - Generation data commit-graph takes longer to write since more
>    information is written into the file.
> 
> As using the commit-graph is much more frequent than writing, we can
> consider both approaches to perform equally well.
> 
> I prefer generation data chunk approach as it also removes 30-bit length
> restriction on generation numbers.

Thank you for your work.

Best,

P.S. I hope I haven't sent it twice...
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Metadata vs Generation Data Chunk
  2020-06-22  9:34 [RFC] Metadata vs Generation Data Chunk Abhishek Kumar
  2020-06-22 13:40 ` Jakub Narębski
@ 2020-06-22 14:46 ` Derrick Stolee
  1 sibling, 0 replies; 8+ messages in thread
From: Derrick Stolee @ 2020-06-22 14:46 UTC (permalink / raw)
  To: Abhishek Kumar, git; +Cc: jnareb

On 6/22/2020 5:34 AM, Abhishek Kumar wrote:
> One of the remaining pre-requisites for implementing generation number
> v2 was distinguishing between corrected commit dates with monotonically
> increasing offsets and topological level without incrementing generation
> number version.
> 
> Two approaches were proposed [1]:
> 1. New chunk for commit data (generation data chunk, "GDAT")
> 2. Metadata/versioning chunk
> 
> Since both approaches have their advantages and disadvantages, I wrote
> up a prototype [2] to investigate their performance.
> 
> [1]: https://lore.kernel.org/git/86mu87qj92.fsf@gmail.com/ 
> [2]: https://github.com/abhishekkumar2718/git/pull/1
> 
> TL;DR: I recommend we should use generation data chunk approach.
> 
> Generation Data Chunk
> =====================
> 
> We could move the generation number v2 into a separate chunk, storing
> topological levels in CDAT and the corrected commit date into a new,
> "GDAT" chunk.  Thus, old Git would use generation number v1, and
> new Git would use corrected commit dates from GDAT.
> 
> Using generation data chunk has the advantage that we would no longer
> be restricted to using 30 bits for generation number.

It will be important to detect these 30-bit overflow and replace the
value with GENERATION_NUMBER_MAX and treat those commits as having
GENERATION_NUMBER_INFINITY(_V2) when parsing them.

This 30-bit overflow _is_ more likely in the approach we've considered
because the commit-date offsets will accumulate through the history.
Please test a check for this against repos with particularly nasty
commit-date problems like the Linux kernel repo. Perhaps it would be
worthwhile to output the maximum stored offset in the trace2 API so
we can see how close we are to overflowing, even if we do not.

In order to fully commit to this plan, we need to know how problematic
this overflow is. Generation number v1 required hundreds of millions
of commits in a row to overflow, but here we could overflow with a few
appropriate commit date problems. How common are they in the wild?

> It also works
> well for commit-graph chains with a mix of v1 and v2 generation numbers.

We do not want to mix commit-graph chains with v1 and v2 numbers,
because we want the comparison between two commits to be independent
of which commit-graph layer they came from. This will require us to
squash the layers if we are asked to write one version and the existing
chain does not match that version.

> However, it increases the time required for I/O as commit data and
> generation numbers are no longer contiguous.

I'm glad you were able to measure this, which makes the locality of
the data chunk a valuable feature.

> Note: While it also increases disk space required for storing
> commit-graph files by 8 bytes per commit, I don't consider it relevant,
> especially on modern systems. A repo of the size of Linux repo would be
> larger by a mere 7.2 Mb.

...relative to what? What's the percentage increase? It still is not
going to be too bad. I'm _guessing_ ~10-15%.

> Metadata / Versioning Chunk
> ===========================
> 
> We could also introduce an optional metadata chunk to store generation
> number version and store corrected date offsets in CDAT. Since the
> offsets are backward compatible, Old Git would still yield correct
> results by assuming the offsets to be topological levels. New Git would
> correctly use the offsets to create corrected commit dates.
> 
> It works just as well as generation number v1 in parsing and writing
> commit-graph files.
> 
> However, the generation numbers are still restricted to 30 bits in CDAT
> chunk and it does not work well with commit-graph chains with a mix of
> v1 and v2 generation numbers.
> 
> Performance
> ===========
> 
> | Command                        | Master | Metadata | Generation Data |
> |--------------------------------|--------|----------|-----------------|
> | git commit-graph write         | 14.45s | 14.28s   | 14.63s          |
> | git log --topo-order -10000    | 0.211s | 0.206s   | 0.208s          |
> | git log --topo-order -100 A..B | 0.019s | 0.015s   | 0.015s          |
> | git merge-base A..B            | 0.137s | 0.137s   | 0.137s          |
> 
> - Metadata and generation data chunks perform better than master on
>   using commit-graph files since they use corrected commit dates.
> 
> - The increased I/O time for parsing GDAT does not affect performance as
>   much as expected.
> 
> - Generation data commit-graph takes longer to write since more
>   information is written into the file.
>> As using the commit-graph is much more frequent than writing, we can
> consider both approaches to perform equally well.
> 
> I prefer generation data chunk approach as it also removes 30-bit length
> restriction on generation numbers.

Thanks for testing these approaches.

The other thing to keep in mind with the new data chunk is that we will
probably still want a 32-bit value, but it no longer needs to be
backwards compatible! We would keep generation number v1 in the data
chunk, but could place _any_ definition in the v2 chunk.

This means that the "offset" from the commit-date does not need to be
strictly increasing! This could help with the overflow issue if someone
decided to do a nasty commit-date order issue (say, a commit with
Unix epoch "0" pointing to a recent commit with "real" date).

Recall that a commit C has a commit-date (date(C)) given by the
commit date, and we are storing an additional offset (off(C)) with
the following requirement for all parents P:

	date(C) + off(C) > date(P) + off(P)

The backwards-compatible requirement is this:

		off(C) > off(P)

but we can drop that if we are using the new chunk. That allows
off(C) to be zero if we already have date(C) > date(P) + off(P),
which will naturally happen as time passes (barring malicious
users).

Since you have an initial prototype of these two approaches,
could you extend your Generation Data prototype to use this
non-backwards-compatible model and report the performance data?

I found that using different pairs of tags in the Linux kernel
repo to be particularly fruitful for A..B performance tests.
In particular, "git merge-base v4.8 v4.9" demonstrated the problems
with generation number v1 [1]. Please also update the
paint_down_to_common() method to use generation numbers so
we can see the benefit or drawback there. Perhaps use config or
an environment variable to change this:

	if (!min_generation)
		queue.compare = compare_commits_by_commit_date;

to this:

	if (!min_genertion && !git_env(...))
		queue.compare = compare_commits_by_commit_date;

Without comparing commit-date heuristics against your generation
number values, we will not be sure that we are moving in the
right direction here.

[1] https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Metadata vs Generation Data Chunk
@ 2020-06-26 13:44 Abhishek Kumar
  2020-06-26 14:40 ` Derrick Stolee
  0 siblings, 1 reply; 8+ messages in thread
From: Abhishek Kumar @ 2020-06-26 13:44 UTC (permalink / raw)
  To: jnareb; +Cc: stolee, git

On 22.06.2020 at 13:40, Jakub Narębski wrote:
> On 22.06.2020 at 11:34, Abhishek Kumar wrote:
> 
> > One of the remaining pre-requisites for implementing generation number
> > v2 was distinguishing between corrected commit dates with monotonically
> > increasing offsets and topological level without incrementing generation
> > number version.
> > 
> > Two approaches were proposed [1]:
> > 1. New chunk for commit data (generation data chunk, "GDAT")
> > 2. Metadata/versioning chunk
> 
> Actually in [1] there was also proposed another distinct approach,
> namely to 'rename' the "CDAT" chunk to something else, like "CDA2"
> (or proposed here "GDAT").
> 
> If I read the code correctly, with old Git if one of required chunks
> is missing then Git would continue work as if commit-graph was not
> present -- as opposed to current handling of unknown commit-graph
> file format version number, where Git would stop working with an
> error message.
> 

Actually, v2.21.0 (and possibly others) segfault when they encounter a
commit-graph without CDAT chunk.

Here's what I did:
- Create a commit-graph, renaming "CDAT" to "CDA2" and otherwise
  identical.
- Call `git log`.
- Git segfaults after calling fill_commit_graph_info().

Diff for the hexdump of commit-graph files:

3c3
< 0000020 4443 3241 0000 0000 0000 a807 0000 0000
---
> 0000020 4443 5441 0000 0000 0000 a807 0000 0000
209,210c209,210
< 0000dd0 0000 0c00 cd5c d1f7 a34d d713 70d3 114e
< 0000de0 7b92 131c f605 3b92 856a 2318          
---
> 0000dd0 0000 0c00 cd5c d1f7 b474 d280 6b17 2eb2
> 0000de0 a660 9ed6 04f0 7bed 8cb0 3712          

(They also differ in the hash checksum, but I don't suppose that would
explain the segfault.)

With this, I presume "CDAT Chunk Replaced With Another Chunk" is no
longer feasible?

The commit-graph files:
"CDA2" - https://github.com/abhishekkumar2718/GSoC20/blob/master/commit-graph-cda2
"CDAT" - https://github.com/abhishekkumar2718/GSoC20/blob/master/commit-graph-cdat

> 
> > Since both approaches have their advantages and disadvantages, I wrote
> > up a prototype [2] to investigate their performance.
> > 
> > [1]: https://lore.kernel.org/git/86mu87qj92.fsf@gmail.com/
> > [2]: https://github.com/abhishekkumar2718/git/pull/1
> 
> That's very nice.  Thanks for investigating that.
> 
> > 
> > TL;DR: I recommend we should use generation data chunk approach.
> > 
> > Generation Data Chunk
> > =====================
> > 
> > We could move the generation number v2 into a separate chunk, storing
> > topological levels in CDAT and the corrected commit date into a new,
> > "GDAT" chunk.  Thus, old Git would use generation number v1, and
> > new Git would use corrected commit dates from GDAT.
> > 
> > Using generation data chunk has the advantage that we would no longer
> > be restricted to using 30 bits for generation number. It also works
> > well for commit-graph chains with a mix of v1 and v2 generation numbers.
> > 
> > However, it increases the time required for I/O as commit data and
> > generation numbers are no longer contiguous.
> > 
> > Note: While it also increases disk space required for storing
> > commit-graph files by 8 bytes per commit, I don't consider it relevant,
> > especially on modern systems. A repo of the size of Linux repo would be
> > larger by a mere 7.2 Mb.
> 
> All right.
> 
> Another advantage: we don't have to store the corrected committer date
> _offset_, we can store the date (as epoch) directly.  This saves some
> calculation, though a minuscule amount.
> 
> Yet another advantage: we don't need backward-compatibility for
> generation number v2, i.e. corrected commit date.
> 

That's actually a great point. This might save us some calculation as we
don't need to calculate the "correction offsets". 

> Another disadvantage: we need to compute both topological levels and
> corrected commit date when creating a commit-graph file or a chunk of
> it.  This means that `git commit-graph write` could be slightly more
> expensive.
> 
> > 
> > Metadata / Versioning Chunk
> > ===========================
> > 
> > We could also introduce an optional metadata chunk to store generation
> > number version and store corrected date offsets in CDAT. Since the
> > offsets are backward compatible, Old Git would still yield correct
> > results by assuming the offsets to be topological levels. New Git would
> > correctly use the offsets to create corrected commit dates.
> 
> This also means that we need to use backward-compatible generation
> number v2, that is corrected commit date with strictly monotonic
> offsets.  Which may lead to more cases where 30 bits for label is not
> enough, and thus worse performance (no detailed reachability
> information for newest commits).
> 

Yes, that's definitely possible. Dr. Stolee raised the same point too,
and I am working on using trace2 API to find the largest offset and the
number of commits for the tests.

> > 
> > It works just as well as generation number v1 in parsing and writing
> > commit-graph files.
> > 
> > However, the generation numbers are still restricted to 30 bits in CDAT
> > chunk and it does not work well with commit-graph chains with a mix of
> > v1 and v2 generation numbers.
> 
> 
> CDAT chunk replaced with another chunk
> ======================================
> 
> In this approach the "CDAT" chunk is missing from the commit-graph file.
> We can craft the replacement ("CDA2") however we like, but it can also
> look like the "CDAT" chunk, only with the offsets for corrected commit
> date rather than topological level for generation number part (30 bits).
> 
> If we choose to follow the "CDAT" format (modified), then the file
> size would not change, and neither would change the amount of I/O
> needed -- there would be the same access structure as in current
> version.
> 
> There should be no confusion with a mix of v1 and v2 generation numbers.
> 
> The disadvantage is of course that older version of Git would no longer
> be able to make use of serialized commit-graph if the file was written
> by newer version of Git.
> 

Another disadvantage is that the CDA2 chunk yet again limits the size of
generation number. Future generation numbers will be restricted to 64
bits by the design of CDA2.

> > 
> > Performance
> > ===========
> 
> What is the repository where those benchmarks took place?

It's based on the linux repository.

> 
> > | Command                        | Master | Metadata | Generation Data |
> > |--------------------------------|--------|----------|-----------------|
> > | git commit-graph write         | 14.45s | 14.28s   | 14.63s          |
> > | git log --topo-order -10000    | 0.211s | 0.206s   | 0.208s          |
> > | git log --topo-order -100 A..B | 0.019s | 0.015s   | 0.015s          |
> > | git merge-base A..B            | 0.137s | 0.137s   | 0.137s          |
> 
> Nice.
> 
> How those two (or three) approaches work on gen-test [3] test cases,
> both in terms of commits walked (extracted via trace2 mechanism) and
> actual wall-time clock, like in the result above?

Sure, will do.

> 
> [3]: https://github.com/derrickstolee/gen-test
> 
> > - Metadata and generation data chunks perform better than master on
> >    using commit-graph files since they use corrected commit dates.
> > 
> > - The increased I/O time for parsing GDAT does not affect performance as
> >    much as expected.
> > 
> > - Generation data commit-graph takes longer to write since more
> >    information is written into the file.
> > 
> > As using the commit-graph is much more frequent than writing, we can
> > consider both approaches to perform equally well.
> > 
> > I prefer generation data chunk approach as it also removes 30-bit length
> > restriction on generation numbers.
> 
> Thank you for your work.
> 
> Best,
> 
> P.S. I hope I haven't sent it twice...
> -- 
> Jakub Narębski

Thanks
- Abhishek

----------

Mutt has been acting weirdly for replies sent using "mail" option. This
mail is in reply to the following post:

https://lore.kernel.org/git/10d99e37-8870-6401-d9aa-21a359b30835@gmail.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Metadata vs Generation Data Chunk
  2020-06-26 13:44 Abhishek Kumar
@ 2020-06-26 14:40 ` Derrick Stolee
  0 siblings, 0 replies; 8+ messages in thread
From: Derrick Stolee @ 2020-06-26 14:40 UTC (permalink / raw)
  To: 10d99e37-8870-6401-d9aa-21a359b30835, jnareb; +Cc: git

On 6/26/2020 9:44 AM, Abhishek Kumar wrote:
> On 22.06.2020 at 13:40, Jakub Narębski wrote:
>> On 22.06.2020 at 11:34, Abhishek Kumar wrote:
>>
>>> One of the remaining pre-requisites for implementing generation number
>>> v2 was distinguishing between corrected commit dates with monotonically
>>> increasing offsets and topological level without incrementing generation
>>> number version.
>>>
>>> Two approaches were proposed [1]:
>>> 1. New chunk for commit data (generation data chunk, "GDAT")
>>> 2. Metadata/versioning chunk
>>
>> Actually in [1] there was also proposed another distinct approach,
>> namely to 'rename' the "CDAT" chunk to something else, like "CDA2"
>> (or proposed here "GDAT").
>>
>> If I read the code correctly, with old Git if one of required chunks
>> is missing then Git would continue work as if commit-graph was not
>> present -- as opposed to current handling of unknown commit-graph
>> file format version number, where Git would stop working with an
>> error message.
>>
> 
> Actually, v2.21.0 (and possibly others) segfault when they encounter a
> commit-graph without CDAT chunk.

Yes, the CDAT chunk is absolutely necessary. It also includes data such as
the commit-date, root tree id, and parent information.

> With this, I presume "CDAT Chunk Replaced With Another Chunk" is no
> longer feasible?

"Replace" was never on the table (in my mind). Instead, we can
consider _adding_ a new chunk that contains the generation number
v2 data.


Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Metadata vs Generation Data Chunk
@ 2020-06-30 15:00 Abhishek Kumar
  2020-06-30 20:46 ` Derrick Stolee
  0 siblings, 1 reply; 8+ messages in thread
From: Abhishek Kumar @ 2020-06-30 15:00 UTC (permalink / raw)
  To: abhishekkumar8222; +Cc: jnareb, stolee, git

Hello everyone,

I have finished all tests that I can think of and instead of throwing
performance numbers at you, I am going to try and tell you a story.

WRITING COMMIT GRAPH
====================

The run time of writing a commit-graph is affected by time taken to
compute generation number(s) and write them to the file.

We had the following possible ideas for writing commit-graph:

MO: Calculate generation number V5 and write the offset into CDAT
    (along with a metadata chunk).

G5O: Calculate generation number V5 and write the corrected date into
    Generation Data chunk.

G5D: Calculate generation number V5 and write the offset into Generation
    Data chunk.

G3IO: Calculate generation number V3 and write GENERATION_NUMBER_MAX into
    CDAT and the offset into Generation Data chunk.

G3ID: Calculate generation number V3 and write GENERATION_NUMBER_MAX into
    CDAT and the corrected date into Generation Data chunk.

G3TO: Calculate generation number V3 and generation number V0, write
    topological levels into CDAT and the offset into Generation Data
    chunk.

G3TD: Calculate generation number V3 and generation number V0, write
    topological levels into CDAT and the offset into Generation Data
    chunk.

Key:
- M -> Metadata, G -> Generation
- 5 -> Generation Number V5, 3 -> Generation Number V3,
- T -> Topological level, I (for infinity) -> Generation Number Max
- O -> Offset, D -> Date

On the Linux repo with HEAD at 08bf1a27, time is taken by different
versions:

| Version | time Taken      |
|---------|-----------------|
| master  | 14.200s         |
| MO      | 14.094s         |
| G5D     | 14.406s         |
| G5O     | 14.323s         |
| G3IO    | 14.175s         |
| G3ID    | 14.258s         |
| G3TO    | 14.331s         |
| G3TD    | 14.419s         |

Observations:

- Writing offset (i.e., 32-bits) is faster than writing corrected date 
  (which are 64-bit long). However, we would have to deal with overflow
  issues.

  The largest offset value observed was of the order 2 ** 29. Assuming
  no further "skips" in offset, we can store around ~635 million
  commits without overflows. This, however, might not be true for other
  repositories.

  A repository with a commit with Unix epoch '0' will overflow.
  If we choose some sane default like commit date must be higher than
  the release date of git, we can store around 110 million commits even
  with '0' epoch.

- Calculating topological levels with generation number V3 takes nearly
  the same time as calculating generation number V5.

- The commit-graph file is larger by 4 Mb (+8%) and 8 Mb (+16%) when
  storing offsets and dates.

Reading Commit Graph
====================

The time taken in using commit-graph to answer queries depends on:
- The number of commits walked.
- The time taken to load commit data from commit-graph.

Number of commits walked for different commands:

| Command                         | Master |   V3   |   V5   |
|---------------------------------|--------|--------|--------|
| log --topo-order -10000         |  48011 |  49954 |  49933 |
| log --topo-order -100 v5.4 v5.5 |   3794 |   3314 |   3312 |
| log --topo-order -100 v4.8 v4.9 |   5736 |   3887 |   3625 |
| merge-base v5.4 v5.5            |  55053 |  57097 |  55109 |
| merge-base v4.8 v4.9            | 635579 | 167468 | 506577 |

- For log --topo-order, V3, and V5 walk 35% fewer commits than
  the topological level when comparing v4.8 and v4.9.

- V3 walks far fewer commits than V5 when comparing by generation
  numbers and then date for paint_down_to_common(). This is unusual
  and is probably due to my implementation. Based on log --topo-order,
  we might expect V5 to perform better than V3.
 
- V3 walks the same number of commits when compared using commit
  date for merge-base.

The time taken for each command rough corresponds to the number of
commits walked, with no difference between Metadata chunk and Generation
Data chunk approaches.

| Command                         | Master  |    V3   |   V5   |
|---------------------------------|---------|---------|--------|
| log --topo-order -10000         |  0.210s |  0.209s | 0.207s |
| log --topo-order -100 v5.4 v5.5 |  0.013s |  0.013s | 0.013s |
| log --topo-order -100 v4.8 v4.9 |  0.015s |  0.013s | 0.013s |
| merge-base v5.4 v5.5            |  0.048s |  0.047s | 0.047s |
| merge-base v4.8 v4.9            |  0.135s |  0.133s | 0.134s |

CONCLUSION
==========

With all this, my inital recommendation becomes more specific as:

- If overflows are unlikely to happen and not accounted for, implement
  generation number V5 using metadata chunk. It has the lowest write
  time and walks just as few commits as V3.

- If overflows are a concern, implement generation number V5 with
  generation data chunk, storing offsets in generation data chunk.

Thanks
- Abhishek

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Metadata vs Generation Data Chunk
  2020-06-30 15:00 Abhishek Kumar
@ 2020-06-30 20:46 ` Derrick Stolee
  2020-07-03  8:29   ` Abhishek Kumar
  0 siblings, 1 reply; 8+ messages in thread
From: Derrick Stolee @ 2020-06-30 20:46 UTC (permalink / raw)
  To: 20200622093451.GA1719, abhishekkumar8222; +Cc: jnareb, git

On 6/30/2020 11:00 AM, Abhishek Kumar wrote:
> Hello everyone,
> 
> I have finished all tests that I can think of and instead of throwing
> performance numbers at you, I am going to try and tell you a story.
> 
> WRITING COMMIT GRAPH
> ====================
> 
> The run time of writing a commit-graph is affected by time taken to
> compute generation number(s) and write them to the file.

As you continue, you use V0, V3, and V5 without (re)introducing
them. Please ensure these are defined well. This message has "Re:"
in the title, but was removed from the thread somehow in my
inbox. I don't see a definition of these terms in the thread [1],
either.

[1] https://lore.kernel.org/git/20200630150056.GA4111@Abhishek-Arch/T/#u

IIRC, we have these modes:

V0: The current generation number gen(C) defined as one more than the
    maximum gen(P) among parents P of C.

V3: The corrected commit-date given by ccd_3(C) = offset_3(C) + date(C)
    where offset_3(C) is the minimum non-negative integer such that
    offset_3(C) + date(C) > offset_3(P) + date(P) for all parents P of C.

V5: The backwards-compatible corrected commit-date is given by
    ccd_5(C) = offset_5(C) + date(C) where the inequality
    offset_5(C) + date(C) > offset_5(P) + date(P) holds, but also
    offset_5(C) > offset_(P) for all parents P of C.

Please correct me if I am wrong about these definitions.

> We had the following possible ideas for writing commit-graph:

The options here seem to be a little bit of overkill. Here are the
ones that I would seriously consider:

> MO: Calculate generation number V5 and write the offset into CDAT
>     (along with a metadata chunk).
>
> G3TO: Calculate generation number V3 and generation number V0, write
>     topological levels into CDAT and the offset into Generation Data
>     chunk.
> 
> G3TD: Calculate generation number V3 and generation number V0, write
>     topological levels into CDAT and the offset into Generation Data
>     chunk.

I think this last option you meant s/the offset/the corrected date/.
This means you are storing the date+offset in the file instead of just
the offset.

> | Version | time Taken      |
> |---------|-----------------|
> | master  | 14.200s         |
> | MO      | 14.094s         |
> | G3TO    | 14.331s         |
> | G3TD    | 14.419s         |

I don't consider write _time_ to be a huge issue, especially not with
these relative values. The SPACE taken by the commit-graph is more
important, and is probably a higher relative change than this time.

> Observations:
> 
> - Writing offset (i.e., 32-bits) is faster than writing corrected date 
>   (which are 64-bit long). However, we would have to deal with overflow
>   issues.

Please keep in mind that the CDAT chunk has a 64-bit "column" that
is split between commit date (34 bits) and generation number (30 bits).
So having 32-bit offsets in a new chunk adds some extra flexibility.

>   The largest offset value observed was of the order 2 ** 29. Assuming
>   no further "skips" in offset, we can store around ~635 million
>   commits without overflows. This, however, might not be true for other
>   repositories.

Was this the largest required offset for a single commit->parent skew,
or the cumulative effect of many offsets needing to grow, along with
the natural generation number growth? I'm guessing this is just the
largest offset that you computed in your write.

Since V3 doesn't require offset_3(C) > offset_3(P), the offsets should
naturally "fix" themselves over time by decreasing as time moves ahead.
So in that sense, I'm rather surprised to see that a 2^29 offset is
necessary.

This does make it seem like V5 should be abandoned because the CDAT
is running out of room for these offsets, so repos that grow enough
to overflow will lose any performance benefits from this approach.

>   A repository with a commit with Unix epoch '0' will overflow.
>   If we choose some sane default like commit date must be higher than
>   the release date of git, we can store around 110 million commits even
>   with '0' epoch.

This really depends on a commit with date '0' having a parent with a
~30-bit timestamp, OR having many of these kinds of shifts with more
reasonable dates.

> - Calculating topological levels with generation number V3 takes nearly
>   the same time as calculating generation number V5.
> 
> - The commit-graph file is larger by 4 Mb (+8%) and 8 Mb (+16%) when
>   storing offsets and dates.

8-16% is noticeable, but also not too bad. Thanks for reporting
these stats!

> Number of commits walked for different commands:
> 
> | Command                         | Master |   V3   |   V5   |
> |---------------------------------|--------|--------|--------|
> | log --topo-order -10000         |  48011 |  49954 |  49933 |
> | log --topo-order -100 v5.4 v5.5 |   3794 |   3314 |   3312 |
> | log --topo-order -100 v4.8 v4.9 |   5736 |   3887 |   3625 |
> | merge-base v5.4 v5.5            |  55053 |  57097 |  55109 |
> | merge-base v4.8 v4.9            | 635579 | 167468 | 506577 |

This last data point is extremely valuable. I think the fact
that V5 is closer to V0 here speaks to the negative effect
of requiring offset_3(C) > offset_3(P).

> - For log --topo-order, V3, and V5 walk 35% fewer commits than
>   the topological level when comparing v4.8 and v4.9.
> 
> - V3 walks far fewer commits than V5 when comparing by generation
>   numbers and then date for paint_down_to_common(). This is unusual
>   and is probably due to my implementation. Based on log --topo-order,
>   we might expect V5 to perform better than V3.

The v4.8 to v4.9 case is particularly nasty in a way that
doesn't cause issues with topo-order walks. This is the
big reason we are pursuing this at all!

> - V3 walks the same number of commits when compared using commit
>   date for merge-base.
> 
> The time taken for each command rough corresponds to the number of
> commits walked, with no difference between Metadata chunk and Generation
> Data chunk approaches.
> 
> | Command                         | Master  |    V3   |   V5   |
> |---------------------------------|---------|---------|--------|
> | log --topo-order -10000         |  0.210s |  0.209s | 0.207s |
> | log --topo-order -100 v5.4 v5.5 |  0.013s |  0.013s | 0.013s |
> | log --topo-order -100 v4.8 v4.9 |  0.015s |  0.013s | 0.013s |
> | merge-base v5.4 v5.5            |  0.048s |  0.047s | 0.047s |
> | merge-base v4.8 v4.9            |  0.135s |  0.133s | 0.134s |

Are you sure these timings are correct for the last row? It
seems that the times would be more significantly different
for the number of commits that were walked. Keep in mind that
we should be reporting these merge-base timings when
paint_down_to_common() has its priority queue set to pull
the maximum generation number (V0, V3, or V5) in order to
really compare. It seems like that was done correctly in
the commit count, but this data seems to imply something
else.

> - If overflows are unlikely to happen and not accounted for, implement
>   generation number V5 using metadata chunk. It has the lowest write
>   time and walks just as few commits as V3.

(For the "merge-base v4.8 v4.9" case, this is untrue.)

> - If overflows are a concern, implement generation number V5 with
>   generation data chunk, storing offsets in generation data chunk.

Here is an excellent time to be REALLY clear about what you mean
about V3 and V5.

Do you have these prototypes available somewhere so others can
test? That would also help in verifying these results. It
would be enough to push the code to a public Git hosting service
so we can fetch your commits, build, and test. These don't need
to be submission-ready patches.

Thanks for doing all this work! We are getting closer and closer
to a final direction.

-Stolee

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Metadata vs Generation Data Chunk
  2020-06-30 20:46 ` Derrick Stolee
@ 2020-07-03  8:29   ` Abhishek Kumar
  0 siblings, 0 replies; 8+ messages in thread
From: Abhishek Kumar @ 2020-07-03  8:29 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: 20200622093451.GA1719, git, jnareb

On Tue, Jun 30, 2020 at 04:46:46PM -0400, Derrick Stolee wrote:
> On 6/30/2020 11:00 AM, Abhishek Kumar wrote:
> > Hello everyone,
> > 
> > I have finished all tests that I can think of and instead of throwing
> > performance numbers at you, I am going to try and tell you a story.
> > 
> > WRITING COMMIT GRAPH
> > ====================
> > 
> > The run time of writing a commit-graph is affected by time taken to
> > compute generation number(s) and write them to the file.
> 
> As you continue, you use V0, V3, and V5 without (re)introducing
> them. Please ensure these are defined well. This message has "Re:"
> in the title, but was removed from the thread somehow in my
> inbox. I don't see a definition of these terms in the thread [1],
> either.
> 
> [1] https://lore.kernel.org/git/20200630150056.GA4111@Abhishek-Arch/T/#u
> 
> IIRC, we have these modes:
> 
> V0: The current generation number gen(C) defined as one more than the
>     maximum gen(P) among parents P of C.
> 
> V3: The corrected commit-date given by ccd_3(C) = offset_3(C) + date(C)
>     where offset_3(C) is the minimum non-negative integer such that
>     offset_3(C) + date(C) > offset_3(P) + date(P) for all parents P of C.
> 
> V5: The backwards-compatible corrected commit-date is given by
>     ccd_5(C) = offset_5(C) + date(C) where the inequality
>     offset_5(C) + date(C) > offset_5(P) + date(P) holds, but also
>     offset_5(C) > offset_(P) for all parents P of C.
> 
> Please correct me if I am wrong about these definitions.
> 

Yes, these are correct.

> > We had the following possible ideas for writing commit-graph:
> 
> The options here seem to be a little bit of overkill. Here are the
> ones that I would seriously consider:
>  
> > MO: Calculate generation number V5 and write the offset into CDAT
> >     (along with a metadata chunk).
> >
> > G3TO: Calculate generation number V3 and generation number V0, write
> >     topological levels into CDAT and the offset into Generation Data
> >     chunk.
> > 
> > G3TD: Calculate generation number V3 and generation number V0, write
> >     topological levels into CDAT and the offset into Generation Data
> >     chunk.
> 
> I think this last option you meant s/the offset/the corrected date/.
> This means you are storing the date+offset in the file instead of just
> the offset.

Right, thanks for correcting.

> 
> > | Version | time Taken      |
> > |---------|-----------------|
> > | master  | 14.200s         |
> > | MO      | 14.094s         |
> > | G3TO    | 14.331s         |
> > | G3TD    | 14.419s         |
> 
> I don't consider write _time_ to be a huge issue, especially not with
> these relative values. The SPACE taken by the commit-graph is more
> important, and is probably a higher relative change than this time.
>  
> > Observations:
> > 
> > - Writing offset (i.e., 32-bits) is faster than writing corrected date 
> >   (which are 64-bit long). However, we would have to deal with overflow
> >   issues.
> 
> Please keep in mind that the CDAT chunk has a 64-bit "column" that
> is split between commit date (34 bits) and generation number (30 bits).
> So having 32-bit offsets in a new chunk adds some extra flexibility.

Huh, I didn't think of that at all. Do we have some possible use in
mind?

> 
> >   The largest offset value observed was of the order 2 ** 29. Assuming
> >   no further "skips" in offset, we can store around ~635 million
> >   commits without overflows. This, however, might not be true for other
> >   repositories.
> 
> Was this the largest required offset for a single commit->parent skew,
> or the cumulative effect of many offsets needing to grow, along with
> the natural generation number growth? I'm guessing this is just the
> largest offset that you computed in your write.
> 
> Since V3 doesn't require offset_3(C) > offset_3(P), the offsets should
> naturally "fix" themselves over time by decreasing as time moves ahead.
> So in that sense, I'm rather surprised to see that a 2^29 offset is
> necessary.
> 

This was the cumulative effect of both of offsets and the generation
number growth. If we require just ccd(C) > ccd(P), the largest offset
was 25960236, merely of the order 2 ^ 25.

> This does make it seem like V5 should be abandoned because the CDAT
> is running out of room for these offsets, so repos that grow enough
> to overflow will lose any performance benefits from this approach.
> 
> >   A repository with a commit with Unix epoch '0' will overflow.
> >   If we choose some sane default like commit date must be higher than
> >   the release date of git, we can store around 110 million commits even
> >   with '0' epoch.
> 
> This really depends on a commit with date '0' having a parent with a
> ~30-bit timestamp, OR having many of these kinds of shifts with more
> reasonable dates.
> 
> > - Calculating topological levels with generation number V3 takes nearly
> >   the same time as calculating generation number V5.
> > 
> > - The commit-graph file is larger by 4 Mb (+8%) and 8 Mb (+16%) when
> >   storing offsets and dates.
> 
> 8-16% is noticeable, but also not too bad. Thanks for reporting
> these stats!
> 
> > Number of commits walked for different commands:
> > 
> > | Command                         | Master |   V3   |   V5   |
> > |---------------------------------|--------|--------|--------|
> > | log --topo-order -10000         |  48011 |  49954 |  49933 |
> > | log --topo-order -100 v5.4 v5.5 |   3794 |   3314 |   3312 |
> > | log --topo-order -100 v4.8 v4.9 |   5736 |   3887 |   3625 |
> > | merge-base v5.4 v5.5            |  55053 |  57097 |  55109 |
> > | merge-base v4.8 v4.9            | 635579 | 167468 | 506577 |
> 
> This last data point is extremely valuable. I think the fact
> that V5 is closer to V0 here speaks to the negative effect
> of requiring offset_3(C) > offset_3(P).
>  

Woah! I was working more or less with the assumption that V5 performs at
least as well as V3 and blamed my implementation of it.

> > - For log --topo-order, V3, and V5 walk 35% fewer commits than
> >   the topological level when comparing v4.8 and v4.9.
> > 
> > - V3 walks far fewer commits than V5 when comparing by generation
> >   numbers and then date for paint_down_to_common(). This is unusual
> >   and is probably due to my implementation. Based on log --topo-order,
> >   we might expect V5 to perform better than V3.
> 
> The v4.8 to v4.9 case is particularly nasty in a way that
> doesn't cause issues with topo-order walks. This is the
> big reason we are pursuing this at all!
> 
>
> > - V3 walks the same number of commits when compared using commit
> >   date for merge-base.
> > 
> > The time taken for each command rough corresponds to the number of
> > commits walked, with no difference between Metadata chunk and Generation
> > Data chunk approaches.
> > 
> > | Command                         | Master  |    V3   |   V5   |
> > |---------------------------------|---------|---------|--------|
> > | log --topo-order -10000         |  0.210s |  0.209s | 0.207s |
> > | log --topo-order -100 v5.4 v5.5 |  0.013s |  0.013s | 0.013s |
> > | log --topo-order -100 v4.8 v4.9 |  0.015s |  0.013s | 0.013s |
> > | merge-base v5.4 v5.5            |  0.048s |  0.047s | 0.047s |
> > | merge-base v4.8 v4.9            |  0.135s |  0.133s | 0.134s |
> 
> Are you sure these timings are correct for the last row? It
> seems that the times would be more significantly different
> for the number of commits that were walked. Keep in mind that
> we should be reporting these merge-base timings when
> paint_down_to_common() has its priority queue set to pull
> the maximum generation number (V0, V3, or V5) in order to
> really compare. It seems like that was done correctly in
> the commit count, but this data seems to imply something
> else.

I hadn't timed it correctly. Forgot to set the env variable for stric
comparision by generation numbers.

The number of commits walked:

| merge-base v5.4 v5.5 |  55053 |  55109 |  57098 |
| merge-base v4.8 v4.9 | 635579 | 506577 | 167496 |

The timings:

| merge-base v5.4 v5.5 | 0.046s | 0.046s | 0.049s |
| merge-base v4.8 v4.9 | 0.670s | 0.499s | 0.157s |

Which neatly corresponds to the number of commits walked.

> 
> > - If overflows are unlikely to happen and not accounted for, implement
> >   generation number V5 using metadata chunk. It has the lowest write
> >   time and walks just as few commits as V3.
> 
> (For the "merge-base v4.8 v4.9" case, this is untrue.)
>  
> > - If overflows are a concern, implement generation number V5 with
> >   generation data chunk, storing offsets in generation data chunk.
> 
> Here is an excellent time to be REALLY clear about what you mean
> about V3 and V5.

With the clarity that corrected commit dates with monotonic offset does
not perform better than corrected commit dates (and my implementation
was probably correct), we should implement store topological levels into
CDAT and the (not strictly monotonic) offsets into generation data
chunk.

> 
> Do you have these prototypes available somewhere so others can
> test? That would also help in verifying these results. It
> would be enough to push the code to a public Git hosting service
> so we can fetch your commits, build, and test. These don't need
> to be submission-ready patches.
> 

Yeah, I do. Added better instructions and more information now.

It should build, and you can replicate the tests by setting $linux_dir
and running gen_perf.sh. The script would print the timing results and
create tracer reports with filenames `master.perf`, `gen_v3.perf` and
`gen_v5.perf`.

[1]: https://github.com/abhishekkumar2718/git/pull/1

> Thanks for doing all this work! We are getting closer and closer
> to a final direction.
> 
> -Stolee

Thanks for the reviews and the discussion!
Abhishek

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-07-03  8:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-22  9:34 [RFC] Metadata vs Generation Data Chunk Abhishek Kumar
2020-06-22 13:40 ` Jakub Narębski
2020-06-22 14:46 ` Derrick Stolee
  -- strict thread matches above, loose matches on Subject: below --
2020-06-26 13:44 Abhishek Kumar
2020-06-26 14:40 ` Derrick Stolee
2020-06-30 15:00 Abhishek Kumar
2020-06-30 20:46 ` Derrick Stolee
2020-07-03  8:29   ` Abhishek Kumar

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).