git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / Atom feed
From: jnareb@gmail.com (Jakub Narębski)
To: Abhishek Kumar <abhishekkumar8222@gmail.com>
Cc: git@vger.kernel.org,
	"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com>,
	"Derrick Stolee" <stolee@gmail.com>,
	"Taylor Blau" <me@ttaylorr.com>,
	"Jakub Narębski" <jnareb@gmail.com>,
	"Junio C Hamano" <gitster@pobox.com>
Subject: Re: [PATCH v3 11/11] doc: add corrected commit date info
Date: Thu, 27 Aug 2020 14:43:53 +0200
Message-ID: <85o8mwb6nq.fsf@gmail.com> (raw)
In-Reply-To: <20200827063951.GA16268@Abhishek-Arch> (Abhishek Kumar's message of "Thu, 27 Aug 2020 12:09:51 +0530")

Hello,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Sun, Aug 23, 2020 at 12:20:57AM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
[...]

>>> +    * This list of 4-byte values store corrected commit date offsets for the
>>> +      commits, arranged in the same order as commit data chunk.
>>
>> I have just realized purely theoretical, but possible, problem with
>> storing non-monotinic generation number related values like corrected
>> commit date offset in constrained space.  There are problems with
>> clamping them.
>>
>> Say that somewhere in the ancestry chain there is a commit A with commit
>> date far in the future by mistake, for example 2120-08-22; it is
>> important for that date to be not able to be represented using uint32_t.
>> Say that a later descendant commit B is malformed, and has committer
>> date of 0, that is 1970-01-01. This means that the corrected commit date
>> for B must be larger than 2120-08-22 - which for this commit means that
>> corrected commit date offset do not fit in 32 bits, and must be clamped
>> (replaced) with GENERATION_NUMBER_V2_OFFSET_MAX.
>>
>> Say that we have commit C that is child of B, and it has correct commit
>> date.  Because of mistake in commit A, it has corrected commit date of
>> more than 2120-08-22 (corrected commit date degenerated into topological
>> level plus constant).
>>
>> Now C can reach B, and B can reach A.  However, if we recover corrected
>> commit date of B out of its date=0 and offset=GENERATION_NUMBER_V2_OFFSET_MAX
>> we get a number that is smaller than correct corrected commit date.  We
>> will have
>>
>>    gen(A) > date(B) + offset(B) < gen(C)
>>
>> Which breaks reachability condition guarantee.
>>
>> If instead we use GENERATION_NUMBER_V2_MAX for commits with clamped
>> corrected commit date, that is offset=GENERATION_NUMBER_V2_OFFSET_MAX,
>> we would get
>>
>>   gen(A) < GENERATION_NUMBER_V2_MAX > gen(C)
>>
>> And again reachability condition is broken.
>>
>> This is a very contrived but possible example.  This shouldn't happen,
>> but ufortunately it can happen.
>>
>
> Yes, that's very unfortunate.
>
> Here's a much simpler example:
>
> A commit P has an reasonable commit date (i.e. after release of Git to
> present) D and has a child commit C with committer date 0. Now, the
> corrected commiter date of C would D + 1 and the offset would be same too,
> as the committer date is zero. This overflows as reasonable dates are of
> the order 2 ^ 34.

No, we need the value of date D that doesn't fit in 2^32 _unsigned_ value,
so it needs to be even more in the future than Y2k38 (2038-01-19 03:14:07),
which is related to storing date as a _signed_ 32-bit integer

The current-ish Unix epoch time is 1598524281 - let's use it for value
of D.  Then the offset for commit C would be 1598524282.  The current
proposal uses 32 bits to store commit date offsets (as unsigned value).
The maximum value of offset that we can store is therefore 2^32 - 1,
which is 4294967295.

   corrected commit date offset(C) = 1,598,524,282
   GENERATION_NUMBER_V2_MAX        = 4,294,967,295

As you can see there is no overflow in the simplified example.

>>
>> The question is how to deal with this issue.  Ignore it as unlikely?
>> Switch to storing corrected commit date, which is monotonic, so if there
>> is commit with GENERATION_NUMBER_V2_MAX, then subsequent descendant
>> commits will also have GENERATION_NUMBER_V2_MAX -- and pay with up to 7%
>> larger commit-graph file?
>>
>
> To be honest, I would prefer storing corrected committer dates over
> storing offsets.
>
> While it is 7% of the size of commit-graph file, it is also *only* around
> ~3.5 MB for a repository of the size of linux kernel (and IIRC
> correctly, the Windows repo has ~2M commits, it amounts to ~8 MB).

It is up to 7% of per-commit data, and it doesn't take into account EDGE
chunk (for octopus merges), and it doesn't also take into account the
size of changed-paths Bloom filters data take in the commit-graph.

> Minimizing space and memory requirements are a top priority, but
> shouldn't making sure our program is correct and efficient to be a
> greater priority?

On the other hand the case where we would encounter offsets that do not
fit in uint32_t is extremply unlikely in sane repositories.

I can think of three solutions:

1. use 64-bit corrected commit dates
   - advantages:
     * simplest code,
     * no need for overflow handling, as we can store all possible values
       of timestamp_t
   - disadvantages:
     * commit-graph size increased by up to 7%

2. use 32-bit corrected commit date offsets,
   but simply do not store GDAT chunk if there is offset that would not
   fit in 32-bit wide field
   - advantages:
     * commit-graph is smaller
     * relatively simple overflow handling
   - disadvantages:
     * performance penalty (generation number v1 vs v2) for abnormal
       repositories (with overflow not fitting in uint32_t)
     * tests would be needed to exercise the overflow code

3. use 32-bit for corrected commit date offset,
   with oveflow handling, for example using most significant bit
   to denote that other bits store position into offset overflow
   with 64-bits for those offsets that do not fit in 31-bits
   - advantages:
     * commit-graph is smaller, increasing for abnormal repos
   - disadvantages:
     * most complex code of all proposed solutions
     * smaller overflow limit of 2^31 - 1
     * tests would be needed to exercise the overflow code

I think because the situation where we encounter overflow in 32-bit
corrected commit date offset is rare, we should go with either 1 or 2
solution.

> I would love to hear your and Dr. Stolee's opinions on this.

I have CC-ed Junio C Hamano to ask for his opinion.

>>> +    * This list can be later modified to store future generation number related
>>> +      data.
>>
>> How can it be later modified?  There is no header, no version number.
>> How would we add another generation number data?
>>
>
> We could modify the graph version in future. Here's how I think it would
> work:
>
> Graph Version 1, No GDAT -> Topological level
> Graph Version 2, GDAT    -> Corrected committer dates
> Graph Version 3, GDAT    -> Generation number v3
>
> and so on.
>
> Of course, we do not have to update generation number definition for
> each graph version.

So it was about generic mechanism, not something specific to the GDAT chunk.

> However, my statement could still be wrong for things that we do not
> foresee (similar to how we missed the hard die on different graph version),
> so I am removing the statement.

Good.

[...]
>>> +We also merge commit-graph chains when we try to write a commit graph with
>>> +two different generation number definitions as they cannot be compared directly.
>>> +We overwrite the existing chain and create a commit-graph with the newer or more
>>> +efficient defintion. For example, overwriting topological levels commit graph
>>> +chain to create a corrected commit dates commit graph chain.
>>> +
>>
>> This is more complicated than that.
>>
>> I think we should explicitly state that Git ensures that in split
>> commit-graph chain, if there are layers without the GDAT chunk (that
>> force Git to use topological levels for generation numbers), then they
>> are top layers.  So if there is commit-graph file created by "Old" Git,
>> then when addig new layer it would also be GDAT-less.
>>
>> Now how to write this...
>
> Thinking about this, I feel creating a new section called "Handling
> Mixed Generation Number Chains" made more sense:
>
>   ## Handling Mixed Generation Number Chains
>
>   With the introduction of generation number v2 and generation data chunk,
>   the following scenario is possible:
>
>   1. "New" Git writes a commit-graph with a GDAT chunk.
>   2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
>
>   The commits in the lower layer will be interpreted as having very large
>   generation values (commit date plus offset) compared to the generation
>   numbers in the top layer (toplogical level). This violates the
>   expectation that the generation of a parent is strictly smaller than the
>   generation of a child. In such cases, we revert to using topological
>   levels for all layers to maintain backwards compatability.
>
>   When writing a new layer in split commit-graph, we write a GDAT chunk
>   only if the topmost layer has a GDAT chunk. This guarantees that if a
>   lyer has GDAT chunk, all lower layers must have a GDAT chunk as well.
>
>   Rewriting layers follows similar approach: if the topmost layer below
>   set of layers being rewriteen (in the split commit-graph chain) exists,
>   and it does not contain GDAT chunk, then the result of rewrite does not
>   have GDAT chunks either.

Good idea, and nice writeup.

Best,
-- 
Jakub Narębski

  reply	other threads:[~2020-08-27 12:44 UTC|newest]

Thread overview: 122+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-28  9:13 [PATCH 0/6] [GSoC] Implement Corrected Commit Date Abhishek Kumar via GitGitGadget
2020-07-28  9:13 ` [PATCH 1/6] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
2020-07-28 15:28   ` Taylor Blau
2020-07-30  5:24     ` Abhishek Kumar
2020-08-04  0:46   ` Jakub Narębski
2020-08-04  0:56     ` Taylor Blau
2020-08-04 10:10       ` Jakub Narębski
2020-08-04  7:55     ` Jakub Narębski
2020-07-28  9:13 ` [PATCH 2/6] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
2020-07-28 13:00   ` Derrick Stolee
2020-07-28 15:30     ` Taylor Blau
2020-08-05 23:16   ` Jakub Narębski
2020-07-28  9:13 ` [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
2020-07-28 13:14   ` Derrick Stolee
2020-07-28 15:19     ` René Scharfe
2020-07-28 15:58       ` Derrick Stolee
2020-07-28 16:01     ` Taylor Blau
2020-07-30  6:07     ` Abhishek Kumar
2020-07-28  9:13 ` [PATCH 4/6] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
2020-07-28 16:03   ` Taylor Blau
2020-07-28  9:13 ` [PATCH 5/6] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
2020-07-28 16:12   ` Taylor Blau
2020-07-30  6:52     ` Abhishek Kumar
2020-07-28  9:13 ` [PATCH 6/6] commit-graph: implement corrected commit date offset Abhishek Kumar via GitGitGadget
2020-07-28 15:55   ` Derrick Stolee
2020-07-28 16:23     ` Taylor Blau
2020-07-30  7:27     ` Abhishek Kumar
2020-07-28 14:54 ` [PATCH 0/6] [GSoC] Implement Corrected Commit Date Taylor Blau
2020-07-30  7:47   ` Abhishek Kumar
2020-07-28 16:35 ` Derrick Stolee
2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 01/10] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 02/10] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 03/10] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 04/10] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 05/10] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
2020-08-10 16:28     ` Derrick Stolee
2020-08-11 11:03       ` Abhishek Kumar
2020-08-11 12:27         ` Derrick Stolee
2020-08-11 18:58           ` Taylor Blau
2020-08-09  2:53   ` [PATCH v2 06/10] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 07/10] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
2020-08-10 14:23     ` Derrick Stolee
2020-08-14  4:59       ` Abhishek Kumar
2020-08-14 12:24         ` Derrick Stolee
2020-08-09  2:53   ` [PATCH v2 08/10] commit-graph: handle mixed generation commit chains Abhishek Kumar via GitGitGadget
2020-08-10 16:42     ` Derrick Stolee
2020-08-11 11:36       ` Abhishek Kumar
2020-08-11 12:43         ` Derrick Stolee
2020-08-09  2:53   ` [PATCH v2 09/10] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 10/10] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
2020-08-10 16:47   ` [PATCH v2 00/10] [GSoC] Implement Corrected Commit Date Derrick Stolee
2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
2020-08-15 16:39     ` [PATCH v3 01/11] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
2020-08-17 22:30       ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
2020-08-18 14:18       ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
2020-08-19 17:54       ` Jakub Narębski
2020-08-21  4:11         ` Abhishek Kumar
2020-08-25 11:11           ` Jakub Narębski
2020-09-01 11:35             ` Abhishek Kumar
2020-08-15 16:39     ` [PATCH v3 04/11] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
2020-08-17 13:22       ` Derrick Stolee
2020-08-21 11:05       ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 05/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
2020-08-21 13:14       ` Jakub Narębski
2020-08-25  5:04         ` Abhishek Kumar
2020-08-25 12:18           ` Jakub Narębski
2020-09-01 12:06             ` Abhishek Kumar
2020-09-03 13:42               ` Jakub Narębski
2020-09-05 17:21                 ` Abhishek Kumar
2020-09-13 15:39                   ` Jakub Narębski
2020-09-28 21:48                     ` Jakub Narębski
2020-10-05  5:25                       ` Abhishek Kumar
2020-08-15 16:39     ` [PATCH v3 06/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
2020-08-21 18:43       ` Jakub Narębski
2020-08-25  6:14         ` Abhishek Kumar
2020-08-25  7:33           ` Jakub Narębski
2020-08-25  7:56             ` Jakub Narębski
2020-09-01 10:26               ` Abhishek Kumar
2020-09-03  9:25                 ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
2020-08-22  0:05       ` Jakub Narębski
2020-08-25  6:49         ` Abhishek Kumar
2020-08-25 10:07           ` Jakub Narębski
2020-09-01 11:01             ` Abhishek Kumar
2020-08-15 16:39     ` [PATCH v3 08/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
2020-08-22 13:09       ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
2020-08-22 17:14       ` Jakub Narębski
2020-08-26  7:15         ` Abhishek Kumar
2020-08-26 10:38           ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 10/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
2020-08-22 19:09       ` Jakub Narębski
2020-09-01 10:08         ` Abhishek Kumar
2020-09-03 19:11           ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
2020-08-22 22:20       ` Jakub Narębski
2020-08-27  6:39         ` Abhishek Kumar
2020-08-27 12:43           ` Jakub Narębski [this message]
2020-08-27 13:15           ` Derrick Stolee
2020-09-01 13:01             ` Abhishek Kumar
2020-08-17  0:13     ` [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date Jakub Narębski
     [not found]       ` <CANQwDwdKp7oKy9BeKdvKhwPUiq0R5MS8TCw-eWGCYCoMGv=G-g@mail.gmail.com>
2020-08-17  1:32         ` Fwd: " Taylor Blau
2020-08-17  7:56           ` Jakub Narębski
2020-08-18  6:12       ` Abhishek Kumar
2020-08-23 15:27       ` Jakub Narębski
2020-08-24  2:49         ` Abhishek Kumar
2020-10-07 14:09     ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
2020-10-07 14:09       ` [PATCH v4 01/10] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
2020-10-24 23:16         ` Jakub Narębski
2020-10-07 14:09       ` [PATCH v4 02/10] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
2020-10-24 23:41         ` Jakub Narębski
2020-10-07 14:09       ` [PATCH v4 03/10] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
2020-10-07 14:09       ` [PATCH v4 04/10] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
2020-10-07 14:09       ` [PATCH v4 05/10] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
2020-10-07 14:09       ` [PATCH v4 06/10] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
2020-10-07 14:09       ` [PATCH v4 07/10] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
2020-10-07 14:09       ` [PATCH v4 08/10] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
2020-10-07 14:09       ` [PATCH v4 09/10] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
2020-10-07 14:09       ` [PATCH v4 10/10] doc: add corrected commit date info Abhishek Kumar via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=85o8mwb6nq.fsf@gmail.com \
    --to=jnareb@gmail.com \
    --cc=abhishekkumar8222@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

git@vger.kernel.org list mirror (unofficial, one of many)

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 git git/ https://public-inbox.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for the project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git