From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_PASS, SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by dcvr.yhbt.net (Postfix) with ESMTP id B12211F66E for ; Thu, 27 Aug 2020 12:44:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729077AbgH0Moj (ORCPT ); Thu, 27 Aug 2020 08:44:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41456 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729120AbgH0Mn5 (ORCPT ); Thu, 27 Aug 2020 08:43:57 -0400 Received: from mail-ej1-x641.google.com (mail-ej1-x641.google.com [IPv6:2a00:1450:4864:20::641]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CC18FC061264 for ; Thu, 27 Aug 2020 05:43:56 -0700 (PDT) Received: by mail-ej1-x641.google.com with SMTP id d26so7488791ejr.1 for ; Thu, 27 Aug 2020 05:43:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:references:date:in-reply-to:message-id :user-agent:mime-version:content-transfer-encoding; bh=7ZH/v/829gGGMtlpMYvVr2fDTJndn/5i/iVxlZDQWAs=; b=rxYvL7eIBDmVGSdFlOih9Co5Agtwkm+x/S/x8F8Uo57gc1WzOYosjJMHMaXMGlJUtr L/T7S8lllR/RImcnVuxqm/+CELzk0cRxtzGD3fo3kzS+5SOLkEN57bDlW3Hy0MIheI8a h8xwe5cUDXOC8hVIxuupseZfI/Hl8/T4+vcZbSXrE85tPg1Tm3tmmBKs57tKoEIJAkXt sHDF801rrttKV/dWxsyQ41GIgC1XVN3swkr+M0NdaXlNPO9plIxnNCPinEXSIaoun76O 5OSTTGj6roPfdvMB4yDcth0Qqfyqk/31qPWVB+8+I2APlZXcM4hHQcZRSifXd3n/NwGp FjnA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:references:date:in-reply-to :message-id:user-agent:mime-version:content-transfer-encoding; bh=7ZH/v/829gGGMtlpMYvVr2fDTJndn/5i/iVxlZDQWAs=; b=sm3q/XkSWsD2EYETpK2xMIyqxaBbTtJRP0EEZdZ9QP79dcUi3OB2TrwY9GH7lGDoEq krRx3gZRy3gYLvlXmK7vaVWcJ6qhP9iQSjSiV3KGw+UYzWxxMVAh0qbn9bChaAHhCO82 275hfzBQdapEKKFc6sjzUXTp6urpl3J+6T6vEv8hytjVtpbLBxbofuF1lFO/sY91pqyR GOi1mu0qSRMxYXpcMrnOHih8wmhUTPGeRXf7kUfSUbUcArbw1kuq1gXuLNdHxZx9CXsG hT0SdH/nxoXKYsoC8y5eGsrGW/tft8tMuAXb5DX94spFd54oO+KO9W9DiRDJzEz1t2xR GP1w== X-Gm-Message-State: AOAM532qtAzWBWNemux8tWzcZeM/Fu2/W+bIKQgmKz99Wreg2nX6EOSp HvewysXJ70uNpD1M/fcgFvM= X-Google-Smtp-Source: ABdhPJy7sK+kRF7O3U63SKcsVDq6jf6PiueRiXf23eWhoD9I0gKt28cPepzCcQYbks3N6RuQ9W25Hg== X-Received: by 2002:a17:907:72cd:: with SMTP id du13mr13466420ejc.67.1598532235338; Thu, 27 Aug 2020 05:43:55 -0700 (PDT) Received: from LAPTOP-ACER-ASPIRE-F5 (host-89-229-7-83.dynamic.mm.pl. [89.229.7.83]) by smtp.gmail.com with ESMTPSA id yh29sm1858939ejb.0.2020.08.27.05.43.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Aug 2020 05:43:54 -0700 (PDT) From: jnareb@gmail.com (Jakub =?utf-8?Q?Nar=C4=99bski?=) To: Abhishek Kumar Cc: git@vger.kernel.org, Abhishek Kumar via GitGitGadget , Derrick Stolee , Taylor Blau , Jakub =?utf-8?Q?Nar=C4=99bski?= , Junio C Hamano Subject: Re: [PATCH v3 11/11] doc: add corrected commit date info References: <85y2m6fhkm.fsf@gmail.com> <20200827063951.GA16268@Abhishek-Arch> Date: Thu, 27 Aug 2020 14:43:53 +0200 In-Reply-To: <20200827063951.GA16268@Abhishek-Arch> (Abhishek Kumar's message of "Thu, 27 Aug 2020 12:09:51 +0530") Message-ID: <85o8mwb6nq.fsf@gmail.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (windows-nt) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Hello, Abhishek Kumar writes: > On Sun, Aug 23, 2020 at 12:20:57AM +0200, Jakub Nar=C4=99bski wrote: >> "Abhishek Kumar via GitGitGadget" writes: [...] >>> + * This list of 4-byte values store corrected commit date offsets f= or the >>> + commits, arranged in the same order as commit data chunk. >> >> I have just realized purely theoretical, but possible, problem with >> storing non-monotinic generation number related values like corrected >> commit date offset in constrained space. There are problems with >> clamping them. >> >> Say that somewhere in the ancestry chain there is a commit A with commit >> date far in the future by mistake, for example 2120-08-22; it is >> important for that date to be not able to be represented using uint32_t. >> Say that a later descendant commit B is malformed, and has committer >> date of 0, that is 1970-01-01. This means that the corrected commit date >> for B must be larger than 2120-08-22 - which for this commit means that >> corrected commit date offset do not fit in 32 bits, and must be clamped >> (replaced) with GENERATION_NUMBER_V2_OFFSET_MAX. >> >> Say that we have commit C that is child of B, and it has correct commit >> date. Because of mistake in commit A, it has corrected commit date of >> more than 2120-08-22 (corrected commit date degenerated into topological >> level plus constant). >> >> Now C can reach B, and B can reach A. However, if we recover corrected >> commit date of B out of its date=3D0 and offset=3DGENERATION_NUMBER_V2_O= FFSET_MAX >> we get a number that is smaller than correct corrected commit date. We >> will have >> >> gen(A) > date(B) + offset(B) < gen(C) >> >> Which breaks reachability condition guarantee. >> >> If instead we use GENERATION_NUMBER_V2_MAX for commits with clamped >> corrected commit date, that is offset=3DGENERATION_NUMBER_V2_OFFSET_MAX, >> we would get >> >> gen(A) < GENERATION_NUMBER_V2_MAX > gen(C) >> >> And again reachability condition is broken. >> >> This is a very contrived but possible example. This shouldn't happen, >> but ufortunately it can happen. >> > > Yes, that's very unfortunate. > > Here's a much simpler example: > > A commit P has an reasonable commit date (i.e. after release of Git to > present) D and has a child commit C with committer date 0. Now, the > corrected commiter date of C would D + 1 and the offset would be same too, > as the committer date is zero. This overflows as reasonable dates are of > the order 2 ^ 34. No, we need the value of date D that doesn't fit in 2^32 _unsigned_ value, so it needs to be even more in the future than Y2k38 (2038-01-19 03:14:07), which is related to storing date as a _signed_ 32-bit integer The current-ish Unix epoch time is 1598524281 - let's use it for value of D. Then the offset for commit C would be 1598524282. The current proposal uses 32 bits to store commit date offsets (as unsigned value). The maximum value of offset that we can store is therefore 2^32 - 1, which is 4294967295. corrected commit date offset(C) =3D 1,598,524,282 GENERATION_NUMBER_V2_MAX =3D 4,294,967,295 As you can see there is no overflow in the simplified example. >> >> The question is how to deal with this issue. Ignore it as unlikely? >> Switch to storing corrected commit date, which is monotonic, so if there >> is commit with GENERATION_NUMBER_V2_MAX, then subsequent descendant >> commits will also have GENERATION_NUMBER_V2_MAX -- and pay with up to 7% >> larger commit-graph file? >> > > To be honest, I would prefer storing corrected committer dates over > storing offsets. > > While it is 7% of the size of commit-graph file, it is also *only* around > ~3.5 MB for a repository of the size of linux kernel (and IIRC > correctly, the Windows repo has ~2M commits, it amounts to ~8 MB). It is up to 7% of per-commit data, and it doesn't take into account EDGE chunk (for octopus merges), and it doesn't also take into account the size of changed-paths Bloom filters data take in the commit-graph. > Minimizing space and memory requirements are a top priority, but > shouldn't making sure our program is correct and efficient to be a > greater priority? On the other hand the case where we would encounter offsets that do not fit in uint32_t is extremply unlikely in sane repositories. I can think of three solutions: 1. use 64-bit corrected commit dates - advantages: * simplest code, * no need for overflow handling, as we can store all possible values of timestamp_t - disadvantages: * commit-graph size increased by up to 7% 2. use 32-bit corrected commit date offsets, but simply do not store GDAT chunk if there is offset that would not fit in 32-bit wide field - advantages: * commit-graph is smaller * relatively simple overflow handling - disadvantages: * performance penalty (generation number v1 vs v2) for abnormal repositories (with overflow not fitting in uint32_t) * tests would be needed to exercise the overflow code 3. use 32-bit for corrected commit date offset, with oveflow handling, for example using most significant bit to denote that other bits store position into offset overflow with 64-bits for those offsets that do not fit in 31-bits - advantages: * commit-graph is smaller, increasing for abnormal repos - disadvantages: * most complex code of all proposed solutions * smaller overflow limit of 2^31 - 1 * tests would be needed to exercise the overflow code I think because the situation where we encounter overflow in 32-bit corrected commit date offset is rare, we should go with either 1 or 2 solution. > I would love to hear your and Dr. Stolee's opinions on this. I have CC-ed Junio C Hamano to ask for his opinion. >>> + * This list can be later modified to store future generation numbe= r related >>> + data. >> >> How can it be later modified? There is no header, no version number. >> How would we add another generation number data? >> > > We could modify the graph version in future. Here's how I think it would > work: > > Graph Version 1, No GDAT -> Topological level > Graph Version 2, GDAT -> Corrected committer dates > Graph Version 3, GDAT -> Generation number v3 > > and so on. > > Of course, we do not have to update generation number definition for > each graph version. So it was about generic mechanism, not something specific to the GDAT chunk. > However, my statement could still be wrong for things that we do not > foresee (similar to how we missed the hard die on different graph version= ), > so I am removing the statement. Good. [...] >>> +We also merge commit-graph chains when we try to write a commit graph = with >>> +two different generation number definitions as they cannot be compared= directly. >>> +We overwrite the existing chain and create a commit-graph with the new= er or more >>> +efficient defintion. For example, overwriting topological levels commi= t graph >>> +chain to create a corrected commit dates commit graph chain. >>> + >> >> This is more complicated than that. >> >> I think we should explicitly state that Git ensures that in split >> commit-graph chain, if there are layers without the GDAT chunk (that >> force Git to use topological levels for generation numbers), then they >> are top layers. So if there is commit-graph file created by "Old" Git, >> then when addig new layer it would also be GDAT-less. >> >> Now how to write this... > > Thinking about this, I feel creating a new section called "Handling > Mixed Generation Number Chains" made more sense: > > ## Handling Mixed Generation Number Chains > > With the introduction of generation number v2 and generation data chunk, > the following scenario is possible: > > 1. "New" Git writes a commit-graph with a GDAT chunk. > 2. "Old" Git writes a split commit-graph on top without a GDAT chunk. > > The commits in the lower layer will be interpreted as having very large > generation values (commit date plus offset) compared to the generation > numbers in the top layer (toplogical level). This violates the > expectation that the generation of a parent is strictly smaller than the > generation of a child. In such cases, we revert to using topological > levels for all layers to maintain backwards compatability. > > When writing a new layer in split commit-graph, we write a GDAT chunk > only if the topmost layer has a GDAT chunk. This guarantees that if a > lyer has GDAT chunk, all lower layers must have a GDAT chunk as well. > > Rewriting layers follows similar approach: if the topmost layer below > set of layers being rewriteen (in the split commit-graph chain) exists, > and it does not contain GDAT chunk, then the result of rewrite does not > have GDAT chunks either. Good idea, and nice writeup. Best, --=20 Jakub Nar=C4=99bski