Re: [RFC] Generation Number v2

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Jakub Narebski <jnareb@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Git List" <git@vger.kernel.org>, "Jeff King" <peff@peff.net>,
	"Derrick Stolee" <dstolee@microsoft.com>
Subject: Re: [RFC] Generation Number v2
Date: Fri, 02 Nov 2018 18:44:39 +0100	[thread overview]
Message-ID: <86a7mrxtko.fsf@gmail.com> (raw)
In-Reply-To: <6902dbff-d9f6-e897-2c20-d0cb47a50795@gmail.com> (Derrick Stolee's message of "Wed, 31 Oct 2018 09:04:12 -0400")

Derrick Stolee <stolee@gmail.com> writes:
> On 10/31/2018 8:54 AM, Ævar Arnfjörð Bjarmason wrote:
>> On Tue, Oct 30 2018, Junio C Hamano wrote:
>>> Derrick Stolee <stolee@gmail.com> writes:
>>>>
>>>> In contrast, maximum generation numbers and corrected commit
>>>> dates both performed quite well. They are frequently the top
>>>> two performing indexes, and rarely significantly different.
>>>>
>>>> The trade-off here now seems to be: which _property_ is more important,
>>>> locally-computable or backwards-compatible?
>>>
>>> Nice summary.
>>>
>>> As I already said, I personally do not think being compatible with
>>> currently deployed clients is important at all (primarily because I
>>> still consider the whole thing experimental), and there is a clear
>>> way forward once we correct the mistake of not having a version
>>> number in the file format that tells the updated clients to ignore
>>> the generation numbers.  For longer term viability, we should pick
>>> something that is immutable, reproducible, computable with minimum
>>> input---all of which would lead to being incrementally computable, I
>>> would think.
>>
>> I think it depends on what we mean by backwards compatibility. None of
>> our docs are saying this is experimental right now, just that it's
>> opt-in like so many other git-config(1) options.
>>
>> So if we mean breaking backwards compatibility in that we'll write a new
>> file or clobber the existing one with a version older clients can't use
>> as an optimization, fine.
>>
>> But it would be bad to produce a hard error on older clients, but
>> avoiding that seems as easy as just creating a "commit-graph2" file in
>> .git/objects/info/.
>
> Well, we have a 1-byte version number following the "CGPH" header in
> the commit-graph file, and clients will ignore the commit-graph file
> if that number is not "1". My hope for backwards-compatibility was
> to avoid incrementing this value and instead use the unused 8th byte.

How?  Some of the considered new generation numbers were backwards
compatibile in the sense that for graph traversal the old code for
(minimum) generation numbers could be used.  But that is for reading.
If the old client tried to update new generation number using old
generation number algorithm (assuming locality), it would write some
chimaera that may or may not work.

> However, it appears that we are destined to increment that version
> number, anyway. Here is my list for what needs to be in the next
> version of the commit-graph file format:

As a reminder, here is how the commit-graph format looks now [1]

  HEADER:
    4-byte signature:                  CGPH
    1-byte version number:             1
    1-byte Hash Version:               1 = SHA-1
    1-byte number (C) of "chunks":     3 or 4
    1-byte (reserved for later use)    <ignored>
  CHUNK LOOKUP:
    (C + 1) * 12 bytes listing the table of contents for the chunks
  CHUNK DATA:
    OIDF
    OIDL
    CDAT
    EDGE [optional, depends on graph structure]
  TRAILER:
    checksum

[1]: Documentation/technical/commit-graph-format.txt

> 1. A four-byte hash version.

Why do you need a four-byte hash version?  255 possible hash functions
is not enough?

> 2. File incrementality (split commit-graph).

This would probably require a segment lookup part, and one chunk of each
type per segment (or even one chunk lookup table per segment).  It may
or may not need generation number which can support segmenting (here
maximal generation numbers have the advantage); but you can assume that
if segment(A) > segment(B) then reach.ix(A) > reach.ix(B),
i.e. lexicographical-like sort.

> 3. Reachability Index versioning

I assume that you mean here versioning for the reachability index that
is stored in the CDAT section (if it is in a separate chunk, it should
be easy to version).

The easiest thing to do when encountering unknown or not supported
generation number (perhaps the commit-graph file was created in the past
by earlier version of Git, perahs it came from server with newer Git, or
perhaps the same repository is sometimes accessed using newer and
sometimes older Git version), is to set commit->generation_number to
GENERATION_NUMBER_ZERO, as you wrote in [2].

We could encode backward-compatibility (for using) somewhat, but I
wonder how much it would be of use.

[2]: https://public-inbox.org/git/61a829ce-0d29-81c9-880e-7aef1bec916e@gmail.com/

> Most of these changes will happen in the file header. The chunks
> themselves don't need to change, but some chunks may be added that
> only make sense in v2 commit-graphs.

What I would like to see in v2 commit-graph format would be some kind
convention for naming / categorizing chunks, so that Git would be able
to handle unknown chunks gracefully.

In PNG format there are various types of chunks.  Quoting Wikipedia:

  Chunk types [in PNG format] are given a four-letter case sensitive
  ASCII type/name; compare FourCC. The case of the different letters in
  the name (bit 5 of the numeric value of the character) is a bit field
  that provides the decoder with some information on the nature of
  chunks it does not recognize.

  The case of the first letter indicates whether the chunk is critical
  or not. If the first letter is uppercase, the chunk is critical; if
  not, the chunk is ancillary. Critical chunks contain information that
  is necessary to read the file. If a decoder encounters a critical
  chunk it does not recognize, it must abort reading the file or supply
  the user with an appropriate warning.

Currently this is solved with the format version.  For given format
version some chunks are necessary (critical); if there would be another
type of chunk that must be understood, we can simply increase format
version. 

  The case of the second letter indicates whether the chunk is "public"
  (either in the specification or the registry of special-purpose public
  chunks) or "private" (not standardised). Uppercase is public and
  lowercase is private. This ensures that public and private chunk names
  can never conflict with each other (although two private chunk names
  could conflict).

For Git this could be "official" and "experimental"... though I don't
see people sharing commit-graph files with experimental chunks that can
be used only by some local unpublished fork of Git.

  The third letter must be uppercase to conform to the PNG
  specification. It is reserved for future expansion. Decoders should
  treat a chunk with a lower case third letter the same as any other
  unrecognised chunk.

In short: reserved.  This could be the fate of all of those 4 flags that
we do not need.

  The case of the fourth letter indicates whether the chunk is safe to
  copy by editors that do not recognize it. If lowercase, the chunk may
  be safely copied regardless of the extent of modifications to the
  file. If uppercase, it may only be copied if the modifications have
  not touched any critical chunks.

If the data contained in the chunks is immutable, then it could be
copied when updating commit-graph file... but what to do with new
commits, with what to fill the data for a new commit for a copied
unknown chunk?  All zeros?  All ones?  Some value defined in the chunk
itself?

Best,
-- 
Jakub Narębski

next prev parent reply	other threads:[~2018-11-02 17:44 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-29 16:55 [RFC] Generation Number v2 Derrick Stolee
2018-10-29 19:22 ` Stefan Beller
2018-10-29 20:06   ` Derrick Stolee
2018-11-01 20:06   ` Jakub Narebski
2018-11-02  9:30     ` Jakub Narebski
2018-11-03 17:27       ` Jakub Narebski
2018-10-29 20:25 ` Derrick Stolee
2018-11-01 22:13   ` Jakub Narebski
2018-10-30  3:59 ` Junio C Hamano
2018-10-31 12:30   ` Derrick Stolee
2018-11-02 13:33     ` Jakub Narebski
2018-10-31 12:54   ` Ævar Arnfjörð Bjarmason
2018-10-31 13:04     ` Derrick Stolee
2018-11-02 17:44       ` Jakub Narebski [this message]
2018-11-01 12:27 ` Jakub Narebski
2018-11-01 13:29   ` Derrick Stolee
2018-11-03 12:33     ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86a7mrxtko.fsf@gmail.com \
    --to=jnareb@gmail.com \
    --cc=avarab@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).