From: Jakub Narebski <jnareb@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
"Junio C Hamano" <gitster@pobox.com>,
"Git List" <git@vger.kernel.org>, "Jeff King" <peff@peff.net>,
"Derrick Stolee" <dstolee@microsoft.com>
Subject: Re: [RFC] Generation Number v2
Date: Fri, 02 Nov 2018 18:44:39 +0100 [thread overview]
Message-ID: <86a7mrxtko.fsf@gmail.com> (raw)
In-Reply-To: <6902dbff-d9f6-e897-2c20-d0cb47a50795@gmail.com> (Derrick Stolee's message of "Wed, 31 Oct 2018 09:04:12 -0400")
Derrick Stolee <stolee@gmail.com> writes:
> On 10/31/2018 8:54 AM, Ævar Arnfjörð Bjarmason wrote:
>> On Tue, Oct 30 2018, Junio C Hamano wrote:
>>> Derrick Stolee <stolee@gmail.com> writes:
>>>>
>>>> In contrast, maximum generation numbers and corrected commit
>>>> dates both performed quite well. They are frequently the top
>>>> two performing indexes, and rarely significantly different.
>>>>
>>>> The trade-off here now seems to be: which _property_ is more important,
>>>> locally-computable or backwards-compatible?
>>>
>>> Nice summary.
>>>
>>> As I already said, I personally do not think being compatible with
>>> currently deployed clients is important at all (primarily because I
>>> still consider the whole thing experimental), and there is a clear
>>> way forward once we correct the mistake of not having a version
>>> number in the file format that tells the updated clients to ignore
>>> the generation numbers. For longer term viability, we should pick
>>> something that is immutable, reproducible, computable with minimum
>>> input---all of which would lead to being incrementally computable, I
>>> would think.
>>
>> I think it depends on what we mean by backwards compatibility. None of
>> our docs are saying this is experimental right now, just that it's
>> opt-in like so many other git-config(1) options.
>>
>> So if we mean breaking backwards compatibility in that we'll write a new
>> file or clobber the existing one with a version older clients can't use
>> as an optimization, fine.
>>
>> But it would be bad to produce a hard error on older clients, but
>> avoiding that seems as easy as just creating a "commit-graph2" file in
>> .git/objects/info/.
>
> Well, we have a 1-byte version number following the "CGPH" header in
> the commit-graph file, and clients will ignore the commit-graph file
> if that number is not "1". My hope for backwards-compatibility was
> to avoid incrementing this value and instead use the unused 8th byte.
How? Some of the considered new generation numbers were backwards
compatibile in the sense that for graph traversal the old code for
(minimum) generation numbers could be used. But that is for reading.
If the old client tried to update new generation number using old
generation number algorithm (assuming locality), it would write some
chimaera that may or may not work.
> However, it appears that we are destined to increment that version
> number, anyway. Here is my list for what needs to be in the next
> version of the commit-graph file format:
As a reminder, here is how the commit-graph format looks now [1]
HEADER:
4-byte signature: CGPH
1-byte version number: 1
1-byte Hash Version: 1 = SHA-1
1-byte number (C) of "chunks": 3 or 4
1-byte (reserved for later use) <ignored>
CHUNK LOOKUP:
(C + 1) * 12 bytes listing the table of contents for the chunks
CHUNK DATA:
OIDF
OIDL
CDAT
EDGE [optional, depends on graph structure]
TRAILER:
checksum
[1]: Documentation/technical/commit-graph-format.txt
> 1. A four-byte hash version.
Why do you need a four-byte hash version? 255 possible hash functions
is not enough?
> 2. File incrementality (split commit-graph).
This would probably require a segment lookup part, and one chunk of each
type per segment (or even one chunk lookup table per segment). It may
or may not need generation number which can support segmenting (here
maximal generation numbers have the advantage); but you can assume that
if segment(A) > segment(B) then reach.ix(A) > reach.ix(B),
i.e. lexicographical-like sort.
> 3. Reachability Index versioning
I assume that you mean here versioning for the reachability index that
is stored in the CDAT section (if it is in a separate chunk, it should
be easy to version).
The easiest thing to do when encountering unknown or not supported
generation number (perhaps the commit-graph file was created in the past
by earlier version of Git, perahs it came from server with newer Git, or
perhaps the same repository is sometimes accessed using newer and
sometimes older Git version), is to set commit->generation_number to
GENERATION_NUMBER_ZERO, as you wrote in [2].
We could encode backward-compatibility (for using) somewhat, but I
wonder how much it would be of use.
[2]: https://public-inbox.org/git/61a829ce-0d29-81c9-880e-7aef1bec916e@gmail.com/
> Most of these changes will happen in the file header. The chunks
> themselves don't need to change, but some chunks may be added that
> only make sense in v2 commit-graphs.
What I would like to see in v2 commit-graph format would be some kind
convention for naming / categorizing chunks, so that Git would be able
to handle unknown chunks gracefully.
In PNG format there are various types of chunks. Quoting Wikipedia:
Chunk types [in PNG format] are given a four-letter case sensitive
ASCII type/name; compare FourCC. The case of the different letters in
the name (bit 5 of the numeric value of the character) is a bit field
that provides the decoder with some information on the nature of
chunks it does not recognize.
The case of the first letter indicates whether the chunk is critical
or not. If the first letter is uppercase, the chunk is critical; if
not, the chunk is ancillary. Critical chunks contain information that
is necessary to read the file. If a decoder encounters a critical
chunk it does not recognize, it must abort reading the file or supply
the user with an appropriate warning.
Currently this is solved with the format version. For given format
version some chunks are necessary (critical); if there would be another
type of chunk that must be understood, we can simply increase format
version.
The case of the second letter indicates whether the chunk is "public"
(either in the specification or the registry of special-purpose public
chunks) or "private" (not standardised). Uppercase is public and
lowercase is private. This ensures that public and private chunk names
can never conflict with each other (although two private chunk names
could conflict).
For Git this could be "official" and "experimental"... though I don't
see people sharing commit-graph files with experimental chunks that can
be used only by some local unpublished fork of Git.
The third letter must be uppercase to conform to the PNG
specification. It is reserved for future expansion. Decoders should
treat a chunk with a lower case third letter the same as any other
unrecognised chunk.
In short: reserved. This could be the fate of all of those 4 flags that
we do not need.
The case of the fourth letter indicates whether the chunk is safe to
copy by editors that do not recognize it. If lowercase, the chunk may
be safely copied regardless of the extent of modifications to the
file. If uppercase, it may only be copied if the modifications have
not touched any critical chunks.
If the data contained in the chunks is immutable, then it could be
copied when updating commit-graph file... but what to do with new
commits, with what to fill the data for a new commit for a copied
unknown chunk? All zeros? All ones? Some value defined in the chunk
itself?
Best,
--
Jakub Narębski
next prev parent reply other threads:[~2018-11-02 17:44 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-10-29 16:55 [RFC] Generation Number v2 Derrick Stolee
2018-10-29 19:22 ` Stefan Beller
2018-10-29 20:06 ` Derrick Stolee
2018-11-01 20:06 ` Jakub Narebski
2018-11-02 9:30 ` Jakub Narebski
2018-11-03 17:27 ` Jakub Narebski
2018-10-29 20:25 ` Derrick Stolee
2018-11-01 22:13 ` Jakub Narebski
2018-10-30 3:59 ` Junio C Hamano
2018-10-31 12:30 ` Derrick Stolee
2018-11-02 13:33 ` Jakub Narebski
2018-10-31 12:54 ` Ævar Arnfjörð Bjarmason
2018-10-31 13:04 ` Derrick Stolee
2018-11-02 17:44 ` Jakub Narebski [this message]
2018-11-01 12:27 ` Jakub Narebski
2018-11-01 13:29 ` Derrick Stolee
2018-11-03 12:33 ` Jakub Narebski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=86a7mrxtko.fsf@gmail.com \
--to=jnareb@gmail.com \
--cc=avarab@gmail.com \
--cc=dstolee@microsoft.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=peff@peff.net \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).