git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jakub Narebski <jnareb@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: Junio C Hamano <gitster@pobox.com>,
	Abhishek Kumar <abhishekkumar8222@gmail.com>,
	git@vger.kernel.org,
	Christian Couder <christian.couder@gmail.com>
Subject: Re: [RFC][GSoC] Implement Generation Number v2
Date: Tue, 24 Mar 2020 10:24:52 +0100	[thread overview]
Message-ID: <86eetiqf4r.fsf@gmail.com> (raw)
In-Reply-To: <13995fbd-d645-56aa-b647-e9a51d00554e@gmail.com> (Derrick Stolee's message of "Mon, 23 Mar 2020 11:54:07 -0400")

Derrick Stolee <stolee@gmail.com> writes:
> On 3/23/2020 9:43 AM, Jakub Narebski wrote:
>> Junio C Hamano <gitster@pobox.com> writes:
>>> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
>>>> Jakub Narębski <jnareb@gmail.com> writes:
>> [...]
>> Proposed solutions are:
>>  - metadata / versioning chunk,
>>  - flag file: `.git/info/generation-number-v2`,
>>  - new chunk for commit data: "CDA2".
>> 
>> I would like to propose yet another solution: putting generation number
>> v2 data in a separate chunk (and possibly keeping generation number v1
>> in CDAT commit data chunk).  In this case we could even use ordinary
>> corrected commit date as generation number v2 (storing offsets as 32-bit
>> unsigned values), instead of backward-compatibile corrected commit date
>> with monotonic offsets.
>
> I agree that if we are creating a new (optional) chunk, then that gets
> around our versioning issues and could store just the offsets to get
> the "corrected commit date" option instead of the backwards-compatible one.
> By including yet another version number at the beginning of that chunk,
> we could present a way to update this "second reachability index chunk"
> with things like your interval mechanism with very little cost.

All this may be just a transitory phase, waiting until Git versions that
fail hard on commit-graph format version change die out... then we will
be able to use version number as intended (though it has the
disadvantage of turning off commit-graph wholly for older Git versions).

>> Each solution has its advantages and disadvantages.
>> 
>> 
>> With the flag file, the problem is (as Junio noticed) that if file gets
>> accidentally deleted, new Git would think incorrectly that commit-graph
>> uses generation number v1... which while suboptimal should not be bad
>> thanks to backward compatibility.  But I think the flag file should have
>> some kind of checksum as its contents (perhaps simply a copy of
>> commit-graph file checksum, or one checksum per file in chain with
>> incremental commit-graph), so that it old Git rewrites commit-graph file
>> leaving flag file present, new Git would notice this.
>
> I'm not a fan of the flag file idea. Optional chunks are a good way forward.
> That _could_ mean the metadata chunk, whose length can grow in the future
> if/when we add more fixed-width metadata values.

So it looks like metadata / versioning chunk would be the best solution,
isn't it?

>> Metadata or versioning chunk cannot be deleted by mistake; if old Git
>> copies unknown chunks to new updated commit-graph file instead of
>> skipping them we would need to add some kind of checksum (similarly to
>> the case for flag file).  The problem to be solved is what to do if some
>> files in the chain of commit-graph files have v2 (and this chunk),
>> and> some have v1 generation number (and do not have this chunk).
>
> The incremental commit-graph format is newer than our previous tests
> for generation number v2, which will be a big reason why that old code
> cannot be immediately adapted here.
>
> The simplest thing to do is usually right: if we try to write a
> generation number version that doesn't match the current commit-graph,
> then we need to flatten the entire chain into one layer and recompute
> the values from scratch. While it is _technically_ possible to mix
> the backwards-compatible corrected commit date with generation number
> v1, it requires taking the "lowest version" when doing comparisons and
> that may behave very strangely. Better to avoid that complication.

Right.

>> About moving commit data with generation number v2 to "CDA2" chunk: if
>> "CDAT" chunk is missing then (I think) old Git would simply not use
>> commit-graph file at all; it may crash, but I don't think so.  If "CDAT"
>> chunk has zero length... I don't know what would happen then, possibly
>> also old Git would simply not use commit-graph data at all.
>
> CDAT is required as it contains more than just generation numbers. It
> has the commit date, parent int-ids, and root tree oid. The generation
> numbers _could_ be left as all zeroes, which is a special case for the
> format before generation numbers were introduced, but it would be better
> to have values there.

I think (but I might be wrong) that the most expensive part of
calculating generation numbers is actually walking the commits.  Because
both generation number v1 (topological level) and generation number v2
(corrected committerdate, with or without monotonic restriction on
offsets) can be computed at the same time, during the same walk,
possibly with negligible cost compared to computing single geneation
number.

But this should be perhaps benchmarked.

>> Putting generation number v2 into separate chunk (which might be called
>> "GEN2" or "OFFS"/"DOFF") has the disadvantage of increasing the on disk
>> size of the commit graph, and possibly also increasing memory
>> consumption (the latter depends on how it would be handled), but has the
>> advantage of being fullly backward compatibile.  Old Git would simply
>> use generation numbers v1 in "CDAT", new Git would use generation
>> numbers v2 in "OFFS" -- combining commit creation date from "CDAT" and
>> offset from "OFFS"), and there should be no problems with updating
>> commit-graph file (either rewriting, or adding new commit-graph to the
>> chain).
>
> I share these concerns but also the locality of the data within the file.
> As we parse commits, we need the parent and commit date information out
> of the CDAT chunk anyway, so it is not difficult to grab the nearby
> generation number. If we put that data further away in a separate chunk,
> then it can be more expensive to flip between the CDAT chunk and the
> GEN2 chunk.

Right, I forgot about this issue, that Git is lazily parsing
commit-graph data, so keeping all [possible] commit data close is a good
idea from the performance point of view.

So it looks like metadata / versioning chunk would be the best solution
to the backward-compatibility interoperability problem.

> In terms of your prototyping for performance checks, it may be good to
> have a number of "GEN<X>" chunks so you can compute one commit-graph
> that stores all of the candidate reachability indexes, then use one
> of the chunks based on a config value or environment variable. I think
> that would only be appropriate for testing if you are evaluating which
> to build, so if you are focusing entirely on backwards-compatible
> corrected commit date, this is not worth spending time on.

Right, but it looks like there is nobody taking new labelings for GSoC
2020.

Good idea for prototyping, true.

Best,
-- 
Jakub Narębski

  reply	other threads:[~2020-03-24  9:24 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-22  9:35 [RFC][GSoC] Implement Generation Number v2 Abhishek Kumar
2020-03-22 20:05 ` Jakub Narebski
2020-03-23  4:25   ` Abhishek Kumar
2020-03-23  5:32     ` Junio C Hamano
2020-03-23 11:32       ` Abhishek Kumar
2020-03-23 13:43       ` Jakub Narebski
2020-03-23 15:54         ` Derrick Stolee
2020-03-24  9:24           ` Jakub Narebski [this message]
2020-03-23 16:04         ` Junio C Hamano
2020-03-24 15:44           ` Jakub Narebski
2020-03-24 21:13             ` Junio C Hamano
2020-03-26 10:15         ` [GSoC][Proposal v2] " Abhishek Kumar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86eetiqf4r.fsf@gmail.com \
    --to=jnareb@gmail.com \
    --cc=abhishekkumar8222@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).