topological index field for commit objects

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* topological index field for commit objects
@ 2016-06-29 18:31 Marc Strapetz
  2016-06-29 18:59 ` Junio C Hamano
  0 siblings, 1 reply; 31+ messages in thread
From: Marc Strapetz @ 2016-06-29 18:31 UTC (permalink / raw)
  To: git

This is no RFE but rather recurring thoughts whenever I'm working with 
commit graphs: a topological index attribute for commit objects would be 
incredible useful. By "topological index" I mean a simple integer for 
which following condition holds true:

if commit C is part of the history of commit D,
   then C's topological index is smaller than D's index

This would allow topological sorting of commits (e.g. in queues) on the 
fly and quickly give a "no" answer on the question whether D is part of 
C's history.

-Marc

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-29 18:31 topological index field for commit objects Marc Strapetz
@ 2016-06-29 18:59 ` Junio C Hamano
  2016-06-29 20:20   ` Stefan Beller
  2016-06-29 21:00   ` Jakub Narębski
  0 siblings, 2 replies; 31+ messages in thread
From: Junio C Hamano @ 2016-06-29 18:59 UTC (permalink / raw)
  To: Marc Strapetz; +Cc: Git Mailing List

On Wed, Jun 29, 2016 at 11:31 AM, Marc Strapetz
<marc.strapetz@syntevo.com> wrote:
> This is no RFE but rather recurring thoughts whenever I'm working with
> commit graphs: a topological index attribute for commit objects would be
> incredible useful. By "topological index" I mean a simple integer for which
> following condition holds true:

Look for "generation numbers" in the list archive, perhaps?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-29 18:59 ` Junio C Hamano
@ 2016-06-29 20:20   ` Stefan Beller
  2016-06-29 20:39     ` Junio C Hamano
  2016-06-29 21:00   ` Jakub Narębski
  1 sibling, 1 reply; 31+ messages in thread
From: Stefan Beller @ 2016-06-29 20:20 UTC (permalink / raw)
  To: Junio C Hamano, Jeff King, Linus Torvalds; +Cc: Marc Strapetz, Git Mailing List

On Wed, Jun 29, 2016 at 11:59 AM, Junio C Hamano <gitster@pobox.com> wrote:
> On Wed, Jun 29, 2016 at 11:31 AM, Marc Strapetz
> <marc.strapetz@syntevo.com> wrote:
>> This is no RFE but rather recurring thoughts whenever I'm working with
>> commit graphs: a topological index attribute for commit objects would be
>> incredible useful. By "topological index" I mean a simple integer for which
>> following condition holds true:
>
> Look for "generation numbers" in the list archive, perhaps?

Thanks for the pointer to the interesting discussions.

In http://www.spinics.net/lists/git/msg161363.html
Linus wrote in a discussion with Jeff:

> Right now, we do *have* a "generation number". It's just that it's
> very easy to corrupt even by mistake. It's called "committer date". We
> could improve on it.

Would it make sense to refuse creating commits that have a commit date
prior to its parents commit date (except when the user gives a
`--dammit-I-know-I-break-a-wildy-used-heuristic`)?

With the proposal to refuse an earlier committer date we would avoid
the corruption by mistake and only allow for corruption by actual
users intention. And then going forward we could rely more on the
committer date than we do today (i.e. some algorithms can go faster)?
For that we'd
 * need to document, what stuff actually breaks if the user overwrites
    the committer date to be earlier
 * and add a switch `--go-slow-because-dates-are-broken`

This seems to be the easiest fix in 2016 to me as that would require
to make just a very small change in commit creation, and we can postpone
the second part (relying even more on dates) until someone wants to fix it?

Thanks,
Stefan

> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-29 20:20   ` Stefan Beller
@ 2016-06-29 20:39     ` Junio C Hamano
  2016-06-29 20:54       ` Stefan Beller
                         ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Junio C Hamano @ 2016-06-29 20:39 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Jeff King, Linus Torvalds, Marc Strapetz, Git Mailing List

Stefan Beller <sbeller@google.com> writes:

> On Wed, Jun 29, 2016 at 11:59 AM, Junio C Hamano <gitster@pobox.com> wrote:
>> On Wed, Jun 29, 2016 at 11:31 AM, Marc Strapetz
>> <marc.strapetz@syntevo.com> wrote:
>>> This is no RFE but rather recurring thoughts whenever I'm working with
>>> commit graphs: a topological index attribute for commit objects would be
>>> incredible useful. By "topological index" I mean a simple integer for which
>>> following condition holds true:
>>
>> Look for "generation numbers" in the list archive, perhaps?
>
> Thanks for the pointer to the interesting discussions.
>
> In http://www.spinics.net/lists/git/msg161363.html
> Linus wrote in a discussion with Jeff:
>
>> Right now, we do *have* a "generation number". It's just that it's
>> very easy to corrupt even by mistake. It's called "committer date". We
>> could improve on it.
>
> Would it make sense to refuse creating commits that have a commit date
> prior to its parents commit date (except when the user gives a
> `--dammit-I-know-I-break-a-wildy-used-heuristic`)?

I think that has also been discussed in the past.  I do not think it
would help very much in practice, as projects already have up to 10
years (and the ones migrated from CVS, even more) worth of commits
they cannot rewrite that may record incorrect committer dates.
You'd need something like "you can trust committer dates that are
newer that this date" per project to switch between slow path and
fast path, with an updated fsck that knows how to compute that
number after you pulled from somebody who used that overriding
option.

If the use of generation number can somehow be limited narrowly, we
may be able to incrementally introduce it only for new commits, but
I haven't thought things through, so let me do so aloud here ;-)

Suppose we use it only for this purpose:

 * When we have two commits, C1 and C2, with generation numbers G1
   and G2, we can say "C1 cannot possibly be an ancestor of C2" if
   G1 > G2.  We cannot say anything else based on generation
   numbers (or lack thereof).

then I think we could just say "A newly created commit must record
generation number G that is larger than generation numbers of its
parent commits; ignore parents that lack generation number for the
purpose of this sentence".

I am not sure if that limited use is all that useful, though.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-29 20:39     ` Junio C Hamano
@ 2016-06-29 20:54       ` Stefan Beller
  2016-06-29 21:37         ` Stefan Beller
  2016-06-29 20:56       ` Jeff King
  2016-06-29 22:15       ` Marc Strapetz
  2 siblings, 1 reply; 31+ messages in thread
From: Stefan Beller @ 2016-06-29 20:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Linus Torvalds, Marc Strapetz, Git Mailing List

On Wed, Jun 29, 2016 at 1:39 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Stefan Beller <sbeller@google.com> writes:
>
>> On Wed, Jun 29, 2016 at 11:59 AM, Junio C Hamano <gitster@pobox.com> wrote:
>>> On Wed, Jun 29, 2016 at 11:31 AM, Marc Strapetz
>>> <marc.strapetz@syntevo.com> wrote:
>>>> This is no RFE but rather recurring thoughts whenever I'm working with
>>>> commit graphs: a topological index attribute for commit objects would be
>>>> incredible useful. By "topological index" I mean a simple integer for which
>>>> following condition holds true:
>>>
>>> Look for "generation numbers" in the list archive, perhaps?
>>
>> Thanks for the pointer to the interesting discussions.
>>
>> In http://www.spinics.net/lists/git/msg161363.html
>> Linus wrote in a discussion with Jeff:
>>
>>> Right now, we do *have* a "generation number". It's just that it's
>>> very easy to corrupt even by mistake. It's called "committer date". We
>>> could improve on it.
>>
>> Would it make sense to refuse creating commits that have a commit date
>> prior to its parents commit date (except when the user gives a
>> `--dammit-I-know-I-break-a-wildy-used-heuristic`)?
>
> I think that has also been discussed in the past.

I should have guessed that and tried to find it.

> I do not think it
> would help very much in practice, as projects already have up to 10
> years (and the ones migrated from CVS, even more) worth of commits
> they cannot rewrite that may record incorrect committer dates.

Chances are that the 10 years of history may be correct time wise as long
as people don't introduce a bad date malevolently.

> You'd need something like "you can trust committer dates that are
> newer that this date" per project

and git version as old versions of git can still be used later.

> to switch between slow path and
> fast path, with an updated fsck that knows how to compute that
> number after you pulled from somebody who used that overriding
> option.

Well you could have a project setting (`config.sortedDates`)
that is automatically computed once when cloning a project and
depending on that setting you can go the fast path.

Additionally when that setting is set, you'd enforce the correct
dates in committing and merging (read pulling) to carry it on to
be true.

>
> If the use of generation number can somehow be limited narrowly, we
> may be able to incrementally introduce it only for new commits, but
> I haven't thought things through, so let me do so aloud here ;-)

I think of it as "committer date == generation number", i.e. just a
special form of noting the generation number with gaps in between.
This loses the ability to know how many commits there are at maximum
between 2 given "numbers" though, which I think is minor.

>
> Suppose we use it only for this purpose:
>
>  * When we have two commits, C1 and C2, with generation numbers G1
>    and G2, we can say "C1 cannot possibly be an ancestor of C2" if
>    G1 > G2.  We cannot say anything else based on generation
>    numbers (or lack thereof).
>
> then I think we could just say "A newly created commit must record
> generation number G that is larger than generation numbers of its
> parent commits; ignore parents that lack generation number for the
> purpose of this sentence".
>
> I am not sure if that limited use is all that useful, though.

I did *not* propose to introduce the generation number, but
rather meant:
* we already have committer date
* it works pretty well
* only a tiny patch is required to tighten the heuristic to work even better
  (going forward) by avoiding accidents in the history that have
  a committer date earlier than their parents.
* we postpone drastic changes (i.e. introduction of generation
  numbers or change of algorithms) for now.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-29 20:39     ` Junio C Hamano
  2016-06-29 20:54       ` Stefan Beller
@ 2016-06-29 20:56       ` Jeff King
  2016-06-29 21:49         ` Jakub Narębski
  2016-06-29 22:15       ` Marc Strapetz
  2 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2016-06-29 20:56 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Stefan Beller, Linus Torvalds, Marc Strapetz, Git Mailing List

On Wed, Jun 29, 2016 at 01:39:17PM -0700, Junio C Hamano wrote:

> > Would it make sense to refuse creating commits that have a commit date
> > prior to its parents commit date (except when the user gives a
> > `--dammit-I-know-I-break-a-wildy-used-heuristic`)?
> 
> I think that has also been discussed in the past.  I do not think it
> would help very much in practice, as projects already have up to 10
> years (and the ones migrated from CVS, even more) worth of commits
> they cannot rewrite that may record incorrect committer dates.

Yep, it has been discussed and I agree it runs into a lot of corner
cases.

> If the use of generation number can somehow be limited narrowly, we
> may be able to incrementally introduce it only for new commits, but
> I haven't thought things through, so let me do so aloud here ;-)

I think the problem is that you really _do_ want generation numbers for
old commits. One of the most obvious cases is something like "tag
--contains HEAD", because it has to examine older tags.

So your history looks something like:

  A -- B -- ... Z
        \        \
	 v1.0     HEAD

Without generation numbers (or some proxy), you have to walk the history
between B..Z to find the answer. With generation numbers, it is
immediately obvious.

So this is the ideal case for generation numbers (the worst cases are
when the things you are looking for are in branchy, close history where
the generation numbers don't tell you much; but in such cases the
walking is usually not too bad).

So I think you really do want to be able to generate and store
generation numbers after the fact. That has an added bonus that you do
not have to worry about baking incorrect values into your objects; you
do the topological walk once, and you _know_ it is correct (at least as
correct as the parent links, but that is our source of truth).

I have patches that generate and store the numbers at pack time, similar
to the way we do the reachability bitmaps. They're not production ready,
but they could probably be made so without too much effort. You wouldn't
have ready-made generation numbers for commits since the last full
repack, but you can compute them incrementally based on what you do have
at a cost linear to the unpacked commits (this is the same for bitmaps).

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-29 18:59 ` Junio C Hamano
  2016-06-29 20:20   ` Stefan Beller
@ 2016-06-29 21:00   ` Jakub Narębski
  1 sibling, 0 replies; 31+ messages in thread
From: Jakub Narębski @ 2016-06-29 21:00 UTC (permalink / raw)
  To: Junio C Hamano, Marc Strapetz; +Cc: Git Mailing List

W dniu 2016-06-29 o 20:59, Junio C Hamano pisze:
> On Wed, Jun 29, 2016 at 11:31 AM, Marc Strapetz
> <marc.strapetz@syntevo.com> wrote:
>
>> This is no RFE but rather recurring thoughts whenever I'm working with
>> commit graphs: a topological index attribute for commit objects would be
>> incredible useful. By "topological index" I mean a simple integer for which
>> following condition holds true:
>> 
>>  if commit C is part of the history of commit D,
>>    then C's topological index is smaller than D's index
>> 
>> This would allow topological sorting of commits (e.g. in queues) on the fly
>> and quickly give a "no" answer on the question whether D is part of
>> C's history. 
> 
> Look for "generation numbers" in the list archive, perhaps?

If I remember correctly the discussion, the problem by adding generation number
to the commit object format was twofold (at least).

First there was a problem of backward compatibility, namely what to do with
existing repositories, where commit objects do not have generation number.
Objects in Git are immutable (and I think replacements mechanism wasn't
invented yet - anyway too many replacements would slow down operations
I guess).

Second objection was of philosophical nature: generation numbers duplicate
(cache) information that is stored in the graph of revisions. Also, what
if they get out of sync?

That is, if I remember the summary of that discussion correctly.


Also, generation numbers (graph level) only help with topological sorting;
for finding of two commits are connected (two nodes are connected) people
play with different ideas, for example FELINE index:
  http://openproceedings.org/EDBT/2014/paper_166.pdf

Nowadays there is also [compressed] bitmap index (if enabled), though I am
not sure if it is yet used to speed-up reachability queries
  http://githubengineering.com/counting-objects/
  https://www.eclipsecon.org/2013/sites/eclipsecon.org.2013/files/Scaling%20Up%20JGit%20-%20EclipseCon%202013.pdf

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-29 20:54       ` Stefan Beller
@ 2016-06-29 21:37         ` Stefan Beller
  2016-06-29 21:43           ` Jeff King
  0 siblings, 1 reply; 31+ messages in thread
From: Stefan Beller @ 2016-06-29 21:37 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Linus Torvalds, Marc Strapetz, Git Mailing List

On Wed, Jun 29, 2016 at 1:54 PM, Stefan Beller <sbeller@google.com> wrote:
> Chances are that the 10 years of history may be correct time wise as long
> as people don't introduce a bad date malevolently.
>

To answer my own speculation:
Even git.git violates the timing property, so there is no hope to find
many projects
to have "parents committer date < committer date"

$ git log --format="%H %ct %P" HEAD | {
> while read sha sha_date parents ; do
> for p in $parents ; do
> # echo "$id, $parents, $p"
> pd=$(git show -s --format="%ct" $p)
> if test $pd -gt $sha_date ; then
> echo $sha
> fi
> done
> done
> }
619a644d6daef56d70aeca85514e2d281eb483a5
776398709dee4050fc194fec45c5818ba9b01afe
ed19f367220a50e4e2a5c1a00b03c14eafcaba89

(out of curiosity these are:)
(2009-10-18 12:34:56, "checkout A...B" switches to the merge base
between A and B)
(2007-09-01 23:53:47, Keep last used delta base in the delta window)
(2006-03-03 23:29:56, Add a Documentation/git-tools.txt)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-29 21:37         ` Stefan Beller
@ 2016-06-29 21:43           ` Jeff King
  0 siblings, 0 replies; 31+ messages in thread
From: Jeff King @ 2016-06-29 21:43 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Junio C Hamano, Linus Torvalds, Marc Strapetz, Git Mailing List

On Wed, Jun 29, 2016 at 02:37:17PM -0700, Stefan Beller wrote:

> On Wed, Jun 29, 2016 at 1:54 PM, Stefan Beller <sbeller@google.com> wrote:
> > Chances are that the 10 years of history may be correct time wise as long
> > as people don't introduce a bad date malevolently.
> >
> 
> To answer my own speculation:
> Even git.git violates the timing property, so there is no hope to find
> many projects
> to have "parents committer date < committer date"

It's expected and reasonable for there to be some skew. People's clocks
aren't all perfectly synced. The question is one of how _much_ skew
you're willing to tolerate (and any algorithms obviously need to allow
that much, too).

In some places we allow 24 hours of skew between any two commits. In
others we allow up to 5 skewed commits in a row.

That may not be enough in general, though. E.g.:

> 619a644d6daef56d70aeca85514e2d281eb483a5

This one is a 3-day skew.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-29 20:56       ` Jeff King
@ 2016-06-29 21:49         ` Jakub Narębski
  2016-06-29 22:00           ` Jeff King
  0 siblings, 1 reply; 31+ messages in thread
From: Jakub Narębski @ 2016-06-29 21:49 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano
  Cc: Stefan Beller, Linus Torvalds, Marc Strapetz, Git Mailing List

W dniu 2016-06-29 o 22:56, Jeff King pisze:
> On Wed, Jun 29, 2016 at 01:39:17PM -0700, Junio C Hamano wrote:
> 
>>> Would it make sense to refuse creating commits that have a commit date
>>> prior to its parents commit date (except when the user gives a
>>> `--dammit-I-know-I-break-a-wildy-used-heuristic`)?
>>
>> I think that has also been discussed in the past.  I do not think it
>> would help very much in practice, as projects already have up to 10
>> years (and the ones migrated from CVS, even more) worth of commits
>> they cannot rewrite that may record incorrect committer dates.
> 
> Yep, it has been discussed and I agree it runs into a lot of corner
> cases.
> 
>> If the use of generation number can somehow be limited narrowly, we
>> may be able to incrementally introduce it only for new commits, but
>> I haven't thought things through, so let me do so aloud here ;-)
> 
> I think the problem is that you really _do_ want generation numbers for
> old commits. One of the most obvious cases is something like "tag
> --contains HEAD", because it has to examine older tags.
> 
> So your history looks something like:
> 
>   A -- B -- ... Z
>         \        \
> 	 v1.0     HEAD
> 
> Without generation numbers (or some proxy), you have to walk the history
> between B..Z to find the answer. With generation numbers, it is
> immediately obvious.
> 
> So this is the ideal case for generation numbers (the worst cases are
> when the things you are looking for are in branchy, close history where
> the generation numbers don't tell you much; but in such cases the
> walking is usually not too bad).

There are other approaches (special indices) that help reachability
queries beside "generation number".

> 
> So I think you really do want to be able to generate and store
> generation numbers after the fact. That has an added bonus that you do
> not have to worry about baking incorrect values into your objects; you
> do the topological walk once, and you _know_ it is correct (at least as
> correct as the parent links, but that is our source of truth).

By the way, what should happen if you add a replacement (in the git-replace
meaning) that creates a shortcut, therefore invalidating generation numbers,
at least in strict sense - committerdate as generation number would be still
good, I think?
 
> I have patches that generate and store the numbers at pack time, similar
> to the way we do the reachability bitmaps. They're not production ready,
> but they could probably be made so without too much effort. You wouldn't
> have ready-made generation numbers for commits since the last full
> repack, but you can compute them incrementally based on what you do have
> at a cost linear to the unpacked commits (this is the same for bitmaps).

Do Git use EWAH / EWOK bitmaps for reachability analysis, or is it still
limited to object counting?

-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-29 21:49         ` Jakub Narębski
@ 2016-06-29 22:00           ` Jeff King
  2016-06-29 22:11             ` Junio C Hamano
                               ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Jeff King @ 2016-06-29 22:00 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Junio C Hamano, Stefan Beller, Linus Torvalds, Marc Strapetz,
	Git Mailing List

On Wed, Jun 29, 2016 at 11:49:35PM +0200, Jakub Narębski wrote:

> > So this is the ideal case for generation numbers (the worst cases are
> > when the things you are looking for are in branchy, close history where
> > the generation numbers don't tell you much; but in such cases the
> > walking is usually not too bad).
> 
> There are other approaches (special indices) that help reachability
> queries beside "generation number".

Yes, though generation numbers can help with more questions (e.g., "what
is the merge base").

> By the way, what should happen if you add a replacement (in the git-replace
> meaning) that creates a shortcut, therefore invalidating generation numbers,
> at least in strict sense - committerdate as generation number would be still
> good, I think?

This is one of the open questions. My older patches turned them off when
replacements and grafts are in effect.

> > I have patches that generate and store the numbers at pack time, similar
> > to the way we do the reachability bitmaps. They're not production ready,
> > but they could probably be made so without too much effort. You wouldn't
> > have ready-made generation numbers for commits since the last full
> > repack, but you can compute them incrementally based on what you do have
> > at a cost linear to the unpacked commits (this is the same for bitmaps).
> 
> Do Git use EWAH / EWOK bitmaps for reachability analysis, or is it still
> limited to object counting?

At GitHub we are using them for --contains analysis, along with mass
ahead/behind (e.g., as in https://github.com/gitster/git/branches). My
plan is to send patches upstream, but they need some cleanup first.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-29 22:00           ` Jeff King
@ 2016-06-29 22:11             ` Junio C Hamano
  2016-06-29 22:30               ` Jeff King
  2016-06-30 10:30             ` Jakub Narębski
  2016-07-20  0:07             ` Jakub Narębski
  2 siblings, 1 reply; 31+ messages in thread
From: Junio C Hamano @ 2016-06-29 22:11 UTC (permalink / raw)
  To: Jeff King
  Cc: Jakub Narębski, Stefan Beller, Linus Torvalds, Marc Strapetz,
	Git Mailing List

Jeff King <peff@peff.net> writes:

> Yes, though generation numbers can help with more questions (e.g., "what
> is the merge base").

Hmm, how?  I have two commits, with generation number 38 and 47,
respectively.  What generation number does the commit that is the
merge base of these two commits?

I know we can say that 38 cannot possibly be a descendant of 47, but
can we say anything else that is useful?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-29 20:39     ` Junio C Hamano
  2016-06-29 20:54       ` Stefan Beller
  2016-06-29 20:56       ` Jeff King
@ 2016-06-29 22:15       ` Marc Strapetz
  2 siblings, 0 replies; 31+ messages in thread
From: Marc Strapetz @ 2016-06-29 22:15 UTC (permalink / raw)
  To: Junio C Hamano, Stefan Beller; +Cc: Jeff King, Linus Torvalds, Git Mailing List

On 29.06.2016 22:39, Junio C Hamano wrote:
> Stefan Beller <sbeller@google.com> writes:
>
>> On Wed, Jun 29, 2016 at 11:59 AM, Junio C Hamano <gitster@pobox.com> wrote:
>>> On Wed, Jun 29, 2016 at 11:31 AM, Marc Strapetz
>>> <marc.strapetz@syntevo.com> wrote:
>>>> This is no RFE but rather recurring thoughts whenever I'm working with
>>>> commit graphs: a topological index attribute for commit objects would be
>>>> incredible useful. By "topological index" I mean a simple integer for which
>>>> following condition holds true:
>>>
>>> Look for "generation numbers" in the list archive, perhaps?
>>
>> Thanks for the pointer to the interesting discussions.
>>
>> In http://www.spinics.net/lists/git/msg161363.html
>> Linus wrote in a discussion with Jeff:
>>
>>> Right now, we do *have* a "generation number". It's just that it's
>>> very easy to corrupt even by mistake. It's called "committer date". We
>>> could improve on it.
>>
>> Would it make sense to refuse creating commits that have a commit date
>> prior to its parents commit date (except when the user gives a
>> `--dammit-I-know-I-break-a-wildy-used-heuristic`)?
>
> I think that has also been discussed in the past.  I do not think it
> would help very much in practice, as projects already have up to 10
> years (and the ones migrated from CVS, even more) worth of commits
> they cannot rewrite that may record incorrect committer dates.
> You'd need something like "you can trust committer dates that are
> newer that this date" per project to switch between slow path and
> fast path, with an updated fsck that knows how to compute that
> number after you pulled from somebody who used that overriding
> option.
>
> If the use of generation number can somehow be limited narrowly, we
> may be able to incrementally introduce it only for new commits, but
> I haven't thought things through, so let me do so aloud here ;-)
>
> Suppose we use it only for this purpose:
>
>  * When we have two commits, C1 and C2, with generation numbers G1
>    and G2, we can say "C1 cannot possibly be an ancestor of C2" if
>    G1 > G2.  We cannot say anything else based on generation
>    numbers (or lack thereof).
>
> then I think we could just say "A newly created commit must record
> generation number G that is larger than generation numbers of its
> parent commits; ignore parents that lack generation number for the
> purpose of this sentence".

 From algorithm perspective, for already existing repositories you would 
still have to switch from an optimized generation number code to the 
current commit-time based code. That could things make even more complex 
and it's possibly expensive to determine whether a repository has full 
generation number support or not.

On the other hand, for new repositories, you could immediately use 
generation number based algorithms. So it could be "A newly created 
commit must record generation number G that is larger than generation 
numbers of its parent commits if all parents commits have a generation 
number recorded; otherwise do not record a generation number". Something 
like "git filter-branch" might already be sufficient to convert 
repositories.

Git versions released in 2019 may start issuing warnings if HEAD has no 
generation number assigned and Git versions released in 2025 may 
completely refuse to work with such repositories.

In the interim period, a local cache as Jeff is proposing could serve as 
secondary source for generation numbers. This would allow to phase out 
current algorithms immediately.

-Marc




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-29 22:11             ` Junio C Hamano
@ 2016-06-29 22:30               ` Jeff King
  2016-07-05 11:43                 ` Johannes Schindelin
  0 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2016-06-29 22:30 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jakub Narębski, Stefan Beller, Linus Torvalds, Marc Strapetz,
	Git Mailing List

On Wed, Jun 29, 2016 at 03:11:39PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Yes, though generation numbers can help with more questions (e.g., "what
> > is the merge base").
> 
> Hmm, how?  I have two commits, with generation number 38 and 47,
> respectively.  What generation number does the commit that is the
> merge base of these two commits?
> 
> I know we can say that 38 cannot possibly be a descendant of 47, but
> can we say anything else that is useful?

I don't think it can give you the answer immediately from those values,
but in general generation numbers help you bound actual traversals and
avoid walking down unproductive paths. So comparing 38 and 47 is not
nearly as useful as walking from 47 down to 37 and saying "I know that I
cannot possibly reach 38 by walking further".

I haven't thought hard specifically about merge bases computation, so
perhaps that is a case that isn't helped at all. It has been a while
since I've looked into this, but I recall there were some computations
that are hard to improve with bitmaps. I may also simply be
mis-remembering; in my patches generations were tightly tied to having a
packv4-style cache of commit properties that can be used rather than
inflating the commit object itself. That cache helps any time you walk
by speeding up the parse_commit() process. But it's a per-commit
speedup. It doesn't change the algorithmic complexity of what you're
doing.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-29 22:00           ` Jeff King
  2016-06-29 22:11             ` Junio C Hamano
@ 2016-06-30 10:30             ` Jakub Narębski
  2016-06-30 18:12               ` Linus Torvalds
  2016-07-01  6:54               ` Jeff King
  2016-07-20  0:07             ` Jakub Narębski
  2 siblings, 2 replies; 31+ messages in thread
From: Jakub Narębski @ 2016-06-30 10:30 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Stefan Beller, Linus Torvalds, Marc Strapetz,
	Git Mailing List

W dniu 2016-06-30 o 00:00, Jeff King pisze:
> On Wed, Jun 29, 2016 at 11:49:35PM +0200, Jakub Narębski wrote:
> 
>>> So this is the ideal case for generation numbers (the worst cases are
>>> when the things you are looking for are in branchy, close history where
>>> the generation numbers don't tell you much; but in such cases the
>>> walking is usually not too bad).
>>
>> There are other approaches (special indices) that help reachability
>> queries beside "generation number".
> 
> Yes, though generation numbers can help with more questions (e.g., "what
> is the merge base").

Well, those special indices usually need generation number anyway. For
example FELINE indices^1 (from "Reachability Queries in Very Large Graphs",
http://openproceedings.org/EDBT/2014/paper_166.pdf, that I found when
trying to find more about compressed bitmap indices used by Git) also
utilize generation numbers to speed up reachability analysis. The idea
behind FELINE is to associate every vertex (commit) with a unique ordered
pair of natural integers (i.e. two more numbers, in addition to the level
aka generation number), so that partial order over N^2 is superset of
partial order of graph (DAG of revisions). It can answer in O(1) that
nodes are not connected, and cut search space if they are.

BTW. some references in this paper ("Related work" section) use PWAH
compression scheme - perhaps it would work with EWAH that Git uses?

1. FELINE = Fast rEfined onLINE search ... or fun with acronyms

>> By the way, what should happen if you add a replacement (in the git-replace
>> meaning) that creates a shortcut, therefore invalidating generation numbers,
>> at least in strict sense - committerdate as generation number would be still
>> good, I think?
> 
> This is one of the open questions. My older patches turned them off when
> replacements and grafts are in effect.

Well, if you store the cache of generation numbers in the packfile, or in
the index of the packfile, or in the bitmap file, or in separate bitmap-like
file, generating them on repack, then of course any grafts or replacements
invalidate them... though for low level commands (like object counting)
replacements are transparent -- or rather they are (and can be) treated as
any other ref for reachability analysis.

Well, if there are no grafts, you could still use them for doing
"git --no-replace-objects log ...", isn't it?

SIDENOTE: should grafts be deprecated and later removed, now that Git has
superior alternative in the form of git-replace, and a simple way to convert
grafts to replacements?

Anyway, if file with mapping from revisions to its generation numbers
were stored outside of packfiles, it could simply be updated / rewritten
when adding new replacement. You could use one cache for no-replace,
and one for replace. Though I do wonder if it would be possible to do
such rewrite fast... well, fast enough assuming that adding replacements
is rare.

Answering my own question:
>> committerdate as generation number would be still good, I think?
No, it wouldn't:

  pre replace:

    1 <-- 2 <-- 3 <-- 5 <--
           \
            \----- 4 <-- 6 <-- 7 <-- 8

  post replace

   1 <-- 2                              \-- 3' <-- 5
          \                              \
           \------ 4 <-- 6 <-- 7 <-- 8 <--\

Either the replacement commit would have generation number correct, but
its child wouldn't, or its child would have correct generation number but
the replacement wouldn't.

>>> I have patches that generate and store the numbers at pack time, similar
>>> to the way we do the reachability bitmaps.

Ah, so those cached generation numbers are generated and stored at pack
time. Where you store them: is it a separate file? Bitmap file? Packfile?

>>>                                            They're not production ready,
>>> but they could probably be made so without too much effort. You wouldn't
>>> have ready-made generation numbers for commits since the last full
>>> repack, but you can compute them incrementally based on what you do have
>>> at a cost linear to the unpacked commits (this is the same for bitmaps).
>>
>> Do Git use EWAH / EWOK bitmaps for reachability analysis, or is it still
>> limited to object counting?
> 
> At GitHub we are using them for --contains analysis, along with mass
> ahead/behind (e.g., as in https://github.com/gitster/git/branches). My
> plan is to send patches upstream, but they need some cleanup first.

That would be nice to have, please.

Er, is mass ahead/behind something that can be plugged into Git
(e.g. for "git branch -v -v"), or is it something GitHub-specific?

P.S. Having Git ensure that committerdate (as an epoch) is greater
than committerdates of its parents at the commit creation time (with
providing warning about time skew, perhaps not doing it if skew is
too large) would be not costly. No need to add anything, and it would
help future queries to be faster, I think.

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-30 10:30             ` Jakub Narębski
@ 2016-06-30 18:12               ` Linus Torvalds
  2016-06-30 23:39                 ` Jakub Narębski
                                   ` (2 more replies)
  2016-07-01  6:54               ` Jeff King
  1 sibling, 3 replies; 31+ messages in thread
From: Linus Torvalds @ 2016-06-30 18:12 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Jeff King, Junio C Hamano, Stefan Beller, Marc Strapetz,
	Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 1893 bytes --]

On Thu, Jun 30, 2016 at 3:30 AM, Jakub Narębski <jnareb@gmail.com> wrote:
>
> P.S. Having Git ensure that committerdate (as an epoch) is greater
> than committerdates of its parents at the commit creation time (with
> providing warning about time skew, perhaps not doing it if skew is
> too large) would be not costly. No need to add anything, and it would
> help future queries to be faster, I think.

So I think git should check the committer date (ie verify that the
commitetr date of a new commit is always equal to or more recent than
all the parents).

But we should probably allow some slop for when machines are out of
sync (think build farms etc automation that may do automated commits,
but the clocks on different machines may be off by a few seconds). We
already do things like that in some of the use of committer dates -
see for example CUTOFF_DATE_SLOP)

And we should probably allow the user to override the refusal to
create new commits with bogus dates - think importing repositories etc
where you may have reasons to just say "that's just how it was".

And it will take years for that to percolate out to all users, so it's
not like it will fix things immediately, but then some time from now,
we can consider commit dates to be more reliably generation numbers.

I do think that it's ok to cache generation numbers somewhere if there
is an algorithm that can make use of them, but every time this comes
up, it's just not been important enough to make a big deal and a new
incompatible object format for it. The committer date is preexisting
and has existing pseudo-generation-number usage, so..improving on the
quality of it sounds like a good idea.

The first step should probably be to make fsck warn about the existing
cases of "commit has older date than parents". Something like the
attached patch, perhaps?

              Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 1262 bytes --]

 fsck.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fsck.c b/fsck.c
index 05315451c56f..722b65371d38 100644
--- a/fsck.c
+++ b/fsck.c
@@ -60,6 +60,7 @@
 	FUNC(NULL_SHA1, WARN) \
 	FUNC(ZERO_PADDED_FILEMODE, WARN) \
 	FUNC(NUL_IN_COMMIT, WARN) \
+	FUNC(DATE_ORDERING, WARN) \
 	/* infos (reported as warnings, but ignored by default) */ \
 	FUNC(BAD_TAG_NAME, INFO) \
 	FUNC(MISSING_TAGGER_ENTRY, INFO)
@@ -604,6 +605,17 @@ static int fsck_ident(const char **ident, struct object *obj, struct fsck_option
 	return 0;
 }

+static void fsck_commit_date(struct fsck_options *options, struct commit *commit)
+{
+	struct commit_list *p;
+
+	for (p = commit->parents; p; p = p->next) {
+		struct commit *parent = p->item;
+		if (commit->date < parent->date)
+			report(options, &commit->object, FSCK_MSG_DATE_ORDERING, "Bad commit date ordering with parent");
+	}
+}
+
 static int fsck_commit_buffer(struct commit *commit, const char *buffer,
 	unsigned long size, struct fsck_options *options)
 {
@@ -650,6 +662,7 @@ static int fsck_commit_buffer(struct commit *commit, const char *buffer,
 				return err;
 		}
 	}
+	fsck_commit_date(options, commit);
 	author_count = 0;
 	while (skip_prefix(buffer, "author ", &buffer)) {
 		author_count++;

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-30 18:12               ` Linus Torvalds
@ 2016-06-30 23:39                 ` Jakub Narębski
  2016-06-30 23:59                 ` Mike Hommey
  2016-07-01  3:17                 ` Jeff King
  2 siblings, 0 replies; 31+ messages in thread
From: Jakub Narębski @ 2016-06-30 23:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff King, Junio C Hamano, Stefan Beller, Marc Strapetz,
	Git Mailing List

W dniu 2016-06-30 o 20:12, Linus Torvalds pisze:
> On Thu, Jun 30, 2016 at 3:30 AM, Jakub Narębski <jnareb@gmail.com> wrote:
>>
>> P.S. Having Git ensure that committerdate (as an epoch) is greater
>> than committerdates of its parents at the commit creation time (with
>> providing warning about time skew, perhaps not doing it if skew is
>> too large) would be not costly. No need to add anything, and it would
>> help future queries to be faster, I think.
> 
> So I think git should check the committer date (ie verify that the
> committer date of a new commit is always equal to or more recent than
> all the parents).
> 
> But we should probably allow some slop for when machines are out of
> sync (think build farms etc automation that may do automated commits,
> but the clocks on different machines may be off by a few seconds). We
> already do things like that in some of the use of committer dates -
> see for example CUTOFF_DATE_SLOP).

The problem with this idea is that the clock skew might be in two
directions, but it does fix/help only one.  If committer's clock
lags behind true time, the automatic bump to have committerdate of
a new commit be greater than committerdates of all its parent is quite
helpful.

However, if some developer has his or her clock misconfigured so that
it gives time in the future, this feature would not help.  Commits from
such developer (from such machine) would screw it up for others.

That's why I think some limit is needed.  For example, if for some
reason commit was created with committer date 3 days in the future,
we would probably do not want for other developers to have to lie
about when they created their commit.

Or did you mean that if the date for new commit is in the past
of committerdates of its parents, its all right when it is within
slop?

> 
> And we should probably allow the user to override the refusal to
> create new commits with bogus dates - think importing repositories etc
> where you may have reasons to just say "that's just how it was".

Right. I was thinking about (ab)using --no-verify flag, or perhaps
prompting user if the terminal is interactive. But perhaps a separate
flag / environment variable would be better...

> And it will take years for that to percolate out to all users, so it's
> not like it will fix things immediately, but then some time from now,
> we can consider commit dates to be more reliably generation numbers.
[...]

That's the idea of the proposal.

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-30 18:12               ` Linus Torvalds
  2016-06-30 23:39                 ` Jakub Narębski
@ 2016-06-30 23:59                 ` Mike Hommey
  2016-07-01  3:17                 ` Jeff King
  2 siblings, 0 replies; 31+ messages in thread
From: Mike Hommey @ 2016-06-30 23:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jakub Narębski, Jeff King, Junio C Hamano, Stefan Beller,
	Marc Strapetz, Git Mailing List

On Thu, Jun 30, 2016 at 11:12:52AM -0700, Linus Torvalds wrote:
> On Thu, Jun 30, 2016 at 3:30 AM, Jakub Narębski <jnareb@gmail.com> wrote:
> >
> > P.S. Having Git ensure that committerdate (as an epoch) is greater
> > than committerdates of its parents at the commit creation time (with
> > providing warning about time skew, perhaps not doing it if skew is
> > too large) would be not costly. No need to add anything, and it would
> > help future queries to be faster, I think.
> 
> So I think git should check the committer date (ie verify that the
> commitetr date of a new commit is always equal to or more recent than
> all the parents).
> 
> But we should probably allow some slop for when machines are out of
> sync (think build farms etc automation that may do automated commits,
> but the clocks on different machines may be off by a few seconds). We
> already do things like that in some of the use of committer dates -
> see for example CUTOFF_DATE_SLOP)
> 
> And we should probably allow the user to override the refusal to
> create new commits with bogus dates - think importing repositories etc
> where you may have reasons to just say "that's just how it was".
> 
> And it will take years for that to percolate out to all users, so it's
> not like it will fix things immediately, but then some time from now,
> we can consider commit dates to be more reliably generation numbers.

This is an assumption that might work for git repositories, but not for
foreign repositories imported in git (think e.g. mercurial repositories
accessed with a remote helper)

Mike

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-30 18:12               ` Linus Torvalds
  2016-06-30 23:39                 ` Jakub Narębski
  2016-06-30 23:59                 ` Mike Hommey
@ 2016-07-01  3:17                 ` Jeff King
  2016-07-01  6:45                   ` Marc Strapetz
                                     ` (2 more replies)
  2 siblings, 3 replies; 31+ messages in thread
From: Jeff King @ 2016-07-01  3:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jakub Narębski, Junio C Hamano, Stefan Beller, Marc Strapetz,
	Git Mailing List

On Thu, Jun 30, 2016 at 11:12:52AM -0700, Linus Torvalds wrote:

> I do think that it's ok to cache generation numbers somewhere if there
> is an algorithm that can make use of them, but every time this comes
> up, it's just not been important enough to make a big deal and a new
> incompatible object format for it. The committer date is preexisting
> and has existing pseudo-generation-number usage, so..improving on the
> quality of it sounds like a good idea.

If you are OK with a cache, I don't think one needs to change the object
format at all. It can be computed on the fly, and is purely a local
optimization.

> The first step should probably be to make fsck warn about the existing
> cases of "commit has older date than parents". Something like the
> attached patch, perhaps?

I have mixed feelings on this, because it forces the user to confront
the issue at a time that's potentially very far from when it actually
happened (and often when it is not their fault).

I expect most people don't run fsck at all under normal circumstances,
so it is only when they have a problem of some sort that they would see
this warning at all. It would kick in and prevent objects being
transferred when things like receive.fsckObjects are configured, but
it's not on by default. GitHub does enable it, so pushing there is often
the first notification people get about any kind of problem.

This is a general problem with all fsck-driven warnings (people may
fetch without realizing they're getting breakage, and such objects may
get years of history built on top). But I think it can be even worse
with timestamps, because the broken state may not even be recognizable
when you first fetch the troublesome object.

E.g., imagine somebody else has their clock set forward, and you fetch
from them. Their object by itself is not broken. It is only when you
want to commit on top of it, with the correct clock, that the broken
state is created (and then, we cannot know whether it is your clock or
the original committer's clock that is bad).

So I think it would be more productive to put a check like this in "git
commit" rather than (or perhaps in addition to) fsck. That prevents
us creating the broken relationship, but it does still mean the user may
have to to go back and tell the original committer that their clock was
broken.

You could also have the fsck check look not only for out-of-order
commits, but also commits in the future (from the perspective of the
receiver). That would reject such broken commits before they even hit
your repository (though again, it is unclear in such a case if the
commit is broken or the clock of the checker).

> +static void fsck_commit_date(struct fsck_options *options, struct commit *commit)
> +{
> +	struct commit_list *p;
> +
> +	for (p = commit->parents; p; p = p->next) {
> +		struct commit *parent = p->item;
> +		if (commit->date < parent->date)
> +			report(options, &commit->object, FSCK_MSG_DATE_ORDERING, "Bad commit date ordering with parent");
> +	}
> +}

I didn't test it, but I suspect this won't work reliably, as we do not
always have the parents parsed during an fsck check (e.g., during
"index-pack --strict").

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-07-01  3:17                 ` Jeff King
@ 2016-07-01  6:45                   ` Marc Strapetz
  2016-07-01  9:48                   ` Jakub Narębski
  2016-07-01 16:08                   ` Junio C Hamano
  2 siblings, 0 replies; 31+ messages in thread
From: Marc Strapetz @ 2016-07-01  6:45 UTC (permalink / raw)
  To: Jeff King, Linus Torvalds
  Cc: Jakub Narębski, Junio C Hamano, Stefan Beller,
	Git Mailing List

On 01.07.2016 05:17, Jeff King wrote:
> On Thu, Jun 30, 2016 at 11:12:52AM -0700, Linus Torvalds wrote:
>
>> I do think that it's ok to cache generation numbers somewhere if there
>> is an algorithm that can make use of them, but every time this comes
>> up, it's just not been important enough to make a big deal and a new
>> incompatible object format for it. The committer date is preexisting
>> and has existing pseudo-generation-number usage, so..improving on the
>> quality of it sounds like a good idea.
>
> If you are OK with a cache, I don't think one needs to change the object
> format at all. It can be computed on the fly, and is purely a local
> optimization.

What I like about the local cache is that it can be fixed easily: if we 
had generation numbers already, there would certainly be a few 
repositories now which would violate the parent - child condition 
(however that has happened) and then you have to deal with this scenario 
anyway (or tell the user to rewrite his entire history). It's a lot 
easier to throw away and rebuild the cache.

Also, the local cache can be improved over time, starting with 
generation numbers now and one day supporting a FELINE index or whatever 
will come.

-Marc

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-30 10:30             ` Jakub Narębski
  2016-06-30 18:12               ` Linus Torvalds
@ 2016-07-01  6:54               ` Jeff King
  2016-07-01  9:59                 ` Jakub Narębski
  1 sibling, 1 reply; 31+ messages in thread
From: Jeff King @ 2016-07-01  6:54 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Junio C Hamano, Stefan Beller, Linus Torvalds, Marc Strapetz,
	Git Mailing List

On Thu, Jun 30, 2016 at 12:30:31PM +0200, Jakub Narębski wrote:

> > This is one of the open questions. My older patches turned them off when
> > replacements and grafts are in effect.
> 
> Well, if you store the cache of generation numbers in the packfile, or in
> the index of the packfile, or in the bitmap file, or in separate bitmap-like
> file, generating them on repack, then of course any grafts or replacements
> invalidate them... though for low level commands (like object counting)
> replacements are transparent -- or rather they are (and can be) treated as
> any other ref for reachability analysis.
> 
> Well, if there are no grafts, you could still use them for doing
> "git --no-replace-objects log ...", isn't it?

Yes, replace refs don't invalidate the concept of a cache. It just
means that you invalidate the invariants of the cache for a specific
view, so you need a cache which matches that view.

It has been several years, but I remember at one point having patches
that summarized the graft/replace state as a single hash, and only used
the cache if it matched that state. So you could actually keep a cache
for some set of replace-refs that you have, as well as a cache for the
case that you've turned them off, etc.

I don't think that level of complexity is really worth it, though.

> >>> I have patches that generate and store the numbers at pack time, similar
> >>> to the way we do the reachability bitmaps.
> 
> Ah, so those cached generation numbers are generated and stored at pack
> time. Where you store them: is it a separate file? Bitmap file? Packfile?

There were a few iterations of the concept over the years, but the
pack-time one uses a separate file with the same name prefix as a pack
(similar to the way bitmaps are stored). The big advantage there is that
we can piggy-back on the pack .idx to avoid having to write each sha1
again (20 bytes per commit, whereas the actual data we're caching is
only 4 bytes).

> > At GitHub we are using them for --contains analysis, along with mass
> > ahead/behind (e.g., as in https://github.com/gitster/git/branches). My
> > plan is to send patches upstream, but they need some cleanup first.
> 
> That would be nice to have, please.
> 
> Er, is mass ahead/behind something that can be plugged into Git
> (e.g. for "git branch -v -v"), or is it something GitHub-specific?

We have a custom command, "git ahead-behind", where you can specify
arbitrary pairs of commits on stdin. But it's all backed by a function
which, yes, could be plugged into "branch -v -v". It caches any bitmaps
it needs, so if you are doing 100 ahead/behind comparisons against
"master", for example, it only has to find the bitmap for "master" once
(remember that we sometimes have to traverse to complete a bitmap when
a branch has been updated since the last repack).

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-07-01  3:17                 ` Jeff King
  2016-07-01  6:45                   ` Marc Strapetz
@ 2016-07-01  9:48                   ` Jakub Narębski
  2016-07-01 16:08                   ` Junio C Hamano
  2 siblings, 0 replies; 31+ messages in thread
From: Jakub Narębski @ 2016-07-01  9:48 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Stefan Beller, Marc Strapetz, Git Mailing List

W dniu 2016-07-01 o 05:17, Jeff King pisze:
> On Thu, Jun 30, 2016 at 11:12:52AM -0700, Linus Torvalds wrote:
> 
>> I do think that it's ok to cache generation numbers somewhere if there
>> is an algorithm that can make use of them, but every time this comes
>> up, it's just not been important enough to make a big deal and a new
>> incompatible object format for it. The committer date is preexisting
>> and has existing pseudo-generation-number usage, so..improving on the
>> quality of it sounds like a good idea.
> 
> If you are OK with a cache, I don't think one needs to change the object
> format at all. It can be computed on the fly, and is purely a local
> optimization.

Note that if cache is generated during (re)packing, it could be optimized
for reading. If we want a cache that follow replacements, it would need
to be adjusted after each new or modified replacement; assuming that one
doesn't modify replacements by hand (then Git would need to detect changes,
and modify cache if needed, perhaps on query, e.g. "git log").

>> The first step should probably be to make fsck warn about the existing
>> cases of "commit has older date than parents". Something like the
>> attached patch, perhaps?
> 
> I have mixed feelings on this, because it forces the user to confront
> the issue at a time that's potentially very far from when it actually
> happened (and often when it is not their fault).

It would be nice to have as an optional check, like e.g. unreachable
objects, or use warning instead of error, like e.g. dangling objects
warning. This would be the default; it can always be changed using
the `fsck.<msg-id>` and `receive.fsck.<msg-id>` configuration variables.

This way we could use this mode to adjust clock-skew heuristics that
Git uses to speed up reachability queries (including time slop), be
it automatically or via a configuration variable.

[...]
> E.g., imagine somebody else has their clock set forward, and you fetch
> from them. Their object by itself is not broken. It is only when you
> want to commit on top of it, with the correct clock, that the broken
> state is created (and then, we cannot know whether it is your clock or
> the original committer's clock that is bad).
> 
> So I think it would be more productive to put a check like this in "git
> commit" rather than (or perhaps in addition to) fsck. That prevents
> us creating the broken relationship, but it does still mean the user may
> have to to go back and tell the original committer that their clock was
> broken.

Well, check at "git commit" time (perhaps with automatic fixing to used
committerdate for the commit being created, within specified conditions)
cannot distinguish between the two cases:

 * committerdate in parent commit(s) is correct,
   but the clock is in the past

 * committerdate in parent commit(s) is incorrect, and in the future,
   while the clock is correct
 
> You could also have the fsck check look not only for out-of-order
> commits, but also commits in the future (from the perspective of the
> receiver). That would reject such broken commits before they even hit
> your repository (though again, it is unclear in such a case if the
> commit is broken or the clock of the checker).

Well, one heuristics is that if commits in the future, and/or commits
with committerdate out of order have all the same committer, then
probably commits are broken. Another "heuristics" would be to assume
that if you invoked date-checking functionality, you have ensured that
your clock is correct.

-- 
Jakub Narębski



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-07-01  6:54               ` Jeff King
@ 2016-07-01  9:59                 ` Jakub Narębski
  0 siblings, 0 replies; 31+ messages in thread
From: Jakub Narębski @ 2016-07-01  9:59 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Stefan Beller, Linus Torvalds, Marc Strapetz,
	Git Mailing List

W dniu 2016-07-01 o 08:54, Jeff King pisze:
> On Thu, Jun 30, 2016 at 12:30:31PM +0200, Jakub Narębski wrote:
> 
>>> This is one of the open questions. My older patches turned them off when
>>> replacements and grafts are in effect.
>>
>> Well, if you store the cache of generation numbers in the packfile, or in
>> the index of the packfile, or in the bitmap file, or in separate bitmap-like
>> file, generating them on repack, then of course any grafts or replacements
>> invalidate them... though for low level commands (like object counting)
>> replacements are transparent -- or rather they are (and can be) treated as
>> any other ref for reachability analysis.
>>
>> Well, if there are no grafts, you could still use them for doing
>> "git --no-replace-objects log ...", isn't it?
> 
> Yes, replace refs don't invalidate the concept of a cache. It just
> means that you invalidate the invariants of the cache for a specific
> view, so you need a cache which matches that view.
> 
> It has been several years, but I remember at one point having patches
> that summarized the graft/replace state as a single hash, and only used
> the cache if it matched that state. So you could actually keep a cache
> for some set of replace-refs that you have, as well as a cache for the
> case that you've turned them off, etc.
> 
> I don't think that level of complexity is really worth it, though.

Well, you could always update the reachability-helpers cache when running
`git replace` command, and when fetching into 'refs/replace' namespace...

...but this wouldn't take into account the fact that you can change
replace refs "by hand", and that grafts file^{1} is only editable by hand.
So at query time Git would need to check (e.g. via hash of graft file,
hash of packed-refs refs/replace namespace, concatenated) that said
cache is still valid for replace-respecting view. And perhaps update
said cache.

Though if we limit ourself to the replacements mechanism, we could
have a configuration variable saying "I will manipulate replacements
only using git-replace, and I want faster reachability", isn't it?


1.) Can we deprecate and remove grafts mechanism now that we have superior
solution and migration mechanism? 
 
>>>>> I have patches that generate and store the numbers at pack time, similar
>>>>> to the way we do the reachability bitmaps.
>>
>> Ah, so those cached generation numbers are generated and stored at pack
>> time. Where you store them: is it a separate file? Bitmap file? Packfile?
> 
> There were a few iterations of the concept over the years, but the
> pack-time one uses a separate file with the same name prefix as a pack
> (similar to the way bitmaps are stored). The big advantage there is that
> we can piggy-back on the pack .idx to avoid having to write each sha1
> again (20 bytes per commit, whereas the actual data we're caching is
> only 4 bytes).

Does it use any lightweight compression mechanism, or is it not needed?
How does the format of this file looks like?
 
>>> At GitHub we are using them for --contains analysis, along with mass
>>> ahead/behind (e.g., as in https://github.com/gitster/git/branches). My
>>> plan is to send patches upstream, but they need some cleanup first.
>>
>> That would be nice to have, please.
>>
>> Er, is mass ahead/behind something that can be plugged into Git
>> (e.g. for "git branch -v -v"), or is it something GitHub-specific?
> 
> We have a custom command, "git ahead-behind", where you can specify
> arbitrary pairs of commits on stdin. But it's all backed by a function
> which, yes, could be plugged into "branch -v -v". It caches any bitmaps
> it needs, so if you are doing 100 ahead/behind comparisons against
> "master", for example, it only has to find the bitmap for "master" once
> (remember that we sometimes have to traverse to complete a bitmap when
> a branch has been updated since the last repack).

That would be nice to have (perhaps invoked only if number of branches
is high enough; that excludes using it for ahead-behind information that
`git checkout` prints).

-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-07-01  3:17                 ` Jeff King
  2016-07-01  6:45                   ` Marc Strapetz
  2016-07-01  9:48                   ` Jakub Narębski
@ 2016-07-01 16:08                   ` Junio C Hamano
  2 siblings, 0 replies; 31+ messages in thread
From: Junio C Hamano @ 2016-07-01 16:08 UTC (permalink / raw)
  To: Jeff King
  Cc: Linus Torvalds, Jakub Narębski, Stefan Beller, Marc Strapetz,
	Git Mailing List

Jeff King <peff@peff.net> writes:

> So I think it would be more productive to put a check like this in "git
> commit" rather than (or perhaps in addition to) fsck. That prevents
> us creating the broken relationship, but it does still mean the user may
> have to to go back and tell the original committer that their clock was
> broken.
>
> You could also have the fsck check look not only for out-of-order
> commits, but also commits in the future (from the perspective of the
> receiver). That would reject such broken commits before they even hit
> your repository (though again, it is unclear in such a case if the
> commit is broken or the clock of the checker).

I agree 100% with the above two paragraphs.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-29 22:30               ` Jeff King
@ 2016-07-05 11:43                 ` Johannes Schindelin
  2016-07-05 12:59                   ` Jakub Narębski
  0 siblings, 1 reply; 31+ messages in thread
From: Johannes Schindelin @ 2016-07-05 11:43 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Jakub Narębski, Stefan Beller,
	Linus Torvalds, Marc Strapetz, Git Mailing List

Hi Peff,

On Wed, 29 Jun 2016, Jeff King wrote:

> I haven't thought hard specifically about merge bases computation, so
> perhaps that is a case that isn't helped at all.

I guess it is not helped by generation numbers.

But then, we often ask: "is commit A an ancestor of commit B" e.g. to
check whether we can fast-forward. The way we do it now is to compute the
merge base (singular: if there are more than one, we stop at the first one
we found) and then look whether commit A is identical to the merge base.

If we had generation numbers available, then we would have to change those
computations in order to benefit from them when determining ancestry.

But then, reachability would accelerate that even more than generation
numbers.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-07-05 11:43                 ` Johannes Schindelin
@ 2016-07-05 12:59                   ` Jakub Narębski
  0 siblings, 0 replies; 31+ messages in thread
From: Jakub Narębski @ 2016-07-05 12:59 UTC (permalink / raw)
  To: Johannes Schindelin, Jeff King
  Cc: Junio C Hamano, Stefan Beller, Linus Torvalds, Marc Strapetz,
	Git Mailing List

W dniu 2016-07-05 o 13:43, Johannes Schindelin pisze:
> On Wed, 29 Jun 2016, Jeff King wrote:
> 
>> I haven't thought hard specifically about merge bases computation, so
>> perhaps that is a case that isn't helped at all.
> 
> I guess it is not helped by generation numbers.
> 
> But then, we often ask: "is commit A an ancestor of commit B" e.g. to
> check whether we can fast-forward. The way we do it now is to compute the
> merge base (singular: if there are more than one, we stop at the first one
> we found) and then look whether commit A is identical to the merge base.

I wonder if this query can be answered faster than finding the merge base
(the common ancestor) with Git core, and if it would be worth it to expose
this functionality to shell...

> If we had generation numbers available, then we would have to change those
> computations in order to benefit from them when determining ancestry.

Generation numbers (node level / topological rank) can help with such
query.  First, if level of A is greater than level of B, then A cannot
be an ancestor of B.  Second, when following from B we can prune path
if we get to node with level lower than A.  This is so called "level
filter" in literature.

FELINE indices cut search space even more... though I don't know if
they would help with finding common ancestors. Perhaps some other
technique would be better (taking into account Git use of EWAH bitmaps
for reachability of objects).

> 
> But then, reachability would accelerate that even more than generation
> numbers.

I wonder if Git uses bitmap indices here, if possible -- they are generated
sparsely.  They can help both in reachability queries (is A in reachability
of B, or in reachability of one of ancestors of B?) and in finding merge
bases (intersection of reachabilities of A and B, or their ancestors...
or something like that, I think, probably more complicated).

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-06-29 22:00           ` Jeff King
  2016-06-29 22:11             ` Junio C Hamano
  2016-06-30 10:30             ` Jakub Narębski
@ 2016-07-20  0:07             ` Jakub Narębski
  2016-07-20 13:02               ` Jeff King
  2 siblings, 1 reply; 31+ messages in thread
From: Jakub Narębski @ 2016-07-20  0:07 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List

W dniu 2016-06-30 o 00:00, Jeff King pisze:
> On Wed, Jun 29, 2016 at 11:49:35PM +0200, Jakub Narębski wrote:

>> Do Git use EWAH / EWOK bitmaps for reachability analysis, or is it still
>> limited to object counting?
>
> At GitHub we are using them for --contains analysis, along with mass
> ahead/behind (e.g., as in https://github.com/gitster/git/branches). My
> plan is to send patches upstream, but they need some cleanup first.

Ping. have you got time to clean up those patches?

-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-07-20  0:07             ` Jakub Narębski
@ 2016-07-20 13:02               ` Jeff King
  2017-02-04 13:43                 ` Jakub Narębski
  0 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2016-07-20 13:02 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: Git Mailing List

On Wed, Jul 20, 2016 at 02:07:38AM +0200, Jakub Narębski wrote:

> W dniu 2016-06-30 o 00:00, Jeff King pisze:
> > On Wed, Jun 29, 2016 at 11:49:35PM +0200, Jakub Narębski wrote:
> 
> >> Do Git use EWAH / EWOK bitmaps for reachability analysis, or is it still
> >> limited to object counting?
> >
> > At GitHub we are using them for --contains analysis, along with mass
> > ahead/behind (e.g., as in https://github.com/gitster/git/branches). My
> > plan is to send patches upstream, but they need some cleanup first.
> 
> Ping. have you got time to clean up those patches?

No, I haven't. Don't hold your breath; it's something I hope to work on
in the next 6 months, not the next 6 weeks.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2016-07-20 13:02               ` Jeff King
@ 2017-02-04 13:43                 ` Jakub Narębski
  2017-02-17  9:26                   ` Jeff King
  0 siblings, 1 reply; 31+ messages in thread
From: Jakub Narębski @ 2017-02-04 13:43 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List

On 20 July 2016 at 15:02, Jeff King <peff@peff.net> wrote:
> On Wed, Jul 20, 2016 at 02:07:38AM +0200, Jakub Narębski wrote:
>> W dniu 2016-06-30 o 00:00, Jeff King pisze:
>>> On Wed, Jun 29, 2016 at 11:49:35PM +0200, Jakub Narębski wrote:
>>
>>>> Do Git use EWAH / EWOK bitmaps for reachability analysis, or is it still
>>>> limited to object counting?
>>>
>>> At GitHub we are using them for --contains analysis, along with mass
>>> ahead/behind (e.g., as in https://github.com/gitster/git/branches). My
>>> plan is to send patches upstream, but they need some cleanup first.
>>
>> Ping. have you got time to clean up those patches?
>
> No, I haven't. Don't hold your breath; it's something I hope to work on
> in the next 6 months, not the next 6 weeks.

Ping, Was there any progress on this front? It is now almost 6 months
later...

Could those patches be made available in a "dirty" form?

TIA,
-- 
Jakub Narębski


-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2017-02-04 13:43                 ` Jakub Narębski
@ 2017-02-17  9:26                   ` Jeff King
  2017-02-17  9:28                     ` Jakub Narębski
  0 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2017-02-17  9:26 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: Git Mailing List

On Sat, Feb 04, 2017 at 02:43:01PM +0100, Jakub Narębski wrote:

> >>>> Do Git use EWAH / EWOK bitmaps for reachability analysis, or is it still
> >>>> limited to object counting?
> >>>
> >>> At GitHub we are using them for --contains analysis, along with mass
> >>> ahead/behind (e.g., as in https://github.com/gitster/git/branches). My
> >>> plan is to send patches upstream, but they need some cleanup first.
> >>
> >> Ping. have you got time to clean up those patches?
> >
> > No, I haven't. Don't hold your breath; it's something I hope to work on
> > in the next 6 months, not the next 6 weeks.
> 
> Ping, Was there any progress on this front? It is now almost 6 months
> later...
> 
> Could those patches be made available in a "dirty" form?

I just pushed them up to the "jk/ahead-behind" branch of
https://github.com/peff/git.

They were originally done on top of v1.9.1, but I forward-ported them to
the current "master" just now. The result compiles, but I haven't really
tested it extensively. Caveat applier.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: topological index field for commit objects
  2017-02-17  9:26                   ` Jeff King
@ 2017-02-17  9:28                     ` Jakub Narębski
  0 siblings, 0 replies; 31+ messages in thread
From: Jakub Narębski @ 2017-02-17  9:28 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List

On 17 February 2017 at 10:26, Jeff King <peff@peff.net> wrote:
> On Sat, Feb 04, 2017 at 02:43:01PM +0100, Jakub Narębski wrote:
>
>> >>>> Do Git use EWAH / EWOK bitmaps for reachability analysis, or is it still
>> >>>> limited to object counting?
>> >>>
>> >>> At GitHub we are using them for --contains analysis, along with mass
>> >>> ahead/behind (e.g., as in https://github.com/gitster/git/branches). My
>> >>> plan is to send patches upstream, but they need some cleanup first.
>> >>
>> >> Ping. have you got time to clean up those patches?
>> >
>> > No, I haven't. Don't hold your breath; it's something I hope to work on
>> > in the next 6 months, not the next 6 weeks.
>>
>> Ping, Was there any progress on this front? It is now almost 6 months
>> later...
>>
>> Could those patches be made available in a "dirty" form?
>
> I just pushed them up to the "jk/ahead-behind" branch of
> https://github.com/peff/git.
>
> They were originally done on top of v1.9.1, but I forward-ported them to
> the current "master" just now. The result compiles, but I haven't really
> tested it extensively. Caveat applier.

Thanks a lot! I'll check this out.

-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-02-17  9:29 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-29 18:31 topological index field for commit objects Marc Strapetz
2016-06-29 18:59 ` Junio C Hamano
2016-06-29 20:20   ` Stefan Beller
2016-06-29 20:39     ` Junio C Hamano
2016-06-29 20:54       ` Stefan Beller
2016-06-29 21:37         ` Stefan Beller
2016-06-29 21:43           ` Jeff King
2016-06-29 20:56       ` Jeff King
2016-06-29 21:49         ` Jakub Narębski
2016-06-29 22:00           ` Jeff King
2016-06-29 22:11             ` Junio C Hamano
2016-06-29 22:30               ` Jeff King
2016-07-05 11:43                 ` Johannes Schindelin
2016-07-05 12:59                   ` Jakub Narębski
2016-06-30 10:30             ` Jakub Narębski
2016-06-30 18:12               ` Linus Torvalds
2016-06-30 23:39                 ` Jakub Narębski
2016-06-30 23:59                 ` Mike Hommey
2016-07-01  3:17                 ` Jeff King
2016-07-01  6:45                   ` Marc Strapetz
2016-07-01  9:48                   ` Jakub Narębski
2016-07-01 16:08                   ` Junio C Hamano
2016-07-01  6:54               ` Jeff King
2016-07-01  9:59                 ` Jakub Narębski
2016-07-20  0:07             ` Jakub Narębski
2016-07-20 13:02               ` Jeff King
2017-02-04 13:43                 ` Jakub Narębski
2017-02-17  9:26                   ` Jeff King
2017-02-17  9:28                     ` Jakub Narębski
2016-06-29 22:15       ` Marc Strapetz
2016-06-29 21:00   ` Jakub Narębski

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).