git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Inconsistencies in commit-graph technical docs.
@ 2018-06-28  5:11 Grant Welch
  2018-06-28 12:49 ` Derrick Stolee
  0 siblings, 1 reply; 4+ messages in thread
From: Grant Welch @ 2018-06-28  5:11 UTC (permalink / raw)
  To: git

I recently read the "Supercharging the Git Commit Graph" blog by
Derrick Stolee. I found the article interesting and wanted to verify
the performance numbers for myself. Then that led me to want to know
more about the implementation, so I read the technical design notes in
commit-graph.txt, and then I jumped into the format documentation in
commit-graph-format.txt.

Along the way, I noticed a few issues. They might just be errors in
the documentation, but I figured it was worth documenting my entire
process just to be sure.

"Supercharging the Git Commit Graph", by Derrick Stolee:
  https://blogs.msdn.microsoft.com/devops/2018/06/25/supercharging-the-git-commit-graph/

# TL;DR

I found a few discrepencies between the documentation in
commit-graph-format.txt and the results that I observed for myself.

1. The "Commit Data" chunk signature is documented to be 'CGET', but
it should be 'CDAT'.

commit-graph.c:18
  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */

2. The "no parent" value is documented to be 0xffffffff, but is
actually 0x70000000.

commit-graph.c:34
  #define GRAPH_PARENT_NONE 0x70000000

3. The "generation" field isn't set on any of the commits. (I don't
know what to make of this.)

# Documented Process

Just to follow along, here are the details about my local repo. I
don't have any local changes; I just track upstream.

repo:    git/git
remotes: git://git.kernel.org/pub/scm/git/git.git (origin)
branch:  master
commit:  ed843436dd4924c10669820cc73daf50f0b4dabd

os:      MacOS 10.11.6 (El Capitan)
git:     2.18.0

Just to be clear, I am following the details as described in:
  /Documentation/technical/commit-graph-format.txt

# Create the commit-graph File

As a setup step, I have to prepare the commit-graph file as described
in the blog post.

  $ git show-ref -s | git commit-graph write --stdin-commits

# Verify SHA1 Checksum

Starting at the end first because like most (all?) of git's binary
formats, there is a SHA1 checksum that verifies the integrity of the
file. Everything checks out fine.

  $ < .git/objects/info/commit-graph tail -c 20 | xxd -p
  35507183b1e9546f9e4844f0796d630daebc85d8

  $ < .git/objects/info/commit-graph wc -c
  3037156

  $ < .git/objects/info/commit-graph head -c $((3037156-20)) | sha1sum
  35507183b1e9546f9e4844f0796d630daebc85d8  -

# Header

The header looks good:
* signature:   CGPH
* version:     1
* hash:        1 (SHA1)
* "chunks":    4
* "reserved":  (ignored)

  $ < .git/objects/info/commit-graph xxd -l8
  00000000: 4347 5048 0101 0400                      CGPH....

# Chunk Lookup, Problem: CGET/CDAT

I have all of the documented "chunks". but it appears, that the
documentation is wrong for the "Commit Data" chunk. The document
signature is 'CGET', but I'm getting 'CDAT'.

* OIDF, Object ID Fanout:   0x44
* OIDL, Object ID Lookup:   0x444
* CDAT, Commit Data:        0x108f58
* EDGE, Large Edge List:    0x2e577c
*     , terminating label:  0x2e57d0

  $ < .git/objects/info/commit-graph xxd -s8 -l$(((4+1)*12)) -c12 -g4
  00000008: 4f494446 00000000 00000044  OIDF.......D
  00000014: 4f49444c 00000000 00000444  OIDL.......D
  00000020: 43444154 00000000 00108f58  CDAT.......X
  0000002c: 45444745 00000000 002e567c  EDGE......V|
  00000038: 00000000 00000000 002e57d0  ..........W.

# OID Fanout

Not much to tell here, but since it's important, I'll just show last
line so know how many commits are recorded in the file. Everything
checks out that I have 54209 commits in my local repo.

  $ < .git/objects/info/commit-graph xxd -s0x44 -l$((256*4)) -c4 -g4
  ...
  00000440: 0000d3c1

  $ echo $((0x0000d3c1))
  54209

  $ git rev-list --all | wc -l
    54209

# OID Lookup

Again, not much to see here; just an alphanumeric, sequential list off
all the commit ids.

  $ < .git/objects/info/commit-graph xxd -s0x444 -l$((54209*20)) -c20 -g20
  ...
  00108f44: fffe694d607ea683b5d08ee99a46d9b06cb74006  ..iM`~.......F..l.@.

  $ git cat-file -p fffe694d607ea683b5d08ee99a46d9b06cb74006
  tree 1241274386e9d5dabaf370477ff19ba0eb36c3c2
  parent 84dee6bbc9eb6dd8e613414ed0271d8290549e89
  author Eric Wong <normalperson@yhbt.net> 1168151155 -0800
  committer Junio C Hamano <junkio@cox.net> 1168152528 -0800
  ...

# Commit Data

This is where I start to see inconsistencies with the documented format.

  $ < .git/objects/info/commit-graph xxd -s0x108f58
-l$((54209*(20+16))) -c36 -g4
  ...
  002e5658: 12412743 86e9d5da baf37047 7ff19ba0 eb36c3c2 00006dea
70000000 00000000 45a097d0

From the above 'git cat-file' command, we can confirm the tree is correct.

tree:        1241274386e9d5dabaf370477ff19ba0eb36c3c2
parent1:     00006dea
parent2:     70000000
generation:  00000000
timestamp:   45a097d0

As does the timestamp.

  $ date --date @$((0x45a097d0)) +%s
  1168152528

Parent #1 looks good as well.

  < .git/objects/info/commit-graph xxd -s$((0x444+(0x00006dea*20)))
-l20 -c20 -g20
  00089a8c: 84dee6bbc9eb6dd8e613414ed0271d8290549e89

## Problem: Parent #2

Parent #2 (70000000), however appears to be a mystery value. The value
is way too high to actually be in the OID Lookup table. And, the only
documented magic numbers are 0xffffffff (no parent) and
0x80000000|0x??????? (octopus merge; see: Last Edge List).

## Problem: Generation

Also, none of these objects have their 'generation' field set.

  $ < .git/objects/info/commit-graph xxd -s0x108f58
-l$((54209*(20+16))) -c36 -g4 | grep 00000000 | wc -l
    54209

## Check Octopus Merges

The final "chunk", Large Edge List, is for handling the extra data
from octopus merge commits. Since I can't conceive of a fancy xxd
command to that could list them, I have to go the long way round to
find an example of one.

  $ git rev-list --all --min-parents=3 | wc -l
      35

  $ git rev-list --all --min-parents=3
  25bf951381a4880c43a3d1c65e6dce651e61148f
  ...

  $ < .git/objects/info/commit-graph xxd -s0x444 -l$((54209*20)) -c20
-g20 | grep 25bf951381a4880c43a3d1c65e6dce651e61148f
  0002706c: 25bf951381a4880c43a3d1c65e6dce651e61148f  %.......C...^m.e.a..

  $ echo $(((0x0002706c-0x444)/20))
  7938

  $ < .git/objects/info/commit-graph xxd -s$((0x444+(7938*20))) -l20 -c20 -g20
  0002706c: 25bf951381a4880c43a3d1c65e6dce651e61148f  %.......C...^m.e.a..

  $ < .git/objects/info/commit-graph xxd -s$((0x108f58+(7938*36))) -l36 -c36 -g4
  0014eba0: ce9488df 264bc29a 37b47932 0fad03c9 3a470db7 000004c7
80000011 00000000 594dbf30

  $ git dump 25bf951381a4880c43a3d1c65e6dce651e61148f
  tree ce9488df264bc29a37b479320fad03c93a470db7
  parent 05ec6e13aaf33b6a647e1321203a770e697eea9a
  parent a84f3e59ebde9e891275ef8325c432db6bdf950c
  parent dc8441fdb4598f54865a5c783d1f86c1e0bcb6dc
  author Junio C Hamano <gitster@pobox.com> 1498083644 -0700
  committer Junio C Hamano <gitster@pobox.com> 1498267440 -0700

  $ < .git/objects/info/commit-graph xxd -s$((0x444+(0x000004c7*20)))
-l20 -c20 -g20
  000063d0: 05ec6e13aaf33b6a647e1321203a770e697eea9a

In the above commands, I was able to find a merge commit with more
than 2 parents using git-rev-list. From there, I was able to find that
commit in the "OID Lookup" table and determine it's position. With
that position, I was able to find the "Commit Data" associated with
it. Finally, the git-cat-file confirms that the tree is correct and
that the first parent commit is correct (after performing another OID
lookup).

Now it's just a matter of reviewing the last chunk to learn more about 80000011.

## Naive Solution for Finding Octopus Merges

In a repo with very few octopus merges, I would expect this to work
100% of the time. However, I doubt that it will work for a repo like
linux/linux that have a lot of commits that are merged through octopus
merges.

This assumes that '80000' won't occur in the tree OID, parent1,
generation, nor the timestamp. Seems like a fair bet, but qualifies as
naive.

I tested using '8000', but it created too many false positives. For
git/git, '800000' works just as well as '80000', but would more likely
to under report elsewhere.

  $ < .git/objects/info/commit-graph xxd -s0x108f58
-l$((54209*(20+16))) -c36 -g4 | grep 80000 | wc -l
      35

  $ < .git/objects/info/commit-graph xxd -s0x108f58
-l$((54209*(20+16))) -c36 -g4 | grep 80000
  ...
  0014eba0: ce9488df 264bc29a 37b47932 0fad03c9 3a470db7 000004c7
80000011 00000000 594dbf30
  ...

# Last Edge List

Finally, I can get a list of all the commits that have been a
parent2-N on an octopus merge.

  $ < .git/objects/info/commit-graph xxd -s0x2e567c
-l$((0x2e57d0-0x2e567c)) -c4 -g4 | wc -l
      85

And, I should be able to confirm the other parents from the example
above. In the following command '0x11' comes from 80000011, '4' is the
number of "bytes" that are used to for OID Lookup, and '2' for the
number of N-1 parents that were on the octopus merge.

  $ < .git/objects/info/commit-graph xxd -s$((0x2e567c+0x11*4))
-l$((4*2)) -c4 -g4
  002e56c0: 00008ac3  ....
  002e56c4: 8000b603  ....

Which works out perfectly.

  $ < .git/objects/info/commit-graph xxd -s$((0x444+(0x00008ac0*20)))
-l20 -c20 -g20
  000adb44: a849d36cf2877a1890371851710382f463290978  .I.l..z..7.Qq...c).x

  $ < .git/objects/info/commit-graph xxd -s$((0x444+(0x0000b603*20)))
-l20 -c20 -g20
  000e3c80: dc8441fdb4598f54865a5c783d1f86c1e0bcb6dc  ..A..Y.T.Z\x=.......

# terminating label

With the 'terminating' chunk, we can see the SHA1 checksum which I
already covered at the beginning.

  $ < .git/objects/info/commit-graph xxd -s0x2e57d0 -g20 -c20
  002e57d0: 35507183b1e9546f9e4844f0796d630daebc85d


--
-grant welch
-gwelch925@gmail.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Inconsistencies in commit-graph technical docs.
  2018-06-28  5:11 Inconsistencies in commit-graph technical docs Grant Welch
@ 2018-06-28 12:49 ` Derrick Stolee
  2018-06-28 12:52   ` [PATCH] commit-graph: fix documentation inconsistencies Derrick Stolee
  2018-06-28 16:54   ` Inconsistencies in commit-graph technical docs Grant Welch
  0 siblings, 2 replies; 4+ messages in thread
From: Derrick Stolee @ 2018-06-28 12:49 UTC (permalink / raw)
  To: Grant Welch, git

On 6/28/2018 1:11 AM, Grant Welch wrote:
> I recently read the "Supercharging the Git Commit Graph blog by
> Derrick Stolee. I found the article interesting and wanted to verify
> the performance numbers for myself. Then that led me to want to know
> more about the implementation, so I read the technical design notes in
> commit-graph.txt, and then I jumped into the format documentation in
> commit-graph-format.txt.
>
> Along the way, I noticed a few issues. They might just be errors in
> the documentation, but I figured it was worth documenting my entire
> process just to be sure.
>
> "Supercharging the Git Commit Graph", by Derrick Stolee:
>    https://blogs.msdn.microsoft.com/devops/2018/06/25/supercharging-the-git-commit-graph/
>
> # TL;DR
>
> I found a few discrepencies between the documentation in
> commit-graph-format.txt and the results that I observed for myself.
>
> 1. The "Commit Data" chunk signature is documented to be 'CGET', but
> it should be 'CDAT'.
>
> commit-graph.c:18
>    #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */

Thanks for catching this! Thankfully, this is an easy fix, as we only 
need to update the documentation.

> 2. The "no parent" value is documented to be 0xffffffff, but is
> actually 0x70000000.
>
> commit-graph.c:34
>    #define GRAPH_PARENT_NONE 0x70000000

This is a more important mistake, as it affects the data that was 
written in the commit-graph file.

Part of the problem is that leading hex digit of 0x7 which in binary is 
0b0111. We already designed a limit of at most 2^{31}-1 (~2.1 billion) 
commits in the commit-graph because of the way we track octopus edges, 
but this mistake has cost us more: we cannot store more than ~1.8 
billion commits.

I'm sorry for this mixup, mostly because it is aesthetically unpleasant. 
Those extra 300 million commits mean less to me than having a clean file 
format.

> 3. The "generation" field isn't set on any of the commits. (I don't
> know what to make of this.)

This is a difference between 2.18 and current 'master', which merged 
ds/generation-numbers. Commit-graphs written with Git 2.18 have all 
generation numbers listed as GENERATION_NUMBER_ZERO (0), which lets 
future versions know that the generation number is not computed yet, so 
the next commit-graph write will compute the correct generation number.

I'll send a patch soon fixing these doc issues.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] commit-graph: fix documentation inconsistencies
  2018-06-28 12:49 ` Derrick Stolee
@ 2018-06-28 12:52   ` Derrick Stolee
  2018-06-28 16:54   ` Inconsistencies in commit-graph technical docs Grant Welch
  1 sibling, 0 replies; 4+ messages in thread
From: Derrick Stolee @ 2018-06-28 12:52 UTC (permalink / raw)
  To: git@vger.kernel.org
  Cc: gitster@pobox.com, gwelch925@gmail.com, Derrick Stolee

The commit-graph feature shipped in Git 2.18 has some inconsistencies in
the constants used by the implementation and specified by the format
document.

The commit data chunk uses the key "CDAT" in the file format, but was
previously documented to say "CGET".

The commit data chunk stores commit parents using two 32-bit fields that
typically store the integer position of the parent in the list of commit
ids within the commit-graph file. When a parent does not exist, we had
documented the value 0xffffffff, but implemented the value 0x70000000.
This swap is easy to correct in the documentation, but unfortunately
reduces the number of commits that we can store in the commit-graph.
Update that estimate, too.

Reported-by: Grant Welch <gwelch925@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph-format.txt | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index ad6af8105c..91c2fbd86a 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -18,9 +18,9 @@ metadata, including:
   the graph file.
 
 These positional references are stored as unsigned 32-bit integers
-corresponding to the array position withing the list of commit OIDs. We
-use the most-significant bit for special purposes, so we can store at most
-(1 << 31) - 1 (around 2 billion) commits.
+corresponding to the array position withing the list of commit OIDs. Due
+to some special constants we use to track parents, we can store at most
+(1 << 30) + (1 << 29) + (1 << 28) - 1 (around 1.8 billion) commits.
 
 == Commit graph files have the following format:
 
@@ -70,10 +70,10 @@ CHUNK DATA:
   OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
       The OIDs for all commits in the graph, sorted in ascending order.
 
-  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
+  Commit Data (ID: {'C', 'D', 'A', 'T' }) (N * (H + 16) bytes)
     * The first H bytes are for the OID of the root tree.
     * The next 8 bytes are for the positions of the first two parents
-      of the ith commit. Stores value 0xffffffff if no parent in that
+      of the ith commit. Stores value 0x7000000 if no parent in that
       position. If there are more than two parents, the second value
       has its most-significant bit on and the other bits store an array
       position into the Large Edge List chunk.

base-commit: ed843436dd4924c10669820cc73daf50f0b4dabd
-- 
2.18.0.24.g1b579a2ee9


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: Inconsistencies in commit-graph technical docs.
  2018-06-28 12:49 ` Derrick Stolee
  2018-06-28 12:52   ` [PATCH] commit-graph: fix documentation inconsistencies Derrick Stolee
@ 2018-06-28 16:54   ` Grant Welch
  1 sibling, 0 replies; 4+ messages in thread
From: Grant Welch @ 2018-06-28 16:54 UTC (permalink / raw)
  To: stolee; +Cc: git

Thanks for the quick response and for the patch.

I started writing such a long process document because I _thought_
that I found a major issue with the Large Edge List. But, in the end,
it was user error. It turned out that I was looking at index '11' when
I should have been looking at index '0x11' ('17'). Whoops!
On Thu, Jun 28, 2018 at 5:49 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 6/28/2018 1:11 AM, Grant Welch wrote:
> > I recently read the "Supercharging the Git Commit Graph blog by
> > Derrick Stolee. I found the article interesting and wanted to verify
> > the performance numbers for myself. Then that led me to want to know
> > more about the implementation, so I read the technical design notes in
> > commit-graph.txt, and then I jumped into the format documentation in
> > commit-graph-format.txt.
> >
> > Along the way, I noticed a few issues. They might just be errors in
> > the documentation, but I figured it was worth documenting my entire
> > process just to be sure.
> >
> > "Supercharging the Git Commit Graph", by Derrick Stolee:
> >    https://blogs.msdn.microsoft.com/devops/2018/06/25/supercharging-the-git-commit-graph/
> >
> > # TL;DR
> >
> > I found a few discrepencies between the documentation in
> > commit-graph-format.txt and the results that I observed for myself.
> >
> > 1. The "Commit Data" chunk signature is documented to be 'CGET', but
> > it should be 'CDAT'.
> >
> > commit-graph.c:18
> >    #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
>
> Thanks for catching this! Thankfully, this is an easy fix, as we only
> need to update the documentation.
>
> > 2. The "no parent" value is documented to be 0xffffffff, but is
> > actually 0x70000000.
> >
> > commit-graph.c:34
> >    #define GRAPH_PARENT_NONE 0x70000000
>
> This is a more important mistake, as it affects the data that was
> written in the commit-graph file.
>
> Part of the problem is that leading hex digit of 0x7 which in binary is
> 0b0111. We already designed a limit of at most 2^{31}-1 (~2.1 billion)
> commits in the commit-graph because of the way we track octopus edges,
> but this mistake has cost us more: we cannot store more than ~1.8
> billion commits.
>
> I'm sorry for this mixup, mostly because it is aesthetically unpleasant.
> Those extra 300 million commits mean less to me than having a clean file
> format.
>
> > 3. The "generation" field isn't set on any of the commits. (I don't
> > know what to make of this.)
>
> This is a difference between 2.18 and current 'master', which merged
> ds/generation-numbers. Commit-graphs written with Git 2.18 have all
> generation numbers listed as GENERATION_NUMBER_ZERO (0), which lets
> future versions know that the generation number is not computed yet, so
> the next commit-graph write will compute the correct generation number.
>
> I'll send a patch soon fixing these doc issues.
>
> Thanks,
> -Stolee



--
-grant welch
-gwelch925@gmail.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-06-28 16:54 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-28  5:11 Inconsistencies in commit-graph technical docs Grant Welch
2018-06-28 12:49 ` Derrick Stolee
2018-06-28 12:52   ` [PATCH] commit-graph: fix documentation inconsistencies Derrick Stolee
2018-06-28 16:54   ` Inconsistencies in commit-graph technical docs Grant Welch

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).