git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: Taylor Blau <me@ttaylorr.com>, Derrick Stolee <stolee@gmail.com>,
	Jeff King <peff@peff.net>,
	Derrick Stolee <derrickstolee@github.com>,
	Derrick Stolee <dstolee@microsoft.com>
Subject: [PATCH v4 1/2] commit-graph: ignore duplicates when merging layers
Date: Fri, 09 Oct 2020 20:53:51 +0000	[thread overview]
Message-ID: <9279aea3ef7aa98996e182c0d1bf41cc179b8146.1602276832.git.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.747.v4.git.1602276832.gitgitgadget@gmail.com>

From: Derrick Stolee <dstolee@microsoft.com>

Thomas reported [1] that a "git fetch" command was failing with an error
saying "unexpected duplicate commit id". The root cause is that they had
fetch.writeCommitGraph enabled which generates commit-graph chains, and
this instance was merging two layers that both contained the same commit
ID.

[1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/

The initial assumption is that Git would not write a commit ID into a
commit-graph layer if it already exists in a lower commit-graph layer.
Somehow, this specific case did get into that situation, leading to this
error.

While unexpected, this isn't actually invalid (as long as the two layers
agree on the metadata for the commit). When we parse a commit that does
not have a graph_pos in the commit_graph_data_slab, we use binary search
in the commit-graph layers to find the commit and set graph_pos. That
position is never used again in this case. However, when we parse a
commit from the commit-graph file, we load its parents from the
commit-graph and assign graph_pos at that point. If those parents were
already parsed from the commit-graph, then nothing needs to be done.
Otherwise, this graph_pos is a valid position in the commit-graph so we
can parse the parents, when necessary.

Thus, this die() is too aggressive. The easiest thing to do would be to
ignore the duplicates.

If we only ignore the duplicates, then we will produce a commit-graph
that has identical commit IDs listed in adjacent positions. This excess
data will never be removed from the commit-graph, which could cascade
into significantly bloated file sizes.

Thankfully, we can collapse the list to erase the duplicate commit
pointers. This allows us to get the end result we want without extra
memory costs and minimal CPU time.

The root cause is due to disabling core.commitGraph, which prevents
parsing commits from the lower layers during a 'git commit-graph write
--split' command. Since we use the 'graph_pos' value to determine
whether a commit is in a lower layer, we never discover that those
commits are already in the commit-graph chain and add them to the top
layer. This layer is then merged down, creating duplicates.

The test added in t5324-split-commit-graph.sh fails without this change.
However, we still have not completely removed the need for this
duplicate check. That will come in a follow-up change.

Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
Helped-by: Taylor Blau <me@ttaylorr.com>
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c                | 16 +++++++++++++---
 t/t5324-split-commit-graph.sh | 12 ++++++++++++
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index cb042bdba8..0280dcb2ce 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -2008,7 +2008,7 @@ static int commit_compare(const void *_a, const void *_b)
 
 static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
 {
-	uint32_t i;
+	uint32_t i, dedup_i = 0;
 
 	if (ctx->report_progress)
 		ctx->progress = start_delayed_progress(
@@ -2023,17 +2023,27 @@ static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
 
 		if (i && oideq(&ctx->commits.list[i - 1]->object.oid,
 			  &ctx->commits.list[i]->object.oid)) {
-			die(_("unexpected duplicate commit id %s"),
-			    oid_to_hex(&ctx->commits.list[i]->object.oid));
+			/*
+			 * Silently ignore duplicates. These were likely
+			 * created due to a commit appearing in multiple
+			 * layers of the chain, which is unexpected but
+			 * not invalid. We should make sure there is a
+			 * unique copy in the new layer.
+			 */
 		} else {
 			unsigned int num_parents;
 
+			ctx->commits.list[dedup_i] = ctx->commits.list[i];
+			dedup_i++;
+
 			num_parents = commit_list_count(ctx->commits.list[i]->parents);
 			if (num_parents > 2)
 				ctx->num_extra_edges += num_parents - 1;
 		}
 	}
 
+	ctx->commits.nr = dedup_i;
+
 	stop_progress(&ctx->progress);
 }
 
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index c334ee9155..a314ce0368 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -440,4 +440,16 @@ test_expect_success '--split=replace with partial Bloom data' '
 	verify_chain_files_exist $graphdir
 '
 
+test_expect_success 'prevent regression for duplicate commits across layers' '
+	git init dup &&
+	git -C dup config core.commitGraph false &&
+	git -C dup commit --allow-empty -m one &&
+	git -C dup commit-graph write --split=no-merge --reachable &&
+	git -C dup commit --allow-empty -m two &&
+	git -C dup commit-graph write --split=no-merge --reachable &&
+	git -C dup commit --allow-empty -m three &&
+	git -C dup commit-graph write --split --reachable &&
+	git -C dup commit-graph verify
+'
+
 test_done
-- 
gitgitgadget


  reply	other threads:[~2020-10-09 20:54 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-08 13:56 [PATCH] commit-graph: ignore duplicates when merging layers Derrick Stolee via GitGitGadget
2020-10-08 14:15 ` Taylor Blau
2020-10-08 14:29   ` Derrick Stolee
2020-10-08 14:59 ` [PATCH v2] " Derrick Stolee via GitGitGadget
2020-10-08 15:04   ` [PATCH v3] " Derrick Stolee via GitGitGadget
2020-10-08 15:53     ` Jeff King
2020-10-08 16:26       ` Derrick Stolee
2020-10-08 16:42         ` Taylor Blau
2020-10-08 16:43         ` Jeff King
2020-10-09 20:53     ` [PATCH v4 0/2] " Derrick Stolee via GitGitGadget
2020-10-09 20:53       ` Derrick Stolee via GitGitGadget [this message]
2020-10-09 20:53       ` [PATCH v4 2/2] commit-graph: don't write commit-graph when disabled Derrick Stolee via GitGitGadget
2020-10-09 21:12         ` Junio C Hamano
2020-10-09 21:17         ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9279aea3ef7aa98996e182c0d1bf41cc179b8146.1602276832.git.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).