rev-list default order sometimes very slow (size and metadata dependent)

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* rev-list default order sometimes very slow (size and metadata dependent)
@ 2023-11-11 16:17 Kevin Lyles
  2023-11-13 20:39 ` Jeff King
  0 siblings, 1 reply; 2+ messages in thread
From: Kevin Lyles @ 2023-11-11 16:17 UTC (permalink / raw
  To: git@vger.kernel.org; +Cc: allred.sean@gmail.com


[-- Attachment #1.1: Type: text/plain, Size: 5028 bytes --]

Hello,

 

As part of my company's migration from SVN to git, we discovered a

performance issue in rev-list with large repositories.  The issue appears to

be metadata-dependent; we were able to work around it (completely avoiding

any performance penalty) by changing the date of certain commits.

 

The general structure of our repository is I think fairly normal (if large

-- we have >5.5 million commits total).  We have a handful of trunk

branches, and ~10k total refs.  To reduce the ref count (we hit other

performance issues when we had significantly more refs), we remove refs as

we're done with them.  Any code that doesn't make it into a trunk is

preserved in an archive branch.  The archive branch has no content, and

consists entirely of octopus merges with 50-500 parents.

 

If the archive branch is created with author/commit dates older than the

rest of the repository, we're able to run:

  $ git rev-list --count --all

in ~9-10 seconds on a mirror clone with a commit-graph.  However, if the

archive branch is instead created with author/commit dates newer than the

rest of the repository, it takes 4-5 minutes.

 

Using any order other than the default or --reverse removes the disparity.

All orders except --author-date-order bring things much closer to the ~9-10

seconds we see with the workaround, and --author-date-order is still under a

minute (though not by much).

 

System info from git bugreport:

  [System Info]

  git version:

  git version 2.42.0.windows.2

  cpu: x86_64

  built from commit: 2f819d1670fff9a1818f63b6722e9959405378e3

  sizeof-long: 4

  sizeof-size_t: 8

  shell-path: /bin/sh

  feature: fsmonitor--daemon

  uname: Windows 10.0 19044

  compiler info: gnuc: 13.2

  libc info: no libc information available

  $SHELL (typically, interactive shell): C:\Program Files\Git\usr\bin\bash.exe

 

  (no enabled hooks)

 

Note that we first realized this was an issue on our GitLab instance, which

runs on Linux, so this is not a Windows-specific bug.

 

I created a bash script to create very similar repositories that are/are not

affected by the issue; it follows.  The issue starts to become visible at 1

million commits (the default), where the difference is ~2x.  5 million

commits is roughly equivalent performance-wise to what we saw in our

repository, with a difference of ~33x.  Note that with 5 million commits,

each repository is ~1.2 GB and takes 7-8 minutes to create on an i9-9900

with NVMe storage.

 

Once you create a fast and a slow repo with the script, try the following

commands in each one:

  # Shows the performance difference

  $ time git rev-list --count –all

  # Shows very similar performance across both repos

  $ time git rev-list --count --all --topo-order

 

Thank you,

Kevin Lyles

klyles@epic.com <mailto:klyles@epic.com> 

----------------------------------------

 

#!/bin/bash

 

usage="Usage: $0 <destination folder> <--fast|--slow> [Number of commits (default: 1000000)]"

destinationFolder=${1:?$usage}

 

oldTimestamp=315554400 # 1980-01-01 midnight

newTimestamp=1672552800 # 2023-01-01 midnight

if [ "$2" == "--fast" ]

then

    archiveTimestamp=$oldTimestamp

elif [ "$2" == "--slow" ]

then

    archiveTimestamp=$newTimestamp

else

    echo "$usage" >&2

    exit 1

fi

 

numberOfCommits=${3:-1000000}

if ! [[ "$numberOfCommits" =~ ^[0-9]+$ ]]

then

    echo "$usage" >&2

    exit 1

fi

 

increment=$(( (newTimestamp - oldTimestamp) / (numberOfCommits + 2) ))

 

timestamp=$oldTimestamp

 

rm -rf "$destinationFolder"

git init "$destinationFolder"

 

echo "Fast-importing repo, please wait..."

{

    echo "feature done"

    echo "reset refs/heads/main"

    echo ""

 

    for count in $(seq "$numberOfCommits")

    do

        timestamp=$(( timestamp + increment ))

        echo "commit refs/heads/main"

        echo "mark :$count"

        echo "committer Test Test <test@test.com <mailto:test@test.com> > $timestamp -0500"

        echo "data <<|"

        echo "Main branch commit #$count"

        echo "|"

        echo ""

    done

 

    parentMark=0

    echo "reset refs/archive"

    for count in $(seq $(( numberOfCommits / 1000 )))

    do

        echo "commit refs/archive"

        echo "committer Test Test <test@test.com <mailto:test@test.com> > $archiveTimestamp -0500"

        echo "data <<|"

        echo "Archive branch commit #$count"

        echo "|"

        for parentCount in {1..50}

        do

            parentMark=$(( (parentMark + 99991) % numberOfCommits + 1 ))

            echo "merge :$parentMark"

        done

        echo ""

    done

 

    echo "done"

} | git -C "$destinationFolder" fast-import

 

git -C "$destinationFolder" commit-graph write


[-- Attachment #1.2: Type: text/html, Size: 12590 bytes --]

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 6034 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: rev-list default order sometimes very slow (size and metadata dependent)
  2023-11-11 16:17 rev-list default order sometimes very slow (size and metadata dependent) Kevin Lyles
@ 2023-11-13 20:39 ` Jeff King
  0 siblings, 0 replies; 2+ messages in thread
From: Jeff King @ 2023-11-13 20:39 UTC (permalink / raw
  To: Kevin Lyles; +Cc: git@vger.kernel.org, allred.sean@gmail.com

On Sat, Nov 11, 2023 at 04:17:17PM +0000, Kevin Lyles wrote:

> If the archive branch is created with author/commit dates older than the
> rest of the repository, we're able to run:
> 
>   $ git rev-list --count --all
> 
> in ~9-10 seconds on a mirror clone with a commit-graph.  However, if the
> archive branch is instead created with author/commit dates newer than the
> rest of the repository, it takes 4-5 minutes.

Thanks for providing a reproduction script. I had trouble following the
text explanation, but I can easily reproduce the issue using your
script.

Running the slow case under perf, it looks like we are spending most of
the time in commit_list_insert_by_date(). Which makes sense. The queue
of commits to visit during the traversal is kept as an ordered linked
list that uses linear search for the insertion. So the worst case to
insert N commits is quadratic.

In practice, it tends not to be a big problem because we visit commits
in roughly reverse-timestamp order as we traverse. So it's more like
O(N*width), where "width" is the average number of simultaneous lines of
history. And your archive graph is...very wide.

Playing with the commit timestamps helps your case because it just
happens to reorder the queue in such a way that we put off looking at
those huge archive merges until much later in the traversal, when our
remaining "N" is much smaller.

The right solution is obviously to switch to a better data structure.
And some of the infrastructure is in place from the last time we dealt
with a similar case:

  https://lore.kernel.org/git/HE1PR0702MB3788FCDAB764252D9CBB42E5B0560@HE1PR0702MB3788.eurprd07.prod.outlook.com/

There we started using a priority queue instead of the linked list.
That patch stopped shorting of replacing the main revs->commits
traversal list, though, as it's referenced in a ton of places (some of
which really care about its list-like nature). Here's a hacky patch that
tries to lazily convert the list to a queue. It makes your case much
faster, but it also causes a few failures in the test suite (related to
commit ordering), so obviously it's violating some assumptions
somewhere. I'm not sure if it's a fruitful direction or not.

And as you noticed, using --topo-order does not suffer from the same
problem. It follows a completely different path, which...you guessed it,
uses a priority queue. I'm not sure what the current state of things is,
but one of the promises of commit-graphs is that with stored generation
numbers, we could compute topo-order incrementally and efficiently. So
another possible route is that we'd simply enable it by default. But I'm
not sure if there's more work that needs to be done to get there, or if
there are other downsides.

diff --git a/revision.c b/revision.c
index 00d5c29bfc..62d33dbfee 100644
--- a/revision.c
+++ b/revision.c
@@ -3119,6 +3119,7 @@ void release_revisions(struct rev_info *revs)
 	clear_decoration(&revs->treesame, free);
 	line_log_free(revs);
 	oidset_clear(&revs->missing_commits);
+	clear_prio_queue(&revs->commitq);
 }

 static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
@@ -4161,6 +4162,36 @@ static void track_linear(struct rev_info *revs, struct commit *commit)

 static struct commit *get_revision_1(struct rev_info *revs)
 {
+	/*
+	 * This is kind of hacky. We'll dynamically move commits from the
+	 * linear list over to our priority queue, which has better worst-case
+	 * behavior. Doing it here serves two purposes:
+	 *
+	 *   - we'll let all of the setup code still operate on the "commits"
+	 *     list, and then just move it over to the queue lazily. This
+	 *     probably could be done in prepare_revision_walk() or similar,
+	 *     but by doing it here we know we're catching all code paths.
+	 *
+	 *   - we'll likewise catch any code which tries to add new commits to
+	 *     the list _during_ the traversal. The main one would be
+	 *     process_parents() below, which we've taught to use the queue
+	 *     directly. But this will catch any other random bits of code,
+	 *     including callers of the revision API, which add to the
+	 *     traversal as they go.
+	 *
+	 * A cleaner solution would probably be to get rid of the "commits"
+	 * list entirely, but that's a much bigger task.
+	 */
+	if (revs->commits && !revs->topo_order) {
+		struct commit_list *p;
+		if (!revs->unsorted_input)
+			revs->commitq.compare = compare_commits_by_commit_date;
+		for (p = revs->commits; p; p = p->next)
+			prio_queue_put(&revs->commitq, p->item);
+		free_commit_list(revs->commits);
+		revs->commits = NULL;
+	}
+
 	while (1) {
 		struct commit *commit;

@@ -4169,7 +4200,7 @@ static struct commit *get_revision_1(struct rev_info *revs)
 		else if (revs->topo_walk_info)
 			commit = next_topo_commit(revs);
 		else
-			commit = pop_commit(&revs->commits);
+			commit = prio_queue_get(&revs->commitq);

 		if (!commit)
 			return NULL;
@@ -4191,7 +4222,7 @@ static struct commit *get_revision_1(struct rev_info *revs)
 				try_to_simplify_commit(revs, commit);
 			else if (revs->topo_walk_info)
 				expand_topo_walk(revs, commit);
-			else if (process_parents(revs, commit, &revs->commits, NULL) < 0) {
+			else if (process_parents(revs, commit, NULL, &revs->commitq) < 0) {
 				if (!revs->ignore_missing_links)
 					die("Failed to traverse parents of commit %s",
 						oid_to_hex(&commit->object.oid));
diff --git a/revision.h b/revision.h
index 94c43138bc..4301b825a0 100644
--- a/revision.h
+++ b/revision.h
@@ -12,6 +12,7 @@
 #include "ident.h"
 #include "list-objects-filter-options.h"
 #include "strvec.h"
+#include "prio-queue.h"

 /**
  * The revision walking API offers functions to build a list of revisions
@@ -124,6 +125,12 @@ struct rev_info {
 	struct object_array pending;
 	struct repository *repo;

+	/*
+	 * We'll use this queue during the actual traversal, rather than
+	 * adding and popping from the "commits" list.
+	 */
+	struct prio_queue commitq;
+
 	/* Parents of shown commits */
 	struct object_array boundary_commits;

^ permalink raw reply related	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-11-13 20:39 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-11 16:17 rev-list default order sometimes very slow (size and metadata dependent) Kevin Lyles
2023-11-13 20:39 ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).