* rev-list default order sometimes very slow (size and metadata dependent)
@ 2023-11-11 16:17 Kevin Lyles
2023-11-13 20:39 ` Jeff King
0 siblings, 1 reply; 2+ messages in thread
From: Kevin Lyles @ 2023-11-11 16:17 UTC (permalink / raw
To: git@vger.kernel.org; +Cc: allred.sean@gmail.com
[-- Attachment #1.1: Type: text/plain, Size: 5028 bytes --]
Hello,
As part of my company's migration from SVN to git, we discovered a
performance issue in rev-list with large repositories. The issue appears to
be metadata-dependent; we were able to work around it (completely avoiding
any performance penalty) by changing the date of certain commits.
The general structure of our repository is I think fairly normal (if large
-- we have >5.5 million commits total). We have a handful of trunk
branches, and ~10k total refs. To reduce the ref count (we hit other
performance issues when we had significantly more refs), we remove refs as
we're done with them. Any code that doesn't make it into a trunk is
preserved in an archive branch. The archive branch has no content, and
consists entirely of octopus merges with 50-500 parents.
If the archive branch is created with author/commit dates older than the
rest of the repository, we're able to run:
$ git rev-list --count --all
in ~9-10 seconds on a mirror clone with a commit-graph. However, if the
archive branch is instead created with author/commit dates newer than the
rest of the repository, it takes 4-5 minutes.
Using any order other than the default or --reverse removes the disparity.
All orders except --author-date-order bring things much closer to the ~9-10
seconds we see with the workaround, and --author-date-order is still under a
minute (though not by much).
System info from git bugreport:
[System Info]
git version:
git version 2.42.0.windows.2
cpu: x86_64
built from commit: 2f819d1670fff9a1818f63b6722e9959405378e3
sizeof-long: 4
sizeof-size_t: 8
shell-path: /bin/sh
feature: fsmonitor--daemon
uname: Windows 10.0 19044
compiler info: gnuc: 13.2
libc info: no libc information available
$SHELL (typically, interactive shell): C:\Program Files\Git\usr\bin\bash.exe
(no enabled hooks)
Note that we first realized this was an issue on our GitLab instance, which
runs on Linux, so this is not a Windows-specific bug.
I created a bash script to create very similar repositories that are/are not
affected by the issue; it follows. The issue starts to become visible at 1
million commits (the default), where the difference is ~2x. 5 million
commits is roughly equivalent performance-wise to what we saw in our
repository, with a difference of ~33x. Note that with 5 million commits,
each repository is ~1.2 GB and takes 7-8 minutes to create on an i9-9900
with NVMe storage.
Once you create a fast and a slow repo with the script, try the following
commands in each one:
# Shows the performance difference
$ time git rev-list --count –all
# Shows very similar performance across both repos
$ time git rev-list --count --all --topo-order
Thank you,
Kevin Lyles
klyles@epic.com <mailto:klyles@epic.com>
----------------------------------------
#!/bin/bash
usage="Usage: $0 <destination folder> <--fast|--slow> [Number of commits (default: 1000000)]"
destinationFolder=${1:?$usage}
oldTimestamp=315554400 # 1980-01-01 midnight
newTimestamp=1672552800 # 2023-01-01 midnight
if [ "$2" == "--fast" ]
then
archiveTimestamp=$oldTimestamp
elif [ "$2" == "--slow" ]
then
archiveTimestamp=$newTimestamp
else
echo "$usage" >&2
exit 1
fi
numberOfCommits=${3:-1000000}
if ! [[ "$numberOfCommits" =~ ^[0-9]+$ ]]
then
echo "$usage" >&2
exit 1
fi
increment=$(( (newTimestamp - oldTimestamp) / (numberOfCommits + 2) ))
timestamp=$oldTimestamp
rm -rf "$destinationFolder"
git init "$destinationFolder"
echo "Fast-importing repo, please wait..."
{
echo "feature done"
echo "reset refs/heads/main"
echo ""
for count in $(seq "$numberOfCommits")
do
timestamp=$(( timestamp + increment ))
echo "commit refs/heads/main"
echo "mark :$count"
echo "committer Test Test <test@test.com <mailto:test@test.com> > $timestamp -0500"
echo "data <<|"
echo "Main branch commit #$count"
echo "|"
echo ""
done
parentMark=0
echo "reset refs/archive"
for count in $(seq $(( numberOfCommits / 1000 )))
do
echo "commit refs/archive"
echo "committer Test Test <test@test.com <mailto:test@test.com> > $archiveTimestamp -0500"
echo "data <<|"
echo "Archive branch commit #$count"
echo "|"
for parentCount in {1..50}
do
parentMark=$(( (parentMark + 99991) % numberOfCommits + 1 ))
echo "merge :$parentMark"
done
echo ""
done
echo "done"
} | git -C "$destinationFolder" fast-import
git -C "$destinationFolder" commit-graph write
[-- Attachment #1.2: Type: text/html, Size: 12590 bytes --]
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 6034 bytes --]
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: rev-list default order sometimes very slow (size and metadata dependent)
2023-11-11 16:17 rev-list default order sometimes very slow (size and metadata dependent) Kevin Lyles
@ 2023-11-13 20:39 ` Jeff King
0 siblings, 0 replies; 2+ messages in thread
From: Jeff King @ 2023-11-13 20:39 UTC (permalink / raw
To: Kevin Lyles; +Cc: git@vger.kernel.org, allred.sean@gmail.com
On Sat, Nov 11, 2023 at 04:17:17PM +0000, Kevin Lyles wrote:
> If the archive branch is created with author/commit dates older than the
> rest of the repository, we're able to run:
>
> $ git rev-list --count --all
>
> in ~9-10 seconds on a mirror clone with a commit-graph. However, if the
> archive branch is instead created with author/commit dates newer than the
> rest of the repository, it takes 4-5 minutes.
Thanks for providing a reproduction script. I had trouble following the
text explanation, but I can easily reproduce the issue using your
script.
Running the slow case under perf, it looks like we are spending most of
the time in commit_list_insert_by_date(). Which makes sense. The queue
of commits to visit during the traversal is kept as an ordered linked
list that uses linear search for the insertion. So the worst case to
insert N commits is quadratic.
In practice, it tends not to be a big problem because we visit commits
in roughly reverse-timestamp order as we traverse. So it's more like
O(N*width), where "width" is the average number of simultaneous lines of
history. And your archive graph is...very wide.
Playing with the commit timestamps helps your case because it just
happens to reorder the queue in such a way that we put off looking at
those huge archive merges until much later in the traversal, when our
remaining "N" is much smaller.
The right solution is obviously to switch to a better data structure.
And some of the infrastructure is in place from the last time we dealt
with a similar case:
https://lore.kernel.org/git/HE1PR0702MB3788FCDAB764252D9CBB42E5B0560@HE1PR0702MB3788.eurprd07.prod.outlook.com/
There we started using a priority queue instead of the linked list.
That patch stopped shorting of replacing the main revs->commits
traversal list, though, as it's referenced in a ton of places (some of
which really care about its list-like nature). Here's a hacky patch that
tries to lazily convert the list to a queue. It makes your case much
faster, but it also causes a few failures in the test suite (related to
commit ordering), so obviously it's violating some assumptions
somewhere. I'm not sure if it's a fruitful direction or not.
And as you noticed, using --topo-order does not suffer from the same
problem. It follows a completely different path, which...you guessed it,
uses a priority queue. I'm not sure what the current state of things is,
but one of the promises of commit-graphs is that with stored generation
numbers, we could compute topo-order incrementally and efficiently. So
another possible route is that we'd simply enable it by default. But I'm
not sure if there's more work that needs to be done to get there, or if
there are other downsides.
diff --git a/revision.c b/revision.c
index 00d5c29bfc..62d33dbfee 100644
--- a/revision.c
+++ b/revision.c
@@ -3119,6 +3119,7 @@ void release_revisions(struct rev_info *revs)
clear_decoration(&revs->treesame, free);
line_log_free(revs);
oidset_clear(&revs->missing_commits);
+ clear_prio_queue(&revs->commitq);
}
static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
@@ -4161,6 +4162,36 @@ static void track_linear(struct rev_info *revs, struct commit *commit)
static struct commit *get_revision_1(struct rev_info *revs)
{
+ /*
+ * This is kind of hacky. We'll dynamically move commits from the
+ * linear list over to our priority queue, which has better worst-case
+ * behavior. Doing it here serves two purposes:
+ *
+ * - we'll let all of the setup code still operate on the "commits"
+ * list, and then just move it over to the queue lazily. This
+ * probably could be done in prepare_revision_walk() or similar,
+ * but by doing it here we know we're catching all code paths.
+ *
+ * - we'll likewise catch any code which tries to add new commits to
+ * the list _during_ the traversal. The main one would be
+ * process_parents() below, which we've taught to use the queue
+ * directly. But this will catch any other random bits of code,
+ * including callers of the revision API, which add to the
+ * traversal as they go.
+ *
+ * A cleaner solution would probably be to get rid of the "commits"
+ * list entirely, but that's a much bigger task.
+ */
+ if (revs->commits && !revs->topo_order) {
+ struct commit_list *p;
+ if (!revs->unsorted_input)
+ revs->commitq.compare = compare_commits_by_commit_date;
+ for (p = revs->commits; p; p = p->next)
+ prio_queue_put(&revs->commitq, p->item);
+ free_commit_list(revs->commits);
+ revs->commits = NULL;
+ }
+
while (1) {
struct commit *commit;
@@ -4169,7 +4200,7 @@ static struct commit *get_revision_1(struct rev_info *revs)
else if (revs->topo_walk_info)
commit = next_topo_commit(revs);
else
- commit = pop_commit(&revs->commits);
+ commit = prio_queue_get(&revs->commitq);
if (!commit)
return NULL;
@@ -4191,7 +4222,7 @@ static struct commit *get_revision_1(struct rev_info *revs)
try_to_simplify_commit(revs, commit);
else if (revs->topo_walk_info)
expand_topo_walk(revs, commit);
- else if (process_parents(revs, commit, &revs->commits, NULL) < 0) {
+ else if (process_parents(revs, commit, NULL, &revs->commitq) < 0) {
if (!revs->ignore_missing_links)
die("Failed to traverse parents of commit %s",
oid_to_hex(&commit->object.oid));
diff --git a/revision.h b/revision.h
index 94c43138bc..4301b825a0 100644
--- a/revision.h
+++ b/revision.h
@@ -12,6 +12,7 @@
#include "ident.h"
#include "list-objects-filter-options.h"
#include "strvec.h"
+#include "prio-queue.h"
/**
* The revision walking API offers functions to build a list of revisions
@@ -124,6 +125,12 @@ struct rev_info {
struct object_array pending;
struct repository *repo;
+ /*
+ * We'll use this queue during the actual traversal, rather than
+ * adding and popping from the "commits" list.
+ */
+ struct prio_queue commitq;
+
/* Parents of shown commits */
struct object_array boundary_commits;
^ permalink raw reply related [flat|nested] 2+ messages in thread
end of thread, other threads:[~2023-11-13 20:39 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-11 16:17 rev-list default order sometimes very slow (size and metadata dependent) Kevin Lyles
2023-11-13 20:39 ` Jeff King
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).