git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Pathological performance with git remote rename and many tracking refs
@ 2022-04-13 22:32 brian m. carlson
  2022-04-14  7:12 ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 6+ messages in thread
From: brian m. carlson @ 2022-04-13 22:32 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 1721 bytes --]

In my day-to-day work, I have the occasion to use GitHub Codespaces on a
repository with about 20,000 refs on the server.  The environment is set
up to pre-clone the repository, but I use a different default remote
name than "origin" ("def", to be particular), and thus, one of the things
I do when I set up that environment is to run "git remote rename origin
def".

This process takes 35 minutes, which is extremely pathological.  I
believe what's happening is that all of the refs are packed, and
renaming the ref causes a loose ref to be created and the old ref to be
deleted (necessitating a rewrite of the packed-refs file).  This is
essentially O(N^2) in the order of refs.

We recently added a --progress option, but I think this performance is
bad enough that that's not going to suffice here, and we should try to
do better.

I found that using "git for-each-ref" and "git update-ref --stdin" in a
pipeline to create and delete the refs as a single transaction takes a
little over 2 seconds.  This is greater than a 99.9% improvement and is
much more along the line of what I'd expect.

I thought about porting this code to use a ref transaction, but I
realized that we don't rename reflogs in that situation, which might be
a problem for some people.  In my case, since it's a freshly cloned repo
and the reflogs aren't interesting, I don't care.

I think a possible way forward may be to either teach ref transactions
about ref renames, or simply to add a --no-reflogs option, which omits
the reflogs in case the user doesn't care.  I'm interested to hear ideas
from others, though, about the best way forward.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Pathological performance with git remote rename and many tracking refs
  2022-04-13 22:32 Pathological performance with git remote rename and many tracking refs brian m. carlson
@ 2022-04-14  7:12 ` Ævar Arnfjörð Bjarmason
  2022-04-15  1:08   ` brian m. carlson
  0 siblings, 1 reply; 6+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-14  7:12 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git


On Wed, Apr 13 2022, brian m. carlson wrote:

> [[PGP Signed Part:Undecided]]
> In my day-to-day work, I have the occasion to use GitHub Codespaces on a
> repository with about 20,000 refs on the server.  The environment is set
> up to pre-clone the repository, but I use a different default remote
> name than "origin" ("def", to be particular), and thus, one of the things
> I do when I set up that environment is to run "git remote rename origin
> def".

Aside from how we'd do renames with transactions, do you know about
clone.defaultRemoteName and --origin?

> This process takes 35 minutes, which is extremely pathological.  I
> believe what's happening is that all of the refs are packed, and
> renaming the ref causes a loose ref to be created and the old ref to be
> deleted (necessitating a rewrite of the packed-refs file).  This is
> essentially O(N^2) in the order of refs.
>
> We recently added a --progress option, but I think this performance is
> bad enough that that's not going to suffice here, and we should try to
> do better.
>
> I found that using "git for-each-ref" and "git update-ref --stdin" in a
> pipeline to create and delete the refs as a single transaction takes a
> little over 2 seconds.  This is greater than a 99.9% improvement and is
> much more along the line of what I'd expect.
>
> I thought about porting this code to use a ref transaction, but I
> realized that we don't rename reflogs in that situation, which might be
> a problem for some people.  In my case, since it's a freshly cloned repo
> and the reflogs aren't interesting, I don't care.

There was a (small) thread as a follow-up to that "rename --progress"
patch at the time, did you spot/read that?:
https://lore.kernel.org/git/220302.865yow6u8a.gmgdl@evledraar.gmail.com/

There's doubtless other previous discussions, I just haven't
found/remember them.

I have (briefly) tried hacking on this myself in the past, as anyone
who'll poke at that will no doubt find "branch rename" and "branch copy"
non-ref-transaction way of doing this are basically other callers with
the same problem.

Before I go any further I think it's good to know how far down this
particular rabbit hole you already are...

> I think a possible way forward may be to either teach ref transactions
> about ref renames, or simply to add a --no-reflogs option, which omits
> the reflogs in case the user doesn't care.  I'm interested to hear ideas
> from others, though, about the best way forward.

More generally, probably:

 1. Teach transactions about N operations on the same refname, which
    they'll currently die on, renames are one case.

 2. Be able to "hook in" to them, updating reflogs is one special-case,
    but we have the same inherent issue with updating config in lockstep
    with transactions.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Pathological performance with git remote rename and many tracking refs
  2022-04-14  7:12 ` Ævar Arnfjörð Bjarmason
@ 2022-04-15  1:08   ` brian m. carlson
  2022-04-15 12:26     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 6+ messages in thread
From: brian m. carlson @ 2022-04-15  1:08 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1472 bytes --]

On 2022-04-14 at 07:12:20, Ævar Arnfjörð Bjarmason wrote:
> Aside from how we'd do renames with transactions, do you know about
> clone.defaultRemoteName and --origin?

Yes, I do know about that.  However, in my case, the repository is
cloned before I even get a chance to touch it, so these options have no
effect.  My dotfiles aren't even on the machine at that point.

> There was a (small) thread as a follow-up to that "rename --progress"
> patch at the time, did you spot/read that?:
> https://lore.kernel.org/git/220302.865yow6u8a.gmgdl@evledraar.gmail.com/

Yeah, I remember reading it at the time.

> More generally, probably:
> 
>  1. Teach transactions about N operations on the same refname, which
>     they'll currently die on, renames are one case.
> 
>  2. Be able to "hook in" to them, updating reflogs is one special-case,
>     but we have the same inherent issue with updating config in lockstep
>     with transactions.

Yeah, that's one of the reasons I was thinking of implementing
--no-reflogs: because in that case, the operation really is a
create/delete operation and it doesn't require any additional logic in
the transaction.

My goal here is specifically not to rearchitect ref transactions and to
implement a relatively simple solution.  Your response is indicating to
me that updating reflogs is going to be the former and not the latter.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Pathological performance with git remote rename and many tracking refs
  2022-04-15  1:08   ` brian m. carlson
@ 2022-04-15 12:26     ` Ævar Arnfjörð Bjarmason
  2022-04-15 17:25       ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-15 12:26 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git


On Fri, Apr 15 2022, brian m. carlson wrote:

> [[PGP Signed Part:Undecided]]
> On 2022-04-14 at 07:12:20, Ævar Arnfjörð Bjarmason wrote:
>> Aside from how we'd do renames with transactions, do you know about
>> clone.defaultRemoteName and --origin?
>
> Yes, I do know about that.  However, in my case, the repository is
> cloned before I even get a chance to touch it, so these options have no
> effect.  My dotfiles aren't even on the machine at that point.

*nod*

>> There was a (small) thread as a follow-up to that "rename --progress"
>> patch at the time, did you spot/read that?:
>> https://lore.kernel.org/git/220302.865yow6u8a.gmgdl@evledraar.gmail.com/
>
> Yeah, I remember reading it at the time.
>
>> More generally, probably:
>> 
>>  1. Teach transactions about N operations on the same refname, which
>>     they'll currently die on, renames are one case.
>> 
>>  2. Be able to "hook in" to them, updating reflogs is one special-case,
>>     but we have the same inherent issue with updating config in lockstep
>>     with transactions.
>
> Yeah, that's one of the reasons I was thinking of implementing
> --no-reflogs: because in that case, the operation really is a
> create/delete operation and it doesn't require any additional logic in
> the transaction.
>
> My goal here is specifically not to rearchitect ref transactions and to
> implement a relatively simple solution.  Your response is indicating to
> me that updating reflogs is going to be the former and not the latter.

Yes, I think it would be nicer to solve the general problem, because
we'd otherwise end up with no-reflogs and have-reflogs paths through
these functions.

I'd also think you'd be stuck trying to instrument transactions in some
new way anyway, because wouldn't this thing be racy if reflogs for one
of the remotes appeared after you started you'd need to abort or deal
with them?

Or was the plan to:

 1. Create lock files for them all
 2. No reflogs?
 3. Rename..

Maybe that was what we did in any case, I'm rather rusty on the exact
procedure at the moment. I.e. did we do the mass-lock first?

Anyway.

I actually wonder if this wouldn't be that hard, because if we have any
reflogs we could simply "replay" them by turning (again, I'm a bit rusty
on the exact lock dance);

    lock(refs/remotes/origin/master);
    lock(refs/remotes/def/master);
    create(refs/remotes/def/master, refs/remotes/origin/master^{});
    delete(refs/remotes/origin/master);
    unlock(refs/remotes/origin/master);
    unlock(refs/remotes/def/master);

Into instead doing:

    lock(refs/remotes/origin/master);
    lock(refs/remotes/def/master);
    for from, to, msg ref_update(refs/remotes/origin/master):
    update(refs/remotes/def/master, from, to, msg);
    delete(refs/remotes/origin/master);
    unlock(refs/remotes/origin/master);
    unlock(refs/remotes/def/master);

I.e. simply replay each individual item in the reflog, basically using
the code in your builtin/stash.c series.

I didn't think of that before, and just thought of some
extra-transactional way to copy the reflog over as-is.

I think that multi-copy approach would run into
ref_update_reject_duplicates(), i.e. you can't have multiple updates
now, but that's a sanity assertion to make sure the transaction is
internally consistent.

In this case the N update() calls are semnatically the same as a
create() call in terms of aborting the transaction, i.e. we'd "fold"
them away, i.e. turn the sanity check into pretending N such updates
were the same as a single create(), so we'd delete on rollback.

Maybe that would work? Hrm...

Anyway, depending on the answer to if your --no-reflogs plan would have
any lock issues etc, perhaps an alternate plan would be:

    git remote rename --unsafe

Where we'd just "mv" the 1-2 relevant refs/remotes/X and
log/refs/remotes/X folders, followed by updating refs...

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Pathological performance with git remote rename and many tracking refs
  2022-04-15 12:26     ` Ævar Arnfjörð Bjarmason
@ 2022-04-15 17:25       ` Junio C Hamano
  2022-04-16 11:23         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2022-04-15 17:25 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: brian m. carlson, git

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> I actually wonder if this wouldn't be that hard, because if we have any
> reflogs we could simply "replay" them by turning (again, I'm a bit rusty
> on the exact lock dance);
>
>     lock(refs/remotes/origin/master);
>     lock(refs/remotes/def/master);
>     create(refs/remotes/def/master, refs/remotes/origin/master^{});

Why deref?  Trying to be prepared for seeing a symbolic ref, or
something?

>     delete(refs/remotes/origin/master);
>     unlock(refs/remotes/origin/master);
>     unlock(refs/remotes/def/master);
>
> Into instead doing:
>
>     lock(refs/remotes/origin/master);
>     lock(refs/remotes/def/master);
>     for from, to, msg ref_update(refs/remotes/origin/master):
>         update(refs/remotes/def/master, from, to, msg);
>     delete(refs/remotes/origin/master);
>     unlock(refs/remotes/origin/master);
>     unlock(refs/remotes/def/master);

Would this (or the above, for that matter) work well when renaming 'foo'
to 'foo/bar' (or the other way around), I wonder?

Anyway.

Is the reasoning behind "replay each and every reflog entry" because
you are trying to avoid depending on the implementation detail (i.e.
"R=.git/logs/refs/remotes; mv $R/origin/master $R/def/master" works
only with the files backend)?  Unless we are willing to add a new
ref backend method to help this natively, it would be a workable but
an ultra-slow way to do so, as it would involve open, write, fsync,
and close for each reflog entry.

    ... goes and looks ...

Hmph, I am utterly confused.  refs/files-backend.c eventually
dispatches to files_copy_or_rename_ref() when rename_ref method in
ref_storage_be is used on the files backend, and the function
already renames the reflog file without having to copy entry by
entry.  We cannot just open a transaction, run many rename_refs and
close it?


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Pathological performance with git remote rename and many tracking refs
  2022-04-15 17:25       ` Junio C Hamano
@ 2022-04-16 11:23         ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 6+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-16 11:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: brian m. carlson, git


On Fri, Apr 15 2022, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> I actually wonder if this wouldn't be that hard, because if we have any
>> reflogs we could simply "replay" them by turning (again, I'm a bit rusty
>> on the exact lock dance);
>>
>>     lock(refs/remotes/origin/master);
>>     lock(refs/remotes/def/master);
>>     create(refs/remotes/def/master, refs/remotes/origin/master^{});
>
> Why deref?  Trying to be prepared for seeing a symbolic ref, or
> something?

I just meant "$(git rev-parse $NAME)", sorry about being unclear. It
would work for branches, but yeah, for annotated tag objects it would be
the wrong thing.

>>     delete(refs/remotes/origin/master);
>>     unlock(refs/remotes/origin/master);
>>     unlock(refs/remotes/def/master);
>>
>> Into instead doing:
>>
>>     lock(refs/remotes/origin/master);
>>     lock(refs/remotes/def/master);
>>     for from, to, msg ref_update(refs/remotes/origin/master):
>>         update(refs/remotes/def/master, from, to, msg);
>>     delete(refs/remotes/origin/master);
>>     unlock(refs/remotes/origin/master);
>>     unlock(refs/remotes/def/master);
>
> Would this (or the above, for that matter) work well when renaming 'foo'
> to 'foo/bar' (or the other way around), I wonder?

Hrm, probably not. Also for that we'll have tricky collisions with the
2nd level (master) v.s. 1st (origin/) won't we?

> Is the reasoning behind "replay each and every reflog entry" because
> you are trying to avoid depending on the implementation detail (i.e.
> "R=.git/logs/refs/remotes; mv $R/origin/master $R/def/master" works
> only with the files backend)?

It's a suggested workaround around the ref backend not understanding how
to "plug in" someting like a "mv", i.e. we can represent this sort of
thing as a sequence of normal ref operations.

> Unless we are willing to add a new
> ref backend method to help this natively, it would be a workable but
> an ultra-slow way to do so, as it would involve open, write, fsync,
> and close for each reflog entry.
>
>     ... goes and looks ...

It's far from optimal, but I think the main slowdown is due to the
(re)writing of the packed refs file, whereas the churn of doing a bunch
of locks, append-only writes etc. should be small by comparison, I
haven't benchmarked it though...

Also for my own reflogs (especially of remotes) some have very busy
logs, and some just 1-2 entries...

> Hmph, I am utterly confused.  refs/files-backend.c eventually
> dispatches to files_copy_or_rename_ref() when rename_ref method in
> ref_storage_be is used on the files backend, and the function
> already renames the reflog file without having to copy entry by
> entry.  We cannot just open a transaction, run many rename_refs and
> close it?

The whole problem (such as it is) is that that isn't part of the
transaction mechanism.

I.e. all of the logic to make this failure tolerant is missing, you're
renaming N refs and their logs, and we fail in the middle of it, does it
roll back to a consistent state? (it doesn't).

Of course this code could be made to handle such failures, address
consistency issues etc., but doing so would essentially be a
re-implementation of transactions.

So I'm wondering (perhaps unproductively) if having some way to
piggy-back the mechanism into the transaction mechanism in an easy way
would be a good way to do it...

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-04-16 11:37 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-13 22:32 Pathological performance with git remote rename and many tracking refs brian m. carlson
2022-04-14  7:12 ` Ævar Arnfjörð Bjarmason
2022-04-15  1:08   ` brian m. carlson
2022-04-15 12:26     ` Ævar Arnfjörð Bjarmason
2022-04-15 17:25       ` Junio C Hamano
2022-04-16 11:23         ` Ævar Arnfjörð Bjarmason

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).