git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* 30min Script in git 2.7.4 takes 22+ hrs in git 2.9.3
@ 2017-04-27 16:36 Robert Stryker
  2017-04-27 20:09 ` Jeff King
  0 siblings, 1 reply; 3+ messages in thread
From: Robert Stryker @ 2017-04-27 16:36 UTC (permalink / raw)
  To: git

Hi all:

The following script attempts to merge 4 git repos into one,
maintaining tag and branch content (but not SHAs). Each original repo
basically gets its own subfolder in the new one. Original repos are
first rewritten to have their history think they always belonged in
the target subfolder.

The problem:  the script takes 30 minutes for one environment
including git 2.7.4, and generates a repo of about 30mb.   When run by
a coworker using git 2.9.3, it takes 22+ hours and generates a 10gb
repo.

Clearly something here is very wrong. Either there's a pretty horrible
regression or my idea is a pretty bad one ;)

General process for the script:
  - check out 4 repos
  - rewrite their history so they always thought they were in a subfolder
  - copy these 4 rewritten folders to a temporary location
  - get a list of branches and tags for each of the 4 repos
  - initialize a new repo with a readme.md
  - for each unique tag
       - check the 4 rewritten / backed up repos for the tag
       - for each of the 4 rewritten repos:
            - if the tag exists in that repo, merge it into the new
repo in a test branch
           -  git pull --no-edit ../intermediate/oneRewrittenRepo    (SLOW PART)
        - save the tag
   - for each unique branch (same logic)

So... yeah... 30mb + 30 minutes -> 11gb + 22 hours somewhere between
these two versions of git?

According to coworker:

during each pass of the Tags' loop it's sitting for a long time on:

 git pull --no-edit ../intermediate/webtools.common

which runs in its turn

git fetch --update-head-ok ../intermediate/webtools.common

which in its turn runs

git-upload-pack ../intermediate/webtools.common


Any ideas here are much appreciated =/

The Script in question is here:
   https://gist.github.com/robstryker/4854fc86ab3714a5e1af353b98cbc768

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: 30min Script in git 2.7.4 takes 22+ hrs in git 2.9.3
  2017-04-27 16:36 30min Script in git 2.7.4 takes 22+ hrs in git 2.9.3 Robert Stryker
@ 2017-04-27 20:09 ` Jeff King
  2017-04-27 20:42   ` Jeff King
  0 siblings, 1 reply; 3+ messages in thread
From: Jeff King @ 2017-04-27 20:09 UTC (permalink / raw)
  To: Robert Stryker; +Cc: git

On Thu, Apr 27, 2017 at 12:36:54PM -0400, Robert Stryker wrote:

> The problem:  the script takes 30 minutes for one environment
> including git 2.7.4, and generates a repo of about 30mb.   When run by
> a coworker using git 2.9.3, it takes 22+ hours and generates a 10gb
> repo.
> 
> Clearly something here is very wrong. Either there's a pretty horrible
> regression or my idea is a pretty bad one ;)

The large size makes me think that you're getting an auto-gc in the
middle that is exploding the unreachable objects into loose storage.
This can happen when objects are ready to be pruned, but Git holds on to
them for a grace periods (2 weeks by default) as a precaution against
simultaneous use.

Try doing:

  git config gc.auto 0

in the repositories before the slow step. Or alternatively, try:

  git config gc.pruneExpire now

which will continue to do the auto-gc, but throw away unreachable
objects immediately.

Or alternatively, we're failing to run gc at all and just getting tons
of loose objects that need packed. What does running "git gc --auto" say
if you run it in the slow repository? Does it improve the disk space
problem?

Even if one of those helps, I'd still like to know why the gc behavior
changed between the two versions. The best way to do that is via
git-bisect.

You should be able to do:

  # make sure you can compile git from source
  git clone git://git.kernel.org/pub/scm/git/git.git
  cd git
  make

  git bisect start
  git bisect good v2.7.4
  git bisect bad v2.9.3

  # for each commit bisect dumps you at, run your test. The bin-wrappers
  # part is important, because it sets up the environment to run
  # sub-programs from the built version. And as pull is a shell script,
  # the problem is likely in a sub-program.
  /path/to/git/bin-wrappers/git pull ...

  # And then mark whether it was fast or slow. You obviously don't need
  # to run the program to completion; just enough to decide if it's fast
  # or slow (which might be better done by observing disk space rather
  # than timing).
  git bisect good ;# or "bad" if it was slow

It's going to be tedious even if it takes 30 minutes per iteration. It
might be worth trying to adjust the test case for smaller repos. :)

It may also be worth trying the test with the latest tip of "master".
v2.9.3 is several versions behind, and it's possible that something may
have been fixed since then (nothing comes immediately to mind, but it's
worth a shot).

-Peff

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: 30min Script in git 2.7.4 takes 22+ hrs in git 2.9.3
  2017-04-27 20:09 ` Jeff King
@ 2017-04-27 20:42   ` Jeff King
  0 siblings, 0 replies; 3+ messages in thread
From: Jeff King @ 2017-04-27 20:42 UTC (permalink / raw)
  To: Robert Stryker; +Cc: git

On Thu, Apr 27, 2017 at 04:09:56PM -0400, Jeff King wrote:

> On Thu, Apr 27, 2017 at 12:36:54PM -0400, Robert Stryker wrote:
> 
> > The problem:  the script takes 30 minutes for one environment
> > including git 2.7.4, and generates a repo of about 30mb.   When run by
> > a coworker using git 2.9.3, it takes 22+ hours and generates a 10gb
> > repo.
> > 
> > Clearly something here is very wrong. Either there's a pretty horrible
> > regression or my idea is a pretty bad one ;)
> 
> The large size makes me think that you're getting an auto-gc in the
> middle that is exploding the unreachable objects into loose storage.
> This can happen when objects are ready to be pruned, but Git holds on to
> them for a grace periods (2 weeks by default) as a precaution against
> simultaneous use.
> 
> Try doing:
> 
>   git config gc.auto 0
> 
> in the repositories before the slow step. Or alternatively, try:
> 
>   git config gc.pruneExpire now
> 
> which will continue to do the auto-gc, but throw away unreachable
> objects immediately.
> 
> Or alternatively, we're failing to run gc at all and just getting tons
> of loose objects that need packed. What does running "git gc --auto" say
> if you run it in the slow repository? Does it improve the disk space
> problem?

Fiddling with your script a bit, I have a suspect. Between your two
versions of git, we started disallowing merge of unrelated histories by
default[1]. Which is exactly what your script is doing:

  echo "Merge in the four rewritten projects, with generic commit messages"
  git pull --no-edit webtools.common.fproj     
  git pull --no-edit webtools.common           
  git pull --no-edit webtools.common.tests     
  git pull --no-edit webtools.common.snippets

If you run under "set -e", or just put "|| exit 1" after those, you'll
see that they fail with v2.9.3 and newer.

So what I think is happening is that we never create that shared
history, and then your per-tag work is building further on a nonsense
fake history. That has two implications:

  - as the divergent history in the shared repo gets bigger and bigger,
    the fetches have to do more and more work to try to find a common
    ancestor (but of course they'll never find one, because the two
    histories aren't related)

  - the divergent history racks up tons of unreachable objects, which
    auto-gc won't pack. After a while of the script running, you can see
    that auto-gc fails with "There are too many unreachable loose
    objects" after the pack. Due to the way background gc works these
    days, that blocks further auto-gc from running until the situation
    is resolved. And you just rack up tons of loose objects, which
    explains the disk usage.

Try adding "--allow-unrelated-histories" to your git-pull invocation.

-Peff

[1] See e379fdf34 (merge: refuse to create too cool a merge by default, 2016-03-18)

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-04-27 20:42 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-27 16:36 30min Script in git 2.7.4 takes 22+ hrs in git 2.9.3 Robert Stryker
2017-04-27 20:09 ` Jeff King
2017-04-27 20:42   ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).