git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Cc: Richard Hipp <drh@sqlite.org>,
	git@vger.kernel.org, Mike Hommey <mh@glandium.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: git-fast-import yields huge packfile
Date: Thu, 21 Mar 2019 15:23:15 +0100	[thread overview]
Message-ID: <87o964cnn0.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <nycvar.QRO.7.76.6.1903211503030.41@tvgsbejvaqbjf.bet>


On Thu, Mar 21 2019, Johannes Schindelin wrote:

> Hi Richard,
>
> On Sat, 16 Mar 2019, Richard Hipp wrote:
>
>> I'm trying to transform a repository from another VCS into a Git
>> repository using "git fast-import".  It appears to work, but the
>> resulting Git repository is huge relative to the original - 18 times
>> larger. Most of the space seems to be taken up by a single large
>> packfile.  That packfile is about 967 MB which is about 1/4th the
>> total uncompressed size of all 41785 distinct Blobs in the original
>> repository.  The source VCS is able to compress this down to 52 MB by
>> comparison.
>
> I feel your pain, as I had the same problem back in the day. My use case
> was mirroring an upstream Mercurial repository to a Git repository. This
> use case went away, so I do not do that anymore (and there are more, less
> happy reasons why I would no longer work on that git-remote-hg project,
> but that's off topic). As one of the last rem(a)inders, Git for Windows
> carries this patch:
>
> https://github.com/git-for-windows/git/commit/b91911ff8d3e2cf279b4708be89de2e3bc8e9e87
>
> Essentially, it *always* runs `git gc --auto` after running `fast-import`.
>
> Which is a lot more high-level advice than the rather low-level `git
> repack` hint given elsewhere in this thread.
>
> Now, I wonder whether we should integrate this into `fast-import` proper
> (with a knob to turn it off), maybe even offer to run `git gc --auto`
> every <N> imported commits?

My reading of the combination of Linus's & Mike Hommey's E-Mails is that
this just happened to work for you because the blob import order you
used was such that you didn't get any on-the-fly deltas.

But as Linus notes you need to pass "-f" aka. "--no-reuse-delta" down to
pack-objects for this to work in the general case, so a plain "git gc"
in that GFW patch won't do the right thing *unless* you didn't end up
with any deltas at all (or close enough for it not to matter).

So in the general case you need to run "git gc --aggressive" after a
"fast-import". I'll add some docs about this in my re-roll of my
concurrent gc doc series:
https://public-inbox.org/git/20190318161502.7979-1-avarab@gmail.com/

I wonder if we should just leave it at that. The fast-import command is
plumbing, and e.g. someone running N number of those now and doing a
"git gc --aggressive" afterwards would have their use broken by this,
their "gc" would abort if the "--aggressive" we spawned after the 1st
fast-import invocation was still running.

I was thinking of introducing some sub-mode for --aggressive that
doesn't tweak the window size, but just passes down "-f". It would more
generally cover these cases, and eta less CPU than the increased window
size (although "--no-reuse-delta" by itself is very expensive).


>> Maybe I'm doing something wrong with the fast-import stream that is
>> defeating Git's attempts at delta compression....
>>
>> Are there any utility programs available for analyzing packfiles so
>> that I try to figure out where the inefficiencies are cropping up, so
>> that I can try to address them?
>>
>> Anybody have any suggestions on what I should be looking for?
>>
>> If anyone would care to see this oversized packfile and perhaps offer
>> suggestions on how I can make it more space-efficient, it can be
>> cloned from https://github.com/drhsqlite/fossil-mirror.git - at least
>> for now - surely I will delete that repo and regenerate it once I
>> figure out this problem.
>>
>> --
>> D. Richard Hipp
>> drh@sqlite.org
>>

      reply	other threads:[~2019-03-21 14:23 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-16 20:31 git-fast-import yields huge packfile Richard Hipp
2019-03-16 21:04 ` Linus Torvalds
2019-03-16 22:12   ` Mike Hommey
2019-03-16 23:22   ` Richard Hipp
2019-03-21 14:09 ` Johannes Schindelin
2019-03-21 14:23   ` Ævar Arnfjörð Bjarmason [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87o964cnn0.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=drh@sqlite.org \
    --cc=git@vger.kernel.org \
    --cc=mh@glandium.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).