git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* git-fast-import yields huge packfile
@ 2019-03-16 20:31 Richard Hipp
  2019-03-16 21:04 ` Linus Torvalds
  2019-03-21 14:09 ` Johannes Schindelin
  0 siblings, 2 replies; 6+ messages in thread
From: Richard Hipp @ 2019-03-16 20:31 UTC (permalink / raw)
  To: git

I'm trying to transform a repository from another VCS into a Git
repository using "git fast-import".  It appears to work, but the
resulting Git repository is huge relative to the original - 18 times
larger. Most of the space seems to be taken up by a single large
packfile.  That packfile is about 967 MB which is about 1/4th the
total uncompressed size of all 41785 distinct Blobs in the original
repository.  The source VCS is able to compress this down to 52 MB by
comparison.

Maybe I'm doing something wrong with the fast-import stream that is
defeating Git's attempts at delta compression....

Are there any utility programs available for analyzing packfiles so
that I try to figure out where the inefficiencies are cropping up, so
that I can try to address them?

Anybody have any suggestions on what I should be looking for?

If anyone would care to see this oversized packfile and perhaps offer
suggestions on how I can make it more space-efficient, it can be
cloned from https://github.com/drhsqlite/fossil-mirror.git - at least
for now - surely I will delete that repo and regenerate it once I
figure out this problem.

-- 
D. Richard Hipp
drh@sqlite.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git-fast-import yields huge packfile
  2019-03-16 20:31 git-fast-import yields huge packfile Richard Hipp
@ 2019-03-16 21:04 ` Linus Torvalds
  2019-03-16 22:12   ` Mike Hommey
  2019-03-16 23:22   ` Richard Hipp
  2019-03-21 14:09 ` Johannes Schindelin
  1 sibling, 2 replies; 6+ messages in thread
From: Linus Torvalds @ 2019-03-16 21:04 UTC (permalink / raw)
  To: Richard Hipp; +Cc: Git List Mailing

On Sat, Mar 16, 2019 at 1:31 PM Richard Hipp <drh@sqlite.org> wrote:
>
> Maybe I'm doing something wrong with the fast-import stream that is
> defeating Git's attempts at delta compression....

fast-import doesn't do fancy delta compression becayse that would
defeat the "fast" part of fast-import.

Just do a git repack after the import to do the proper repacking. I
get a 41Mb packfile when I try that on your repo.

So a simple

   git repack -adf

should fix things up for you (the "-f" to make sure it doesn't try to
re-use things from the silly bad pack). Alternatively, use "git gc
--aggressive", which will do that forced repack too.

                  Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git-fast-import yields huge packfile
  2019-03-16 21:04 ` Linus Torvalds
@ 2019-03-16 22:12   ` Mike Hommey
  2019-03-16 23:22   ` Richard Hipp
  1 sibling, 0 replies; 6+ messages in thread
From: Mike Hommey @ 2019-03-16 22:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Richard Hipp, Git List Mailing

On Sat, Mar 16, 2019 at 02:04:33PM -0700, Linus Torvalds wrote:
> On Sat, Mar 16, 2019 at 1:31 PM Richard Hipp <drh@sqlite.org> wrote:
> >
> > Maybe I'm doing something wrong with the fast-import stream that is
> > defeating Git's attempts at delta compression....
> 
> fast-import doesn't do fancy delta compression becayse that would
> defeat the "fast" part of fast-import.

fast-import however does try to do delta compression of blobs against the
last blob that was imported, so if you put your blobs in an order where
they can be delta-ed, you can win without a git repack.

For one-shot conversions, you can just rely on git repack.

Mike

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git-fast-import yields huge packfile
  2019-03-16 21:04 ` Linus Torvalds
  2019-03-16 22:12   ` Mike Hommey
@ 2019-03-16 23:22   ` Richard Hipp
  1 sibling, 0 replies; 6+ messages in thread
From: Richard Hipp @ 2019-03-16 23:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git List Mailing

On 3/16/19, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>    git repack -adf
>

Thanks for the tip!

-- 
D. Richard Hipp
drh@sqlite.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git-fast-import yields huge packfile
  2019-03-16 20:31 git-fast-import yields huge packfile Richard Hipp
  2019-03-16 21:04 ` Linus Torvalds
@ 2019-03-21 14:09 ` Johannes Schindelin
  2019-03-21 14:23   ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 6+ messages in thread
From: Johannes Schindelin @ 2019-03-21 14:09 UTC (permalink / raw)
  To: Richard Hipp; +Cc: git

Hi Richard,

On Sat, 16 Mar 2019, Richard Hipp wrote:

> I'm trying to transform a repository from another VCS into a Git
> repository using "git fast-import".  It appears to work, but the
> resulting Git repository is huge relative to the original - 18 times
> larger. Most of the space seems to be taken up by a single large
> packfile.  That packfile is about 967 MB which is about 1/4th the
> total uncompressed size of all 41785 distinct Blobs in the original
> repository.  The source VCS is able to compress this down to 52 MB by
> comparison.

I feel your pain, as I had the same problem back in the day. My use case
was mirroring an upstream Mercurial repository to a Git repository. This
use case went away, so I do not do that anymore (and there are more, less
happy reasons why I would no longer work on that git-remote-hg project,
but that's off topic). As one of the last rem(a)inders, Git for Windows
carries this patch:

https://github.com/git-for-windows/git/commit/b91911ff8d3e2cf279b4708be89de2e3bc8e9e87

Essentially, it *always* runs `git gc --auto` after running `fast-import`.

Which is a lot more high-level advice than the rather low-level `git
repack` hint given elsewhere in this thread.

Now, I wonder whether we should integrate this into `fast-import` proper
(with a knob to turn it off), maybe even offer to run `git gc --auto`
every <N> imported commits?

Ciao,
Johannes

> Maybe I'm doing something wrong with the fast-import stream that is
> defeating Git's attempts at delta compression....
>
> Are there any utility programs available for analyzing packfiles so
> that I try to figure out where the inefficiencies are cropping up, so
> that I can try to address them?
>
> Anybody have any suggestions on what I should be looking for?
>
> If anyone would care to see this oversized packfile and perhaps offer
> suggestions on how I can make it more space-efficient, it can be
> cloned from https://github.com/drhsqlite/fossil-mirror.git - at least
> for now - surely I will delete that repo and regenerate it once I
> figure out this problem.
>
> --
> D. Richard Hipp
> drh@sqlite.org
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git-fast-import yields huge packfile
  2019-03-21 14:09 ` Johannes Schindelin
@ 2019-03-21 14:23   ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 6+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-03-21 14:23 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Richard Hipp, git, Mike Hommey, Linus Torvalds


On Thu, Mar 21 2019, Johannes Schindelin wrote:

> Hi Richard,
>
> On Sat, 16 Mar 2019, Richard Hipp wrote:
>
>> I'm trying to transform a repository from another VCS into a Git
>> repository using "git fast-import".  It appears to work, but the
>> resulting Git repository is huge relative to the original - 18 times
>> larger. Most of the space seems to be taken up by a single large
>> packfile.  That packfile is about 967 MB which is about 1/4th the
>> total uncompressed size of all 41785 distinct Blobs in the original
>> repository.  The source VCS is able to compress this down to 52 MB by
>> comparison.
>
> I feel your pain, as I had the same problem back in the day. My use case
> was mirroring an upstream Mercurial repository to a Git repository. This
> use case went away, so I do not do that anymore (and there are more, less
> happy reasons why I would no longer work on that git-remote-hg project,
> but that's off topic). As one of the last rem(a)inders, Git for Windows
> carries this patch:
>
> https://github.com/git-for-windows/git/commit/b91911ff8d3e2cf279b4708be89de2e3bc8e9e87
>
> Essentially, it *always* runs `git gc --auto` after running `fast-import`.
>
> Which is a lot more high-level advice than the rather low-level `git
> repack` hint given elsewhere in this thread.
>
> Now, I wonder whether we should integrate this into `fast-import` proper
> (with a knob to turn it off), maybe even offer to run `git gc --auto`
> every <N> imported commits?

My reading of the combination of Linus's & Mike Hommey's E-Mails is that
this just happened to work for you because the blob import order you
used was such that you didn't get any on-the-fly deltas.

But as Linus notes you need to pass "-f" aka. "--no-reuse-delta" down to
pack-objects for this to work in the general case, so a plain "git gc"
in that GFW patch won't do the right thing *unless* you didn't end up
with any deltas at all (or close enough for it not to matter).

So in the general case you need to run "git gc --aggressive" after a
"fast-import". I'll add some docs about this in my re-roll of my
concurrent gc doc series:
https://public-inbox.org/git/20190318161502.7979-1-avarab@gmail.com/

I wonder if we should just leave it at that. The fast-import command is
plumbing, and e.g. someone running N number of those now and doing a
"git gc --aggressive" afterwards would have their use broken by this,
their "gc" would abort if the "--aggressive" we spawned after the 1st
fast-import invocation was still running.

I was thinking of introducing some sub-mode for --aggressive that
doesn't tweak the window size, but just passes down "-f". It would more
generally cover these cases, and eta less CPU than the increased window
size (although "--no-reuse-delta" by itself is very expensive).


>> Maybe I'm doing something wrong with the fast-import stream that is
>> defeating Git's attempts at delta compression....
>>
>> Are there any utility programs available for analyzing packfiles so
>> that I try to figure out where the inefficiencies are cropping up, so
>> that I can try to address them?
>>
>> Anybody have any suggestions on what I should be looking for?
>>
>> If anyone would care to see this oversized packfile and perhaps offer
>> suggestions on how I can make it more space-efficient, it can be
>> cloned from https://github.com/drhsqlite/fossil-mirror.git - at least
>> for now - surely I will delete that repo and regenerate it once I
>> figure out this problem.
>>
>> --
>> D. Richard Hipp
>> drh@sqlite.org
>>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-03-21 14:23 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-16 20:31 git-fast-import yields huge packfile Richard Hipp
2019-03-16 21:04 ` Linus Torvalds
2019-03-16 22:12   ` Mike Hommey
2019-03-16 23:22   ` Richard Hipp
2019-03-21 14:09 ` Johannes Schindelin
2019-03-21 14:23   ` Ævar Arnfjörð Bjarmason

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).