git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <junkio@cox.net>
To: "Troy Telford" <ttelford@linuxnetworx.com>
Cc: git@vger.kernel.org
Subject: Re: git clone dies (large git repository)
Date: Sat, 19 Aug 2006 13:46:30 -0700	[thread overview]
Message-ID: <7vfyfs313t.fsf@assigned-by-dhcp.cox.net> (raw)
In-Reply-To: <op.teh30gmyies9li@rygel.lnxi.com> (Troy Telford's message of "Fri, 18 Aug 2006 16:42:06 -0600")

"Troy Telford" <ttelford@linuxnetworx.com> writes:

> I originally had everything as loose objects.  I then ran 'git-repack
> -d' on occasion, so I had a combination of a large pack file, smaller
> pack  files, and loose objects.  Finally, I tried 'git repack -a -d'
> and  consolidated it all into a single 4GB pack file.  It didn't seem
> to make  much difference in the output.
>
> Am I bumping some sort of limitation within git, or have I uncovered a bug?

The former.  Unfortunately this comes from an old design
decision.

Fortunately this design decision is not something irreversible
(see Chapter 1 of Documentation/ManagementStyle in the kernel
repository ;-).

The packfile is a dual-use format.  When used for network
transfer, we only send the .pack file and have the recipient
reconstruct the corresponding .idx file.  When used locally, we
need both .pack and .idx file; .pack contains the meat of the
data, and .idx allows us random access to the objects stored in
the corresponding .pack file.

What is interesting is that .pack format does not have (as far
as I know) inherent size limitation.  However, .idx file has
hardcoded 32-bit offsets into .pack -- hence, in practice, you
cannot use a .pack that is over 4GB locally.

One crude workaround that would work _today_ for your situation
without changing file formats would be to use git-fetch into an
empty repository (and do ref cloning by hand) instead of using
git-clone.  git-fetch gets .pack data over the wire and explode
the objects contained in the stream into individual objects (as
opposed to git-clone gets .pack data, stores it as a .pack and
tries to create corresponding .idx which in your case would bust
the 32-bit limit and fail).

This is from a private note I sent to Linus on Jun 26 2005 when
pack & idx pairs were initially introduced.

 - Design decision.  As before, you have assumption that nothing
   is longer than 2^32 bytes.  I am not unhappy with that
   restriction with individual objects (even their uncompressed
   size limited below 4GB or even 2GB is fine --- after all we
   are talking about a source control system).  I am however
   wondering if we would regret it later to have a packed file
   also limited to 4GB by having object_entry.offset "unsigned
   long" (and fwrite htonl'ed 4 bytes).  I personally do not
   have problem with this, but I can easily see HPA frowning on
   us.  He didn't like it when I said "in GIT world, file sizes
   and offsets are of type 'unsigned long'" some time ago.

I do not have a copy of a response from Linus to this point, but
if I recall things correctly, since then, the plan always has
been (1) to limit the size of individual packfiles to fit within
the idx limit and/or (2) extend the idx format to be able to
express offset over 2^32.  The latter is possible because idx
file is a local matter, used only for local accesses and does
not get set over the wire.

However, even if we revise the .idx file format, we have another
practical problem to solve.  Currently we assume that we can mmap
one packfile as a whole and do a random access into it.  This
needs to be changed so that we (perhaps optionally, only when
dealing with a huge packfile) mmap part of a .pack at a time.

I recall more recently (as opposed to the heated discussion
immediately after packfile was introduced June last year) we had
another discussion about people not being able to mmap huge
packfiles, and partial mmapping was one of the things that were
discussed there.

  parent reply	other threads:[~2006-08-19 20:46 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-08-18 22:42 git clone dies (large git repository) Troy Telford
2006-08-19 10:58 ` Jakub Narebski
2006-08-19 20:46 ` Junio C Hamano [this message]
2006-08-21 23:30   ` Troy Telford
2006-08-22  0:23     ` Junio C Hamano
2006-08-22  0:42       ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7vfyfs313t.fsf@assigned-by-dhcp.cox.net \
    --to=junkio@cox.net \
    --cc=git@vger.kernel.org \
    --cc=ttelford@linuxnetworx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).