From: Junio C Hamano <junkio@cox.net>
To: git@vger.kernel.org
Cc: Kees-Jan Dijkzeul <k.j.dijkzeul@gmail.com>,
Linus Torvalds <torvalds@osdl.org>
Subject: Re: Cygwin can't handle huge packfiles?
Date: Fri, 07 Apr 2006 01:15:47 -0700 [thread overview]
Message-ID: <7vhd55ls24.fsf@assigned-by-dhcp.cox.net> (raw)
In-Reply-To: Pine.LNX.4.64.0604030734440.3781@g5.osdl.org
Linus Torvalds <torvalds@osdl.org> writes:
> On Mon, 3 Apr 2006, Linus Torvalds wrote:
>>
>> That said, I think git _does_ have problems with large pack-files. We have
>> some 32-bit issues etc
>
> I should clarify that. git _itself_ shouldn't have any 32-bit issues, but
> the packfile data structure does. The index has 32-bit offsets into
> individual pack-files.
>
> That's not hugely fundamental,...
Linus _does_ understand what he means, but let me clarify and
outline a possible future direction.
* pack-*.pack file has the following format:
- The header appears at the beginning and consists of the following:
4-byte signature
4-byte version number (network byte order)
4-byte number of objects contained in the pack (network byte order)
Observation: we cannot have more than 4G versions ;-) and
more than 4G objects in a pack.
- The header is followed by number of object entries, each of
which looks like this:
(undeltified representation)
n-byte type and length (4-bit type, (n-1)*7+4-bit length)
compressed data
(deltified representation)
n-byte type and length (4-bit type, (n-1)*7+4-bit length)
20-byte base object name
compressed delta data
Observation: length of each object is encoded in a variable
length format and is not constrained to 32-bit or anything.
- The trailer records 20-byte SHA1 checksum of all of the above.
* pack-*.idx file has the following format:
- The header consists of 256 4-byte network byte order
integers. N-th entry of this table records the number of
objects in the corresponding pack, the first byte of whose
object name are smaller than N.
Observation: we would need to extend this to an array of
8-byte integers to go beyond 4G objects per pack, but it is
not strictly necessary.
- The header is followed by sorted 28-byte entries, one entry
per object in the pack. Each entry is:
4-byte network byte order integer, recording where the
object is stored in the packfile as the offset from the
beginning.
20-byte object name.
Observation: we would definitely need to extend this to
8-byte integer plus 20-byte object name to handle a packfile
that is larger than 4GB.
- The file is concluded with a trailer:
A copy of the 20-byte SHA1 checksum at the end of
corresponding packfile.
20-byte SHA1-checksum of all of the above.
This is not fundamental, in that pack idx file is something we
can regenerate from a packfile. The push/fetch transfer over
git native protocols does not even transfer pack idx file;
instead, the recipient uses git-index-pack to generate pack idx.
git-index-pack would need to be updated to update the necessary
fields to 8-byte integers, without breaking existing packfiles.
The code to read idx file currently has a sanity check logic to
make sure that the size of the idx file is consistent with
24-byte entries (the last entry in the header matches the number
of objects recorded in the pack). So we could reliably tell
between the current 24-byte version and 28-byte "beyond 4GB"
version, and support both formats at the same time.
Even after we start supporting the 28-byte "beyond 4GB" format,
we can and we should continue writing the current 24-byte
version of pack idx file when the packfile offset can be
expressed with 32-bit.
Having said that, I have to warn that this is not for weak of
heart. The necessary changes would be somewhat involved.
----------------------------------------------------------------
Pack idx file
idx
+--------------------------------+
| fanout[0] = 2 |-.
+--------------------------------+ |
| fanout[1] | |
+--------------------------------+ |
| fanout[2] | |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| fanout[255] | |
+--------------------------------+ |
main | offset | |
index | object name 00XXXXXXXXXXXXXXXX | |
table +--------------------------------+ |
| offset | |
| object name 00XXXXXXXXXXXXXXXX | |
+--------------------------------+ |
.-| offset |<+
| | object name 01XXXXXXXXXXXXXXXX |
| +--------------------------------+
| | offset |
| | object name 01XXXXXXXXXXXXXXXX |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| | offset |
| | object name FFXXXXXXXXXXXXXXXX |
| +--------------------------------+
trailer | | packfile checksum |
| +--------------------------------+
| | idxfile checksum |
| +--------------------------------+
.-------.
|
Pack file entry: <+
packed object header:
1-byte type (bit 4-6)
size0 (bit 0-3)
end-of-length (bit 7)
n-byte sizeN (as long as MSB is set, each 7-bit)
size0..sizeN form 4+7+7+..+7 bit integer, size0
is the most significant part.
packed object data:
If it is not DELTA, then deflated bytes (the size above
is the size before compression).
If it is DELTA, then
20-byte base object name SHA1 (the size above is the
size of the delta data that follows).
delta data, deflated.
next prev parent reply other threads:[~2006-04-07 8:16 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-04-03 9:46 Cygwin can't handle huge packfiles? Kees-Jan Dijkzeul
2006-04-03 13:23 ` Johannes Schindelin
2006-04-03 14:26 ` Morten Welinder
2006-04-03 14:33 ` Linus Torvalds
2006-04-03 14:36 ` Linus Torvalds
2006-04-05 13:24 ` Kees-Jan Dijkzeul
2006-04-05 14:14 ` Johannes Schindelin
2006-04-05 21:08 ` Christopher Faylor
2006-04-05 23:27 ` Rutger Nijlunsing
2006-04-06 0:34 ` Christopher Faylor
2006-04-06 4:13 ` Junio C Hamano
2006-04-07 8:15 ` Junio C Hamano [this message]
2006-04-07 8:27 ` Jakub Narebski
2006-04-07 14:11 ` Nicolas Pitre
2006-04-07 18:31 ` Junio C Hamano
2006-04-07 18:46 ` Nicolas Pitre
2006-04-03 15:12 ` Johannes Schindelin
2006-04-03 14:38 ` Alex Riesen
-- strict thread matches above, loose matches on Subject: below --
2006-04-06 20:57 linux
2006-04-06 23:53 ` Junio C Hamano
2006-04-07 3:05 ` linux
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7vhd55ls24.fsf@assigned-by-dhcp.cox.net \
--to=junkio@cox.net \
--cc=git@vger.kernel.org \
--cc=k.j.dijkzeul@gmail.com \
--cc=torvalds@osdl.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).