Re: Confused over packfile and index design

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Jeff King <peff@peff.net>
To: "Steven E. Harris" <seh@panix.com>
Cc: git@vger.kernel.org
Subject: Re: Confused over packfile and index design
Date: Fri, 8 Apr 2011 20:20:48 -0400	[thread overview]
Message-ID: <20110409002047.GB7445@sigill.intra.peff.net> (raw)
In-Reply-To: <m2d3kw70su.fsf@Spindle.sehlabs.com>

On Fri, Apr 08, 2011 at 07:58:41PM -0400, Steven E. Harris wrote:

> ,----
> | Importantly, packfile indexes are /not/ neccesary to extract objects
> | from a packfile, they are simply used to quickly retrieve individual
> | objects from a pack. The packfile format is used in upload-pack and
> | receieve-pack programs (push and fetch protocols) to transfer objects
> | and there is no index used then - it can be built after the fact by
> | scanning the packfile.
> `----
> 
> That suggests that it's possible to read the packfile linearly and
> deduce where the various objects start and end, without the index
> available.

Yes. For example, when we do a "git fetch", we get _just_ the packfile
and create our own local index.

> Later, in the section on the packfile format, we find this:
> 
> ,----
> | It is important to note that the size specified in the header data is
> | not the size of the data that actually follows, but the size of that
> | data /when expanded/. This is why the offsets in the packfile index are
> | so useful, otherwise you have to expand every object just to tell when
> | the next header starts.
> `----
> 
> Now that makes it sound like without the index, even if one knows where
> a packed object starts, reading its header tells its /inflated/ size,
> /not/ the number of remaining payload bytes representing the object. If
> that's true, then how does one figure out where one object ends and the
> next one begins /without the index/?

The actual object data (whether it is the object itself or a delta) is
all zlib-encoded, so it has its own size header and checksum there, I
believe. The pack-format documentation is a bit vague, but a quick read
of unpack_raw_entry and unpack_entry_data in builtin/index-pack.c seems
to confirm that this is how it works.

Take that response with a grain of salt, though. That is just from my
quick read of the code, so I could be wrong.

> Recall that the first paragraph quoted above says that the index can be
> built from the packfile, as opposed to it being essential to reading the
> packfile. Is one of these paragraphs incorrect?

No, if I'm correct, it is just that there is an extra header that
neither mentions. :)

> The Git documentation on the pack format² mentions that the packed
> object headers represent the lengths as variable-sized integers
> 
> ,----
> | n-byte type and length (3-bit type, (n-1)*7+4-bit length)
> `----
> 
> but it doesn't say whether that's the number of (deflated) payload bytes
> or the inflated object size, as the Git Book asserts.

That should be the inflated object size.

> I imagine that if the format is meant to record the size of the deflated
> payload, then it would be challenging to compress the data straight into
> the packfile, because one wouldn't know the final size until it was
> written, which means that one wouldn't know how many bytes will be
> necessary to write its length in the header, which means one wouldn't
> know where to start writing the deflated payload.

I believe zlib handles streaming it out for us. I'm not too familiar
with zlib's format, but I assume it outputs in chunks with occasional
headers. So finding the end of stream means while reading through the
whole stream and skipping past each chunk.

> Are there any other clarifying documents you can recommend to understand
> the design?

Not that I know of; what's in docs/technical is generally authoritative,
except for reading the code.

-Peff

next prev parent reply	other threads:[~2011-04-09  0:21 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-08 23:58 Confused over packfile and index design Steven E. Harris
2011-04-09  0:20 ` Jeff King [this message]
2011-04-09  2:07 ` Shawn Pearce
2011-04-09 14:30   ` Steven E. Harris
2011-04-09 14:45     ` Shawn Pearce
2011-04-10  2:08 ` Nicolas Pitre
2011-04-10 20:10   ` Steven E. Harris

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110409002047.GB7445@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=seh@panix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).