git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Shawn Pearce <spearce@spearce.org>
To: "Steven E. Harris" <seh@panix.com>
Cc: git@vger.kernel.org
Subject: Re: Confused over packfile and index design
Date: Fri, 8 Apr 2011 22:07:50 -0400	[thread overview]
Message-ID: <BANLkTikXcvRf1bLJXFOHBcGcN-B0m_xSnw@mail.gmail.com> (raw)
In-Reply-To: <m2d3kw70su.fsf@Spindle.sehlabs.com>

On Fri, Apr 8, 2011 at 19:58, Steven E. Harris <seh@panix.com> wrote:
> I was reading the Git Book discussion¹ on the packfile and index formats,
> and there's a confusing set of assertions concerning the design choices
> that sound contradictory.

Its not.

> First, near the end of the section about the index format, we find the
> following paragraph:
>
> ,----
> | Importantly, packfile indexes are /not/ neccesary to extract objects
> | from a packfile, they are simply used to quickly retrieve individual
> | objects from a pack. The packfile format is used in upload-pack and
> | receieve-pack programs (push and fetch protocols) to transfer objects
> | and there is no index used then - it can be built after the fact by
> | scanning the packfile.
> `----
>
> That suggests that it's possible to read the packfile linearly and
> deduce where the various objects start and end, without the index
> available.

It is possible to do this.

Applications can scan the pack file by reading the 12 byte fixed
header and getting the object count from the 2nd word. Then enter a
loop that reads that many objects from the stream, before reading the
trailer SHA-1 checksum.

To read an object, the object header is consumed, reading the inflated
length from the variable length field. If the type code indicates the
object is a delta, the delta base reference is also read. Then
remaining bytes are shoved into a libz inflate() routine until libz
says the stream is over. As Peff mentioned elsewhere in the thread,
libz maintains its own markers and checksum to know when the object's
stream is over. As a safety measure, the inflated length from the
object header is checked against the number of bytes returned by libz.
Any remaining data that libz didn't consume is the next object's
header and data.

> Later, in the section on the packfile format, we find this:
>
> ,----
> | It is important to note that the size specified in the header data is
> | not the size of the data that actually follows, but the size of that
> | data /when expanded/. This is why the offsets in the packfile index are
> | so useful, otherwise you have to expand every object just to tell when
> | the next header starts.
> `----
>
> Now that makes it sound like without the index, even if one knows where
> a packed object starts, reading its header tells its /inflated/ size,
> /not/ the number of remaining payload bytes representing the object.

Yes.

> I imagine that if the format is meant to record the size of the deflated
> payload,

Its not. Its meant to tell us how many bytes to malloc() in order to
hold the result of the libz inflate() call when the object is being
read from the packfile. That way we don't under or over allocate the
result buffer.

-- 
Shawn.

  parent reply	other threads:[~2011-04-09  2:08 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-08 23:58 Confused over packfile and index design Steven E. Harris
2011-04-09  0:20 ` Jeff King
2011-04-09  2:07 ` Shawn Pearce [this message]
2011-04-09 14:30   ` Steven E. Harris
2011-04-09 14:45     ` Shawn Pearce
2011-04-10  2:08 ` Nicolas Pitre
2011-04-10 20:10   ` Steven E. Harris

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BANLkTikXcvRf1bLJXFOHBcGcN-B0m_xSnw@mail.gmail.com \
    --to=spearce@spearce.org \
    --cc=git@vger.kernel.org \
    --cc=seh@panix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).