Confused over packfile and index design

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Confused over packfile and index design
@ 2011-04-08 23:58 Steven E. Harris
  2011-04-09  0:20 ` Jeff King
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Steven E. Harris @ 2011-04-08 23:58 UTC (permalink / raw)
  To: git

I was reading the Git Book discussion¹ on the packfile and index formats,
and there's a confusing set of assertions concerning the design choices
that sound contradictory.

First, near the end of the section about the index format, we find the
following paragraph:

,----
| Importantly, packfile indexes are /not/ neccesary to extract objects
| from a packfile, they are simply used to quickly retrieve individual
| objects from a pack. The packfile format is used in upload-pack and
| receieve-pack programs (push and fetch protocols) to transfer objects
| and there is no index used then - it can be built after the fact by
| scanning the packfile.
`----

That suggests that it's possible to read the packfile linearly and
deduce where the various objects start and end, without the index
available.

Later, in the section on the packfile format, we find this:

,----
| It is important to note that the size specified in the header data is
| not the size of the data that actually follows, but the size of that
| data /when expanded/. This is why the offsets in the packfile index are
| so useful, otherwise you have to expand every object just to tell when
| the next header starts.
`----

Now that makes it sound like without the index, even if one knows where
a packed object starts, reading its header tells its /inflated/ size,
/not/ the number of remaining payload bytes representing the object. If
that's true, then how does one figure out where one object ends and the
next one begins /without the index/?

Recall that the first paragraph quoted above says that the index can be
built from the packfile, as opposed to it being essential to reading the
packfile. Is one of these paragraphs incorrect?

The Git documentation on the pack format² mentions that the packed
object headers represent the lengths as variable-sized integers

,----
| n-byte type and length (3-bit type, (n-1)*7+4-bit length)
`----

but it doesn't say whether that's the number of (deflated) payload bytes
or the inflated object size, as the Git Book asserts.

I imagine that if the format is meant to record the size of the deflated
payload, then it would be challenging to compress the data straight into
the packfile, because one wouldn't know the final size until it was
written, which means that one wouldn't know how many bytes will be
necessary to write its length in the header, which means one wouldn't
know where to start writing the deflated payload.

Are there any other clarifying documents you can recommend to understand
the design?

Footnotes: 
¹ http://book.git-scm.com/7_the_packfile.html
² http://www.kernel.org/pub/software/scm/git/docs/technical/pack-format.txt

-- 
Steven E. Harris

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Confused over packfile and index design
  2011-04-08 23:58 Confused over packfile and index design Steven E. Harris
@ 2011-04-09  0:20 ` Jeff King
  2011-04-09  2:07 ` Shawn Pearce
  2011-04-10  2:08 ` Nicolas Pitre
  2 siblings, 0 replies; 7+ messages in thread
From: Jeff King @ 2011-04-09  0:20 UTC (permalink / raw)
  To: Steven E. Harris; +Cc: git

On Fri, Apr 08, 2011 at 07:58:41PM -0400, Steven E. Harris wrote:

> ,----
> | Importantly, packfile indexes are /not/ neccesary to extract objects
> | from a packfile, they are simply used to quickly retrieve individual
> | objects from a pack. The packfile format is used in upload-pack and
> | receieve-pack programs (push and fetch protocols) to transfer objects
> | and there is no index used then - it can be built after the fact by
> | scanning the packfile.
> `----
> 
> That suggests that it's possible to read the packfile linearly and
> deduce where the various objects start and end, without the index
> available.

Yes. For example, when we do a "git fetch", we get _just_ the packfile
and create our own local index.

> Later, in the section on the packfile format, we find this:
> 
> ,----
> | It is important to note that the size specified in the header data is
> | not the size of the data that actually follows, but the size of that
> | data /when expanded/. This is why the offsets in the packfile index are
> | so useful, otherwise you have to expand every object just to tell when
> | the next header starts.
> `----
> 
> Now that makes it sound like without the index, even if one knows where
> a packed object starts, reading its header tells its /inflated/ size,
> /not/ the number of remaining payload bytes representing the object. If
> that's true, then how does one figure out where one object ends and the
> next one begins /without the index/?

The actual object data (whether it is the object itself or a delta) is
all zlib-encoded, so it has its own size header and checksum there, I
believe. The pack-format documentation is a bit vague, but a quick read
of unpack_raw_entry and unpack_entry_data in builtin/index-pack.c seems
to confirm that this is how it works.

Take that response with a grain of salt, though. That is just from my
quick read of the code, so I could be wrong.

> Recall that the first paragraph quoted above says that the index can be
> built from the packfile, as opposed to it being essential to reading the
> packfile. Is one of these paragraphs incorrect?

No, if I'm correct, it is just that there is an extra header that
neither mentions. :)

> The Git documentation on the pack format² mentions that the packed
> object headers represent the lengths as variable-sized integers
> 
> ,----
> | n-byte type and length (3-bit type, (n-1)*7+4-bit length)
> `----
> 
> but it doesn't say whether that's the number of (deflated) payload bytes
> or the inflated object size, as the Git Book asserts.

That should be the inflated object size.

> I imagine that if the format is meant to record the size of the deflated
> payload, then it would be challenging to compress the data straight into
> the packfile, because one wouldn't know the final size until it was
> written, which means that one wouldn't know how many bytes will be
> necessary to write its length in the header, which means one wouldn't
> know where to start writing the deflated payload.

I believe zlib handles streaming it out for us. I'm not too familiar
with zlib's format, but I assume it outputs in chunks with occasional
headers. So finding the end of stream means while reading through the
whole stream and skipping past each chunk.

> Are there any other clarifying documents you can recommend to understand
> the design?

Not that I know of; what's in docs/technical is generally authoritative,
except for reading the code.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Confused over packfile and index design
  2011-04-08 23:58 Confused over packfile and index design Steven E. Harris
  2011-04-09  0:20 ` Jeff King
@ 2011-04-09  2:07 ` Shawn Pearce
  2011-04-09 14:30   ` Steven E. Harris
  2011-04-10  2:08 ` Nicolas Pitre
  2 siblings, 1 reply; 7+ messages in thread
From: Shawn Pearce @ 2011-04-09  2:07 UTC (permalink / raw)
  To: Steven E. Harris; +Cc: git

On Fri, Apr 8, 2011 at 19:58, Steven E. Harris <seh@panix.com> wrote:
> I was reading the Git Book discussion¹ on the packfile and index formats,
> and there's a confusing set of assertions concerning the design choices
> that sound contradictory.

Its not.

> First, near the end of the section about the index format, we find the
> following paragraph:
>
> ,----
> | Importantly, packfile indexes are /not/ neccesary to extract objects
> | from a packfile, they are simply used to quickly retrieve individual
> | objects from a pack. The packfile format is used in upload-pack and
> | receieve-pack programs (push and fetch protocols) to transfer objects
> | and there is no index used then - it can be built after the fact by
> | scanning the packfile.
> `----
>
> That suggests that it's possible to read the packfile linearly and
> deduce where the various objects start and end, without the index
> available.

It is possible to do this.

Applications can scan the pack file by reading the 12 byte fixed
header and getting the object count from the 2nd word. Then enter a
loop that reads that many objects from the stream, before reading the
trailer SHA-1 checksum.

To read an object, the object header is consumed, reading the inflated
length from the variable length field. If the type code indicates the
object is a delta, the delta base reference is also read. Then
remaining bytes are shoved into a libz inflate() routine until libz
says the stream is over. As Peff mentioned elsewhere in the thread,
libz maintains its own markers and checksum to know when the object's
stream is over. As a safety measure, the inflated length from the
object header is checked against the number of bytes returned by libz.
Any remaining data that libz didn't consume is the next object's
header and data.

> Later, in the section on the packfile format, we find this:
>
> ,----
> | It is important to note that the size specified in the header data is
> | not the size of the data that actually follows, but the size of that
> | data /when expanded/. This is why the offsets in the packfile index are
> | so useful, otherwise you have to expand every object just to tell when
> | the next header starts.
> `----
>
> Now that makes it sound like without the index, even if one knows where
> a packed object starts, reading its header tells its /inflated/ size,
> /not/ the number of remaining payload bytes representing the object.

Yes.

> I imagine that if the format is meant to record the size of the deflated
> payload,

Its not. Its meant to tell us how many bytes to malloc() in order to
hold the result of the libz inflate() call when the object is being
read from the packfile. That way we don't under or over allocate the
result buffer.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Confused over packfile and index design
  2011-04-09  2:07 ` Shawn Pearce
@ 2011-04-09 14:30   ` Steven E. Harris
  2011-04-09 14:45     ` Shawn Pearce
  0 siblings, 1 reply; 7+ messages in thread
From: Steven E. Harris @ 2011-04-09 14:30 UTC (permalink / raw)
  To: git

Shawn Pearce <spearce@spearce.org> writes:

> Then remaining bytes are shoved into a libz inflate() routine until
> libz says the stream is over. As Peff mentioned elsewhere in the
> thread, libz maintains its own markers and checksum to know when the
> object's stream is over.

Ah, so even though you as the caller don't know how much data to feed to
libz, so long as you continue feeding it until it signals completion, it
will figure it out and tell you how much data it needed after all.

> As a safety measure, the inflated length from the object header is
> checked against the number of bytes returned by libz.  Any remaining
> data that libz didn't consume is the next object's header and data.

I see. This means that it's the packed object's "job" -- or, rather, the
job of the parser for the packed object -- to determine the payload
length. If the data was not compressed, then perhaps the deflated size
indicated in the header could provide sufficient framing, but for now we
don't need to worry about such flexibility.

[...]

> Its meant to tell us how many bytes to malloc() in order to hold the
> result of the libz inflate() call when the object is being read from
> the packfile. That way we don't under or over allocate the result
> buffer.

Does Git always inflate the objects into an in-memory buffer? As the
size of these objects can be very large (given the variable-length size
encoding), is there any provision to inflate the object to a temporary
file?

-- 
Steven E. Harris

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Confused over packfile and index design
  2011-04-09 14:30   ` Steven E. Harris
@ 2011-04-09 14:45     ` Shawn Pearce
  0 siblings, 0 replies; 7+ messages in thread
From: Shawn Pearce @ 2011-04-09 14:45 UTC (permalink / raw)
  To: Steven E. Harris; +Cc: git

On Sat, Apr 9, 2011 at 10:30, Steven E. Harris <seh@panix.com> wrote:
> Shawn Pearce <spearce@spearce.org> writes:
>> Its meant to tell us how many bytes to malloc() in order to hold the
>> result of the libz inflate() call when the object is being read from
>> the packfile. That way we don't under or over allocate the result
>> buffer.
>
> Does Git always inflate the objects into an in-memory buffer?

Yes.

> As the
> size of these objects can be very large (given the variable-length size
> encoding), is there any provision to inflate the object to a temporary
> file?

Not currently. If you don't have enough memory for the malloc() buffer
of a big object, Git dies with an out of memory error.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Confused over packfile and index design
  2011-04-08 23:58 Confused over packfile and index design Steven E. Harris
  2011-04-09  0:20 ` Jeff King
  2011-04-09  2:07 ` Shawn Pearce
@ 2011-04-10  2:08 ` Nicolas Pitre
  2011-04-10 20:10   ` Steven E. Harris
  2 siblings, 1 reply; 7+ messages in thread
From: Nicolas Pitre @ 2011-04-10  2:08 UTC (permalink / raw)
  To: Steven E. Harris; +Cc: git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4028 bytes --]

On Fri, 8 Apr 2011, Steven E. Harris wrote:

> I was reading the Git Book discussion¹ on the packfile and index formats,
> and there's a confusing set of assertions concerning the design choices
> that sound contradictory.
> 
> First, near the end of the section about the index format, we find the
> following paragraph:
> 
> ,----
> | Importantly, packfile indexes are /not/ neccesary to extract objects
> | from a packfile, they are simply used to quickly retrieve individual
> | objects from a pack. The packfile format is used in upload-pack and
> | receieve-pack programs (push and fetch protocols) to transfer objects
> | and there is no index used then - it can be built after the fact by
> | scanning the packfile.
> `----
> 
> That suggests that it's possible to read the packfile linearly and
> deduce where the various objects start and end, without the index
> available.

Exact.

> Later, in the section on the packfile format, we find this:
> 
> ,----
> | It is important to note that the size specified in the header data is
> | not the size of the data that actually follows, but the size of that
> | data /when expanded/. This is why the offsets in the packfile index are
> | so useful, otherwise you have to expand every object just to tell when
> | the next header starts.
> `----
> 
> Now that makes it sound like without the index, even if one knows where
> a packed object starts, reading its header tells its /inflated/ size,
> /not/ the number of remaining payload bytes representing the object. If
> that's true, then how does one figure out where one object ends and the
> next one begins /without the index/?

There is a reason why we use a pack index.  It is not essential to have 
it but it is extremely convenient.  Because to know exactly where one 
object ends and therefore where the next one starts, we do have to 
inflate every object.  So the idea is to do that once to construct the 
pack index and allow for random access once the index is available.  
Accessing a particular object without the pack index would be extremely 
costly otherwise, especially if it is towards the end of the pack.

The reason for storing only the expanded data size is to have the exact 
buffer size allocated for the inflated data.  The zlib stream that 
follows is encoded to consume only the needed data to produce the 
inflated object.  When the output buffer is all used, the zlib library 
should flag the end of the deflated stream.  If not then there is an 
error in the pack data.

> Recall that the first paragraph quoted above says that the index can be
> built from the packfile, as opposed to it being essential to reading the
> packfile. Is one of these paragraphs incorrect?

Well... in practice the index is pretty much essential if you want to 
read any random object from the pack.  But the index can be recreated 
at anytime simply by reading all objects sequentially from the pack.

> The Git documentation on the pack format² mentions that the packed
> object headers represent the lengths as variable-sized integers
> 
> ,----
> | n-byte type and length (3-bit type, (n-1)*7+4-bit length)
> `----
> 
> but it doesn't say whether that's the number of (deflated) payload bytes
> or the inflated object size, as the Git Book asserts.

It is the inflated object size.

> I imagine that if the format is meant to record the size of the deflated
> payload, then it would be challenging to compress the data straight into
> the packfile, because one wouldn't know the final size until it was
> written, which means that one wouldn't know how many bytes will be
> necessary to write its length in the header, which means one wouldn't
> know where to start writing the deflated payload.

Exact.  And we also want to be able to construct a pack on the fly and 
stream it over a network connection without having to seek back.

> Are there any other clarifying documents you can recommend to understand
> the design?

When in doubt, the code is always the ultimate source of information.


Nicolas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Confused over packfile and index design
  2011-04-10  2:08 ` Nicolas Pitre
@ 2011-04-10 20:10   ` Steven E. Harris
  0 siblings, 0 replies; 7+ messages in thread
From: Steven E. Harris @ 2011-04-10 20:10 UTC (permalink / raw)
  To: git

Nicolas Pitre <nico@fluxnic.net> writes:

> So the idea is to do that once to construct the pack index and allow
> for random access once the index is available.  Accessing a particular
> object without the pack index would be extremely costly otherwise,
> especially if it is towards the end of the pack.

Thanks for the explanation. It's clear now.

> The reason for storing only the expanded data size is to have the
> exact buffer size allocated for the inflated data.  The zlib stream
> that follows is encoded to consume only the needed data to produce the
> inflated object.  When the output buffer is all used, the zlib library
> should flag the end of the deflated stream.  If not then there is an
> error in the pack data.

That provides some error checking, then, as we trust zlib to know when
it's had enough input, and we have to trust its assessment on how much
is enough, given the lack of delimiting or framing in the packfile
format.

By the way, I looked over the zlib manual¹, and I see that many of the
inflating/decompressing functions require the caller to specify the
number of input bytes available. There is inflateBack() that uses
callback functions to request more data upon underflow. The higher-level
inflate() function also looks like it can be called in a loop, refilling
the input buffer upon underflow. Is Git using one of these two functions
here?

[...]

> When in doubt, the code is always the ultimate source of information.

Yes, I need to learn my way around in there to find the call sites
relevant to this discussion.

Footnotes: 
¹ http://www.zlib.net/manual.html

-- 
Steven E. Harris

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-04-10 20:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-08 23:58 Confused over packfile and index design Steven E. Harris
2011-04-09  0:20 ` Jeff King
2011-04-09  2:07 ` Shawn Pearce
2011-04-09 14:30   ` Steven E. Harris
2011-04-09 14:45     ` Shawn Pearce
2011-04-10  2:08 ` Nicolas Pitre
2011-04-10 20:10   ` Steven E. Harris

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).