git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* pack file object size question
@ 2018-12-16 21:52 Farhan Khan
  2018-12-16 22:14 ` Jonathan Nieder
  2018-12-17 19:39 ` Jeff King
  0 siblings, 2 replies; 6+ messages in thread
From: Farhan Khan @ 2018-12-16 21:52 UTC (permalink / raw)
  To: git

Hi all,

I am trying to write an implementation of "git index-pack" and having
a bit of trouble with understanding the ".pack" format. Specifically,
I am having trouble figuring out the boundary between two objects in
the pack file.

It seems that there is a 12 byte header (signature, version, number of
objects), then it immediately jumps into each individual object. The
object consists of the object header, then the zlib deflated object,
followed by a SHA1 of the above. Is this accurate? If so, where is the
size of the entire object size (object header + zlib deflated object +
sha) identified in git source? I tracked it down to what I believe is
builtin/index-pack.c under the function parse_pack_objects, the
for-loop currently in line 1138, but I cannot find where that object
size is calculated for the next iteration of the loop.

I think what I most specifically need is where the size of the
deflated object is identified.

Thanks,
--
Farhan Khan
PGP Fingerprint: B28D 2726 E2BC A97E 3854 5ABE 9A9F 00BC D525 16EE

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: pack file object size question
  2018-12-16 21:52 pack file object size question Farhan Khan
@ 2018-12-16 22:14 ` Jonathan Nieder
  2018-12-16 23:36   ` Farhan Khan
  2018-12-17 19:39 ` Jeff King
  1 sibling, 1 reply; 6+ messages in thread
From: Jonathan Nieder @ 2018-12-16 22:14 UTC (permalink / raw)
  To: Farhan Khan; +Cc: git

Hi,

Farhan Khan wrote:

> I am trying to write an implementation of "git index-pack" and having
> a bit of trouble with understanding the ".pack" format. Specifically,
> I am having trouble figuring out the boundary between two objects in
> the pack file.

Have you seen Documentation/technical/pack-format.txt?  If so, do you
have ideas for improving it?

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: pack file object size question
  2018-12-16 22:14 ` Jonathan Nieder
@ 2018-12-16 23:36   ` Farhan Khan
  2018-12-17  0:14     ` Jonathan Nieder
  0 siblings, 1 reply; 6+ messages in thread
From: Farhan Khan @ 2018-12-16 23:36 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git

On Sun, Dec 16, 2018 at 5:15 PM Jonathan Nieder <jrnieder@gmail.com> wrote:
>
> Hi,
>
> Farhan Khan wrote:
>
> > I am trying to write an implementation of "git index-pack" and having
> > a bit of trouble with understanding the ".pack" format. Specifically,
> > I am having trouble figuring out the boundary between two objects in
> > the pack file.
>
> Have you seen Documentation/technical/pack-format.txt?  If so, do you
> have ideas for improving it?
>
> Thanks,
> Jonathan

Hi Jonathan,

Yes, I have. I think the issue is, the compressed object has a fixed
size and git inflates it, then moves on to the next object. I am
trying to figure out how where it identifies the size of the object.
--
Farhan Khan
PGP Fingerprint: B28D 2726 E2BC A97E 3854 5ABE 9A9F 00BC D525 16EE

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: pack file object size question
  2018-12-16 23:36   ` Farhan Khan
@ 2018-12-17  0:14     ` Jonathan Nieder
  2018-12-17 15:31       ` Duy Nguyen
  0 siblings, 1 reply; 6+ messages in thread
From: Jonathan Nieder @ 2018-12-17  0:14 UTC (permalink / raw)
  To: Farhan Khan; +Cc: git

Hi,

Farhan Khan wrote:
>> Farhan Khan wrote:

>>> I am having trouble figuring out the boundary between two objects in
>>> the pack file.
[...]
>              I think the issue is, the compressed object has a fixed
> size and git inflates it, then moves on to the next object. I am
> trying to figure out how where it identifies the size of the object.

Do you mean the compressed size or uncompressed size?

It sounds to me like pack-format.txt needs to do a better job of
distinguishing the two.  Under "Pack file entry", I see

| Pack file entry: <+
|
|    packed object header:
|	1-byte size extension bit (MSB)
|	       type (next 3 bit)
|	       size0 (lower 4-bit)
|	n-byte sizeN (as long as MSB is set, each 7-bit)
|		size0..sizeN form 4+7+7+..+7 bit integer, size0
|		is the least significant part, and sizeN is the
|		most significant part.
|    packed object data:
|	If it is not DELTA, then deflated bytes (the size above
|		is the size before compression).
|	If it is REF_DELTA, then
|	  20-byte base object name SHA-1 (the size above is the
|		size of the delta data that follows).
|	  delta data, deflated.
|	If it is OFS_DELTA, then
|	  n-byte offset (see below) interpreted as a negative
|		offset from the type-byte of the header of the
|		ofs-delta entry (the size above is the size of
|		the delta data that follows).
|	  delta data, deflated.

which suggests that the "length" field is something between the two:
it is the size of the inflated form of the packed object data, before
resolving deltas.  It's useful for allocating a buffer to inflate
into.

The zlib container format (https://tools.ietf.org/html/rfc1950) does
not contain size information, so I believe you'll have to use a
"deflate" (https://tools.ietf.org/html/rfc1951) decoder such as zlib
to find the end of the deflated bytes.

In index-pack, you need to inflate the objects anyway.  In random
lookups, the idx file tells you where to look, so it doesn't come up
there, either.  So this would only be expected to come up if you are
doing a sort of partial index-pack that wants to skip some objects.

Thanks and hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: pack file object size question
  2018-12-17  0:14     ` Jonathan Nieder
@ 2018-12-17 15:31       ` Duy Nguyen
  0 siblings, 0 replies; 6+ messages in thread
From: Duy Nguyen @ 2018-12-17 15:31 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Farhan Khan, git

On Sun, Dec 16, 2018 at 04:14:46PM -0800, Jonathan Nieder wrote:
> Hi,
> 
> Farhan Khan wrote:
> >> Farhan Khan wrote:
> 
> >>> I am having trouble figuring out the boundary between two objects in
> >>> the pack file.
> [...]
> >              I think the issue is, the compressed object has a fixed
> > size and git inflates it, then moves on to the next object. I am
> > trying to figure out how where it identifies the size of the object.
> 
> Do you mean the compressed size or uncompressed size?
> 
> It sounds to me like pack-format.txt needs to do a better job of
> distinguishing the two.

How about something like this?

I mostly wrote this based on memory (and a very quick look at
index-pack) but I think we never ever really stored compressed
sizes. The "length" field (even in loose format) is always about
uncompressed size.

-- 8< --
diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index cab5bdd2ff..4fd49f61d6 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -31,6 +31,11 @@ Git pack format
 	 is an OBJ_OFS_DELTA object
      compressed delta data
 
+     Note: The length (in bytes) is of uncompressed objects or
+     deltified representation. We're supposed to reach the end of zlib
+     stream once we have inflated the given length, otherwise it's a
+     corrupted pack file.
+
      Observation: length of each object is encoded in a variable
      length format and is not constrained to 32-bit or anything.
 
@@ -199,7 +204,8 @@ Pack file entry: <+
 		is the size before compression).
 	If it is REF_DELTA, then
 	  20-byte base object name SHA-1 (the size above is the
-		size of the delta data that follows).
+		size of the delta data that follows, before
+		compression).
           delta data, deflated.
 	If it is OFS_DELTA, then
 	  n-byte offset (see below) interpreted as a negative
-- 8< --

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: pack file object size question
  2018-12-16 21:52 pack file object size question Farhan Khan
  2018-12-16 22:14 ` Jonathan Nieder
@ 2018-12-17 19:39 ` Jeff King
  1 sibling, 0 replies; 6+ messages in thread
From: Jeff King @ 2018-12-17 19:39 UTC (permalink / raw)
  To: Farhan Khan; +Cc: git

On Sun, Dec 16, 2018 at 04:52:13PM -0500, Farhan Khan wrote:

> It seems that there is a 12 byte header (signature, version, number of
> objects), then it immediately jumps into each individual object. The
> object consists of the object header, then the zlib deflated object,
> followed by a SHA1 of the above. Is this accurate?

Others discussed the length confusion, but I wanted to point out one
more thing: the packfile does not contain the sha1 of each object. That
is computed by index-pack (but there is a sha1 of the contents of the
_entire_ packfile).

A bit error on the wire will be detected by the whole-pack sha1. A bit
error on the sender's disk generally be detected by zlib, but not
always. The ultimate check that the receiver does is make sure it has
all of the expected objects by walking the object graph from the
proposed ref updates. Any object which has an undetected bit error will
appear to be be missing (as well as any object that the sender actually
just failed to send).

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-12-17 19:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-16 21:52 pack file object size question Farhan Khan
2018-12-16 22:14 ` Jonathan Nieder
2018-12-16 23:36   ` Farhan Khan
2018-12-17  0:14     ` Jonathan Nieder
2018-12-17 15:31       ` Duy Nguyen
2018-12-17 19:39 ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).