git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Confusion about the PACK format
@ 2019-02-10 16:02 Florian Steenbuck
  2019-02-10 19:05 ` Ramsay Jones
  0 siblings, 1 reply; 4+ messages in thread
From: Florian Steenbuck @ 2019-02-10 16:02 UTC (permalink / raw)
  To: git

Hello to all,

I try to understand the git protocol only on the server site. So I
start without reading any docs and which turns to be fine until I got
to the PACK format (pretty early failure I know).

I have read this documentation:
https://raw.githubusercontent.com/git/git/c4df23f7927d8d00e666a3c8d1b3375f1dc8a3c1/Documentation/technical/pack-format.txt

But their are some confusion about this text.

The basic header is no problem, but somehow I got stuck while try to
read the length and type of the objects, which are ints that can be
resolved with 3-bits and 4-bits. The question is where and how ?

I try to parse the int from the beginning of the bits:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
1 - 3 = type
4 - 7 = len
(I using a python way to convert it to a int which is equal to:
int('{0:08b}'.format(raw_objects[offset])[x:y], 2)
)
As mentioned in the doc n is part of the len calculation. Question of
interest why ? And what is n ?
I interpreted it as the type as it is a number and prefixed with n-byte.

This requires me to do one more step:
len = (type-1)*len

Which ends in a endless loop, because my byte offset never hits the
end of, caused by often type that is 0.

I then try to interpret it in a different way so I now take two bytes
and on every byte I take the end of the bits for get the type and len:

byte a (equal to [offset+0])
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
6 - 8 = type

byte b (equal to [offset+1])
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
5 - 8 = len

Which in my case stops the endless loop, but seems to not get correct
typing, zero and five types appears everywhere, and not get the
correct size of the object, I use the length that I calculate before.
Their are multiple errors that appears while try to decompress the
data.

Now to complete other topic error handling, what is the suggest way to
handle a type 0 or type 5 and what is with type < 0 and type > 5 ?

Type 0 is invalid, should the parsing fail here ?
Type 5 is reserved for future expansion, where should we continue then ?

Also I do not understand this:
`a negative relative offset from the delta object's position in the
pack if thisis an OBJ_OFS_DELTA object`
What relative offset I need to check here ?

Kind Regards
Florian

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Confusion about the PACK format
  2019-02-10 16:02 Confusion about the PACK format Florian Steenbuck
@ 2019-02-10 19:05 ` Ramsay Jones
  2019-02-10 19:35   ` Ramsay Jones
  0 siblings, 1 reply; 4+ messages in thread
From: Ramsay Jones @ 2019-02-10 19:05 UTC (permalink / raw)
  To: Florian Steenbuck, git



On 10/02/2019 16:02, Florian Steenbuck wrote:
> Hello to all,
> 
> I try to understand the git protocol only on the server site. So I
> start without reading any docs and which turns to be fine until I got
> to the PACK format (pretty early failure I know).
> 
> I have read this documentation:
> https://raw.githubusercontent.com/git/git/c4df23f7927d8d00e666a3c8d1b3375f1dc8a3c1/Documentation/technical/pack-format.txt
> 
> But their are some confusion about this text.
> 
> The basic header is no problem, but somehow I got stuck while try to
> read the length and type of the objects, which are ints that can be
> resolved with 3-bits and 4-bits. The question is where and how ?
> 

Hmm, the 'type and length' encoding could be described more clearly!
Hopefully, just on this issue, the following could help:

In my git.git repo, which is fully packed, I have a single pack file, with

  $ git count-objects -v
  count: 0
  size: 0
  in-pack: 270277
  packs: 1
  size-pack: 101929
  prune-packable: 0
  garbage: 0
  size-garbage: 0
  $ 

... 270277 objects in it. The beginning of the file looks like:

  $ xxd .git/objects/pack/pack-d554e6d8335601c2525b40487faf36493094ab50.pack | head
  00000000: 5041 434b 0000 0002 0004 1fc5 9d13 789c  PACK..........x.
  00000010: 9d8f cd6a c330 1084 ef7a 8a3d 171a b4ab  ...j.0...z.=....
  00000020: 9525 8750 0abd 945c f304 ab95 5cfb 602b  .%.P...\....\.`+
  00000030: b84a 7fde 3e2a 943e 406f c3f0 cd30 d3f6  .J..>*.>@o...0..
  00000040: 5260 741a 5025 92e2 1458 917c c294 a3c3  R`t.P%...X.|....
  00000050: 4803 e521 395f c2d8 4d73 95bd 6c0d 82f5  H..!9_..Ms..l...
  00000060: 6172 310f 0529 7a2f d6a7 40c5 d9a0 d185  ar1..)z/..@.....
  00000070: 622d 8789 9cb8 3f1e 5132 6366 4de4 8531  b-....?.Q2cfM..1
  00000080: 114a 70ec 9447 2f5a 526f e29c 3847 23b7  .Jp..G/ZRo..8G#.
  00000090: 36d7 1dce b76d a9f0 02af b2ca 56e1 f4b6  6....m......V...
  $ 

You can see the header, which consists of 3 32-bit values, where the
packfile signature is the '5041 434b', then the version number which
is '0000 0002', followed by the number of objects '0004 1fc5' which
is 270277. Next comes the first 'object entry', which starts '9d13'.

Now, the 'n-byte type and length' is a variable length encoding of
the object type and length. The number of bytes used to encode this
data is content dependant. If the top bit of a byte is set, then we
need to process the next byte, otherwise we are done. So, looking
at the first 'object entry' byte (at offset 12) '9d', we take the
top nibble, remove the top bit, and shift right 4 bits to get the
object type. ie. (0x9d >> 4) & 7 which gives an object type of 1
(which is a commit object). The lower nibble of the first byte
contains the first (or only) 4 bits of the size, here (0x9d & 15)
which is 0xd. Given that the top bit of this byte is set, we now
process the next byte. After the first byte, each byte contains 7
bits of the size field which is combined with the value from the
previous byte by shifting and adding (first by 4 bits, then 11, 18,
25 etc.). So, in this case we have (0x13 << 4) + 0xd = 317.

The compressed data follows, '789c' ...

We can use git-verify-pack to confirm the details here:

  $ git verify-pack -v .git/objects/pack/pack-d554e6d8335601c2525b40487faf36493094ab50.idx | head -n 1
  878e2cd30e1656909c5073043d32fe9d02204daa commit 317 216 12
  $ 
 
So the object 878e2cd30e, at offset 12 in the file, is a commit object
with size 317 (which has an in-pack size of 216).

Hope this helps.

ATB,
Ramsay Jones


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Confusion about the PACK format
  2019-02-10 19:05 ` Ramsay Jones
@ 2019-02-10 19:35   ` Ramsay Jones
  2019-02-12  0:41     ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Ramsay Jones @ 2019-02-10 19:35 UTC (permalink / raw)
  To: Florian Steenbuck, git



On 10/02/2019 19:05, Ramsay Jones wrote:
> 
> 
> On 10/02/2019 16:02, Florian Steenbuck wrote:
>> Hello to all,
>>
>> I try to understand the git protocol only on the server site. So I
>> start without reading any docs and which turns to be fine until I got
>> to the PACK format (pretty early failure I know).
>>
>> I have read this documentation:
>> https://raw.githubusercontent.com/git/git/c4df23f7927d8d00e666a3c8d1b3375f1dc8a3c1/Documentation/technical/pack-format.txt
>>
>> But their are some confusion about this text.
>>
>> The basic header is no problem, but somehow I got stuck while try to
>> read the length and type of the objects, which are ints that can be
>> resolved with 3-bits and 4-bits. The question is where and how ?
>>
> 
> Hmm, the 'type and length' encoding could be described more clearly!
> Hopefully, just on this issue, the following could help:
> 
> In my git.git repo, which is fully packed, I have a single pack file, with
> 
>   $ git count-objects -v
>   count: 0
>   size: 0
>   in-pack: 270277
>   packs: 1
>   size-pack: 101929
>   prune-packable: 0
>   garbage: 0
>   size-garbage: 0
>   $ 
> 
> ... 270277 objects in it. The beginning of the file looks like:
> 
>   $ xxd .git/objects/pack/pack-d554e6d8335601c2525b40487faf36493094ab50.pack | head
>   00000000: 5041 434b 0000 0002 0004 1fc5 9d13 789c  PACK..........x.
>   00000010: 9d8f cd6a c330 1084 ef7a 8a3d 171a b4ab  ...j.0...z.=....
>   00000020: 9525 8750 0abd 945c f304 ab95 5cfb 602b  .%.P...\....\.`+
>   00000030: b84a 7fde 3e2a 943e 406f c3f0 cd30 d3f6  .J..>*.>@o...0..
>   00000040: 5260 741a 5025 92e2 1458 917c c294 a3c3  R`t.P%...X.|....
>   00000050: 4803 e521 395f c2d8 4d73 95bd 6c0d 82f5  H..!9_..Ms..l...
>   00000060: 6172 310f 0529 7a2f d6a7 40c5 d9a0 d185  ar1..)z/..@.....
>   00000070: 622d 8789 9cb8 3f1e 5132 6366 4de4 8531  b-....?.Q2cfM..1
>   00000080: 114a 70ec 9447 2f5a 526f e29c 3847 23b7  .Jp..G/ZRo..8G#.
>   00000090: 36d7 1dce b76d a9f0 02af b2ca 56e1 f4b6  6....m......V...
>   $ 
> 
> You can see the header, which consists of 3 32-bit values, where the
> packfile signature is the '5041 434b', then the version number which
> is '0000 0002', followed by the number of objects '0004 1fc5' which
> is 270277. Next comes the first 'object entry', which starts '9d13'.
> 
> Now, the 'n-byte type and length' is a variable length encoding of
> the object type and length. The number of bytes used to encode this
> data is content dependant. If the top bit of a byte is set, then we
> need to process the next byte, otherwise we are done. So, looking
> at the first 'object entry' byte (at offset 12) '9d', we take the
> top nibble, remove the top bit, and shift right 4 bits to get the
> object type. ie. (0x9d >> 4) & 7 which gives an object type of 1
> (which is a commit object). The lower nibble of the first byte
> contains the first (or only) 4 bits of the size, here (0x9d & 15)
> which is 0xd. Given that the top bit of this byte is set, we now
> process the next byte. After the first byte, each byte contains 7
> bits of the size field which is combined with the value from the
> previous byte by shifting and adding (first by 4 bits, then 11, 18,
> 25 etc.). So, in this case we have (0x13 << 4) + 0xd = 317.

Sorry, to be clear, I should have said, "mask off the top bit,
shift and add", so:

  ((0x13 & 0x7f) << 4) + 0xd = 317

ATB,
Ramsay Jones

> 
> The compressed data follows, '789c' ...
> 
> We can use git-verify-pack to confirm the details here:
> 
>   $ git verify-pack -v .git/objects/pack/pack-d554e6d8335601c2525b40487faf36493094ab50.idx | head -n 1
>   878e2cd30e1656909c5073043d32fe9d02204daa commit 317 216 12
>   $ 
>  
> So the object 878e2cd30e, at offset 12 in the file, is a commit object
> with size 317 (which has an in-pack size of 216).
> 
> Hope this helps.
> 
> ATB,
> Ramsay Jones
> 
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Confusion about the PACK format
  2019-02-10 19:35   ` Ramsay Jones
@ 2019-02-12  0:41     ` Jeff King
  0 siblings, 0 replies; 4+ messages in thread
From: Jeff King @ 2019-02-12  0:41 UTC (permalink / raw)
  To: Ramsay Jones; +Cc: Florian Steenbuck, git

On Sun, Feb 10, 2019 at 07:35:38PM +0000, Ramsay Jones wrote:

> > Now, the 'n-byte type and length' is a variable length encoding of
> > the object type and length. The number of bytes used to encode this
> > data is content dependant. If the top bit of a byte is set, then we
> > need to process the next byte, otherwise we are done. So, looking
> > at the first 'object entry' byte (at offset 12) '9d', we take the
> > top nibble, remove the top bit, and shift right 4 bits to get the
> > object type. ie. (0x9d >> 4) & 7 which gives an object type of 1
> > (which is a commit object). The lower nibble of the first byte
> > contains the first (or only) 4 bits of the size, here (0x9d & 15)
> > which is 0xd. Given that the top bit of this byte is set, we now
> > process the next byte. After the first byte, each byte contains 7
> > bits of the size field which is combined with the value from the
> > previous byte by shifting and adding (first by 4 bits, then 11, 18,
> > 25 etc.). So, in this case we have (0x13 << 4) + 0xd = 317.
> 
> Sorry, to be clear, I should have said, "mask off the top bit,
> shift and add", so:
> 
>   ((0x13 & 0x7f) << 4) + 0xd = 317

Yes. Also, see the first 10 or so lines of builtin/index-pack.c's
unpack_raw_entry() for real-world example code.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-02-12  0:41 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-10 16:02 Confusion about the PACK format Florian Steenbuck
2019-02-10 19:05 ` Ramsay Jones
2019-02-10 19:35   ` Ramsay Jones
2019-02-12  0:41     ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).