git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Documentation errors for HTTP protocol v2 and packfile
@ 2020-12-19 17:07 Ross Light
  2020-12-21  7:54 ` [PATCH 0/2] pack-format.txt: document lengths at start of delta data Martin Ågren
  0 siblings, 1 reply; 8+ messages in thread
From: Ross Light @ 2020-12-19 17:07 UTC (permalink / raw)
  To: git

Hello Git mailing list,

I'm working on a library to interoperate with the Git wire protocol.
While doing so, I noticed two omissions in the documentation:

1. In the protocol-v2 doc [1], the HTTP example implies that the first
bytes in the response are "000eversion 2\n" when in fact they will be
PKT-LINE("# service=git-upload-pack" LF) followed by a flush packet,
then the version 2 data.

2. In the pack-format doc, the Deltified representation section [2]
describes the instruction sequence well, but neglects to mention the
two size varints [3] at the beginning of such an object.

I think it would be good to correct these documents for others
attempting to work with Git internals. Let me know how I can help.

-Ross Light

[1]: https://github.com/git/git/blob/ae46588be0cd730430dded4491246dfb4eac5557/Documentation/technical/protocol-v2.txt#L72-L79
[2]: https://github.com/git/git/blob/ae46588be0cd730430dded4491246dfb4eac5557/Documentation/technical/pack-format.txt#L70-L76
[3]: https://github.com/git/git/blob/ae46588be0cd730430dded4491246dfb4eac5557/patch-delta.c#L29-L36

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 0/2] pack-format.txt: document lengths at start of delta data
  2020-12-19 17:07 Documentation errors for HTTP protocol v2 and packfile Ross Light
@ 2020-12-21  7:54 ` Martin Ågren
  2020-12-21  7:54   ` [PATCH 1/2] pack-format.txt: define "varint" format Martin Ågren
                     ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Martin Ågren @ 2020-12-21  7:54 UTC (permalink / raw)
  To: Ross Light; +Cc: git

Hi Ross,

Welcome to the list and thanks for reporting your findings. I haven't
looked at all at the first issue you brought up, about the protocol-v2
doc.

On Sat, 19 Dec 2020 at 18:10, Ross Light <ross@zombiezen.com> wrote:
> 2. In the pack-format doc, the Deltified representation section [2]
> describes the instruction sequence well, but neglects to mention the
> two size varints [3] at the beginning of such an object.

Would something like this be what you have in mind?

Martin Ågren (2):
  pack-format.txt: define "varint" format
  pack-format.txt: document lengths at start of delta data

 Documentation/technical/pack-format.txt | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

-- 
2.30.0.rc1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] pack-format.txt: define "varint" format
  2020-12-21  7:54 ` [PATCH 0/2] pack-format.txt: document lengths at start of delta data Martin Ågren
@ 2020-12-21  7:54   ` Martin Ågren
  2020-12-21 21:40     ` Junio C Hamano
  2020-12-21  7:54   ` [PATCH 2/2] pack-format.txt: document lengths at start of delta data Martin Ågren
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Martin Ågren @ 2020-12-21  7:54 UTC (permalink / raw)
  To: Ross Light; +Cc: git

We define our varint format pretty much on the fly as we describe a pack
file entry. In preparation for referring to it in more places in this
document, define "varint" and refer to it.

Signed-off-by: Martin Ågren <martin.agren@gmail.com>
---
 Documentation/technical/pack-format.txt | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index f96b2e605f..42198de74c 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -55,6 +55,15 @@ Valid object types are:
 
 Type 5 is reserved for future expansion. Type 0 is invalid.
 
+=== Variable-length integer encoding
+
+This document uses "varint" encoding of non-negative integers: From
+each byte, the seven least significant bits are used to form the
+resulting integer. As long as the most significant bit is 1, this
+process continues; the byte with MSB 0 provides the last seven bits.
+The seven-bit chunks are concatenated. Later values are more
+significant.
+
 === Deltified representation
 
 Conceptually there are only four object types: commit, tree, tag and
@@ -196,10 +205,10 @@ Pack file entry: <+
 	1-byte size extension bit (MSB)
 	       type (next 3 bit)
 	       size0 (lower 4-bit)
-        n-byte sizeN (as long as MSB is set, each 7-bit)
-		size0..sizeN form 4+7+7+..+7 bit integer, size0
-		is the least significant part, and sizeN is the
-		most significant part.
+        n-byte size1 (varint encoding; present if MSB is set)
+        If the MSB is set, the size is size0 + 16*size1, otherwise
+        it is size0. (Equivalently, the entire packed object header
+        is a varint encoding of (size/16)*128 + type*16 + size%16.)
      packed object data:
         If it is not DELTA, then deflated bytes (the size above
 		is the size before compression).
-- 
2.30.0.rc1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] pack-format.txt: document lengths at start of delta data
  2020-12-21  7:54 ` [PATCH 0/2] pack-format.txt: document lengths at start of delta data Martin Ågren
  2020-12-21  7:54   ` [PATCH 1/2] pack-format.txt: define "varint" format Martin Ågren
@ 2020-12-21  7:54   ` Martin Ågren
  2020-12-23  4:40   ` [PATCH 0/2] " Ross Light
  2020-12-29 22:43   ` [PATCH v2] pack-format.txt: document sizes " Martin Ågren
  3 siblings, 0 replies; 8+ messages in thread
From: Martin Ågren @ 2020-12-21  7:54 UTC (permalink / raw)
  To: Ross Light; +Cc: git

We document the delta data as a set of instructions, but forget to
document the two lengths that precede those instructions: the length of
the base object and the length of the object to be reconstructed. Fix
this omission.

Reported-by: Ross Light <ross@zombiezen.com>
Signed-off-by: Martin Ågren <martin.agren@gmail.com>
---
 Documentation/technical/pack-format.txt | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 42198de74c..05889a2e43 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -82,7 +82,10 @@ Ref-delta can also refer to an object outside the pack (i.e. the
 so-called "thin pack"). When stored on disk however, the pack should
 be self contained to avoid cyclic dependency.
 
-The delta data is a sequence of instructions to reconstruct an object
+The delta data starts with the length of the base object and the
+length of the object to be reconstructed. These lengths are
+encoded as varints.  The remainder of
+the delta data is a sequence of instructions to reconstruct the object
 from the base object. If the base object is deltified, it must be
 converted to canonical form first. Each instruction appends more and
 more data to the target object until it's complete. There are two
-- 
2.30.0.rc1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] pack-format.txt: define "varint" format
  2020-12-21  7:54   ` [PATCH 1/2] pack-format.txt: define "varint" format Martin Ågren
@ 2020-12-21 21:40     ` Junio C Hamano
  2020-12-29 22:41       ` Martin Ågren
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2020-12-21 21:40 UTC (permalink / raw)
  To: Martin Ågren; +Cc: Ross Light, git

Martin Ågren <martin.agren@gmail.com> writes:

> We define our varint format pretty much on the fly as we describe a pack
> file entry. In preparation for referring to it in more places in this
> document, define "varint" and refer to it.
>
> Signed-off-by: Martin Ågren <martin.agren@gmail.com>
> ---
>  Documentation/technical/pack-format.txt | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
> index f96b2e605f..42198de74c 100644
> --- a/Documentation/technical/pack-format.txt
> +++ b/Documentation/technical/pack-format.txt
> @@ -55,6 +55,15 @@ Valid object types are:
>  
>  Type 5 is reserved for future expansion. Type 0 is invalid.
>  
> +=== Variable-length integer encoding
> +
> +This document uses "varint" encoding of non-negative integers: From
> +each byte, the seven least significant bits are used to form the
> +resulting integer. As long as the most significant bit is 1, this
> +process continues; the byte with MSB 0 provides the last seven bits.
> +The seven-bit chunks are concatenated. Later values are more
> +significant.
> +

Unfortunately we have two kinds of varint, and the above describes
the older and less efficient variant that is used to encode

 (1) the size of the base object and the object that would result
     by applying a delta (stored at the beginning of each datum that
     represents a deltified object, and read by get_delta_hdr_size())

 (2) the size of the object that is represented in the base
     representation and read by unpack_object_header())

(and nothing else as far as I know).

There is another varint encoding that is slightly more efficient and
it also is used in the packfile format to encode the OFS_DELTA
offset, i.e. the number of bytes to go back in the same packfile
starting at the beginning of OFS_DELTA dats to find the object the
delta applies to.  This newer variant is what is known as "varint"
and used throughout the other parts of the system (see hits from
"git grep -e ncode_varint").

We need to be careful when using a generic "varint" to mean the
older variant as it would confuse readers of OFS_DELTA section.

	... goes and looks ...

The phrase "offset encoding" is used in the document to talk about
OFS_DELTA offset.  It is actually what the rest of the code thinks
is the canonical varint defined in varint.[ch]).

A way to avoid confusion would be to refrain from using "varint" as
the primary way to describe this size field; instead explain it as
the "size encoding", to match "offset encoding" used for OFS_DELTA.

It may also help if we added to the description of "offset encoding"
that it is what other parts of the system consider the canonical
"varint" encoding.

Thanks.

>  === Deltified representation
>  Conceptually there are only four object types: commit, tree, tag and
> @@ -196,10 +205,10 @@ Pack file entry: <+
>  	1-byte size extension bit (MSB)
>  	       type (next 3 bit)
>  	       size0 (lower 4-bit)
> -        n-byte sizeN (as long as MSB is set, each 7-bit)
> -		size0..sizeN form 4+7+7+..+7 bit integer, size0
> -		is the least significant part, and sizeN is the
> -		most significant part.
> +        n-byte size1 (varint encoding; present if MSB is set)
> +        If the MSB is set, the size is size0 + 16*size1, otherwise
> +        it is size0. (Equivalently, the entire packed object header
> +        is a varint encoding of (size/16)*128 + type*16 + size%16.)
>       packed object data:
>          If it is not DELTA, then deflated bytes (the size above
>  		is the size before compression).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/2] pack-format.txt: document lengths at start of delta data
  2020-12-21  7:54 ` [PATCH 0/2] pack-format.txt: document lengths at start of delta data Martin Ågren
  2020-12-21  7:54   ` [PATCH 1/2] pack-format.txt: define "varint" format Martin Ågren
  2020-12-21  7:54   ` [PATCH 2/2] pack-format.txt: document lengths at start of delta data Martin Ågren
@ 2020-12-23  4:40   ` Ross Light
  2020-12-29 22:43   ` [PATCH v2] pack-format.txt: document sizes " Martin Ågren
  3 siblings, 0 replies; 8+ messages in thread
From: Ross Light @ 2020-12-23  4:40 UTC (permalink / raw)
  To: Martin Ågren; +Cc: git

Looks good to me, FWIW. Thank you! :)

-Ross

On Sun, Dec 20, 2020 at 11:55 PM Martin Ågren <martin.agren@gmail.com> wrote:
>
> Hi Ross,
>
> Welcome to the list and thanks for reporting your findings. I haven't
> looked at all at the first issue you brought up, about the protocol-v2
> doc.
>
> On Sat, 19 Dec 2020 at 18:10, Ross Light <ross@zombiezen.com> wrote:
> > 2. In the pack-format doc, the Deltified representation section [2]
> > describes the instruction sequence well, but neglects to mention the
> > two size varints [3] at the beginning of such an object.
>
> Would something like this be what you have in mind?
>
> Martin Ågren (2):
>   pack-format.txt: define "varint" format
>   pack-format.txt: document lengths at start of delta data
>
>  Documentation/technical/pack-format.txt | 22 +++++++++++++++++-----
>  1 file changed, 17 insertions(+), 5 deletions(-)
>
> --
> 2.30.0.rc1
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] pack-format.txt: define "varint" format
  2020-12-21 21:40     ` Junio C Hamano
@ 2020-12-29 22:41       ` Martin Ågren
  0 siblings, 0 replies; 8+ messages in thread
From: Martin Ågren @ 2020-12-29 22:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ross Light, Git Mailing List

On Mon, 21 Dec 2020 at 22:40, Junio C Hamano <gitster@pobox.com> wrote:
>
> Martin Ågren <martin.agren@gmail.com> writes:
>
> > We define our varint format pretty much on the fly as we describe a pack
> > file entry. In preparation for referring to it in more places in this
> > document, define "varint" and refer to it.

> We need to be careful when using a generic "varint" to mean the
> older variant as it would confuse readers of OFS_DELTA section.
>
>         ... goes and looks ...
>
> The phrase "offset encoding" is used in the document to talk about
> OFS_DELTA offset.  It is actually what the rest of the code thinks
> is the canonical varint defined in varint.[ch]).
>
> A way to avoid confusion would be to refrain from using "varint" as
> the primary way to describe this size field; instead explain it as
> the "size encoding", to match "offset encoding" used for OFS_DELTA.

Thank you very much for these comments. I will post a v2 soon, which
will do exactly this: avoid "varint" in favor of "size encoding".

> It may also help if we added to the description of "offset encoding"
> that it is what other parts of the system consider the canonical
> "varint" encoding.

I will leave this as #leftoverbits, though. I'll "only" fix the omission
reported by Ross.

Martin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2] pack-format.txt: document sizes at start of delta data
  2020-12-21  7:54 ` [PATCH 0/2] pack-format.txt: document lengths at start of delta data Martin Ågren
                     ` (2 preceding siblings ...)
  2020-12-23  4:40   ` [PATCH 0/2] " Ross Light
@ 2020-12-29 22:43   ` Martin Ågren
  3 siblings, 0 replies; 8+ messages in thread
From: Martin Ågren @ 2020-12-29 22:43 UTC (permalink / raw)
  To: git; +Cc: Ross Light, Junio C Hamano

We document the delta data as a set of instructions, but forget to
document the two sizes that precede those instructions: the size of the
base object and the size of the object to be reconstructed. Fix this
omission.

Rather than cramming all the details about the encoding into the running
text, introduce a separate section detailing our "size encoding" and
refer to it.

Reported-by: Ross Light <ross@zombiezen.com>
Signed-off-by: Martin Ågren <martin.agren@gmail.com>
---
 v1: https://lore.kernel.org/git/cover.1608537234.git.martin.agren@gmail.com/

 Documentation/technical/pack-format.txt | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index f96b2e605f..96d2fc589f 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -55,6 +55,18 @@ Valid object types are:
 
 Type 5 is reserved for future expansion. Type 0 is invalid.
 
+=== Size encoding
+
+This document uses the following "size encoding" of non-negative
+integers: From each byte, the seven least significant bits are
+used to form the resulting integer. As long as the most significant
+bit is 1, this process continues; the byte with MSB 0 provides the
+last seven bits.  The seven-bit chunks are concatenated. Later
+values are more significant.
+
+This size encoding should not be confused with the "offset encoding",
+which is also used in this document.
+
 === Deltified representation
 
 Conceptually there are only four object types: commit, tree, tag and
@@ -73,7 +85,10 @@ Ref-delta can also refer to an object outside the pack (i.e. the
 so-called "thin pack"). When stored on disk however, the pack should
 be self contained to avoid cyclic dependency.
 
-The delta data is a sequence of instructions to reconstruct an object
+The delta data starts with the size of the base object and the
+size of the object to be reconstructed. These sizes are
+encoded using the size encoding from above.  The remainder of
+the delta data is a sequence of instructions to reconstruct the object
 from the base object. If the base object is deltified, it must be
 converted to canonical form first. Each instruction appends more and
 more data to the target object until it's complete. There are two
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-12-29 22:45 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-19 17:07 Documentation errors for HTTP protocol v2 and packfile Ross Light
2020-12-21  7:54 ` [PATCH 0/2] pack-format.txt: document lengths at start of delta data Martin Ågren
2020-12-21  7:54   ` [PATCH 1/2] pack-format.txt: define "varint" format Martin Ågren
2020-12-21 21:40     ` Junio C Hamano
2020-12-29 22:41       ` Martin Ågren
2020-12-21  7:54   ` [PATCH 2/2] pack-format.txt: document lengths at start of delta data Martin Ågren
2020-12-23  4:40   ` [PATCH 0/2] " Ross Light
2020-12-29 22:43   ` [PATCH v2] pack-format.txt: document sizes " Martin Ågren

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).