Request for detailed documentation of git pack protocol

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Request for detailed documentation of git pack protocol
@ 2009-05-12 21:29 Jakub Narebski
  2009-05-12 23:34 ` Shawn O. Pearce
  0 siblings, 1 reply; 66+ messages in thread
From: Jakub Narebski @ 2009-05-12 21:29 UTC (permalink / raw)
  To: git

We have now proliferation of different (re)implementations of git:
JGit in Java, Dulwich in Python, Grit in Ruby; and there are other
planned: git# / managed git in C# (GSoC Mono project), ObjectiveGit
in Objective-C (for iPhone IIRC).  At some time they would reach
the point (or reached it already) of implementing git-daemon...
but currently the documentation of git protocol is lacking.

This can lead, as you can read from recent post on git mailing, to
implementing details wrong (like Dulwich not using full SHA-1 where
it should, leading to ordinary git clients to failing to fetch from it),
or fail at best practices of implementation (like JGit last issue with
deadlocking for multi_ack extension).

The current documentation of git protocol is very sparse; the docs
in Documentation/technical/pack-protocol.txt offer only a sketch of
exchange.  You can find more, including pkt-line format, a way sideband
is multiplexed, and how capabilities are negotiated between server and
client in design document for "smart" HTTP server, for example in
  Subject: Re: More on git over HTTP POST
  Message-ID: <20080803025602.GB27465@spearce.org>
  URL: http://thread.gmane.org/gmane.comp.version-control.git/91104/focus=91196

It would be really nice, I think, to have RFC for git pack protocol.
And it would help avoid incompatibilities between different clients
and servers.  If the document would contain expected behaviour of
client and server and Best Current Practices it would help avoid
pitfals when implementing git-daemon in other implementation.

Perhaps in the future it could be sent for inclusion as "official"
RFC to IETF?  (Dreaming big).

Unfortunately I don't know enough about this area of code to write
one myself.

-- 
Jakub Narebski
ShadeHawk or jnareb on #git
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-05-12 21:29 Request for detailed documentation of git pack protocol Jakub Narebski
@ 2009-05-12 23:34 ` Shawn O. Pearce
  2009-05-14  8:24   ` Jakub Narebski
  2009-05-14 13:55   ` Scott Chacon
  0 siblings, 2 replies; 66+ messages in thread
From: Shawn O. Pearce @ 2009-05-12 23:34 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com> wrote:
> We have now proliferation of different (re)implementations of git:
> JGit in Java, Dulwich in Python, Grit in Ruby; and there are other
> planned: git# / managed git in C# (GSoC Mono project), ObjectiveGit
> in Objective-C (for iPhone IIRC).  At some time they would reach
> the point (or reached it already) of implementing git-daemon...
> but currently the documentation of git protocol is lacking.

Well, lets see...

JGit - me and Robin, here on git ML.  JGit is the closest
reimplementation effort to the canonical C implementation.
JGit runs in production servers for many folks, e.g. quite
a few Google engineers use the JGit server every day.  Its
our main git daemon.

Grit - GitHub folks.  They know where to find us.  And their
business is Git.  If Grit isn't right, they'll make it right,
or possibly suffer a loss of customers.  I'm fairly certain
that GitHub runs Grit in production.

ObjectGit - Scott Chacon, again, a GitHub folk.  Though he has
expressed interest in moving to JGit or libgit2 where/when possible.

Dulwich - off in its own world and not even trying to match basic
protocol rules by just watching what happens when you telnet to a
git port.  No clue how that's going to fair.

git# - We'll see.  Mono GSoC Git projects have a really bad
reputation of ignoring the existing git knowledge and hoping
they can invent the wheel on their own.

> This can lead, as you can read from recent post on git mailing, to
> implementing details wrong (like Dulwich not using full SHA-1 where
> it should, leading to ordinary git clients to failing to fetch from it),
> or fail at best practices of implementation (like JGit last issue with
> deadlocking for multi_ack extension).

Dulwich is just busted.

No existing developers knew that the fetch-pack/upload-pack protocol
has this required implicit buffering consideration until JGit
deadlocked over it.  But even then I'm still not 100% sure this
is true, or if it is just an artifact of the JGit upload-pack side
implementation being partially wrong.

> The current documentation of git protocol is very sparse; the docs
> in Documentation/technical/pack-protocol.txt offer only a sketch of
> exchange.  You can find more, including pkt-line format, a way sideband
> is multiplexed, and how capabilities are negotiated between server and
> client in design document for "smart" HTTP server, for example in
>   Subject: Re: More on git over HTTP POST
>   Message-ID: <20080803025602.GB27465@spearce.org>
>   URL: http://thread.gmane.org/gmane.comp.version-control.git/91104/focus=91196

Seriously?  Don't link to that.  Its a horrible version of the smart
HTTP RFC, and worse, it doesn't describe what you say it describes.

> It would be really nice, I think, to have RFC for git pack protocol.
> And it would help avoid incompatibilities between different clients
> and servers.  If the document would contain expected behaviour of
> client and server and Best Current Practices it would help avoid
> pitfals when implementing git-daemon in other implementation.

Yea, it would be nice.  But find me someone who knows the protocol
and who has the time to document the #!@* thing.  Maybe I'll try
to work on this myself, but I'm strapped for time, especially over
the next two-to-three months.

And lets not even start to mention Dulwich not completing a thin
pack before storing it on disk.  Those sorts of on disk things
matter to other more popular Git implementations (c git, jgit).

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-05-12 23:34 ` Shawn O. Pearce
@ 2009-05-14  8:24   ` Jakub Narebski
  2009-05-14 14:57     ` Shawn O. Pearce
  2009-05-14 18:13     ` Nicolas Pitre
  2009-05-14 13:55   ` Scott Chacon
  1 sibling, 2 replies; 66+ messages in thread
From: Jakub Narebski @ 2009-05-14  8:24 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git

On Wed, 13 May 2009 Shawn O. Pearce wrote:
> Jakub Narebski <jnareb@gmail.com> wrote:

> > We have now proliferation of different (re)implementations of git:
> > JGit in Java, Dulwich in Python, Grit in Ruby; and there are other
> > planned: git# / managed git in C# (GSoC Mono project), ObjectiveGit
> > in Objective-C (for iPhone IIRC).  At some time they would reach
> > the point (or reached it already) of implementing git-daemon...
> > but currently the documentation of git protocol is lacking.
> 
> Well, lets see...
> 
> JGit - me and Robin, here on git ML.  JGit is the closest
> reimplementation effort to the canonical C implementation.
> JGit runs in production servers for many folks, e.g. quite
> a few Google engineers use the JGit server every day.  Its
> our main git daemon.
> 
> Grit - GitHub folks.  They know where to find us.  And their
> business is Git.  If Grit isn't right, they'll make it right,
> or possibly suffer a loss of customers.  I'm fairly certain
> that GitHub runs Grit in production.
> 
> ObjectGit - Scott Chacon, again, a GitHub folk.  Though he has
> expressed interest in moving to JGit or libgit2 where/when possible.
> 
> Dulwich - off in its own world and not even trying to match basic
> protocol rules by just watching what happens when you telnet to a
> git port.  No clue how that's going to fair.
> 
> git# - We'll see.  Mono GSoC Git projects have a really bad
> reputation of ignoring the existing git knowledge and hoping
> they can invent the wheel on their own.

So you are saying that even if detailed pack protocol specification
isn't written down (Documentation/technical/pack-protocol.txt is more
of a sketch than reference documentation), the knowledge is there,
and it is not that hard to get (just ask on git mailing list), isn't it?
 
> > This can lead, as you can read from recent post on git mailing, to
> > implementing details wrong (like Dulwich not using full SHA-1 where
> > it should, leading to ordinary git clients to failing to fetch from it),
> > or fail at best practices of implementation (like JGit last issue with
> > deadlocking for multi_ack extension).
> 
> Dulwich is just busted.

That was my impression too. Those details it got wrong aren't so
obscure and hard to get right...

> 
> No existing developers knew that the fetch-pack/upload-pack protocol
> has this required implicit buffering consideration until JGit
> deadlocked over it.  But even then I'm still not 100% sure this
> is true, or if it is just an artifact of the JGit upload-pack side
> implementation being partially wrong.

Well... I guess that section on Best Current Practices to avoid 
deadlocking would not be there to avoid this issue in JGit, but
would be added for the future later.

> 
> > The current documentation of git protocol is very sparse; the docs
> > in Documentation/technical/pack-protocol.txt offer only a sketch of
> > exchange.  You can find more, including pkt-line format, a way sideband
> > is multiplexed, and how capabilities are negotiated between server and
> > client in design document for "smart" HTTP server, for example in
> >   Subject: Re: More on git over HTTP POST
> >   Message-ID: <20080803025602.GB27465@spearce.org>
> >   URL: http://thread.gmane.org/gmane.comp.version-control.git/91104/focus=91196
> 
> Seriously?  Don't link to that.  Its a horrible version of the smart
> HTTP RFC, and worse, it doesn't describe what you say it describes.

Ooops, I am sorry. This was my bookmark into this thread (which is very
interesting, and contain host of otherwise unknown to me information
about pack protocol), but the post in this thread was quite arbitrary
(a random post where I decided that this thread is interesting enough
to bookmark, and long enough to not want to save all interesting posts).

>  
> > It would be really nice, I think, to have RFC for git pack protocol.
> > And it would help avoid incompatibilities between different clients
> > and servers.  If the document would contain expected behaviour of
> > client and server and Best Current Practices it would help avoid
> > pitfals when implementing git-daemon in other implementation.
> 
> Yea, it would be nice.  But find me someone who knows the protocol
> and who has the time to document the #!@* thing.  Maybe I'll try
> to work on this myself, but I'm strapped for time, especially over
> the next two-to-three months.

I was afraid of this: that the people who know pack protocol good
enough to be able to write it down are otherwise busy. But we get
detailed / updated packfile and pack index format descriptions some
time ago (thanks all that contributed to it!). I hope that the same
would happen with pack _protocol_ description.

I was hoping of document in RFC format; dreaming about having it
submitted to IETF as (at least) unofficial RFC like Atom Publication
Protocol (or is it proper RFC these days?), and then accepted like
HTTP protocol. But I understand that it is not the same situation;
there wouldn't (and perhaps shouldn't) too many independent git-daemon
implementations...

> 
> And lets not even start to mention Dulwich not completing a thin
> pack before storing it on disk.  Those sorts of on disk things
> matter to other more popular Git implementations (c git, jgit).

Ugh! Errr... aren't thin packs send only if other side has the
capability for it? What is then Dulwich doing announcing such 
capability when not supporting it correctly...

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-05-12 23:34 ` Shawn O. Pearce
  2009-05-14  8:24   ` Jakub Narebski
@ 2009-05-14 13:55   ` Scott Chacon
  2009-05-14 14:44     ` Shawn O. Pearce
                       ` (2 more replies)
  1 sibling, 3 replies; 66+ messages in thread
From: Scott Chacon @ 2009-05-14 13:55 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Jakub Narebski, git

Hey,

On Tue, May 12, 2009 at 4:34 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
> Jakub Narebski <jnareb@gmail.com> wrote:
>> We have now proliferation of different (re)implementations of git:
>> JGit in Java, Dulwich in Python, Grit in Ruby; and there are other
>> planned: git# / managed git in C# (GSoC Mono project), ObjectiveGit
>> in Objective-C (for iPhone IIRC).  At some time they would reach
>> the point (or reached it already) of implementing git-daemon...
>> but currently the documentation of git protocol is lacking.
>
> Well, lets see...
>
> JGit - me and Robin, here on git ML.  JGit is the closest
> reimplementation effort to the canonical C implementation.
> JGit runs in production servers for many folks, e.g. quite
> a few Google engineers use the JGit server every day.  Its
> our main git daemon.
>
> Grit - GitHub folks.  They know where to find us.  And their
> business is Git.  If Grit isn't right, they'll make it right,
> or possibly suffer a loss of customers.  I'm fairly certain
> that GitHub runs Grit in production.
>
> ObjectGit - Scott Chacon, again, a GitHub folk.  Though he has
> expressed interest in moving to JGit or libgit2 where/when possible.

Actually, all of this work has moved to CocoaGit, which is much
farther along than ObjectiveGit ever was.  Although I would love to
use libgit2 when it gets that far, this was for Mac/iPhone native
client work which JGit would not be helpful for.

>
> Dulwich - off in its own world and not even trying to match basic
> protocol rules by just watching what happens when you telnet to a
> git port.  No clue how that's going to fair.

Oddly enough, I'm in Dulwich land too. I've been working on a
Mercurial plugin that will provide a two way lossless bridge for Hg to
be able to push and pull to/from a Git server.  I've fixed some of the
issues I've found with the client side work and both pushes and pulls
will work now. (I did turn off 'thin-pack' capability announcement,
since you're correct that it simply was not properly implemented).

If we're going to round out the list, I've also worked on an
ActionScript partial implementation, but it never got to the packfile
level, and some of the Erlang guys are interested in writing at least
a partial Erlang implementation too, which I may get involved in at
some point.

It seems like if anyone would do what you're asking, it's probably me.
In the next few weeks, I do what I can to fix up the remainder of the
Dulwich code as part of my hg-git work.  I'm also working with Shawn
on the Apress book, where I was going to try to document much of this
information, perhaps I could try writing an RFC as an appendix or
something - then that will force him to spend time correcting
everything I got wrong :)  At least that might be a good starting
point - I'm unfamiliar with the actual RFC process, so I'll research
that a bit today.  I don't mind writing it, I think it would be really
really useful to have, I just am unfamiliar with the process.

Thanks,
Scott


>
> git# - We'll see.  Mono GSoC Git projects have a really bad
> reputation of ignoring the existing git knowledge and hoping
> they can invent the wheel on their own.
>
>> This can lead, as you can read from recent post on git mailing, to
>> implementing details wrong (like Dulwich not using full SHA-1 where
>> it should, leading to ordinary git clients to failing to fetch from it),
>> or fail at best practices of implementation (like JGit last issue with
>> deadlocking for multi_ack extension).
>
> Dulwich is just busted.
>
> No existing developers knew that the fetch-pack/upload-pack protocol
> has this required implicit buffering consideration until JGit
> deadlocked over it.  But even then I'm still not 100% sure this
> is true, or if it is just an artifact of the JGit upload-pack side
> implementation being partially wrong.
>
>> The current documentation of git protocol is very sparse; the docs
>> in Documentation/technical/pack-protocol.txt offer only a sketch of
>> exchange.  You can find more, including pkt-line format, a way sideband
>> is multiplexed, and how capabilities are negotiated between server and
>> client in design document for "smart" HTTP server, for example in
>>   Subject: Re: More on git over HTTP POST
>>   Message-ID: <20080803025602.GB27465@spearce.org>
>>   URL: http://thread.gmane.org/gmane.comp.version-control.git/91104/focus=91196
>
> Seriously?  Don't link to that.  Its a horrible version of the smart
> HTTP RFC, and worse, it doesn't describe what you say it describes.
>
>> It would be really nice, I think, to have RFC for git pack protocol.
>> And it would help avoid incompatibilities between different clients
>> and servers.  If the document would contain expected behaviour of
>> client and server and Best Current Practices it would help avoid
>> pitfals when implementing git-daemon in other implementation.
>
> Yea, it would be nice.  But find me someone who knows the protocol
> and who has the time to document the #!@* thing.  Maybe I'll try
> to work on this myself, but I'm strapped for time, especially over
> the next two-to-three months.
>
> And lets not even start to mention Dulwich not completing a thin
> pack before storing it on disk.  Those sorts of on disk things
> matter to other more popular Git implementations (c git, jgit).
>
> --
> Shawn.
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-05-14 13:55   ` Scott Chacon
@ 2009-05-14 14:44     ` Shawn O. Pearce
  2009-05-14 15:01     ` Jakub Narebski
  2009-06-02 21:39     ` Jakub Narebski
  2 siblings, 0 replies; 66+ messages in thread
From: Shawn O. Pearce @ 2009-05-14 14:44 UTC (permalink / raw)
  To: Scott Chacon; +Cc: Jakub Narebski, git

Scott Chacon <schacon@gmail.com> wrote:
> On Tue, May 12, 2009 at 4:34 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
> >
> > Dulwich - off in its own world and not even trying to match basic
> > protocol rules by just watching what happens when you telnet to a
> > git port. ??No clue how that's going to fair.
> 
> Oddly enough, I'm in Dulwich land too. I've been working on a
> Mercurial plugin that will provide a two way lossless bridge for Hg to
> be able to push and pull to/from a Git server.

How are you going to represent an n-way merge in Git in Hg?

> I've fixed some of the
> issues I've found with the client side work and both pushes and pulls
> will work now. (I did turn off 'thin-pack' capability announcement,
> since you're correct that it simply was not properly implemented).

I'm half interested in Dulwich for "repo"[1] but I need the #@!*
library to be stable and correctly implement Git conventions.  :-)

[1] http://android.git.kernel.org/?p=tools/repo.git;a=summary

> If we're going to round out the list, I've also worked on an
> ActionScript partial implementation, but it never got to the packfile
> level,

ActionScript?  WTF?  As in that thing that embeds in Flash?

> and some of the Erlang guys are interested in writing at least
> a partial Erlang implementation too, which I may get involved in at
> some point.

I heard they moved their official repository to Git.  Their VM as
a network server is just plain awesome.  I half wish I was using
that for the Gerrit backend rather than Apache MINA.  Erlang is
rock-solid and doesn't have major threading bugs in its core.

Oh heck, I just found the documentation for the Erlang sshd.  Nice.
Sadly it lacks public key support it seems, and a solid Git library
with server protocols fully implemented, but, eh, its management
is way better than Java.

> It seems like if anyone would do what you're asking, it's probably me.
> In the next few weeks, I do what I can to fix up the remainder of the
> Dulwich code as part of my hg-git work.  I'm also working with Shawn
> on the Apress book, where I was going to try to document much of this
> information, perhaps I could try writing an RFC as an appendix or
> something - then that will force him to spend time correcting
> everything I got wrong :)

Hah!

Even if you don't write it for the book, I'll certainly try to
give a technical review over the content.  That goes for anyone
who takes the time to write the protocol out, and has a fair clue
as to how it currently works.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-05-14  8:24   ` Jakub Narebski
@ 2009-05-14 14:57     ` Shawn O. Pearce
  2009-05-14 15:02       ` Andreas Ericsson
  2009-05-15 16:51       ` Clemens Buchacher
  2009-05-14 18:13     ` Nicolas Pitre
  1 sibling, 2 replies; 66+ messages in thread
From: Shawn O. Pearce @ 2009-05-14 14:57 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com> wrote:
> On Wed, 13 May 2009 Shawn O. Pearce wrote:
> > Jakub Narebski <jnareb@gmail.com> wrote:
> > > We have now proliferation of different (re)implementations of git:
> > > JGit in Java, Dulwich in Python, Grit in Ruby; and there are other
> > > planned: git# / managed git in C# (GSoC Mono project), ObjectiveGit
> > > in Objective-C (for iPhone IIRC).  At some time they would reach
> > > the point (or reached it already) of implementing git-daemon...
> > > but currently the documentation of git protocol is lacking.
> > 
> > Well, lets see...
> 
> So you are saying that even if detailed pack protocol specification
> isn't written down (Documentation/technical/pack-protocol.txt is more
> of a sketch than reference documentation), the knowledge is there,
> and it is not that hard to get (just ask on git mailing list), isn't it?

Yup.  We've tried to keep JGit right here on this list just to keep
the knowledge concentrated here, so git@vger is the place anyone
can ask questions, and get good answers.

> > No existing developers knew that the fetch-pack/upload-pack protocol
> > has this required implicit buffering consideration until JGit
> > deadlocked over it.  But even then I'm still not 100% sure this
> > is true, or if it is just an artifact of the JGit upload-pack side
> > implementation being partially wrong.
> 
> Well... I guess that section on Best Current Practices to avoid 
> deadlocking would not be there to avoid this issue in JGit, but
> would be added for the future later.

So I'm actually right (and Junio confirmed it off list), the
fetch-pack/upload-pack protocol with multi_ack enabled requires
a buffer on the client side of at least 2952 bytes which can be
drained to the server after the client enters its receive phase.

In practical implementations like git:// TCP and SSH, there is
enough inherit buffering in the TX side of the client that this
isn't an issue.

In loopback mode for local file URIs, it may become an issue.  C Git
is just getting lucky by the pipe size I think.  Though I thought I
read somewhere yesterday pipe FIFOs in Linux were being allocated
at 512 bytes, not one system page.  Of course other systems could
allocate whatever size they want too, and may allocate something
below the 2952 minimum, and we'd most likely see a deadlock on them.

> > > The current documentation of git protocol is very sparse; the docs
> > > in Documentation/technical/pack-protocol.txt offer only a sketch of
> > > exchange.  You can find more, including pkt-line format, a way sideband
> > > is multiplexed, and how capabilities are negotiated between server and
> > > client in design document for "smart" HTTP server, for example in
> > >   Subject: Re: More on git over HTTP POST
> > >   Message-ID: <20080803025602.GB27465@spearce.org>
> > >   URL: http://thread.gmane.org/gmane.comp.version-control.git/91104/focus=91196
> > 
> > Seriously?  Don't link to that.  Its a horrible version of the smart
> > HTTP RFC, and worse, it doesn't describe what you say it describes.
> 
> Ooops, I am sorry. This was my bookmark into this thread (which is very
> interesting, and contain host of otherwise unknown to me information
> about pack protocol), but the post in this thread was quite arbitrary
> (a random post where I decided that this thread is interesting enough
> to bookmark, and long enough to not want to save all interesting posts).

GMane is dead right now, otherwise I'd try to find the link you
were more likely talking about.  I think you were right, there may
have been a much better post in that particular thread.
 
> > And lets not even start to mention Dulwich not completing a thin
> > pack before storing it on disk.  Those sorts of on disk things
> > matter to other more popular Git implementations (c git, jgit).
> 
> Ugh! Errr... aren't thin packs send only if other side has the
> capability for it?

Yes.

> What is then Dulwich doing announcing such 
> capability when not supporting it correctly...

Because the implementation is just busted.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-05-14 13:55   ` Scott Chacon
  2009-05-14 14:44     ` Shawn O. Pearce
@ 2009-05-14 15:01     ` Jakub Narebski
  2009-05-15  0:58       ` A Large Angry SCM
  2009-06-02 21:39     ` Jakub Narebski
  2 siblings, 1 reply; 66+ messages in thread
From: Jakub Narebski @ 2009-05-14 15:01 UTC (permalink / raw)
  To: Scott Chacon; +Cc: Shawn O. Pearce, git

On Thu, 14 May 2009, Scott Chacon wrote:
> On Tue, May 12, 2009 at 4:34 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
>> Jakub Narebski <jnareb@gmail.com> wrote:

>>> We have now proliferation of different (re)implementations of git:
>>> JGit in Java, Dulwich in Python, Grit in Ruby; and there are other
>>> planned: git# / managed git in C# (GSoC Mono project), ObjectiveGit
>>> in Objective-C (for iPhone IIRC).  At some time they would reach
>>> the point (or reached it already) of implementing git-daemon...
>>> but currently the documentation of git protocol is lacking.
>>
>> Well, lets see...

[...]
>> ObjectGit - Scott Chacon, again, a GitHub folk.  Though he has
>> expressed interest in moving to JGit or libgit2 where/when possible.
> 
> Actually, all of this work has moved to CocoaGit, which is much
> farther along than ObjectiveGit ever was.  Although I would love to
> use libgit2 when it gets that far, this was for Mac/iPhone native
> client work which JGit would not be helpful for.

Could you give URL for homepage or announcement (if it exists), and
for git repository / web interface page for CocoaGit? It isn't present
on http://git.or.cz/gitwiki/InterfacesFrontendsAndTools which tries
to be clearinghouse and list all significant git tools... but which
is probably hopelessly out of date now (unfortunately).

> 
>>
>> Dulwich - off in its own world and not even trying to match basic
>> protocol rules by just watching what happens when you telnet to a
>> git port.  No clue how that's going to fair.
> 
> Oddly enough, I'm in Dulwich land too. I've been working on a
> Mercurial plugin that will provide a two way lossless bridge for Hg to
> be able to push and pull to/from a Git server.

I'm assuming here that the bridge has to remember somehow about the
info which cannot be represented in other SCM (like octopus merges,
or tag objects, or tagging non-commits in Git; like I guess 'rename
tracking' information in Mercurial) to be it truly two-way...

> I've fixed some of the 
> issues I've found with the client side work and both pushes and pulls
> will work now. (I did turn off 'thin-pack' capability announcement,
> since you're correct that it simply was not properly implemented).
> 
> If we're going to round out the list, I've also worked on an
> ActionScript partial implementation, but it never got to the packfile
> level, and some of the Erlang guys are interested in writing at least
> a partial Erlang implementation too, which I may get involved in at
> some point.

Well, with yet another implementation it is even more important to have
good technical documentation of file formats and network protocols.

BTW. if I remember correctly there were some hobbyist one-person 
(single-developer) implementations of git in Haskell and in Lisp
or Scheme...

> 
> It seems like if anyone would do what you're asking, it's probably me.
> In the next few weeks, I do what I can to fix up the remainder of the
> Dulwich code as part of my hg-git work.  I'm also working with Shawn
> on the Apress book, where I was going to try to document much of this
> information, perhaps I could try writing an RFC as an appendix or
> something - then that will force him to spend time correcting
> everything I got wrong :)  At least that might be a good starting
> point - I'm unfamiliar with the actual RFC process, so I'll research
> that a bit today.  I don't mind writing it, I think it would be really
> really useful to have, I just am unfamiliar with the process.

I don't think RFC _process_ is something to worry about; in the future
perhaps (just like Atom Publishing protocol was submitted to IETF).
I was thinking about _format_ used in RFC (BNF-like specification,
specific semantic for 'MUST' etc. like in RFC2119). Although any format
(more or less formal) would be better that none.

Thank you very much for your offer!

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-05-14 14:57     ` Shawn O. Pearce
@ 2009-05-14 15:02       ` Andreas Ericsson
  2009-05-15 20:29         ` Linus Torvalds
  2009-05-15 16:51       ` Clemens Buchacher
  1 sibling, 1 reply; 66+ messages in thread
From: Andreas Ericsson @ 2009-05-14 15:02 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Jakub Narebski, git

Shawn O. Pearce wrote:
> 
> In loopback mode for local file URIs, it may become an issue.  C Git
> is just getting lucky by the pipe size I think.  Though I thought I
> read somewhere yesterday pipe FIFOs in Linux were being allocated
> at 512 bytes, not one system page.  Of course other systems could
> allocate whatever size they want too, and may allocate something
> below the 2952 minimum, and we'd most likely see a deadlock on them.
> 

Linux allocates one page 4096 bytes for a FIFO. 512 is the maximum
size guaranteed by POSIX to result in an atomic write.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Register now for Nordic Meet on Nagios, June 3-4 in Stockholm
 http://nordicmeetonnagios.op5.org/

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-05-14  8:24   ` Jakub Narebski
  2009-05-14 14:57     ` Shawn O. Pearce
@ 2009-05-14 18:13     ` Nicolas Pitre
  2009-05-14 20:27       ` Jakub Narebski
  1 sibling, 1 reply; 66+ messages in thread
From: Nicolas Pitre @ 2009-05-14 18:13 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Shawn O. Pearce, git

On Thu, 14 May 2009, Jakub Narebski wrote:

> I was afraid of this: that the people who know pack protocol good
> enough to be able to write it down are otherwise busy. But we get
> detailed / updated packfile and pack index format descriptions some
> time ago (thanks all that contributed to it!). I hope that the same
> would happen with pack _protocol_ description.

If someone with the wish for such a document volunteers to work on it 
then I'm sure people with fuller knowledge will review and comment on 
the result as appropriate.

> I was hoping of document in RFC format; dreaming about having it
> submitted to IETF as (at least) unofficial RFC like Atom Publication
> Protocol (or is it proper RFC these days?), and then accepted like
> HTTP protocol.

I think we'd have to move to a new version of the protocol for that.  
The current protocol, even if it does the job, is not particularly 
elegant.

> > And lets not even start to mention Dulwich not completing a thin
> > pack before storing it on disk.  Those sorts of on disk things
> > matter to other more popular Git implementations (c git, jgit).
> 
> Ugh! Errr... aren't thin packs send only if other side has the
> capability for it? What is then Dulwich doing announcing such 
> capability when not supporting it correctly...

They probably don't bother because in theory you don't need to complete 
a thin pack for the system to still work.  We require that any pack 
never contain a delta which base object is in a different pack because 
that makes for better performances when accessing the pack and when 
repacking.  And not doing so makes pack validation (think verify-pack) 
impossible without the dependent objects, and that makes incremental 
repacking much much harder wrt prevention of delta cycles.

Those validation tools from C git (fsck, verify-pack, etc.) should be 
quite useful for people wishing to implement their own git.

Nicolas

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-05-14 18:13     ` Nicolas Pitre
@ 2009-05-14 20:27       ` Jakub Narebski
  0 siblings, 0 replies; 66+ messages in thread
From: Jakub Narebski @ 2009-05-14 20:27 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Shawn O. Pearce, git, Scott Chacon

On Thu, 14 May 2009, Nicolas Pitre wrote:
> On Thu, 14 May 2009, Jakub Narebski wrote:
> 
> > I was afraid of this: that the people who know pack protocol good
> > enough to be able to write it down are otherwise busy. But we get
> > detailed / updated packfile and pack index format descriptions some
> > time ago (thanks all that contributed to it!). I hope that the same
> > would happen with pack _protocol_ description.
> 
> If someone with the wish for such a document volunteers to work on it 
> then I'm sure people with fuller knowledge will review and comment on 
> the result as appropriate.

Well, but still somebody with time and at least some expertise in
the area would be required to start it.

> > I was hoping of document in RFC format; dreaming about having it
> > submitted to IETF as (at least) unofficial RFC like Atom Publication
> > Protocol (or is it proper RFC these days?), and then accepted like
> > HTTP protocol.
> 
> I think we'd have to move to a new version of the protocol for that.  
> The current protocol, even if it does the job, is not particularly 
> elegant.

Are all RFC (including proposals / informational RFCs) defined protocols
elegant? Well... perhaps they are. The quality of IETF standards is way
higher than, say, ECMA :-)

But I accept that having RFC to be on the list of 'official' RFCs, even
as an "experimental" RFC is just a dream. Nevertheless I think that 
following RFC format, which includes using a common set of terms such 
as "MUST" and "NOT RECOMMENDED" (as defined by RFC 2119), Augmented 
Backus–Naur Form (ABNF) (as defined by RFC 5234) as a metalanguage,
would be a good idea for technical / protocol documentation.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-05-14 15:01     ` Jakub Narebski
@ 2009-05-15  0:58       ` A Large Angry SCM
  2009-05-15 19:05         ` Ealdwulf Wuffinga
  0 siblings, 1 reply; 66+ messages in thread
From: A Large Angry SCM @ 2009-05-15  0:58 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Scott Chacon, Shawn O. Pearce, git

Jakub Narebski wrote:
> I don't think RFC _process_ is something to worry about; in the future
> perhaps (just like Atom Publishing protocol was submitted to IETF).
> I was thinking about _format_ used in RFC (BNF-like specification,
> specific semantic for 'MUST' etc. like in RFC2119). Although any format
> (more or less formal) would be better that none.

Standardese, the peculiar dialect and formalism employed by RFC authors, 
is not difficult to master. The difficult part is writing the prose 
that's an _unambiguous_ description of the protocol you're attempting to 
document. There's even a tool, xml2rfc, that will do the formatting for you.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-05-14 14:57     ` Shawn O. Pearce
  2009-05-14 15:02       ` Andreas Ericsson
@ 2009-05-15 16:51       ` Clemens Buchacher
  1 sibling, 0 replies; 66+ messages in thread
From: Clemens Buchacher @ 2009-05-15 16:51 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Jakub Narebski, git

On Thu, May 14, 2009 at 07:57:24AM -0700, Shawn O. Pearce wrote:
> > > Jakub Narebski <jnareb@gmail.com> wrote:
> > > > The current documentation of git protocol is very sparse; the docs
> > > > in Documentation/technical/pack-protocol.txt offer only a sketch of
> > > > exchange.  You can find more, including pkt-line format, a way sideband
> > > > is multiplexed, and how capabilities are negotiated between server and
> > > > client in design document for "smart" HTTP server, for example in
> > > >   Subject: Re: More on git over HTTP POST
> > > >   Message-ID: <20080803025602.GB27465@spearce.org>
> > > >   URL: http://thread.gmane.org/gmane.comp.version-control.git/91104/focus=91196
[...]
> GMane is dead right now, otherwise I'd try to find the link you
> were more likely talking about.  I think you were right, there may
> have been a much better post in that particular thread.

I believe this is the most recent version:

	http://article.gmane.org/gmane.comp.version-control.git/94313

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-05-15  0:58       ` A Large Angry SCM
@ 2009-05-15 19:05         ` Ealdwulf Wuffinga
  0 siblings, 0 replies; 66+ messages in thread
From: Ealdwulf Wuffinga @ 2009-05-15 19:05 UTC (permalink / raw)
  To: gitzilla; +Cc: Jakub Narebski, Scott Chacon, Shawn O. Pearce, git

On Fri, May 15, 2009 at 1:58 AM, A Large Angry SCM <gitzilla@gmail.com> wrote:

> Standardese, the peculiar dialect and formalism employed by RFC authors, is
> not difficult to master. The difficult part is writing the prose that's an
> _unambiguous_ description of the protocol you're attempting to document.
> There's even a tool, xml2rfc, that will do the formatting for you.

There are also tools for writing unambiguous prose:
http://en.wikipedia.org/wiki/Controlled_natural_language.
They look potentially tedious to use, but I haven't actually tried.

Ealdwulf

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-05-14 15:02       ` Andreas Ericsson
@ 2009-05-15 20:29         ` Linus Torvalds
  0 siblings, 0 replies; 66+ messages in thread
From: Linus Torvalds @ 2009-05-15 20:29 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Shawn O. Pearce, Jakub Narebski, git



On Thu, 14 May 2009, Andreas Ericsson wrote:

> Shawn O. Pearce wrote:
> > 
> > In loopback mode for local file URIs, it may become an issue.  C Git
> > is just getting lucky by the pipe size I think.  Though I thought I
> > read somewhere yesterday pipe FIFOs in Linux were being allocated
> > at 512 bytes, not one system page.  Of course other systems could
> > allocate whatever size they want too, and may allocate something
> > below the 2952 minimum, and we'd most likely see a deadlock on them.
> > 
> 
> Linux allocates one page 4096 bytes for a FIFO. 512 is the maximum
> size guaranteed by POSIX to result in an atomic write.

Actually, modern Linux will allocate up to 16 pages (PIPE_BUFFERS), but 
they may not all be filled - we coalesce small writes only if the end 
result fits entirely into a page. So the maximum buffer is 16*PAGE_SIZE, 
and the minimum buffer space (assuming regular "write()" system calls) is 
something like 16*(PAGE_SIZE/2+1).

But yeah, POSIX allows for much smaller buffers.

			Linus

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-05-14 13:55   ` Scott Chacon
  2009-05-14 14:44     ` Shawn O. Pearce
  2009-05-14 15:01     ` Jakub Narebski
@ 2009-06-02 21:39     ` Jakub Narebski
  2009-06-02 23:27       ` Shawn O. Pearce
                         ` (3 more replies)
  2 siblings, 4 replies; 66+ messages in thread
From: Jakub Narebski @ 2009-06-02 21:39 UTC (permalink / raw)
  To: Scott Chacon; +Cc: Shawn O. Pearce, git

On Thu, 14 May 2009, Scott Chacon wrote:
> On Tue, May 12, 2009 at 4:34 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
>> Jakub Narebski <jnareb@gmail.com> wrote:

>>> We have now proliferation of different (re)implementations of git:
>>> JGit in Java, Dulwich in Python, Grit in Ruby; and there are other
>>> planned: git# / managed git in C# (GSoC Mono project), ObjectiveGit
>>> in Objective-C (for iPhone IIRC).  At some time they would reach
>>> the point (or reached it already) of implementing git-daemon...
>>> but currently the documentation of git protocol is lacking.
[...]

> It seems like if anyone would do what you're asking, it's probably me.
> [...]  I'm also working with Shawn
> on the Apress book, where I was going to try to document much of this
> information, perhaps I could try writing an RFC as an appendix or
> something - then that will force him to spend time correcting
> everything I got wrong :)  At least that might be a good starting
> point - I'm unfamiliar with the actual RFC process, so I'll research
> that a bit today.  I don't mind writing it, I think it would be really
> really useful to have, I just am unfamiliar with the process.

[...]
>>> It would be really nice, I think, to have RFC for git pack protocol.
>>> And it would help avoid incompatibilities between different clients
>>> and servers.  If the document would contain expected behaviour of
>>> client and server and Best Current Practices it would help avoid
>>> pitfals when implementing git-daemon in other implementation.
>>
>> Yea, it would be nice.  But find me someone who knows the protocol
>> and who has the time to document the #!@* thing.  Maybe I'll try
>> to work on this myself, but I'm strapped for time, especially over
>> the next two-to-three months.

I see that there is (at least beginnings of) description of git pack
protocol in section "Transfer Protocols"[1][2] of chapter "7. Internals
and Plumbing" of "Git Community Book".

 [1] http://book.git-scm.com/7_transfer_protocols.html
 [2] http://github.com/schacon/gitbook/blob/master/text/54_Transfer_Protocols/0_Transfer_Protocols.markdown

Let me quote here relevant part of this chapter, with some comment I am
not sure validity of... and therefore I'd like to ask for comments here,
rather than sending a patch of pull request already

> ### Fetching Data with Upload Pack ###
>
> For the smarter protocols, fetching objects is much more efficient. 
> A socket is opened, either over ssh or over port 9418 (in the case of
> the git:// protocol), and the linkgit:git-fetch-pack[1] command on
> the client begins communicating with a forked
> linkgit:git-upload-pack[1] process on the server.

Is fetching over SSH exactly the same as fetching over git:// protocol?

>
> Then the server will tell the client which SHAs it has for each ref,
> and the client figures out what it needs and responds with a list of
> SHAs it wants and already has.
>
> At this point, the server will generate a packfile with all the
> objects that the client needs and begin streaming it down to the
> client.

We would want here probably the overview of client-server communication
as described in Documentation/technical/pack-protocol.txt

>
> Let's look at an example.
>
> The client connects and sends the request header. The clone command
>
> 	$ git clone git://myserver.com/project.git
>
> produces the following request:
>
> 	0032git-upload-pack /project.git\\000host=myserver.com\\000
>
> The first four bytes contain the hex length of the line (including 4
> byte line length and trailing newline if present). Following are the
> command and arguments. This is followed by a null byte and then the
> host information. The request is terminated by a null byte.

There is a question how to organize this information. Should we describe
pkt-line format upfront, e.g. using ABNF notation from RFC 5234 used in
RFC documents:

  <pkt-line>   = ( <pkt-length> <pkt-payload> [ LF ] ) / <pkt-flush>
  <pkt-length> = 4HEXDIGIT                  ; length of <pkt-line>
  <pkt-flush>  = "0000"

or something like that?

Sidenote: wouldn't it be better to use \0 (\\0 in source) for NUL
character rather than \000 (\\000 in source) octal representation?

>
> The request is processed and turned into a call to git-upload-pack:
>
>  	$ git-upload-pack /path/to/repos/project.git

Is it "git-upload-pack" or "git upload-pack" nowadays?

Additionally currently this chapter does not explain how request for
"/project.git" is turned into /path/to/repos/project.git path to
repository both in case of git-daemon (git:// protocol) and SSH.

>
> This immediately returns information of the repo:

To be more exact this is information about references (I guess this
is information about heads only, is it?), with information about
server capabilities stuffed in.

>
> 	007c74730d410fcb6603ace96f1dc55ea6196122532d HEAD\\000multi_ack thin-pack side-band side-band-64k ofs-delta shallow no-progress
>       003e7d1665144a3a975c05f1f43902ddaf084e784dbe refs/heads/debug
>       003d5a3f6be755bbb7deae50065988cbfa1ffa9ab68a refs/heads/dist
>       003e7e47fe2bd8d01d481f44d7af0531bd93d3b21c01 refs/heads/local
>       003f74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/master
>       0000 
>
> Each line starts with a four byte line length declaration in hex. The
> section is terminated by a line length declaration of 0000.

Should we describe here, or in appendix, or in sidenote, or in footnote
all currently supported client capabilities and server capabilities?

 * multi_ack (why not mult-ack?)
 * thin-pack 
 * side-band 
 * side-band-64k 
 * ofs-delta 
 * shallow 
 * no-progress

Is each line terminated by "\n" or "\0"? Is 'flush' line? This is not
clear from above description. From simple playing with nc (netcat) it
looks like each line with exception of '0000' is terminated with "\n".

>
> This is sent back to the client verbatim. The client responds with
> another request:
>
> 	0054want 74730d410fcb6603ace96f1dc55ea6196122532d multi_ack side-band-64k ofs-delta 
> 	0032want 7d1665144a3a975c05f1f43902ddaf084e784dbe
> 	0032want 5a3f6be755bbb7deae50065988cbfa1ffa9ab68a
> 	0032want 7e47fe2bd8d01d481f44d7af0531bd93d3b21c01
> 	0032want 74730d410fcb6603ace96f1dc55ea6196122532d

The semantics (meaning) of those 'want' lines is not described here,
although one can easily guess that those are commits that client does
not have, and which do want. In the case of "git clone" those are all
unique sha1 that client got (what happend if server has detached HEAD?)

It is not clear, but one can guess that set of capabilities that client
sends (without stuffing behind NUL character this time?) is a supported
by client and wanted subset of server capabilities.

> 	00000009done

First I thought that this is an error... but not, the 'flush' ("0000")
is not LF terminated.

>
> The is sent to the open git-upload-pack process which then streams
> out the final response:

Hmmm... here it is used different notation than above; everything is
within quotes, and end-of-line character is explicitly stated this time.

>
> 	"0008NAK\n"

What does this server response mean? That served doesn't need more
info? Having overview of client-server communication upfront would help
here (there would be a point to refer to).

> 	"0023\\002Counting objects: 2797, done.\n"
> 	"002b\\002Compressing objects:   0% (1/1177)   \r"
> 	"002c\\002Compressing objects:   1% (12/1177)   \r"
> 	"002c\\002Compressing objects:   2% (24/1177)   \r"
> 	"002c\\002Compressing objects:   3% (36/1177)   \r"
> 	"002c\\002Compressing objects:   4% (48/1177)   \r"
> 	"002c\\002Compressing objects:   5% (59/1177)   \r"
> 	"002c\\002Compressing objects:   6% (71/1177)   \r"
> 	"0053\\002Compressing objects:   7% (83/1177)   \rCompressing objects:   8% (95/1177)   \r" ...
> 	"005b\\002Compressing objects: 100% (1177/1177)   \rCompressing objects: 100% (1177/1177), done.\n"

I guess that it is sideband support: after pkt-length there is number
of stream (multiplexing), where 2 = \002 means stderr.

I wonder why sometimes it is one line per update, and sometimes there
is more than one update info stuffed in single line.

>       "2004\\001PACK\\000\\000\\000\\002\\000\\000\n\\355\\225\\017x\\234\\235\\216K\n\\302"...
>       "2005\\001\\360\\204{\\225\\376\\330\\345]z\226\273"...
> 	...
> 	"0037\\002Total 2797 (delta 1799), reused 2360 (delta 1529)\n"
> 	...

I can guess that this is example of multiplexing at work. Here again
some kind of ABNF notation would be IMHO useful, e.g.

  <pkt-line-sideband> = <pkt-length> <sideband-channel> <pkt-payload> [ LF / CR ]
  <pkt-length-sideband> = 4HEXDIGIT   ; length of <pkt-line-sideband>
  <sideband-channel> = %d01-%d02

(Or something like that; I am not sure about ABNF details here).

> 	"<\\276\\255L\\273s\\005\\001w0006\\001[0000"

Hmmm... strange, this is not in pkt-line format...

>
> See the Packfile chapter previously for the actual format of the
> packfile data in the response.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-02 21:39     ` Jakub Narebski
@ 2009-06-02 23:27       ` Shawn O. Pearce
  2009-06-03  0:50         ` Jakub Narebski
  2009-06-03 12:29       ` Jakub Narebski
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 66+ messages in thread
From: Shawn O. Pearce @ 2009-06-02 23:27 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Scott Chacon, git

Jakub Narebski <jnareb@gmail.com> wrote:
> I see that there is (at least beginnings of) description of git pack
> protocol in section "Transfer Protocols"[1][2] of chapter "7. Internals
> and Plumbing" of "Git Community Book".
> 
>  [1] http://book.git-scm.com/7_transfer_protocols.html
>  [2] http://github.com/schacon/gitbook/blob/master/text/54_Transfer_Protocols/0_Transfer_Protocols.markdown
> 
> > ### Fetching Data with Upload Pack ###
> >
> > For the smarter protocols, fetching objects is much more efficient. 
> > A socket is opened, either over ssh or over port 9418 (in the case of
> > the git:// protocol), and the linkgit:git-fetch-pack[1] command on
> > the client begins communicating with a forked
> > linkgit:git-upload-pack[1] process on the server.
> 
> Is fetching over SSH exactly the same as fetching over git:// protocol?

Yes.  Except git:// starts off by sending "git-receive-pack
'repo.git'" on the wire using a pkt-line format, while ssh:// sends
that by way of the remote exec support built into the SSH protocol.
IOW, the only way that git:// differs from SSH is by providing the
smallest shim possible to replace that remote exec feature.

> > Let's look at an example.
> >
> > The client connects and sends the request header. The clone command
> >
> > 	$ git clone git://myserver.com/project.git
> >
> > produces the following request:
> >
> > 	0032git-upload-pack /project.git\\000host=myserver.com\\000
> >
> > The first four bytes contain the hex length of the line (including 4
> > byte line length and trailing newline if present). Following are the
> > command and arguments. This is followed by a null byte and then the
> > host information. The request is terminated by a null byte.
> 
> There is a question how to organize this information. Should we describe
> pkt-line format upfront, e.g. using ABNF notation from RFC 5234 used in
> RFC documents:
> 
>   <pkt-line>   = ( <pkt-length> <pkt-payload> [ LF ] ) / <pkt-flush>
>   <pkt-length> = 4HEXDIGIT                  ; length of <pkt-line>
>   <pkt-flush>  = "0000"
> 
> or something like that?

Yes.

> Sidenote: wouldn't it be better to use \0 (\\0 in source) for NUL
> character rather than \000 (\\000 in source) octal representation?

Most languages today honor '\0' or "\0" as a means of embedding a
NUL into a char type.  So \0 seems correct to me.

> > The request is processed and turned into a call to git-upload-pack:
> >
> >  	$ git-upload-pack /path/to/repos/project.git
> 
> Is it "git-upload-pack" or "git upload-pack" nowadays?

Sadly, we still invoke "git-upload-pack" IIRC.

> > This immediately returns information of the repo:
> 
> To be more exact this is information about references (I guess this
> is information about heads only, is it?)

No, its *all* refs.  `git for-each-ref` plus HEAD.

> , with information about
> server capabilities stuffed in.
> 
> >
> > 	007c74730d410fcb6603ace96f1dc55ea6196122532d HEAD\\000multi_ack thin-pack side-band side-band-64k ofs-delta shallow no-progress
> >       003e7d1665144a3a975c05f1f43902ddaf084e784dbe refs/heads/debug
> >       003d5a3f6be755bbb7deae50065988cbfa1ffa9ab68a refs/heads/dist
> >       003e7e47fe2bd8d01d481f44d7af0531bd93d3b21c01 refs/heads/local
> >       003f74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/master
> >       0000 
> >
> > Each line starts with a four byte line length declaration in hex. The
> > section is terminated by a line length declaration of 0000.
> 
> Should we describe here, or in appendix, or in sidenote, or in footnote
> all currently supported client capabilities and server capabilities?

Yes.

>  * multi_ack (why not mult-ack?)

Hysterical rasins.  ;-)

>  * thin-pack 
>  * side-band 
>  * side-band-64k 
>  * ofs-delta 
>  * shallow 
>  * no-progress
> 
> Is each line terminated by "\n" or "\0"?

Actually, its weird...  Each line is terminated by a "\n" by
convention only, which is included in the 4 byte length declaration.
After reading a line the client slaps a NUL onto the end at the
position indicated by the length declaration, and processes the
line, skipping the "\n" at the end if it is present, and sliently
accepting the line just fine if the "\n" is missing.

This is why the "\0capability" hack works, the client didn't care
that that first ref doesn't end in an LF.  But it stopped where that
"\0" was because it was using a C string style operator.

> Is 'flush' line? This is not
> clear from above description. From simple playing with nc (netcat) it
> looks like each line with exception of '0000' is terminated with "\n".

The only reason we end with "\n" is to make playing with netcat
easier.  There isn't a real practical reason in terms of the protocol
for why you need that "\n" in there.

But.  That flush line is magical.  A length of "0000" means its a
flush packet, which has no data payload.  An "\n" after the "0000"
would break the protocol as the server would read that "\n" in a
context where it is expecting another pkt-line length declaration.
"\n" is not a hex digit, so "0000\n" is horribly horribly broken.

> > This is sent back to the client verbatim. The client responds with
> > another request:
> >
> > 	0054want 74730d410fcb6603ace96f1dc55ea6196122532d multi_ack side-band-64k ofs-delta 
> > 	0032want 7d1665144a3a975c05f1f43902ddaf084e784dbe
> > 	0032want 5a3f6be755bbb7deae50065988cbfa1ffa9ab68a
> > 	0032want 7e47fe2bd8d01d481f44d7af0531bd93d3b21c01
> > 	0032want 74730d410fcb6603ace96f1dc55ea6196122532d
> 
> The semantics (meaning) of those 'want' lines is not described here,
> although one can easily guess that those are commits that client does
> not have, and which do want. In the case of "git clone" those are all
> unique sha1 that client got (what happend if server has detached HEAD?)

IIRC, HEAD isn't fetched if its detached.

The client pattern matches the advertisements against the fetch
refspec, which is "refs/heads/*:refs/remotes/origin/*" by default.
HEAD doesn't match the LHS, so it doesn't get wanted by the client.

> It is not clear, but one can guess that set of capabilities that client
> sends (without stuffing behind NUL character this time?) is a supported
> by client and wanted subset of server capabilities.

Yes.  Another oddity.  Why the heck we didn't also use the NUL hack
here is a good question.  Basically, the NUL hack wasn't necessary
in the server at the time that capabilities were added, because the
server was parsing the line with a fixed position parser.  It only
looked at the first 45 characters ("want 0x40").  Anything after
that was assumed to be garbage... like that unnecessary LF.

> > 	00000009done
> 
> First I thought that this is an error... but not, the 'flush' ("0000")
> is not LF terminated.

Correct.  Again, server only cares that its "done" in a packet.
I think "donedammitsendmeapacknow" is also going to make the current
servers spit back a pack.  :-)

> >
> > 	"0008NAK\n"
> 
> What does this server response mean? That served doesn't need more
> info?

It means the server is answering a prior flush from the client,
and is saying "I still can't serve you, keep tell me more have".

> > 	"0023\\002Counting objects: 2797, done.\n"
> > 	"002b\\002Compressing objects:   0% (1/1177)   \r"
> > 	"002c\\002Compressing objects:   1% (12/1177)   \r"
> > 	"002c\\002Compressing objects:   2% (24/1177)   \r"
> > 	"002c\\002Compressing objects:   3% (36/1177)   \r"
> > 	"002c\\002Compressing objects:   4% (48/1177)   \r"
> > 	"002c\\002Compressing objects:   5% (59/1177)   \r"
> > 	"002c\\002Compressing objects:   6% (71/1177)   \r"
> > 	"0053\\002Compressing objects:   7% (83/1177)   \rCompressing objects:   8% (95/1177)   \r" ...
> > 	"005b\\002Compressing objects: 100% (1177/1177)   \rCompressing objects: 100% (1177/1177), done.\n"
> 
> I guess that it is sideband support: after pkt-length there is number
> of stream (multiplexing), where 2 = \002 means stderr.

Yes.  Actually, 2 means "progress messages, most likely suitable
for stderr".  1 means "pack data".  3 means "fatal error message,
and we're dead now".

> I wonder why sometimes it is one line per update, and sometimes there
> is more than one update info stuffed in single line.

Buffering.  There are two processes running on the server side,
git-pack-objects is producing these messages on its stderr,
and the pack data on stdout.  Both are actually a pipe read by
git-upload-pack in a select loop.  If pack-objects can write two
messages into the pipe buffer before upload-pack is woken to read
them out, upload-pack might find two (or more) messages ready to
read without blocking.  These get bundled into a single packet,
because, why not, its easier to code it that way.

Its most common on the end like that, where we dump 100%, and
then immediately add the ", done" and start a new progress meter.
Its less likely in the middle, where we try to space out the progress
updates to around 1 per second, or 1 per percentage unit.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-02 23:27       ` Shawn O. Pearce
@ 2009-06-03  0:50         ` Jakub Narebski
  2009-06-03  1:29           ` Shawn O. Pearce
                             ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Jakub Narebski @ 2009-06-03  0:50 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Scott Chacon, git

Thank you very much for your comments!

On Wed, 3 Jun 2009, Shawn O. Pearce wrote:
> Jakub Narebski <jnareb@gmail.com> wrote:

>> I see that there is (at least beginnings of) description of git pack
>> protocol in section "Transfer Protocols"[1][2] of chapter "7. Internals
>> and Plumbing" of "Git Community Book".
>> 
>>  [1] http://book.git-scm.com/7_transfer_protocols.html
>>  [2] http://github.com/schacon/gitbook/blob/master/text/54_Transfer_Protocols/0_Transfer_Protocols.markdown
>> 
>>> ### Fetching Data with Upload Pack ###
>>>
>>> For the smarter protocols, fetching objects is much more efficient. 
>>> A socket is opened, either over ssh or over port 9418 (in the case of
>>> the git:// protocol), and the linkgit:git-fetch-pack[1] command on
>>> the client begins communicating with a forked
>>> linkgit:git-upload-pack[1] process on the server.
>> 
>> Is fetching over SSH exactly the same as fetching over git:// protocol?
> 
> Yes.  Except git:// starts off by sending "git-receive-pack
> 'repo.git'" on the wire using a pkt-line format, while ssh:// sends
> that by way of the remote exec support built into the SSH protocol.
> IOW, the only way that git:// differs from SSH is by providing the
> smallest shim possible to replace that remote exec feature.
>  
>>> Let's look at an example.
>>>
>>> The client connects and sends the request header. The clone command
>>>
>>> 	$ git clone git://myserver.com/project.git
>>>
>>> produces the following request:
>>>
>>> 	0032git-upload-pack /project.git\\000host=myserver.com\\000
[...]

So this mean that when cloning via SSH 

  $ git clone ssh://myserver.com/project.git

instead of this first request git would simply invoke [something like]:

  # ssh myserver.com git-upload-pack project.git

isn't it? (I am not sure if it uses "project.git" or "/project.git", 
and how it does generate full pathname for repository).


BTW I wonder why we use stuffing here using "\0" / NUL as separator
trick, and whether line has to be terminated with "\0", or can it be
terminated with "\n".

>> 
>> There is a question how to organize this information. Should we describe
>> pkt-line format upfront, e.g. using ABNF notation from RFC 5234 used in
>> RFC documents:
>> 
>>   <pkt-line>   = ( <pkt-length> <pkt-payload> [ LF ] ) / <pkt-flush>
>>   <pkt-length> = 4HEXDIGIT                  ; length of <pkt-line>
>>   <pkt-flush>  = "0000"
>> 
>> or something like that?
> 
> Yes.

where

     HEXDIGIT = 0-9 / a-f

Well, it should probably be spelled in full. Probably, because I have
no experience with using ABNF... and didn't do my research :-)

     HEXDIGIT = DIGIT / "a" / "b" / "c" / "d" / "e" / "f"
     DIGIT    = "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9"

(should HEXDIGIT use lowercase a-f, or can it use uppercase A-F?)

>> Sidenote: wouldn't it be better to use \0 (\\0 in source) for NUL
>> character rather than \000 (\\000 in source) octal representation?
> 
> Most languages today honor '\0' or "\0" as a means of embedding a
> NUL into a char type.  So \0 seems correct to me.

That was more a question to Scott Chacon, sorry.

Do Ruby understand "\0", or do you need to spell it "\000"?

> 
>>> The request is processed and turned into a call to git-upload-pack:
>>>
>>>  	$ git-upload-pack /path/to/repos/project.git
>> 
>> Is it "git-upload-pack" or "git upload-pack" nowadays?
> 
> Sadly, we still invoke "git-upload-pack" IIRC.

So that is why git-upload-pack has to be in $PATH, or is it only because
new server can be used with old clients (before git-cmd moving outside
$PATH)?

>  
>>> This immediately returns information of the repo:
>> 
>> To be more exact this is information about references (I guess this
>> is information about heads only, is it?)
> 
> No, its *all* refs.  `git for-each-ref` plus HEAD.

You meant probably `git show-ref` plus HEAD, isn't it? 
`git for-each-ref` has different default output...

Still, example should IMHO include at least one tag...

> 
>> , with information about
>> server capabilities stuffed in.
>> 
>>>
>>> 	  007c74730d410fcb6603ace96f1dc55ea6196122532d HEAD\\000multi_ack thin-pack side-band side-band-64k ofs-delta shallow no-progress
>>>       003e7d1665144a3a975c05f1f43902ddaf084e784dbe refs/heads/debug
>>>       003d5a3f6be755bbb7deae50065988cbfa1ffa9ab68a refs/heads/dist
>>>       003e7e47fe2bd8d01d481f44d7af0531bd93d3b21c01 refs/heads/local
>>>       003f74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/master
>>>       0000 
>>>
>>> Each line starts with a four byte line length declaration in hex. The
>>> section is terminated by a line length declaration of 0000.
>> 
>> Should we describe here, or in appendix, or in sidenote, or in footnote
>> all currently supported client capabilities and server capabilities?
> 
> Yes.
> 
>>  * multi_ack (why not multi-ack?)
> 
> Hysterical rasins.  ;-)

What does multi_ack capability mean?
 
>>  * thin-pack

Server can send thin packs, i.e. packs which do not contain base 
elements, if those base elements are available on clients side.
Client has thin-pack capability when it understand how to "thicken"
them adding required delta bases making them independent.

Of course it doesn't make sense for client to use (request) this
capability for git-clone.
  
>>  * side-band 
>>  * side-band-64k 

This probably means that server can send, and client understand 
multiplexed (muxed) progress reports and error info interleaved
with the packfile itself.

But I don't know what is the difference, whether server can provide
side-band-64k without the other (side-band), and whether client has
to request only one of those two capabilities.

>>  * ofs-delta 

Server can send, and client understand PACKv2 with delta refering to
its base by position in pack rather than by SHA-1... do I understand
this correctly?

>>  * shallow 

Server can send shallow clone (git clone --depth ...).

>>  * no-progress

What that does mean?

>> 
>> Is each line terminated by "\n" or "\0"?
> 
> Actually, its weird...  Each line is terminated by a "\n" by
> convention only, which is included in the 4 byte length declaration.
> After reading a line the client slaps a NUL onto the end at the
> position indicated by the length declaration, and processes the
> line, skipping the "\n" at the end if it is present, and sliently
> accepting the line just fine if the "\n" is missing.

This probably should be described... 

Does git require that each line is terminated by something (e.g. "\n"),
or does it not?

> 
> This is why the "\0capability" hack works, the client didn't care
> that that first ref doesn't end in an LF.  But it stopped where that
> "\0" was because it was using a C string style operator.

It is a bit pity tat git protocol was not created with extendability
(like capabilities) in mind...

> 
>> Is 'flush' line? This is not
>> clear from above description. From simple playing with nc (netcat) it
>> looks like each line with exception of '0000' is terminated with "\n".
> 
> The only reason we end with "\n" is to make playing with netcat
> easier.  There isn't a real practical reason in terms of the protocol
> for why you need that "\n" in there.
> 
> But.  That flush line is magical.  A length of "0000" means its a
> flush packet, which has no data payload.  An "\n" after the "0000"
> would break the protocol as the server would read that "\n" in a
> context where it is expecting another pkt-line length declaration.
> "\n" is not a hex digit, so "0000\n" is horribly horribly broken.

O.K.

It would probably be more clean to explicitly include terminators in
the output (like for final response: sending packfile) and put magical
'flush' line in separate row (separate line of example output).

BTW. do "0001" - "0003" pkt-lines are reserved, or just invalid?

>  
>>> This is sent back to the client verbatim. The client responds with
>>> another request:
>>>
>>> 	0054want 74730d410fcb6603ace96f1dc55ea6196122532d multi_ack side-band-64k ofs-delta 
>>> 	0032want 7d1665144a3a975c05f1f43902ddaf084e784dbe
>>> 	0032want 5a3f6be755bbb7deae50065988cbfa1ffa9ab68a
>>> 	0032want 7e47fe2bd8d01d481f44d7af0531bd93d3b21c01
>>> 	0032want 74730d410fcb6603ace96f1dc55ea6196122532d
>> 
>> The semantics (meaning) of those 'want' lines is not described here,
>> although one can easily guess that those are commits that client does
>> not have, and which do want. In the case of "git clone" those are all
>> unique sha1 that client got (what happend if server has detached HEAD?)
> 
> IIRC, HEAD isn't fetched if its detached.
> 
> The client pattern matches the advertisements against the fetch
> refspec, which is "refs/heads/*:refs/remotes/origin/*" by default.
> HEAD doesn't match the LHS, so it doesn't get wanted by the client.

Well, unless client requests 'mirror' clone, where from what 
I understand "refs/*:refs/*" is implied... but that would also
not match HEAD.  Hmmm... I would think that "git clone --mirror"
would also mirror HEAD if it is detached.

>  
>> It is not clear, but one can guess that set of capabilities that client
>> sends (without stuffing behind NUL character this time?) is a supported
>> by client and wanted subset of server capabilities.
> 
> Yes.  Another oddity.  Why the heck we didn't also use the NUL hack
> here is a good question.  Basically, the NUL hack wasn't necessary
> in the server at the time that capabilities were added, because the
> server was parsing the line with a fixed position parser.  It only
> looked at the first 45 characters ("want 0x40").  Anything after
> that was assumed to be garbage... like that unnecessary LF.

And it was necessary when sending server capabilities because that
response cannot be parsed using fixed position parser: HEAD can not
exists, and refs have arbitrary (well, up to something less than
PATH_MAX for sure) length.

>  
>>> 	00000009done
>> 
>> First I thought that this is an error... but not, the 'flush' ("0000")
>> is not LF terminated.
> 
> Correct.  Again, server only cares that its "done" in a packet.
> I think "donedammitsendmeapacknow" is also going to make the current
> servers spit back a pack.  :-)
>  
>>>
>>> 	"0008NAK\n"
>> 
>> What does this server response mean? That served doesn't need more
>> info?
> 
> It means the server is answering a prior flush from the client,
> and is saying "I still can't serve you, keep tell me more have".

Hmmm... the communication between server and client is not entirely
clean. Do I understand correctly that this NAK is response to clients
flush after all those "want" lines? And that "0009done" from client
tells server that it should send everything it has?

> 
>>> 	"0023\\002Counting objects: 2797, done.\n"
>>> 	"002b\\002Compressing objects:   0% (1/1177)   \r"
>>> 	"002c\\002Compressing objects:   1% (12/1177)   \r"
>>> 	"002c\\002Compressing objects:   2% (24/1177)   \r"
>>> 	"002c\\002Compressing objects:   3% (36/1177)   \r"
>>> 	"002c\\002Compressing objects:   4% (48/1177)   \r"
>>> 	"002c\\002Compressing objects:   5% (59/1177)   \r"
>>> 	"002c\\002Compressing objects:   6% (71/1177)   \r"
>>> 	"0053\\002Compressing objects:   7% (83/1177)   \rCompressing objects:   8% (95/1177)   \r" ...
>>> 	"005b\\002Compressing objects: 100% (1177/1177)   \rCompressing objects: 100% (1177/1177), done.\n"
>> 
>> I guess that it is sideband support: after pkt-length there is number
>> of stream (multiplexing), where 2 = \002 means stderr.
> 
> Yes.  Actually, 2 means "progress messages, most likely suitable
> for stderr".  1 means "pack data".  3 means "fatal error message,
> and we're dead now".

But it is easily extendable, i.e. sideband > 3 would work, although
be ignored, isn't it?

By the way, how client does know that server started to send final
data, i.e. packfile multiplexed / interleaved with progress reports,
and should expect <pkt-line-band> rather than <pkt-line> output?

>  
>> I wonder why sometimes it is one line per update, and sometimes there
>> is more than one update info stuffed in single line.
> 
> Buffering.  There are two processes running on the server side,
> git-pack-objects is producing these messages on its stderr,
> and the pack data on stdout.  Both are actually a pipe read by
> git-upload-pack in a select loop.  If pack-objects can write two
> messages into the pipe buffer before upload-pack is woken to read
> them out, upload-pack might find two (or more) messages ready to
> read without blocking.  These get bundled into a single packet,
> because, why not, its easier to code it that way.
> 
> Its most common on the end like that, where we dump 100%, and
> then immediately add the ", done" and start a new progress meter.
> Its less likely in the middle, where we try to space out the progress
> updates to around 1 per second, or 1 per percentage unit.

Ahhh... now I understand. Thanks.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03  0:50         ` Jakub Narebski
@ 2009-06-03  1:29           ` Shawn O. Pearce
  2009-06-03  2:11             ` Junio C Hamano
  2009-06-03  9:21             ` Jakub Narebski
  2009-06-03  2:18           ` Robin H. Johnson
  2009-06-03 20:56           ` Tony Finch
  2 siblings, 2 replies; 66+ messages in thread
From: Shawn O. Pearce @ 2009-06-03  1:29 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Scott Chacon, git

Jakub Narebski <jnareb@gmail.com> wrote:
> On Wed, 3 Jun 2009, Shawn O. Pearce wrote:
> >>>
> >>> The client connects and sends the request header. The clone command
> >>>
> >>> 	$ git clone git://myserver.com/project.git
> >>>
> >>> produces the following request:
> >>>
> >>> 	0032git-upload-pack /project.git\\000host=myserver.com\\000
> [...]
> 
> So this mean that when cloning via SSH 
> 
>   $ git clone ssh://myserver.com/project.git
> 
> instead of this first request git would simply invoke [something like]:
> 
>   # ssh myserver.com git-upload-pack project.git

Actually, 

    # ssh myserver.com git-upload-pack /project.git

> isn't it? (I am not sure if it uses "project.git" or "/project.git", 
> and how it does generate full pathname for repository).

In an ssh:// format URI, its absolute in the URI, so the / after
the host name (or port number) is sent as an argument, which is then
read by the remote git-upload-pack exactly as is, so its effectively
an absolute path in the remote filesystem.

In a "user@host:path" format URI, its relative to the user's home
directory, because we run:

    # ssh user@host git-upload-pack path

> BTW I wonder why we use stuffing here using "\0" / NUL as separator
> trick, and whether line has to be terminated with "\0", or can it be
> terminated with "\n".

Stuffing here?  What are we talking about again?

> >>> The request is processed and turned into a call to git-upload-pack:
> >>>
> >>>  	$ git-upload-pack /path/to/repos/project.git
> >> 
> >> Is it "git-upload-pack" or "git upload-pack" nowadays?
> > 
> > Sadly, we still invoke "git-upload-pack" IIRC.
> 
> So that is why git-upload-pack has to be in $PATH, or is it only because
> new server can be used with old clients (before git-cmd moving outside
> $PATH)?

Both.  :-)

Clients (old and new) ask for git-upload-pack and git-receive-pack,
by SSH as we saw above... so it needs to be in the remote $PATH.

> >>> This immediately returns information of the repo:
> >> 
> >> To be more exact this is information about references (I guess this
> >> is information about heads only, is it?)
> > 
> > No, its *all* refs.  `git for-each-ref` plus HEAD.
> 
> You meant probably `git show-ref` plus HEAD, isn't it? 
> `git for-each-ref` has different default output...

Whatever.  I was talking about what we enumerate, not the output
format.  The server code is actually not using either of those
programs, but is instead just making the direct calls to refs.c
functions (for_each_ref function) and formatting the result specially
for the protocol.

> > Hysterical rasins.  ;-)
> 
> What does multi_ack capability mean?

It allows the server to return "ACK $SHA1 continue" as soon as
it finds a commit that it can use as a common base, between the
client's wants and the client's have set.

By sending this early, the server can potentially head off the
client from walking any further down that particular branch of the
client's repository history.  The client may still need to walk down
other branches, sending have lines for those, until the server has
a complete cut across the DAG, or the client has said "done".

IIRC, without multi_ack, a client sends have lines in --date-order
until the server has found a common base.  That means the client
will send have lines that are already known by the server to be
common, because they overlap in time with another branch that the
server hasn't found a common base on yet.

E.g. the client has things in caps that the server doesn't; server
has things in lower case:

     +---- u ---------------------- x
    /             +----- y
   /             /
  a -- b -- c -- d -- E -- F
   \
    +--- Q -- R -- S

If the client wants x,y and starts out by saying have F,S, the
server doesn't know what F,S is.  Eventually the client says "have
d" and the server sends "ACK d continue" to let the client know to
stop walking down that line (so don't send c-b-a), but its not done
yet, it needs a base for X.  The client keeps going with S-R-Q,
until a gets reached, at which point the server has a clear base
and it all ends.

Without multi_ack the client would have sent that c-b-a chain anyway,
interleaved with S-R-Q.

Junio, am I right?  I think I am, but I've had to reverse engineer
most of this.  And the above is my understanding of it.

> >>  * thin-pack
> 
> Server can send thin packs, i.e. packs which do not contain base 
> elements, if those base elements are available on clients side.
> Client has thin-pack capability when it understand how to "thicken"
> them adding required delta bases making them independent.

Yes.

> Of course it doesn't make sense for client to use (request) this
> capability for git-clone.

No, no it doesn't.  But if the client does request it (and I think
modern clients actually do request it, even on initial clone case)
the server won't produce a thin pack. Why?  There is no common base,
so there is no uninteresting set to omit from the pack.  :-)

> >>  * side-band 
> >>  * side-band-64k 
> 
> This probably means that server can send, and client understand 
> multiplexed (muxed) progress reports and error info interleaved
> with the packfile itself.
> 
> But I don't know what is the difference, whether server can provide
> side-band-64k without the other (side-band), and whether client has
> to request only one of those two capabilities.

Yes.  These two options are mutually exclusive. A client should
ask for only one of them, and a modern client always favors
side-band-64k.

Long ago, we only had side-band, which allowed up to 1000 bytes
per packet.  But the packet length field is 4 bytes, in hex, so 16
bits worth of information space.  Limiting it to only 1000 bytes
for a large 800 MiB binary pack file on initial clone is really
quite poor usage of the data stream space.

We couldn't just up the limit the server sends to the full 2^16
because older clients literally had a char[1000] allocated on the
stack, and we'd overflow it.  So "side-band-64k" came about as
another way for the client to request side-band, but to also say
it can handle the much larger packets, packets that are actually
crammed nearly full (65520 bytes).

> >>  * ofs-delta 
> 
> Server can send, and client understand PACKv2 with delta refering to
> its base by position in pack rather than by SHA-1... do I understand
> this correctly?

Yes.  Its that they can send/read OBJ_OFS_DELTA, aka type 6 in
a pack file.

> >>  * shallow 
> 
> Server can send shallow clone (git clone --depth ...).
> 
> >>  * no-progress
> 
> What that does mean?

The client was started with "git clone -q" or something, and doesn't
want that side brand 2.  Basically the client just says "I do not
wish to receive stream 2 on sideband, so do not send it to me,
and if you did, I will drop it on the floor anyway".

> >> Is each line terminated by "\n" or "\0"?
> > 
> > Actually, its weird...  Each line is terminated by a "\n" by
> > convention only, which is included in the 4 byte length declaration.
> > After reading a line the client slaps a NUL onto the end at the
> > position indicated by the length declaration, and processes the
> > line, skipping the "\n" at the end if it is present, and sliently
> > accepting the line just fine if the "\n" is missing.
> 
> This probably should be described... 
> 
> Does git require that each line is terminated by something (e.g. "\n"),
> or does it not?

It doesn't, but convention says to include the "\n" to be nice to
a human.  Junio may argue that it does require it, I don't know,
but if you read through any modern implementation (e.g. C Git or
JGit) the "\n" is entirely optional when parsing the lines.

Actually, it has to be, because that "\n" isn't there on the first
line when the capability data appears either wedged behind the "\0"
or after the " " at the end of the line.

> > This is why the "\0capability" hack works, the client didn't care
> > that that first ref doesn't end in an LF.  But it stopped where that
> > "\0" was because it was using a C string style operator.
> 
> It is a bit pity tat git protocol was not created with extendability
> (like capabilities) in mind...

Yes, no doubt.  There are many things I would have done differently,
given that I now have 20/20 hindsight vision into the past's future.

:-)

The protocol (mostly) works fine as-is.  Its widely distributed in
terms of clients using it on a daily basis.  Its likely to continue
to serve our needs well into the future.  So, it is what it is.

> BTW. do "0001" - "0003" pkt-lines are reserved, or just invalid?

Invalid.  No clue if they are considered "reserved for future use".
I don't think they are, I think they just out flat out not something
any client can ever sanely produce.

But, hey, look, another back door we can use in the future to
wedge something else into this protocol, after introducing a new
capability for it.  :-)

> >>>
> >>> 	"0008NAK\n"
> >> 
> >> What does this server response mean? That served doesn't need more
> >> info?
> > 
> > It means the server is answering a prior flush from the client,
> > and is saying "I still can't serve you, keep tell me more have".
> 
> Hmmm... the communication between server and client is not entirely
> clean. Do I understand correctly that this NAK is response to clients
> flush after all those "want" lines?

Yes.

> And that "0009done" from client
> tells server that it should send everything it has?

Yes.  It means the client will not issue any more "have" lines,
as it has nothing further in its history, so the server just has
to give up and start generating a pack based on what it knows.

> > Yes.  Actually, 2 means "progress messages, most likely suitable
> > for stderr".  1 means "pack data".  3 means "fatal error message,
> > and we're dead now".
> 
> But it is easily extendable, i.e. sideband > 3 would work, although
> be ignored, isn't it?

Correct.

> By the way, how client does know that server started to send final
> data, i.e. packfile multiplexed / interleaved with progress reports,
> and should expect <pkt-line-band> rather than <pkt-line> output?

After the client receives a "ACK" or "NAK" for the number of
outstanding flushes it still has, *after* it has sent "done".
This also varies based on whether or not multi_ack was enabled.

Its ugly.  But basically you keep a running counter of each "flush"
sent, and then you send a "done" out, and then you wait until
you have the right number of ACK/NAK answers back, and then the
stream changes format.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03  1:29           ` Shawn O. Pearce
@ 2009-06-03  2:11             ` Junio C Hamano
  2009-06-03  2:15               ` Shawn O. Pearce
  2009-06-03  9:21             ` Jakub Narebski
  1 sibling, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2009-06-03  2:11 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Jakub Narebski, Scott Chacon, git

"Shawn O. Pearce" <spearce@spearce.org> writes:

> IIRC, without multi_ack, a client sends have lines in --date-order
> until the server has found a common base.  That means the client
> will send have lines that are already known by the server to be
> common, because they overlap in time with another branch that the
> server hasn't found a common base on yet.
>
> E.g. the client has things in caps that the server doesn't; server
> has things in lower case:
>
>      +---- u ---------------------- x
>     /             +----- y
>    /             /
>   a -- b -- c -- d -- E -- F
>    \
>     +--- Q -- R -- S
>
> If the client wants x,y and starts out by saying have F,S, the
> server doesn't know what F,S is.  Eventually the client says "have
> d" and the server sends "ACK d continue" to let the client know to
> stop walking down that line (so don't send c-b-a), but its not done
> yet, it needs a base for X.  The client keeps going with S-R-Q,
> until a gets reached, at which point the server has a clear base
> and it all ends.
>
> Without multi_ack the client would have sent that c-b-a chain anyway,
> interleaved with S-R-Q.
>
> Junio, am I right?

Correct.

>> >>  * thin-pack
>> 
>> Server can send thin packs, i.e. packs which do not contain base 
>> elements, if those base elements are available on clients side.
>> Client has thin-pack capability when it understand how to "thicken"
>> them adding required delta bases making them independent.
>
> Yes.
>  
>> Of course it doesn't make sense for client to use (request) this
>> capability for git-clone.
>
> No, no it doesn't.  But if the client does request it (and I think
> modern clients actually do request it, even on initial clone case)
> the server won't produce a thin pack. Why?  There is no common base,
> so there is no uninteresting set to omit from the pack.  :-)

There also is "clone --reference".

> Actually, it has to be, because that "\n" isn't there on the first
> line when the capability data appears either wedged behind the "\0"
> or after the " " at the end of the line.

Correct.

> Its ugly.  But basically you keep a running counter of each "flush"
> sent, and then you send a "done" out, and then you wait until
> you have the right number of ACK/NAK answers back, and then the
> stream changes format.

One thing that I did not see mentioned in this thread is that the
implementation is allowed to buffer non-flush packets and send multiple of
them out with a single write(2).  In other words, packet_write() could
buffer instead of directly calling safe_write(), while packet_flush() must
do safe_write() and make sure it drains.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03  2:11             ` Junio C Hamano
@ 2009-06-03  2:15               ` Shawn O. Pearce
  0 siblings, 0 replies; 66+ messages in thread
From: Shawn O. Pearce @ 2009-06-03  2:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jakub Narebski, Scott Chacon, git

Junio C Hamano <gitster@pobox.com> wrote:
> One thing that I did not see mentioned in this thread is that the
> implementation is allowed to buffer non-flush packets and send multiple of
> them out with a single write(2).  In other words, packet_write() could
> buffer instead of directly calling safe_write(), while packet_flush() must
> do safe_write() and make sure it drains.

Good point.

That's one reason why in JGit I call the flush packet of "0000"
end(), and flush() triggers the drain.  JGit buffers everything
its writing, but only by one standard "have" window IIRC.

JGit server code triggers a flush() after side-band channel 2 packet
ends, but not an end(), because we only want to drain to the network,
not inject a bad "0000" packet in the stream.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03  0:50         ` Jakub Narebski
  2009-06-03  1:29           ` Shawn O. Pearce
@ 2009-06-03  2:18           ` Robin H. Johnson
  2009-06-03 10:47             ` Jakub Narebski
  2009-06-03 20:56           ` Tony Finch
  2 siblings, 1 reply; 66+ messages in thread
From: Robin H. Johnson @ 2009-06-03  2:18 UTC (permalink / raw)
  To: Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 1325 bytes --]

> >>> 	"0008NAK\n"
> >> What does this server response mean? That served doesn't need more
> >> info?
> > It means the server is answering a prior flush from the client,
> > and is saying "I still can't serve you, keep tell me more have".
> Hmmm... the communication between server and client is not entirely
> clean. Do I understand correctly that this NAK is response to clients
> flush after all those "want" lines? And that "0009done" from client
> tells server that it should send everything it has?
Relatedly with the "done" message, I'm in the process of writing a hook
that allows the server to deny the client at this point, instead of
building and sending a pack.

Suggestions on other modifications that might be needed to integrate. 
The hook:
- takes all want/have lines as input (maybe capabilities too?)
- returns 0/1
- on error, should also send a message to stderr, to be passed over the
  wire.

My intended use is to block initial clones while still allowing updates
(as long as you've got a tree at least commit X recent, I'll talk to
you). Initial and too-old clients get a message to go and download a
bundle instead.

-- 
Robin Hugh Johnson
Gentoo Linux Developer & Infra Guy
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85

[-- Attachment #2: Type: application/pgp-signature, Size: 330 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03  1:29           ` Shawn O. Pearce
  2009-06-03  2:11             ` Junio C Hamano
@ 2009-06-03  9:21             ` Jakub Narebski
  2009-06-03 14:48               ` Shawn O. Pearce
  1 sibling, 1 reply; 66+ messages in thread
From: Jakub Narebski @ 2009-06-03  9:21 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Scott Chacon, git, Junio C Hamano

On Wed, 3 Jun 2009, Shawn O. Pearce wrote:
> Jakub Narebski <jnareb@gmail.com> wrote:
>> On Wed, 3 Jun 2009, Shawn O. Pearce wrote:
>>>>>
>>>>> The client connects and sends the request header. The clone command
>>>>>
>>>>> 	$ git clone git://myserver.com/project.git
>>>>>
>>>>> produces the following request:
>>>>>
>>>>> 	0032git-upload-pack /project.git\\000host=myserver.com\\000
>> [...]
>> 
>> So this mean that when cloning via SSH 
>> 
>>   $ git clone ssh://myserver.com/project.git
>> 
>> instead of this first request git would simply invoke [something like]:
>> 
>>   # ssh myserver.com git-upload-pack project.git
> 
> Actually, 
> 
>     # ssh myserver.com git-upload-pack /project.git
>  
>> isn't it? (I am not sure if it uses "project.git" or "/project.git", 
>> and how it does generate full pathname for repository).
> 
> In an ssh:// format URI, its absolute in the URI, so the / after
> the host name (or port number) is sent as an argument, which is then
> read by the remote git-upload-pack exactly as is, so its effectively
> an absolute path in the remote filesystem.
> 
> In a "user@host:path" format URI, its relative to the user's home
> directory, because we run:
> 
>     # ssh user@host git-upload-pack path

By the way, this accidentally shows why one might want to prefer 
scp-like / ssh-like "URL" for SSH fetch / push, i.e.

  [user@]myserver.com:/path/to/repo.git/

rather than ssh:// URL version

  ssh://[user@]myserver.com/path/to/repo.git/

On the other hand I think only URL version allows to specify
nonstandard port (well, that and ~/.ssh/config).

>> BTW I wonder why we use stuffing here using "\0" / NUL as separator
>> trick, and whether line has to be terminated with "\0", or can it be
>> terminated with "\n".
> 
> Stuffing here?  What are we talking about again?

I'm sorry, I was too cryptic here.

I meant that in the request line for fetching via git:// protocol

	0032git-upload-pack /project.git\\000host=myserver.com\\000

you separate path to repository from extra options using "\0" / NUL
as a separator. Well, this is only sane separator, as it is path 
terminator, the only character which cannot appear in pathname 
(although I do wonder whether project names with e.g. control 
characters or UTF-8 characters would work correctly).

Is the final terminating character required to be NUL ("\0"), or can
it be for LF ("\n"), i.e.

	0032git-upload-pack /project.git\\000host=myserver.com\\n

What options besides (required?) "host=<server>[:<port>]" are supported?
Do I understand correctly that "host=<host>" information is required
for core.gitProxy to work, isn't it?

>>>>  * no-progress
>> 
>> What that does mean?
> 
> The client was started with "git clone -q" or something, and doesn't
> want that side brand 2.  Basically the client just says "I do not
> wish to receive stream 2 on sideband, so do not send it to me,
> and if you did, I will drop it on the floor anyway".

Does this mean that if server does not support "no-progress" capability
then client is required to drop diagnostic by itself? Can client request
to not use sideband (multiplexing) if it is asking for "no-progress";
or is multiplexing required for possible signaling of error condition 
on channel 3?

>> It is a bit pity that git protocol was not created with extendability
>> (like capabilities) in mind...
> 
> Yes, no doubt.  There are many things I would have done differently,
> given that I now have 20/20 hindsight vision into the past's future.
> 
> :-)
> 
> The protocol (mostly) works fine as-is.  Its widely distributed in
> terms of clients using it on a daily basis.  Its likely to continue
> to serve our needs well into the future.  So, it is what it is.

I do wonder if existing Internet Standard (in the meaning of RFC) 
protocols also have such kludges and hacks...

>> By the way, how client does know that server started to send final
>> data, i.e. packfile multiplexed / interleaved with progress reports,
>> and should expect <pkt-line-band> rather than <pkt-line> output?
> 
> After the client receives a "ACK" or "NAK" for the number of
> outstanding flushes it still has, *after* it has sent "done".
> This also varies based on whether or not multi_ack was enabled.
> 
> Its ugly.  But basically you keep a running counter of each "flush"
> sent, and then you send a "done" out, and then you wait until
> you have the right number of ACK/NAK answers back, and then the
> stream changes format.

Hmmm... perhaps it would be better if pkt-line-sideband had some
distinguishing characteristics from ordinary pkt-line, or that sending
multiplexed (with sideband) output was preceded by some signal like
"0001" or "0004" or "0005\n", or "000dsideband\n".  But as you said
hindsight is 20/20.

P.S. By the way, is pkt-line format original invention, or was it 
'borrowed' from some other standard or protocol?

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03  2:18           ` Robin H. Johnson
@ 2009-06-03 10:47             ` Jakub Narebski
  2009-06-03 14:17               ` Shawn O. Pearce
  0 siblings, 1 reply; 66+ messages in thread
From: Jakub Narebski @ 2009-06-03 10:47 UTC (permalink / raw)
  To: Robin H. Johnson; +Cc: Git Mailing List, Shawn O. Pearce, Scott Chacon

Please try to not cull the CC list (if possible), and provide
attribution for quoting.

"Robin H. Johnson" <robbat2@gentoo.org> writes:

>>>>> 	"0008NAK\n"
>>>>>
>>>> What does this server response mean? That served doesn't need more
>>>> info?
>>>>
>>> It means the server is answering a prior flush from the client,
>>> and is saying "I still can't serve you, keep tell me more have".
>>>
>> Hmmm... the communication between server and client is not entirely
>> clean. Do I understand correctly that this NAK is response to clients
>> flush after all those "want" lines? And that "0009done" from client
>> tells server that it should send everything it has?

> Relatedly with the "done" message, I'm in the process of writing a hook
> that allows the server to deny the client at this point, instead of
> building and sending a pack.
> 
> Suggestions on other modifications that might be needed to integrate. 
> The hook:
> - takes all want/have lines as input (maybe capabilities too?)
> - returns 0/1
> - on error, should also send a message to stderr, to be passed over the
>   wire.

I am not sure if it would be possible to fit a hook there, but perhaps
it would be possible to add such `pre-upload` hook... Note that it
would have to somehow work for both git:// and ssh:// protocols, and
perhaps also for "dumb" protocols such as http:// (and other
curl-based) and deprecated rsync://

> 
> My intended use is to block initial clones while still allowing updates
> (as long as you've got a tree at least commit X recent, I'll talk to
> you). Initial and too-old clients get a message to go and download a
> bundle instead.

Wouldn't it be better to make use of mirror-sync (which sadly is in
planning stages only; see SoC2009Ideas page on git wiki) to redirect
to some other repository to be used for cloning requests?

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-02 21:39     ` Jakub Narebski
  2009-06-02 23:27       ` Shawn O. Pearce
@ 2009-06-03 12:29       ` Jakub Narebski
  2009-06-03 14:19         ` Shawn O. Pearce
  2009-06-04 20:55       ` Jakub Narebski
  2009-06-06 21:38       ` Comments pack protocol description in "Git Community Book" (second round) Jakub Narebski
  3 siblings, 1 reply; 66+ messages in thread
From: Jakub Narebski @ 2009-06-03 12:29 UTC (permalink / raw)
  To: Scott Chacon; +Cc: Shawn O. Pearce, git

On Tue, 2 Jun 2009, Jakub Narebski wrote:
> Should we describe here, or in appendix, or in sidenote, or
> in a footnote, all currently supported client capabilities
> and server capabilities? 
> 
>  * multi_ack (why not multi-ack?)
>  * thin-pack 
>  * side-band 
>  * side-band-64k 
>  * ofs-delta 
>  * shallow 
>  * no-progress

There is also another capability

   * include-tag

What does it mean? Is it about sending tags if we are sending objects 
they point to, or is it about sending all tags?


P.S. Is hexdigit length case sensitive i.e. 0-9a-f, or is it not
     case sensitive i.e. 0-9a-fA-F?
-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 10:47             ` Jakub Narebski
@ 2009-06-03 14:17               ` Shawn O. Pearce
  0 siblings, 0 replies; 66+ messages in thread
From: Shawn O. Pearce @ 2009-06-03 14:17 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Robin H. Johnson, Git Mailing List, Scott Chacon

Jakub Narebski <jnareb@gmail.com> wrote:
> "Robin H. Johnson" <robbat2@gentoo.org> writes:
> 
> > Relatedly with the "done" message, I'm in the process of writing a hook
> > that allows the server to deny the client at this point, instead of
> > building and sending a pack.
> 
> I am not sure if it would be possible to fit a hook there, but perhaps
> it would be possible to add such `pre-upload` hook... Note that it
> would have to somehow work for both git:// and ssh:// protocols, and
> perhaps also for "dumb" protocols such as http:// (and other
> curl-based) and deprecated rsync://

Uh, that hook can't be used on HTTP or rsync.  How do you expect the
HTTP or rsync client to execute a process on the server?  It can't,
the server isn't git aware.  Its only valid on smart protocols.
 
> > My intended use is to block initial clones while still allowing updates
> > (as long as you've got a tree at least commit X recent, I'll talk to
> > you). Initial and too-old clients get a message to go and download a
> > bundle instead.
> 
> Wouldn't it be better to make use of mirror-sync (which sadly is in
> planning stages only; see SoC2009Ideas page on git wiki) to redirect
> to some other repository to be used for cloning requests?

Yes.  But mirror-sync isn't here yet, and is a lot more work to
create than hacking upload-pack.c to invoke a hook.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 12:29       ` Jakub Narebski
@ 2009-06-03 14:19         ` Shawn O. Pearce
  0 siblings, 0 replies; 66+ messages in thread
From: Shawn O. Pearce @ 2009-06-03 14:19 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Scott Chacon, git

Jakub Narebski <jnareb@gmail.com> wrote:
> On Tue, 2 Jun 2009, Jakub Narebski wrote:
> > Should we describe here, or in appendix, or in sidenote, or
> > in a footnote, all currently supported client capabilities
> > and server capabilities? 
> 
>    * include-tag
> 
> What does it mean? Is it about sending tags if we are sending objects 
> they point to,

Yes, this.

> or is it about sending all tags?

No, not this.

If we pack an object to the client, and a tag points exactly at
that object, we pack the tag too.  In general this allows a client
to get all new tags when it fetches a branch, in a single network
connection.
 
> P.S. Is hexdigit length case sensitive i.e. 0-9a-f, or is it not
>      case sensitive i.e. 0-9a-fA-F?

Git parses both a-f/A-F, but prefers to create a-f.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03  9:21             ` Jakub Narebski
@ 2009-06-03 14:48               ` Shawn O. Pearce
  2009-06-03 15:07                 ` Shawn O. Pearce
                                   ` (3 more replies)
  0 siblings, 4 replies; 66+ messages in thread
From: Shawn O. Pearce @ 2009-06-03 14:48 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Scott Chacon, git, Junio C Hamano

Jakub Narebski <jnareb@gmail.com> wrote:
> I'm sorry, I was too cryptic here.
> 
> I meant that in the request line for fetching via git:// protocol
> 
> 	0032git-upload-pack /project.git\\000host=myserver.com\\000
> 
> you separate path to repository from extra options using "\0" / NUL
> as a separator. Well, this is only sane separator, as it is path 
> terminator, the only character which cannot appear in pathname 
> (although I do wonder whether project names with e.g. control 
> characters or UTF-8 characters would work correctly).

No, that isn't the reason '\0' is used here.  But yea, that is true.

The reason \0 is used is, git-daemon reads the 4 byte length, decodes
that, then reads that many bytes.  Finally it writes a '\0' at the
end of what it read, so that the entire "line" is NUL terminated.
Then it reads the "command path" part from the resulting C string.

The host=myserver.com part came later, after many daemons were
already running all over the world.  By hiding it behind the '\0'
an old daemon would never see it (but strlen() returned a value that
was less than the length read, but the old daemons didn't care).
Newer daemons look for where strlen() < length, and assume that
the host header follows.

The host header ends with '\0' in case additional headers would
also appear here in the future.  IOW, like HTTP allows new headers
to be added before the "\r\n\r\n" terminator at the body, we allow
them between "\0".

Why '\0'?  The only real Git implementation that matters is C Git,
and its written in C, and that's easy to work with in C.

As far as UTF-8 or other characters... that path is scanned to check
for nasty cases like '../../../../etc/passwd', but is otherwise
handed off to the system's stat() and chdir() functions as-is.  So
like any other path in Git, it had damn well better match what the
host will recognize.

If the host is using SHIFT-JIS on its filesystem, then a client must
request the path in SHIFT-JIS.  And there is no way to specify that
to the client in advance.

In practice, I think most people stick to an latin1 style character
set here, maybe even the commonly acceptable printable characters
for US-ASCII, so it winds up being not that much of an issue.

> Is the final terminating character required to be NUL ("\0"), or can
> it be for LF ("\n"), i.e.
> 
> 	0032git-upload-pack /project.git\\000host=myserver.com\\n

The LF thing is like I said before, for a human, not the machine.
Hell, if the LF is present I think it would have to be *after* the
'\0' in the line, otherwise git daemon would assume that the host
name includes an LF at the end of it.

The NUL at the end of the host name is not strictly required, but
must be present if the client were to ever pass additional options
to the server.

See above about why... client reads line, sticks a NUL at the end,
if the host header doesn't end in NUL on the wire, it does now
in memory.

> What options besides (required?) "host=<server>[:<port>]" are supported?

Currently only host is supported.  And yea, it takes the :<port> if
the client included the port number in the URL (git://foo:8813/path).

Actually, I just realized JGit isn't compliant here.  It doesn't
send the :<port> like C Git would.

> Do I understand correctly that "host=<host>" information is required
> for core.gitProxy to work, isn't it?

No.  Its for the git-daemon name based virtual hosting.
See --interpolated-path option to git daemon, with the %H/%CH
format characters.

> >>>>  * no-progress
> >> 
> >> What that does mean?
> > 
> > The client was started with "git clone -q" or something, and doesn't
> > want that side brand 2.  Basically the client just says "I do not
> > wish to receive stream 2 on sideband, so do not send it to me,
> > and if you did, I will drop it on the floor anyway".
> 
> Does this mean that if server does not support "no-progress" capability
> then client is required to drop diagnostic by itself?

Yes.

> Can client request
> to not use sideband (multiplexing) if it is asking for "no-progress";
> or is multiplexing required for possible signaling of error condition 
> on channel 3?

We still want it for the error condition on channel 3.  But if the
client didn't care about errors, and wanted no-progress, and the
server didn't support no-progress, then yes, the client could just
avoid asking for the side-band capability.

> >> It is a bit pity that git protocol was not created with extendability
> >> (like capabilities) in mind...
> > 
> > Yes, no doubt.  There are many things I would have done differently,
> > given that I now have 20/20 hindsight vision into the past's future.
> > 
> > :-)
> > 
> > The protocol (mostly) works fine as-is.  Its widely distributed in
> > terms of clients using it on a daily basis.  Its likely to continue
> > to serve our needs well into the future.  So, it is what it is.
> 
> I do wonder if existing Internet Standard (in the meaning of RFC) 
> protocols also have such kludges and hacks...

I'm sure they have some... oddities.  But perhaps not as bad as git.

We have a history of not leaving ourselves room for future expansion,
and then needing to find a backdoor in the canonical implementation
parser in order to make it work.

In the protocol suite, its been the strlen() < pktlen trick which
has generally worked.  Oh, and also sticking stuff after a fixed
length record, where we didn't care.

Oh, and send-pack/receive-pack protocol now has ".have" refs, which
work for C Git because the send-pack client was always calling
check_ref_format() on each thing sent by the server, and ".have"
isn't a valid ref name.  Why the hell the send-pack client was doing
that, I have no idea.  But, when the ref failed it was a silent
failure, so we were able to use ".have" for some new capability.

It also broke JGit, which wasn't doing this seemingly pointless
check_ref_format() and silently fail business.  Oh, and IIRC,
GitHub may have been burned around the same time somehow.

In packed-refs, Junio had a hard time adding the "peeled-refs"
support, because the first version of the parser was so strict.
But again, somehow he managed to find a backdoor in the old parser,
and that backdoor is why that file looks the way it does today.

In the loose object format, when we added new-style loose objects
we found a backdoor in the way libz deflate formats the first 2
bytes of the file... and encoded something that shouldn't appear
there to signal it was a new "pack style" loose object.

Pack index v2 uses a hole where old clients would barf on the
'\377t0c' followed by the version '2' not being monotonically
increasing.

I think there's something like that in DIRC too, but that change
(to introduce the current DIRC format) may predate my involvement
with Git, so my memory isn't very good there.

> P.S. By the way, is pkt-line format original invention, or was it 
> 'borrowed' from some other standard or protocol?

No clue.  I find it f'king odd that the length is in hex.  There
isn't much value to the protocol being human readable.  The PACK
part of the stream sure as hell ain't.  You aren't going to type
out a sequence of "have" lines against the remote, like you could
with say an HTTP GET.  *shrug*

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 14:48               ` Shawn O. Pearce
@ 2009-06-03 15:07                 ` Shawn O. Pearce
  2009-06-03 15:39                   ` Jakub Narebski
  2009-06-03 16:51                 ` Jakub Narebski
                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 66+ messages in thread
From: Shawn O. Pearce @ 2009-06-03 15:07 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Scott Chacon, git, Junio C Hamano

"Shawn O. Pearce" <spearce@spearce.org> wrote:
> Jakub Narebski <jnareb@gmail.com> wrote:
> > What options besides (required?) "host=<server>[:<port>]" are supported?
> 
> Currently only host is supported.  And yea, it takes the :<port> if
> the client included the port number in the URL (git://foo:8813/path).

Ok, I'm wrong.  It *doesn't* send the port.  The reason is obtuse,
but git_tcp_connect() clobbers the port number out of the host
name string, so that later when git_connect() sends this "host=%s",
only the host name is transmitted.

> Actually, I just realized JGit isn't compliant here.  It doesn't
> send the :<port> like C Git would.

So, actually JGit is compliant here.

> > Do I understand correctly that "host=<host>" information is required
> > for core.gitProxy to work, isn't it?

If core.gitProxy or GIT_PROXY_COMMAND are set, you can lie to the
remote git daemon about the host.  E.g.:

  $ cat proxy.sh
  #!/bin/sh
  exec nc git.kernel.org 9418

  GIT_PROXY_COMMAND=proxy.sh git ls-remote git://github.com/foo.git

During that kernel.org receives "\0host=github.com\0" host header,
which is not the name you connected to it as.  :-)

In practice I doubt anyone would do that, but, you can confuse
yourself.  I guess about equally as well as url.insteadof.  :-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 15:07                 ` Shawn O. Pearce
@ 2009-06-03 15:39                   ` Jakub Narebski
  2009-06-03 15:50                     ` Shawn O. Pearce
  0 siblings, 1 reply; 66+ messages in thread
From: Jakub Narebski @ 2009-06-03 15:39 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Scott Chacon, git, Junio C Hamano

On Wed, 3 Jun 2009, Shawn O. Pearce wrote:
> "Shawn O. Pearce" <spearce@spearce.org> wrote:
> > Jakub Narebski <jnareb@gmail.com> wrote:

> > > What options besides (required?) "host=<server>[:<port>]" are supported?
> > 
> > Currently only host is supported.  And yea, it takes the :<port> if
> > the client included the port number in the URL (git://foo:8813/path).
> 
> Ok, I'm wrong.  It *doesn't* send the port.  The reason is obtuse,
> but git_tcp_connect() clobbers the port number out of the host

What about git_proxy_connect()? Does it clobber port number either?

> name string, so that later when git_connect() sends this "host=%s",
> only the host name is transmitted.

Hmmm... so does that mean that in the following fragment of deamon.c
on branch is dead in practice?

  if (strncasecmp("host=", extra_args, 5) == 0) {
    val = extra_args + 5;
    vallen = strlen(val) + 1;
    if (*val) {
      /* Split <host>:<port> at colon. */
      char *host = val;
      char *port = strrchr(host, ':');
      if (port) {
        *port = 0;
        port++;
        free(tcp_port);
        tcp_port = xstrdup(port);
      }
      free(hostname);
      hostname = xstrdup_tolower(host);
    }

    /* On to the next one */
    extra_args = val + vallen;
  }


> > Actually, I just realized JGit isn't compliant here.  It doesn't
> > send the :<port> like C Git would.
> 
> So, actually JGit is compliant here.

Well, we can take the stance that C Git isn't compliant either ;-)

>  
> > > Do I understand correctly that "host=<host>" information is required
> > > for core.gitProxy to work, isn't it?
> 
> If core.gitProxy or GIT_PROXY_COMMAND are set, you can lie to the
> remote git daemon about the host.  E.g.:
> 
>   $ cat proxy.sh
>   #!/bin/sh
>   exec nc git.kernel.org 9418
> 
>   GIT_PROXY_COMMAND=proxy.sh git ls-remote git://github.com/foo.git
> 
> During that kernel.org receives "\0host=github.com\0" host header,
> which is not the name you connected to it as.  :-)
> 
> In practice I doubt anyone would do that, but, you can confuse
> yourself.  I guess about equally as well as url.insteadof.  :-)


A question: do compliant implementation MUST not fail on receiving
arguments it doesn't understand, e.g.:

   003bgit-upload-pack /project.git\0host=myserver.com\0user=me\0

or can it go hang the client, or silently fail?

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 15:39                   ` Jakub Narebski
@ 2009-06-03 15:50                     ` Shawn O. Pearce
  0 siblings, 0 replies; 66+ messages in thread
From: Shawn O. Pearce @ 2009-06-03 15:50 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Scott Chacon, git, Junio C Hamano

Jakub Narebski <jnareb@gmail.com> wrote:
> On Wed, 3 Jun 2009, Shawn O. Pearce wrote:
> > "Shawn O. Pearce" <spearce@spearce.org> wrote:
> > 
> > Ok, I'm wrong.  It *doesn't* send the port.  The reason is obtuse,
> > but git_tcp_connect() clobbers the port number out of the host
> 
> What about git_proxy_connect()? Does it clobber port number either?

Dammit, not enough coffee.

We copy the string before the clobbering happens.  It *DOES* send
the port.  Either way, TCP or proxy.

JGit isn't compliant.  I'll send a patch soon.
 
> > name string, so that later when git_connect() sends this "host=%s",
> > only the host name is transmitted.
> 
> Hmmm... so does that mean that in the following fragment of deamon.c
> on branch is dead in practice?

No, its valid... I misread the client code.

> A question: do compliant implementation MUST not fail on receiving
> arguments it doesn't understand, e.g.:
> 
>    003bgit-upload-pack /project.git\0host=myserver.com\0user=me\0
> 
> or can it go hang the client, or silently fail?

My understanding is that a compliant server MUST accept and ignore
anything the client sends if it doesn't recognize it.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 14:48               ` Shawn O. Pearce
  2009-06-03 15:07                 ` Shawn O. Pearce
@ 2009-06-03 16:51                 ` Jakub Narebski
  2009-06-03 16:56                   ` Shawn O. Pearce
  2009-06-03 21:38                   ` Tony Finch
  2009-06-03 17:11                 ` Junio C Hamano
  2009-06-03 19:05                 ` Johannes Sixt
  3 siblings, 2 replies; 66+ messages in thread
From: Jakub Narebski @ 2009-06-03 16:51 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Scott Chacon, git, Junio C Hamano

On Wed, 3 Jun 2009, Shawn O. Pearce wrote:
> Jakub Narebski <jnareb@gmail.com> wrote:

[..]
>>>> It is a bit pity that git protocol was not created with
>>>> extendability (like capabilities) in mind...
>>> 
>>> Yes, no doubt.  There are many things I would have done differently,
>>> given that I now have 20/20 hindsight vision into the past's future.
>>> 
>>> :-)
>>> 
>>> The protocol (mostly) works fine as-is.  Its widely distributed in
>>> terms of clients using it on a daily basis.  Its likely to continue
>>> to serve our needs well into the future.  So, it is what it is.
>> 
>> I do wonder if existing Internet Standard (in the meaning of RFC) 
>> protocols also have such kludges and hacks...

I wonder if there are some BCP (Best Common Practices) RFC for designing
protocols (and BCP documents for designing file formats). And which one
of RFC 2360, RFC 2424,... are applicable here.

> 
> I'm sure they have some... oddities.  But perhaps not as bad as git.
> 
> We have a history of not leaving ourselves room for future expansion,
> and then needing to find a backdoor in the canonical implementation
> parser in order to make it work.
> 
> In the protocol suite, its been the strlen() < pktlen trick which
> has generally worked.  Oh, and also sticking stuff after a fixed
> length record, where we didn't care.

Magic number (magic sequence) identifying protocol / format plus
version number.  But it is good that we have capabilities now
(which is better than version number in this case, IMHO).

> 
> Oh, and send-pack/receive-pack protocol now has ".have" refs, which
> work for C Git because the send-pack client was always calling
> check_ref_format() on each thing sent by the server, and ".have"
> isn't a valid ref name.  Why the hell the send-pack client was doing
> that, I have no idea.  But, when the ref failed it was a silent
> failure, so we were able to use ".have" for some new capability.
> 
> It also broke JGit, which wasn't doing this seemingly pointless
> check_ref_format() and silently fail business.  Oh, and IIRC,
> GitHub may have been burned around the same time somehow.

What are those ".have" refs? They are not described in current version
of "Transfer Protocols" (sub)section in "The Community Book". I remember
that they were discussed on git mailing list, but I don't remember what
they were about...

> 
> In packed-refs, Junio had a hard time adding the "peeled-refs"
> support, because the first version of the parser was so strict.
> But again, somehow he managed to find a backdoor in the old parser,
> and that backdoor is why that file looks the way it does today.

I don't remember what that was about... Nevertheless now we have
kind of 'capabilities' section in .git/packed-refs

> 
> In the loose object format, when we added new-style loose objects
> we found a backdoor in the way libz deflate formats the first 2
> bytes of the file... and encoded something that shouldn't appear
> there to signal it was a new "pack style" loose object.
> 
> Pack index v2 uses a hole where old clients would barf on the
> '\377t0c' followed by the version '2' not being monotonically
> increasing.

Interesting... even more so that this problem of designing without
extendability in mind (magic number + version) is so persistent :-(

>> P.S. By the way, is pkt-line format original invention, or was it 
>> 'borrowed' from some other standard or protocol?
> 
> No clue.  I find it f'king odd that the length is in hex.  There
> isn't much value to the protocol being human readable.  The PACK
> part of the stream sure as hell ain't.  You aren't going to type
> out a sequence of "have" lines against the remote, like you could
> with say an HTTP GET.  *shrug*

Well... in theory you could... ;-)

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 16:51                 ` Jakub Narebski
@ 2009-06-03 16:56                   ` Shawn O. Pearce
  2009-06-03 20:19                     ` Jakub Narebski
  2009-06-06 16:33                     ` Scott Chacon
  2009-06-03 21:38                   ` Tony Finch
  1 sibling, 2 replies; 66+ messages in thread
From: Shawn O. Pearce @ 2009-06-03 16:56 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Scott Chacon, git, Junio C Hamano

Jakub Narebski <jnareb@gmail.com> wrote:
> On Wed, 3 Jun 2009, Shawn O. Pearce wrote:
> > Oh, and send-pack/receive-pack protocol now has ".have" refs [...]
> 
> What are those ".have" refs? They are not described in current version
> of "Transfer Protocols" (sub)section in "The Community Book". I remember
> that they were discussed on git mailing list, but I don't remember what
> they were about...

If the remote receiving repository has alternates, the ".have" refs are
the refs of the alternate repositories.  This signals to the client that
the server has these objects reachable, but the client isn't permitted
to send commands to alter these refs.

Its good for a site like GitHub or repo.or.cz where cheap forks are
implemented by creating a repository that points to a common shared
base via alternates.  The ".have" refs say that the server already
has everything in that common shared base, so the client doesn't
have to re-upload the entire project if the fork started out empty,
or had all refs deleted from it.

> > In packed-refs, Junio had a hard time adding the "peeled-refs"
> > support, because the first version of the parser was so strict.
> > But again, somehow he managed to find a backdoor in the old parser,
> > and that backdoor is why that file looks the way it does today.
> 
> I don't remember what that was about... Nevertheless now we have
> kind of 'capabilities' section in .git/packed-refs

Sort of.  In a file format its worse than a network protocol,
because the file can't alter its contents based on what the
reader can understand.

> Interesting... even more so that this problem of designing without
> extendability in mind (magic number + version) is so persistent :-(

I know.  I think we maybe have learned the lesson.  I don't know.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 14:48               ` Shawn O. Pearce
  2009-06-03 15:07                 ` Shawn O. Pearce
  2009-06-03 16:51                 ` Jakub Narebski
@ 2009-06-03 17:11                 ` Junio C Hamano
  2009-06-03 19:05                 ` Johannes Sixt
  3 siblings, 0 replies; 66+ messages in thread
From: Junio C Hamano @ 2009-06-03 17:11 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Jakub Narebski, Scott Chacon, git

"Shawn O. Pearce" <spearce@spearce.org> writes:

> I think there's something like that in DIRC too, but that change
> (to introduce the current DIRC format) may predate my involvement
> with Git, so my memory isn't very good there.

The index_extension part where the cache-tree is stored is another
example.  The length of the index is known from the mmap, the file
checksum is defined to appear at the end of the file, and the number of
entries are recorded in the file header, so there was a hole after that
many index entries I could add new section.

>> P.S. By the way, is pkt-line format original invention, or was it 
>> 'borrowed' from some other standard or protocol?
>
> No clue.  I find it f'king odd that the length is in hex.  There
> isn't much value to the protocol being human readable.  The PACK
> part of the stream sure as hell ain't.  You aren't going to type
> out a sequence of "have" lines against the remote, like you could
> with say an HTTP GET.  *shrug*

The text-ness made it easier to debug while I was developing the sideband
support.  I literally typed the pkt-line from the terminal ;-).

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 14:48               ` Shawn O. Pearce
                                   ` (2 preceding siblings ...)
  2009-06-03 17:11                 ` Junio C Hamano
@ 2009-06-03 19:05                 ` Johannes Sixt
  3 siblings, 0 replies; 66+ messages in thread
From: Johannes Sixt @ 2009-06-03 19:05 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Jakub Narebski, Scott Chacon, git, Junio C Hamano

On Mittwoch, 3. Juni 2009, Shawn O. Pearce wrote:
> We have a history of not leaving ourselves room for future expansion,
> and then needing to find a backdoor in the canonical implementation
> parser in order to make it work.
>
> In the protocol suite, its been the strlen() < pktlen trick which
> has generally worked.  Oh, and also sticking stuff after a fixed
> length record, where we didn't care.

This reminds me of one thing: upload-pack (of C git) sends a complete pack if 
and only if there were no errors, so that fetch-pack sees an error if 
upload-pack dies or if there is no side-band where upload-pack could signal 
an error (at least I think that are the reasons). There is a comment in 
upload-pack that explains a bit of it:

/* Data ready; we keep the last byte to ourselves
 * in case we detect broken rev-list, so that we
 * can leave the stream corrupted.  This is
 * unfortunate -- unpack-objects would happily
 * accept a valid packdata with trailing garbage,
 * so appending garbage after we pass all the
 * pack data is not good enough to signal
 * breakage to downstream.
 */

-- Hannes

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 16:56                   ` Shawn O. Pearce
@ 2009-06-03 20:19                     ` Jakub Narebski
  2009-06-03 20:24                       ` Shawn O. Pearce
  2009-06-06 16:33                     ` Scott Chacon
  1 sibling, 1 reply; 66+ messages in thread
From: Jakub Narebski @ 2009-06-03 20:19 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Scott Chacon, git, Junio C Hamano

Shawn O. Pearce <spearce@spearce.org> wrote:
> Jakub Narebski <jnareb@gmail.com> wrote:
>> On Wed, 3 Jun 2009, Shawn O. Pearce wrote:

>>> Oh, and send-pack/receive-pack protocol now has ".have" refs [...]
>> 
>> What are those ".have" refs? They are not described in current version
>> of "Transfer Protocols" (sub)section in "The Community Book". I remember
>> that they were discussed on git mailing list, but I don't remember what
>> they were about...
> 
> If the remote receiving repository has alternates, the ".have" refs are
> the refs of the alternate repositories.  This signals to the client that
> the server has these objects reachable, but the client isn't permitted
> to send commands to alter these refs.
> 
> Its good for a site like GitHub or repo.or.cz where cheap forks are
> implemented by creating a repository that points to a common shared
> base via alternates.  The ".have" refs say that the server already
> has everything in that common shared base, so the client doesn't
> have to re-upload the entire project if the fork started out empty,
> or had all refs deleted from it.

So the output (for fetch or clone) would look like this for repository
with alternates (shared object database):

  00887b68fcd777f94534f0b794c5dc2e109c49938395 HEAD\0multi_ack thin-pack side-band side-band-64k ofs-delta shallow no-progress include-tag\n
  0048a6afbb5c9618395ed28a299f0913e9e6df2058ef refs/heads/adaptive-filter\n
  004bbc60ab88b6899573fb545c6b4961f8ff3ce20695 refs/heads/filtered-to-window\n
  003f7b68fcd777f94534f0b794c5dc2e109c49938395 refs/heads/master\n
  0044226d09c3b5e16b5c1bd377aae9459cae3f778847 refs/heads/save-config\n
  0050dab192738152e1fa7233e06d941f9ada865c6e65 refs/tags/jnareb@gmail.com-gpg-pub\n
  00535812582f41a234828c8a2ec38047462979dc5dd8 refs/tags/jnareb@gmail.com-gpg-pub^{}\n
  003c50808bc27817eac090683e44fce4368fff39f9b2 refs/tags/v1.2\n
  0033b11cf09043f18b368ec0d988f064ea21247c843d .keep\n

Does it matter for fetch, or is it important only for pushing?


BTW. do "include-tag" capability MUST NOT (REQUIRED) be send if there
are not tags (tag objects?), or just SHOULD NOT (RECOMMENDED), or even
MAY NOT (OPTIONAL).  GitHub server doesn't send it if there are no 
tags...

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 20:19                     ` Jakub Narebski
@ 2009-06-03 20:24                       ` Shawn O. Pearce
  2009-06-03 22:04                         ` Jakub Narebski
  2009-06-04  7:17                         ` Andreas Ericsson
  0 siblings, 2 replies; 66+ messages in thread
From: Shawn O. Pearce @ 2009-06-03 20:24 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Scott Chacon, git, Junio C Hamano

Jakub Narebski <jnareb@gmail.com> wrote:
> 
> >>> Oh, and send-pack/receive-pack protocol now has ".have" refs [...]
> 
> So the output (for fetch or clone) would look like this for repository
> with alternates (shared object database):

No.  fetch/clone (aka fetch-pack/upload-pack protocl) does not use
the .have feature.

> Does it matter for fetch, or is it important only for pushing?

Because yea, it only matters for pushing.  Actually, in the case of
fetch, we shouldn't advertise what our alternate has, the client
should just fetch from the alternate.

In push it matters because the client wants to know what the remote
has, so it can trim the pack down to only the new objects, to reduce
transfer time.

> BTW. do "include-tag" capability MUST NOT (REQUIRED) be send if there
> are not tags (tag objects?), or just SHOULD NOT (RECOMMENDED), or even
> MAY NOT (OPTIONAL).  GitHub server doesn't send it if there are no 
> tags...

Clients MAY always send include-tag, hardcoding it into a request.
The decision for a client to request include-tag only has to do
with the client's desires for tag data, whether or not a server
had advertised objects in the refs/tags/* namespace.

Clients SHOULD NOT send include-tag if remote.name.tagopt was set
to --no-tags, as the client doesn't want tag data.

Servers MUST accept include-tag without error or warning, even if the
server does not understand or support the option.

Servers SHOULD pack the tags if their referrant is packed and the
client has requested include-tag.

Clients MUST be prepared for the case where a server has ignored
include-tag and has not actually sent tags in the pack.  In such
cases the client SHOULD issue a subsequent fetch to acquire the
tags that include-tag would have otherwise given the client.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03  0:50         ` Jakub Narebski
  2009-06-03  1:29           ` Shawn O. Pearce
  2009-06-03  2:18           ` Robin H. Johnson
@ 2009-06-03 20:56           ` Tony Finch
  2009-06-03 21:20             ` Jakub Narebski
  2 siblings, 1 reply; 66+ messages in thread
From: Tony Finch @ 2009-06-03 20:56 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Shawn O. Pearce, Scott Chacon, git

On Wed, 3 Jun 2009, Jakub Narebski wrote:
>
>      HEXDIGIT = 0-9 / a-f
>
> Well, it should probably be spelled in full. Probably, because I have
> no experience with using ABNF... and didn't do my research :-)

The ABNF core rules include a definition for HEXDIG. See appendix B of
RFC 5234.

> (should HEXDIGIT use lowercase a-f, or can it use uppercase A-F?)

Double-quoted strings in ABNF are case-insensitive ASCII, so the HEXDIG
rule accepts both. You need to use %x61 if you want a but not A.

Tony.
-- 
f.anthony.n.finch  <dot@dotat.at>  http://dotat.at/
GERMAN BIGHT HUMBER: SOUTHWEST 5 TO 7. MODERATE OR ROUGH. SQUALLY SHOWERS.
MODERATE OR GOOD.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 20:56           ` Tony Finch
@ 2009-06-03 21:20             ` Jakub Narebski
  2009-06-03 21:53               ` Tony Finch
  0 siblings, 1 reply; 66+ messages in thread
From: Jakub Narebski @ 2009-06-03 21:20 UTC (permalink / raw)
  To: Tony Finch; +Cc: Shawn O. Pearce, Scott Chacon, git

On Wed, 3 Jan 2009, Tony Finch wrote:
> On Wed, 3 Jun 2009, Jakub Narebski wrote:
> >
> >      HEXDIGIT = 0-9 / a-f
> >
> > Well, it should probably be spelled in full. Probably, because I have
> > no experience with using ABNF... and didn't do my research :-)
> 
> The ABNF core rules include a definition for HEXDIG. See appendix B of
> RFC 5234.

I have found it _after_ sending this post in Wikipedia article (which
is shorter than RFC 5234), but thanks anyway.

> > (should HEXDIGIT use lowercase a-f, or can it use uppercase A-F?)
> 
> Double-quoted strings in ABNF are case-insensitive ASCII, so the HEXDIG
> rule accepts both. You need to use %x61 if you want a but not A.

        HEXDIG_LC = DIGIT / %x61-%x66  ; 0-9a-f, case sensitive

Actually git accepts both lowercase and uppercase in HEXDIG (at least
for pkt-length), but it prefers lowercase.
-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 16:51                 ` Jakub Narebski
  2009-06-03 16:56                   ` Shawn O. Pearce
@ 2009-06-03 21:38                   ` Tony Finch
  1 sibling, 0 replies; 66+ messages in thread
From: Tony Finch @ 2009-06-03 21:38 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Shawn O. Pearce, Scott Chacon, git, Junio C Hamano

On Wed, 3 Jun 2009, Jakub Narebski wrote:
> On Wed, 3 Jun 2009, Shawn O. Pearce wrote:
> > Jakub Narebski <jnareb@gmail.com> wrote:
> >>
> >> I do wonder if existing Internet Standard (in the meaning of RFC)
> >> protocols also have such kludges and hacks...
> >
> > I'm sure they have some... oddities.  But perhaps not as bad as git.

One small example is the way DNS was extended to support extra flags and
response codes (and other features). EDNS is signalled using an OPT
pseudo-RR, which is basically the same technique as git's .have refs.

There are a couple of examples in RFC 822 / MIME headers: RFC 2047 (for
encoding character set information in subject and address headers) and RFC
2231 (the same job but for attachment filenames etc.). In practice common
software uses 2047 syntax for both purposes :-/

Mostly Internet protocols have grown generic extension frameworks fairly
early in their lives, so syntactic hacks are rare.

> I wonder if there are some BCP (Best Common Practices) RFC for designing
> protocols (and BCP documents for designing file formats). And which one
> of RFC 2360, RFC 2424,... are applicable here.

As far as I know protocol design is pretty much folklore.

> Magic number (magic sequence) identifying protocol / format plus
> version number.  But it is good that we have capabilities now
> (which is better than version number in this case, IMHO).

Yes, capabilities are a good design pattern.

Tony.
-- 
f.anthony.n.finch  <dot@dotat.at>  http://dotat.at/
GERMAN BIGHT HUMBER: SOUTHWEST 5 TO 7. MODERATE OR ROUGH. SQUALLY SHOWERS.
MODERATE OR GOOD.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 21:20             ` Jakub Narebski
@ 2009-06-03 21:53               ` Tony Finch
  2009-06-04  8:45                 ` Jakub Narebski
  0 siblings, 1 reply; 66+ messages in thread
From: Tony Finch @ 2009-06-03 21:53 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Shawn O. Pearce, Scott Chacon, git

On Wed, 3 Jun 2009, Jakub Narebski wrote:
>
> Actually git accepts both lowercase and uppercase in HEXDIG (at least
> for pkt-length), but it prefers lowercase.

You should ensure that all hex digit strings follow the same rule.
Are SHA-1 object names case insensitive too?

Case insensitivity has a history of being awkward. SMTP has always had
case-insensitive commands, though the RFCs have always written them in
upper case and implementations have pretty much all emitted them in upper
case. See http://tools.ietf.org/html/rfc5321#section-2.4 especially the
caveat about broken case-sensitive implementations.

Tony.
-- 
f.anthony.n.finch  <dot@dotat.at>  http://dotat.at/
GERMAN BIGHT HUMBER: SOUTHWEST 5 TO 7. MODERATE OR ROUGH. SQUALLY SHOWERS.
MODERATE OR GOOD.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 20:24                       ` Shawn O. Pearce
@ 2009-06-03 22:04                         ` Jakub Narebski
  2009-06-03 22:04                           ` Shawn O. Pearce
  2009-06-03 22:16                           ` Junio C Hamano
  2009-06-04  7:17                         ` Andreas Ericsson
  1 sibling, 2 replies; 66+ messages in thread
From: Jakub Narebski @ 2009-06-03 22:04 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Scott Chacon, git, Junio C Hamano

On Wed, 3 Jun 2009, Shawn O. Pearce wrote:
> Jakub Narebski <jnareb@gmail.com> wrote:

> > BTW. do "include-tag" capability MUST NOT (REQUIRED) be send if there
> > are not tags (tag objects?), or just SHOULD NOT (RECOMMENDED), or even
> > MAY NOT (OPTIONAL).  GitHub server doesn't send it if there are no 
> > tags...
> 
> Clients MAY always send include-tag, hardcoding it into a request.
> The decision for a client to request include-tag only has to do
> with the client's desires for tag data, whether or not a server
> had advertised objects in the refs/tags/* namespace.
> 
> Clients SHOULD NOT send include-tag if remote.name.tagopt was set
> to --no-tags, as the client doesn't want tag data.
> 
> Servers MUST accept include-tag without error or warning, even if the
> server does not understand or support the option.
> 
> Servers SHOULD pack the tags if their referrant is packed and the
> client has requested include-tag.
> 
> Clients MUST be prepared for the case where a server has ignored
> include-tag and has not actually sent tags in the pack.  In such
> cases the client SHOULD issue a subsequent fetch to acquire the
> tags that include-tag would have otherwise given the client.

So do I understand correctly that capabilities are governed by the
following generic rules:

1. Server sends space separated list of capabilities it support. It
   MUST NOT send capabilities it *does not* support. It MAY NOT send
   "include-tag" if there are no tag objects (or is it SHOULD NOT?).
2. Client sends space separated list of capabilities it wants. It SHOULD
   (or perhaps it is MAY?) send subset of server capabilities, i.e do
   not send capabilities served does not advertise.
3. Server MUST ignore capabilities it does not understand. Server MUST
   NOT ignore capabilities (or SHOULD NOT only?) that client requested
   and server advertised.

I know that client MUST send only maximum of one of "side-band" and 
"side-band-64k", but how should server reacts if client sends both?
Should it use "side-band-64k"?

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 22:04                         ` Jakub Narebski
@ 2009-06-03 22:04                           ` Shawn O. Pearce
  2009-06-03 22:16                           ` Junio C Hamano
  1 sibling, 0 replies; 66+ messages in thread
From: Shawn O. Pearce @ 2009-06-03 22:04 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Scott Chacon, git, Junio C Hamano

Jakub Narebski <jnareb@gmail.com> wrote:
> So do I understand correctly that capabilities are governed by the
> following generic rules:
> 
> 1. Server sends space separated list of capabilities it support. It
>    MUST NOT send capabilities it *does not* support. It MAY NOT send
>    "include-tag" if there are no tag objects (or is it SHOULD NOT?).

The server SHOULD send include-tag, if it supports it, irregardless
of whether or not there are tags available.  Its just easier to
code the server to send the R@!* line up front based on the software
version, and not the repository content.

> 2. Client sends space separated list of capabilities it wants. It SHOULD
>    (or perhaps it is MAY?) send subset of server capabilities, i.e do
>    not send capabilities served does not advertise.

It SHOULD send a subset of server capabilities.

> 3. Server MUST ignore capabilities it does not understand.

True.

> Server MUST NOT ignore capabilities (or SHOULD NOT only?) that
> client requested and server advertised.

True, MUST NOT.  Otherwise you will have protocol errors.

However, include-tag can be SHOULD NOT... since the client must be
able to recover from it anyway.

> I know that client MUST send only maximum of one of "side-band" and 
> "side-band-64k", but how should server reacts if client sends both?
> Should it use "side-band-64k"?

MUST favor side-band-64k if client requests both.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 22:04                         ` Jakub Narebski
  2009-06-03 22:04                           ` Shawn O. Pearce
@ 2009-06-03 22:16                           ` Junio C Hamano
  2009-06-03 22:46                             ` Jakub Narebski
  1 sibling, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2009-06-03 22:16 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Shawn O. Pearce, Scott Chacon, git

Jakub Narebski <jnareb@gmail.com> writes:

> 1. Server sends space separated list of capabilities it support. It
>    MUST NOT send capabilities it *does not* support. It MAY NOT send
>    "include-tag" if there are no tag objects (or is it SHOULD NOT?).

I doubt RFC 2119 lingo would include MAY NOT, as it is ambiguous
especially to non-native speakers (like me).  You meant to say "MAY omit
sending", perhaps, but in general capabilies are what you _can_ do at the
protocol level, and in my opinion, you shouldn't have to check if a
particular repository you (as a program with given set of features
implemented) happen to be looking at has tags in order to decide what
capabilities to advertise.

> 2. Client sends space separated list of capabilities it wants. It SHOULD
>    (or perhaps it is MAY?) send subset of server capabilities, i.e do
>    not send capabilities served does not advertise.

I'd say "the client SHOULD NOT ask for capabilities the server did not say
it supports".

> 3. Server MUST ignore capabilities it does not understand. Server MUST
>    NOT ignore capabilities (or SHOULD NOT only?) that client requested
>    and server advertised.

I know unrecognized capability requests are silently ignored, but I
consider that as a sloppy/practical programming, and not a specification.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 22:16                           ` Junio C Hamano
@ 2009-06-03 22:46                             ` Jakub Narebski
  0 siblings, 0 replies; 66+ messages in thread
From: Jakub Narebski @ 2009-06-03 22:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shawn O. Pearce, Scott Chacon, git

Junio C Hamano wrote:
> Jakub Narebski <jnareb@gmail.com> writes:
> 
> > 1. Server sends space separated list of capabilities it support. It
> >    MUST NOT send capabilities it *does not* support. It MAY NOT send
> >    "include-tag" if there are no tag objects (or is it SHOULD NOT?).
> 
> I doubt RFC 2119 lingo would include MAY NOT, as it is ambiguous
> especially to non-native speakers (like me).  

You are right, RFC 2119 does not include MAY NOT.

> You meant to say "MAY omit 
> sending", perhaps, but in general capabilies are what you _can_ do at the
> protocol level, and in my opinion, you shouldn't have to check if a
> particular repository you (as a program with given set of features
> implemented) happen to be looking at has tags in order to decide what
> capabilities to advertise.

I wonder why in http://book.git-scm.com/7_transfer_protocols.html 
("Git Community Book", chapter "7. Internals and Plumbing", section
 "Transfer Protocols", subsection "Fetching Data with Upload Pack")
"include-tag" is not included ;) in advertised server capabilities.
Because github's git-daemon advertises it even if there are no
tags present

  $ echo -e -n "0039git-upload-pack /schacon/gitbook.git\0host=github.com\0" | 
    nc -v github.com 9418
  Connection to github.com 9418 port [tcp/*] succeeded!
  00887217a7c7e582c46cec22a130adf4b9d7d950fba0 HEAD\0multi_ack thin-pack side-band side-band-64k ofs-delta shallow no-progress include-tag
  00441d3fcd5ced445d1abc402225c0b8a1299641f497 refs/heads/integration
  003f7217a7c7e582c46cec22a130adf4b9d7d950fba0 refs/heads/master
  003edc9d991bc43cb04e692efc793f885eb4ff7fda98 refs/heads/pt_BR

> > 2. Client sends space separated list of capabilities it wants. It SHOULD
> >    (or perhaps it is MAY?) send subset of server capabilities, i.e do
> >    not send capabilities served does not advertise.
> 
> I'd say "the client SHOULD NOT ask for capabilities the server did not say
> it supports".

I agree that it is better formulation (phrasing).

> 
> > 3. Server MUST ignore capabilities it does not understand. Server MUST
> >    NOT ignore capabilities (or SHOULD NOT only?) that client requested
> >    and server advertised.
> 
> I know unrecognized capability requests are silently ignored, but I
> consider that as a sloppy/practical programming, and not a specification.

Well, the whole 'be strict in what you send, and accepting in what you
accept' leads unfortunately to accepting sloppy programming and coding.

Nevertheless it is I guess better to silently ignore unknown
capabilities requested by client than fail, isn't it?

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 20:24                       ` Shawn O. Pearce
  2009-06-03 22:04                         ` Jakub Narebski
@ 2009-06-04  7:17                         ` Andreas Ericsson
  2009-06-04  7:26                           ` Junio C Hamano
  1 sibling, 1 reply; 66+ messages in thread
From: Andreas Ericsson @ 2009-06-04  7:17 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Jakub Narebski, Scott Chacon, git, Junio C Hamano

Shawn O. Pearce wrote:
> 
> Servers MUST accept include-tag without error or warning, even if the
> server does not understand or support the option.
> 
> Servers SHOULD pack the tags if their referrant is packed and the
> client has requested include-tag.
> 
> Clients MUST be prepared for the case where a server has ignored
> include-tag and has not actually sent tags in the pack.  In such
> cases the client SHOULD issue a subsequent fetch to acquire the
> tags that include-tag would have otherwise given the client.
> 

How is "no tags present" signalled? Without such a signal, the client
must always issue a subsequent request every time there are no tags
embedded in the received pack, as it can't know if the server ignored
the option silently or if there just aren't any new tags.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Register now for Nordic Meet on Nagios, June 3-4 in Stockholm
 http://nordicmeetonnagios.op5.org/

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-04  7:17                         ` Andreas Ericsson
@ 2009-06-04  7:26                           ` Junio C Hamano
  0 siblings, 0 replies; 66+ messages in thread
From: Junio C Hamano @ 2009-06-04  7:26 UTC (permalink / raw)
  To: Andreas Ericsson
  Cc: Shawn O. Pearce, Jakub Narebski, Scott Chacon, git,
	Junio C Hamano

Andreas Ericsson <ae@op5.se> writes:

> How is "no tags present" signalled? Without such a signal, the client
> must always issue a subsequent request every time there are no tags
> embedded in the received pack, as it can't know if the server ignored
> the option silently or if there just aren't any new tags.

The fetcher first learns the set of tags and what objects they point at.
That's in the first part of the upload-pack protocol.

Of course, if there is no tag, you won't see them advertised, so you can
know.

After finishing the main part of the object transfer, if some of the
objects that are pointed at by tags are present (and reachable from refs),
but the tags that point at them do not exist yet, that is a sign that the
uploader didn't give you these tag objects.  Then you can turn around to
request those tags by initiating another exchange, and ask for them with
"want" lines.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 21:53               ` Tony Finch
@ 2009-06-04  8:45                 ` Jakub Narebski
  2009-06-04 11:41                   ` Tony Finch
  2009-06-04 18:41                   ` Shawn O. Pearce
  0 siblings, 2 replies; 66+ messages in thread
From: Jakub Narebski @ 2009-06-04  8:45 UTC (permalink / raw)
  To: Tony Finch; +Cc: Shawn O. Pearce, Scott Chacon, git

Dnia środa 3. czerwca 2009 23:53, Tony Finch napisał:
> On Wed, 3 Jun 2009, Jakub Narebski wrote:
> >
> > Actually git accepts both lowercase and uppercase in HEXDIG (at least
> > for pkt-length), but it prefers lowercase.
> 
> You should ensure that all hex digit strings follow the same rule.
> Are SHA-1 object names case insensitive too?
> 
> Case insensitivity has a history of being awkward. SMTP has always had
> case-insensitive commands, though the RFCs have always written them in
> upper case and implementations have pretty much all emitted them in upper
> case. See http://tools.ietf.org/html/rfc5321#section-2.4 especially the
> caveat about broken case-sensitive implementations.

There should be no problem with pkt-length being case insensitive, as
standard conversion routines (strtol, sprintf) accept 0-9a-fA-F for 
base 16 / hexadecimal conversion.  The requirement here is that client
and server SHOULD use lowercase, but MUST accept mixed case (do case
insensitive parsing of hex4).

I think SHA-1 is lowercased, so mixed case should work there. Well, at
least "git show 6096D7" (note the uppercase 'D' in shortened SHA-1 name)
works as expected.

But I do not know what are, or what should be protocol requirements.
Should SHA-1 use lowercase, or be case insensitive? Should commands such
as "have", "want", "done" use lower case or be case insensitive? Should
status indicators "ACK" and "NAK" be upper case, or should be case
insensitive? Should capabilities be case sensitive, and should they be
compared case sensitive or not?

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-04  8:45                 ` Jakub Narebski
@ 2009-06-04 11:41                   ` Tony Finch
  2009-06-04 18:41                   ` Shawn O. Pearce
  1 sibling, 0 replies; 66+ messages in thread
From: Tony Finch @ 2009-06-04 11:41 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Shawn O. Pearce, Scott Chacon, git

On Thu, 4 Jun 2009, Jakub Narebski wrote:
>
> But I do not know what are, or what should be protocol requirements.
> Should SHA-1 use lowercase, or be case insensitive? Should commands such
> as "have", "want", "done" use lower case or be case insensitive? Should
> status indicators "ACK" and "NAK" be upper case, or should be case
> insensitive? Should capabilities be case sensitive, and should they be
> compared case sensitive or not?

I think the current (rough) consensus is that case insensitivity causes
pain unless its scope is carefully controlled. I18n causes a lot of the
difficulties.

One way in which you can control the scope is by limiting
case-insensitivity to protocol elements that must be ASCII (commands,
replies, SHA-1 hashes). But I'm not sure there's any benefit to making
the protocol case insensitive, especially when it isn't possible to
type it manually.

I've already given one example of interoperability problems in SMTP
arising from case insensitivity. In the opposite direction, Unix and
XML are good examples of case sensitivity working well in practice.

I have to say I spend all my time working with old-school case insensitive
protocols, and they have clearly been extremely successful, so it's
tempting to copy them. But I think that will lead to ugliness - have a
look through the HTTP spec for its mish-mash of sensitive and insensitive
protocol elements.

In the specific instance of the pkt-length, if all current implementations
write the length in lower case, you can say in the spec it MUST be lower
case. If you do that then the same requirement can apply to both the
client and the server which makes the spec shorter and simpler. Postel's
robustness principle suggests that it doesn't really matter if the parser
treats it case-insensitively.

Tony.
-- 
f.anthony.n.finch  <dot@dotat.at>  http://dotat.at/
GERMAN BIGHT HUMBER: SOUTHWEST 5 TO 7. MODERATE OR ROUGH. SQUALLY SHOWERS.
MODERATE OR GOOD.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-04  8:45                 ` Jakub Narebski
  2009-06-04 11:41                   ` Tony Finch
@ 2009-06-04 18:41                   ` Shawn O. Pearce
  1 sibling, 0 replies; 66+ messages in thread
From: Shawn O. Pearce @ 2009-06-04 18:41 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Tony Finch, Scott Chacon, git

Jakub Narebski <jnareb@gmail.com> wrote:
> Dnia ?roda 3. czerwca 2009 23:53, Tony Finch napisa?:
> > On Wed, 3 Jun 2009, Jakub Narebski wrote:
> > >
> > > Actually git accepts both lowercase and uppercase in HEXDIG (at least
> > > for pkt-length), but it prefers lowercase.
> > 
> > You should ensure that all hex digit strings follow the same rule.
> > Are SHA-1 object names case insensitive too?
> > 
> > Case insensitivity has a history of being awkward. SMTP has always had
> > case-insensitive commands, though the RFCs have always written them in
> > upper case and implementations have pretty much all emitted them in upper
> > case. See http://tools.ietf.org/html/rfc5321#section-2.4 especially the
> > caveat about broken case-sensitive implementations.
> 
> There should be no problem with pkt-length being case insensitive, as
> standard conversion routines (strtol, sprintf) accept 0-9a-fA-F for 
> base 16 / hexadecimal conversion.  The requirement here is that client
> and server SHOULD use lowercase, but MUST accept mixed case (do case
> insensitive parsing of hex4).

ACK.  This is what C Git does today.  JGit sends lower case, but
is wrong by only accepting lowercase.  I will patch it today to
accept mixed case.
 
> I think SHA-1 is lowercased, so mixed case should work there. Well, at
> least "git show 6096D7" (note the uppercase 'D' in shortened SHA-1 name)
> works as expected.

ACK.  Mixed case SHA-1 MUST be accepted, but lower case SHOULD
be output.
 
> But I do not know what are, or what should be protocol requirements.
> Should SHA-1 use lowercase, or be case insensitive?

SHA-1 SHOULD be lowercase (a-f), MUST accept a-f or A-F.

> Should commands such as "have", "want", "done" use lower case or
> be case insensitive?

These MUST be lowercase.

> Should status indicators "ACK" and "NAK" be upper case,

These MUST be uppercase.  Though "ACK %s continue" MUST be mixed
case, as I just wrote it.

> Should capabilities be case sensitive, and should they be
> compared case sensitive or not?

No, they are case sensitive.  

Why?  All of the above is the current C code implementation.
We have to follow what the code does today, and it does case
sensitive compares almost everywhere... except in the SHA-1 parsing.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-02 21:39     ` Jakub Narebski
  2009-06-02 23:27       ` Shawn O. Pearce
  2009-06-03 12:29       ` Jakub Narebski
@ 2009-06-04 20:55       ` Jakub Narebski
  2009-06-04 21:57         ` Shawn O. Pearce
  2009-06-05  0:45         ` Shawn O. Pearce
  2009-06-06 21:38       ` Comments pack protocol description in "Git Community Book" (second round) Jakub Narebski
  3 siblings, 2 replies; 66+ messages in thread
From: Jakub Narebski @ 2009-06-04 20:55 UTC (permalink / raw)
  To: Scott Chacon; +Cc: Shawn O. Pearce, git, Linus Torvalds

This is combined response to various messages in this thread, following
my discoveries done using simple Perl script (using IO::Socket) which
assumes role of a git client, tested against github.com (IIRC it uses
Ruby implementation) and git.kernel.org (C Git), and "nc -l 9418".

By the way, is there some publicly accessible JGit (Java) and Dulwich
(Python) git-daemon one can test against?

  sp = Shawn O. Pearce
  jn = Jakub Narebski
  gb = Git Community Book (http://book.git-scm.com)


jn>> I meant that in the request line for fetching via git:// protocol
jn>>
jn>>       0032git-upload-pack /project.git\\000host=myserver.com\\000
jn>>
jn>> you separate path to repository from extra options using "\0" / NUL
jn>> as a separator. Well, this is only sane separator, as it is path
jn>> terminator, the only character which cannot appear in pathname
jn>> (although I do wonder whether project names with e.g. control
jn>> characters or UTF-8 characters would work correctly).
sp>
sp> No, that isn't the reason '\0' is used here.  But yea, that is true.
sp>
sp> The reason \0 is used is, git-daemon reads the 4 byte length, decodes
sp> that, then reads that many bytes.  Finally it writes a '\0' at the
sp> end of what it read, so that the entire "line" is NUL terminated.
sp> Then it reads the "command path" part from the resulting C string.
sp>
sp> The host=myserver.com part came later, after many daemons were
sp> already running all over the world.  By hiding it behind the '\0'
sp> an old daemon would never see it (but strlen() returned a value that
sp> was less than the length read, but the old daemons didn't care).
sp> Newer daemons look for where strlen() < length, and assume that
sp> the host header follows.
sp>
sp> The host header ends with '\0' in case additional headers would
sp> also appear here in the future.  IOW, like HTTP allows new headers
sp> to be added before the "\r\n\r\n" terminator at the body, we allow
sp> them between "\0".
[...]

sp> The NUL at the end of the host name is not strictly required, but
sp> must be present if the client were to ever pass additional options
sp> to the server.

Actually both git.kernel.org and github.com failed (deadlocked / hung)
when I tried to add extra key=value parameter at the end of request:

  003bgit-upload-pack /project.git\0host=myserver.com\0user=me\0

Hmmmm...


jn>> Hmmm... the communication between server and client is not entirely
jn>> clean. Do I understand correctly that this NAK is response to
jn>> clients flush after all those "want" lines?
sp>
sp> Yes.
sp>
jn>> And that "0009done" from client
jn>> tells server that it should send everything it has?
sp>
sp> Yes.  It means the client will not issue any more "have" lines,
sp> as it has nothing further in its history, so the server just has
sp> to give up and start generating a pack based on what it knows.

Here we were talking about the following part of exchange: 
(I have added "C:" prefix to signal that this is what client, 
git-clone here, sends; I have added also explicit "\n" to mark LF
characters terminating lines, and put each pkt-line on separate line)

gb>  C: 0054want 74730d410fcb6603ace96f1dc55ea6196122532d multi_ack side-band-64k ofs-delta\n
gb>  C: 0032want 7d1665144a3a975c05f1f43902ddaf084e784dbe\n
gb>  C: 0032want 5a3f6be755bbb7deae50065988cbfa1ffa9ab68a\n
gb>  C: 0032want 7e47fe2bd8d01d481f44d7af0531bd93d3b21c01\n
gb>  C: 0032want 74730d410fcb6603ace96f1dc55ea6196122532d\n
gb>  C: 0000
gb>  C: 0009done\n

and where server response is (again the quote from "Git Community Book"
was modified, removing here doublequotes and doubling of backslashes):

gb>  S: 0008NAK\n
gb>  S: 0023\002Counting objects: 2797, done.\n
gb>  [...]
gb>  S: 2004\001PACK\000\000\000\002 [...]

I have thought that after sending "0000" flush line client can wait for
NAK or ACK server response... but it is not the case.  When I tried to
read from server after "0000" flush and before "0009done\n", my client
(or netcat instance) deadlocked (hung) waiting for server response.
I either did a mistake in my fake client, or I don't understand git pack
protocol correctly.  Should client wait for NAK or ACK from server _only_
after sending maximum number of want/have lines (256 if I remember 
correctly?)?

When I removed sending "0000" flush line my fake client again hung 
(deadlocked?) waiting for server.


jn>> P.S. By the way, is pkt-line format original invention, or was it 
jn>> 'borrowed' from some other standard or protocol?
sp>
sp> No clue.  I find it f'king odd that the length is in hex.  There
sp> isn't much value to the protocol being human readable.  The PACK
sp> part of the stream sure as hell ain't.  You aren't going to type
sp> out a sequence of "have" lines against the remote, like you could
sp> with say an HTTP GET.  *shrug*

"git gui blame pkt-line.c" shows that pkt-line format is Linus invention.

It looks quite a bit like 'chunked' transfer encoding[1] in HTTP; there
each non-empty chunk starts with the number of octets of the data it
embeds (size written in hexadecimal) followed by a CRLF (carriage return
and linefeed), and the data itself. The chunk is then closed with a CRLF.
In some implementations, white space chars (0x20) are padded between
chunk-size and the CRLF.  In pkt-line format number of octet has fixed
width (4 hexadecimal digits, 0-padded), and we do not use CRLF as 
terminator of chunk/packet length and of chunk/packet itself.

In HTTP 'chunked' transfer encoding the last chunk is a single line,
simply made of the chunk-size (0).  In pkt-line format we use special
size of "0000" for a flush packet.

[1] http://en.wikipedia.org/wiki/Chunked_transfer_encoding

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-04 20:55       ` Jakub Narebski
@ 2009-06-04 21:57         ` Shawn O. Pearce
  2009-06-05  0:45         ` Shawn O. Pearce
  1 sibling, 0 replies; 66+ messages in thread
From: Shawn O. Pearce @ 2009-06-04 21:57 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Scott Chacon, git, Linus Torvalds

Jakub Narebski <jnareb@gmail.com> wrote:
> 
> By the way, is there some publicly accessible JGit (Java) and Dulwich
> (Python) git-daemon one can test against?

There is a public JGit SSH based daemon running at
review.source.android.com, on port 29418.  But no
public git-daemon that I know of.

You can easily set up a JGit daemon yourself, assuming you have a
'javac' available:

  git clone git://repo.or.cz/egit.git jgit
  cd jgit
  ./make_jgit.sh
  ./jgit daemon --export-all . &
  git ls-remote git://localhost/.git

> Actually both git.kernel.org and github.com failed (deadlocked / hung)
> when I tried to add extra key=value parameter at the end of request:
> 
>   003bgit-upload-pack /project.git\0host=myserver.com\0user=me\0

JGit does it fine.  I retested locally with this:

 perl -e '$s="git-upload-pack $ARGV[0]\0hosterver.com\0user=me\0";
   printf "%4.4x%s",4+length $s,$s
   ' /.git | nc localhost 9418

But yea, repo.or.cz hung.  I see the bug in git daemon, I'll post
a patch in a second.  Don't do that test anymore, anywhere.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-04 20:55       ` Jakub Narebski
  2009-06-04 21:57         ` Shawn O. Pearce
@ 2009-06-05  0:45         ` Shawn O. Pearce
  2009-06-05  7:24           ` Jakub Narebski
  1 sibling, 1 reply; 66+ messages in thread
From: Shawn O. Pearce @ 2009-06-05  0:45 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Scott Chacon, git, Linus Torvalds

Jakub Narebski <jnareb@gmail.com> wrote:
> 
> Here we were talking about the following part of exchange: 
> (I have added "C:" prefix to signal that this is what client, 
> git-clone here, sends; I have added also explicit "\n" to mark LF
> characters terminating lines, and put each pkt-line on separate line)
> 
> gb>  C: 0054want 74730d410fcb6603ace96f1dc55ea6196122532d multi_ack side-band-64k ofs-delta\n
> gb>  C: 0032want 7d1665144a3a975c05f1f43902ddaf084e784dbe\n
> gb>  C: 0032want 5a3f6be755bbb7deae50065988cbfa1ffa9ab68a\n
> gb>  C: 0032want 7e47fe2bd8d01d481f44d7af0531bd93d3b21c01\n
> gb>  C: 0032want 74730d410fcb6603ace96f1dc55ea6196122532d\n
> gb>  C: 0000
> gb>  C: 0009done\n
> 
> and where server response is (again the quote from "Git Community Book"
> was modified, removing here doublequotes and doubling of backslashes):

That should be fine.

Here's a dummy client that works against both jgit and C Git:

  perl -e '
	$h="fcfcfb1fd94829c1a1704f894fc111d14770d34e";
	$c="multi_ack side-band-64k ofs-delta";
    sub w{$_=shift;printf "%4.4x%s",4+length,$_;}
    w("git-upload-pack /.git");
    w("want $h $c\n");
    printf "0000";
    w("done")
  ' | nc localhost 9418

Are you sure you are flushing the IO buffers in the dummy client?

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-05  0:45         ` Shawn O. Pearce
@ 2009-06-05  7:24           ` Jakub Narebski
  2009-06-05  8:45             ` Jakub Narebski
  0 siblings, 1 reply; 66+ messages in thread
From: Jakub Narebski @ 2009-06-05  7:24 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Scott Chacon, git

On Fri, 5 Jun 2009, Shawn O. Pearce wrote:
> Jakub Narebski <jnareb@gmail.com> wrote:
> > 
> > Here we were talking about the following part of exchange: 
> > (I have added "C:" prefix to signal that this is what client, 
> > git-clone here, sends; I have added also explicit "\n" to mark LF
> > characters terminating lines, and put each pkt-line on separate line)
> > 
> > gb>  C: 0054want 74730d410fcb6603ace96f1dc55ea6196122532d multi_ack side-band-64k ofs-delta\n
> > gb>  C: 0032want 7d1665144a3a975c05f1f43902ddaf084e784dbe\n
> > gb>  C: 0032want 5a3f6be755bbb7deae50065988cbfa1ffa9ab68a\n
> > gb>  C: 0032want 7e47fe2bd8d01d481f44d7af0531bd93d3b21c01\n
> > gb>  C: 0032want 74730d410fcb6603ace96f1dc55ea6196122532d\n
> > gb>  C: 0000
> > gb>  C: 0009done\n
> > 
> > and where server response is (again the quote from "Git Community Book"
> > was modified, removing here doublequotes and doubling of backslashes):
> 
> That should be fine.
> 
> Here's a dummy client that works against both jgit and C Git:
> 
>   perl -e '
> 	$h="fcfcfb1fd94829c1a1704f894fc111d14770d34e";
> 	$c="multi_ack side-band-64k ofs-delta";
>     sub w{$_=shift;printf "%4.4x%s",4+length,$_;}
>     w("git-upload-pack /.git");
>     w("want $h $c\n");
>     printf "0000";
>     w("done")
>   ' | nc localhost 9418
> 
> Are you sure you are flushing the IO buffers in the dummy client?

That is not what I meant. Perhaps I didn't explain it clear enough...

The above sequence works fine with dummy client in Perl; where it hangs
is when client tries to wait for server response (NAK or ACK) _before_
sending "done":

      $sock->print(pkt_line("want $h $c\n"));
      $sock->print("0000");
      $sock->flush();

      while (!$sock->eof()) {
        my $r = $sock->read($hex4, 4);  
        ...
      }

      $sock->print("0009done\n");
      $sock->flush();

But perhaps I did something wrong in my dummy client...


Also the flush "0000" seems to be required... but when I tried to repeat
it using the above example it actually does not hang, but doesn't get
PACK from git-daemon: there is something wrong in above snippet, as 
I get the same error on server whether I put "0000" flush line or not...

 c$  perl -e '
         my $h="c1e54552c9b35521f189db53db24cc82b5b75816";
         my $c="multi_ack side-band-64k ofs-delta";
         sub w{$_=shift;printf "%04x%s",4+length,$_;}
         w("git-upload-pack /git.git");
         w("want $h $c\n");
         ## printf "0000";    # <-- commented out!
         w("done");
     ' | nc localhost -v 9418
 
 s$  git daemon --export-all --verbose \
         --base-path=/home/local/scm/ /home/local/scm/
 [12791] Connection from 127.0.0.1:42484
 [12791] Request upload-pack for '/git.git'
 fatal: git upload-pack: not our ref c1e54552c9b35521f189db53db24cc82b5b75816 multi_ack side-band-64k ofs-delta

 [12692] [12791] Disconnected (with error)

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-05  7:24           ` Jakub Narebski
@ 2009-06-05  8:45             ` Jakub Narebski
  0 siblings, 0 replies; 66+ messages in thread
From: Jakub Narebski @ 2009-06-05  8:45 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Scott Chacon, git

Jakub Narebski <jnareb@gmail.com> writes:

> Also the flush "0000" seems to be required... but when I tried to repeat
> it using the above example it actually does not hang, but doesn't get
> PACK from git-daemon: there is something wrong in above snippet, as 
> I get the same error on server whether I put "0000" flush line or not...
> 
>  c$  perl -e '
>          my $h="c1e54552c9b35521f189db53db24cc82b5b75816";
>          my $c="multi_ack side-band-64k ofs-delta";
>          sub w{$_=shift;printf "%04x%s",4+length,$_;}
>          w("git-upload-pack /git.git");
>          w("want $h $c\n");
>          ## printf "0000";    # <-- commented out!
>          w("done");
>      ' | nc localhost -v 9418
>  
>  s$  git daemon --export-all --verbose \
>          --base-path=/home/local/scm/ /home/local/scm/
>  [12791] Connection from 127.0.0.1:42484
>  [12791] Request upload-pack for '/git.git'
>  fatal: git upload-pack: not our ref c1e54552c9b35521f189db53db24cc82b5b75816 multi_ack side-band-64k ofs-delta
> 
>  [12692] [12791] Disconnected (with error)

While it works against git-clone ("git clone git://localhost/git.git"),
so the problem is with the above snippet, not with git-daemon invocation.

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-03 16:56                   ` Shawn O. Pearce
  2009-06-03 20:19                     ` Jakub Narebski
@ 2009-06-06 16:33                     ` Scott Chacon
  2009-06-06 17:24                       ` Junio C Hamano
  2009-06-06 17:41                       ` Jakub Narebski
  1 sibling, 2 replies; 66+ messages in thread
From: Scott Chacon @ 2009-06-06 16:33 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Jakub Narebski, git, Junio C Hamano

Hey,

On Wed, Jun 3, 2009 at 9:56 AM, Shawn O. Pearce<spearce@spearce.org> wrote:
> Jakub Narebski <jnareb@gmail.com> wrote:
>> On Wed, 3 Jun 2009, Shawn O. Pearce wrote:
>> > Oh, and send-pack/receive-pack protocol now has ".have" refs [...]
>>
>> What are those ".have" refs? They are not described in current version
>> of "Transfer Protocols" (sub)section in "The Community Book". I remember
>> that they were discussed on git mailing list, but I don't remember what
>> they were about...
>
> If the remote receiving repository has alternates, the ".have" refs are
> the refs of the alternate repositories.  This signals to the client that
> the server has these objects reachable, but the client isn't permitted
> to send commands to alter these refs.

Can someone help me out with the '.have' refs?  What do they look
like?  Is this the same as the '.keep' ref Jakub mentioned earlier in
one of the example server responses?

I'm trying to take this whole thread and actually write an RFC style
document for all of this stuff, but I'm still unclear on the .have
portion of the conversation.  Pointing me to an earlier relevant
thread in the Git mailing list would be fine, too - it's difficult to
search for '.have' usefully.

Thanks!
Scott

>
> Its good for a site like GitHub or repo.or.cz where cheap forks are
> implemented by creating a repository that points to a common shared
> base via alternates.  The ".have" refs say that the server already
> has everything in that common shared base, so the client doesn't
> have to re-upload the entire project if the fork started out empty,
> or had all refs deleted from it.
>
>> > In packed-refs, Junio had a hard time adding the "peeled-refs"
>> > support, because the first version of the parser was so strict.
>> > But again, somehow he managed to find a backdoor in the old parser,
>> > and that backdoor is why that file looks the way it does today.
>>
>> I don't remember what that was about... Nevertheless now we have
>> kind of 'capabilities' section in .git/packed-refs
>
> Sort of.  In a file format its worse than a network protocol,
> because the file can't alter its contents based on what the
> reader can understand.
>
>> Interesting... even more so that this problem of designing without
>> extendability in mind (magic number + version) is so persistent :-(
>
> I know.  I think we maybe have learned the lesson.  I don't know.
>
> --
> Shawn.
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-06 16:33                     ` Scott Chacon
@ 2009-06-06 17:24                       ` Junio C Hamano
  2009-06-06 17:41                       ` Jakub Narebski
  1 sibling, 0 replies; 66+ messages in thread
From: Junio C Hamano @ 2009-06-06 17:24 UTC (permalink / raw)
  To: Scott Chacon; +Cc: Shawn O. Pearce, Jakub Narebski, git, Junio C Hamano

Scott Chacon <schacon@gmail.com> writes:

> I'm trying to take this whole thread and actually write an RFC style
> document for all of this stuff, but I'm still unclear on the .have
> portion of the conversation.  Pointing me to an earlier relevant
> thread in the Git mailing list would be fine, too - it's difficult to
> search for '.have' usefully.

The actual patch series is this.

    http://thread.gmane.org/gmane.comp.version-control.git/95351

The thread the patch series's cover letter refers as "the topic discussed
earlier" is this.

    http://thread.gmane.org/gmane.comp.version-control.git/95072/focus=95256

Here is how people can dig this, for people's reference.

 (1) Where in the code is this feature implemented?

     $ git grep -n -e '\.have' -- '*.c'
     builtin-receive-pack.c:647:             add_extra_ref(".have",...
     connect.c:87:               name_len == 5 && !memcmp(".have", ...

 (2) When was it added?

     $ git blame -L 645,650 builtin-receive-pack.c
     d79796bc (Junio C Hamano 2008-09-09 01:27:10 -0700 645)              ex
     d79796bc (Junio C Hamano 2008-09-09 01:27:10 -0700 646)              ex
     d79796bc (Junio C Hamano 2008-09-09 01:27:10 -0700 647)                
     d79796bc (Junio C Hamano 2008-09-09 01:27:10 -0700 648)         }
     d79796bc (Junio C Hamano 2008-09-09 01:27:10 -0700 649)         transpo
     d79796bc (Junio C Hamano 2008-09-09 01:27:10 -0700 650)         free(ot
  
 (3) Go to http://news.gmane.org/gmane.comp.version-control.git/ and page
     back to the timeframe:

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Request for detailed documentation of git pack protocol
  2009-06-06 16:33                     ` Scott Chacon
  2009-06-06 17:24                       ` Junio C Hamano
@ 2009-06-06 17:41                       ` Jakub Narebski
  1 sibling, 0 replies; 66+ messages in thread
From: Jakub Narebski @ 2009-06-06 17:41 UTC (permalink / raw)
  To: Scott Chacon; +Cc: Shawn O. Pearce, git, Junio C Hamano

On Sat, 6 Jan 2009, Scott Chacon wrote:
> On Wed, Jun 3, 2009 at 9:56 AM, Shawn O. Pearce <spearce@spearce.org> wrote:
>> Jakub Narebski <jnareb@gmail.com> wrote:
>>> On Wed, 3 Jun 2009, Shawn O. Pearce wrote:
>>>>
>>>> Oh, and send-pack/receive-pack protocol now has ".have" refs [...]
>>>
>>> What are those ".have" refs? They are not described in current version
>>> of "Transfer Protocols" (sub)section in "The Community Book". I remember
>>> that they were discussed on git mailing list, but I don't remember what
>>> they were about...
>>
>> If the remote receiving repository has alternates, the ".have" refs are
>> the refs of the alternate repositories.  This signals to the client that
>> the server has these objects reachable, but the client isn't permitted
>> to send commands to alter these refs.
> 
> Can someone help me out with the '.have' refs?  What do they look
> like?  Is this the same as the '.keep' ref Jakub mentioned earlier in
> one of the example server responses?

This was my mistake, and even more that was double mistake. It is 
'.have', not '.keep', and (as Shawn said) it can be found in response
during _push_ as a reply from git-receive-pack, not during fetch / clone
as reply from git-upload-pack.

If a repository you want to push to uses alternates (e.g. was cloned
using --shared option, or using --reference=<repository path> option),
then refs from repository which serves as source of alternate 
(additional) object database are shown as '.have' refs.

Create some repository, add some objects to it that it is not empty,
then clone it (locally) using e.g. "git clone --mirror --shared",
do some work on clone (for example delete one of branches), then try to
push.

I used "ssh localhost git-receive-pack /path/to/clone.git" as a dummy
client to see what response from git-receive-pack would be:

  0059c0a92eb6f58c25a4c00e5e754e6de83e103231a1 .have report-status delete-refs ofs-delta\n 
  0033efd990cb1a5f35b2b3e8b0ef0a85f43b118b8688 .have\n
  0033c0a92eb6f58c25a4c00e5e754e6de83e103231a1 .have\n
  003fefd990cb1a5f35b2b3e8b0ef0a85f43b118b8688 refs/heads/master\n
  0000

'.have' are references in repository which given repository borrows
object from, i.e. which object database is in $GIT_DIR/objects/info/alternates
file.

Sidenote: here as far as I can see we do not use "\0" trick... 
which is a bit strange (at least for me).

> 
> I'm trying to take this whole thread and actually write an RFC style
> document for all of this stuff, but I'm still unclear on the .have
> portion of the conversation.  Pointing me to an earlier relevant
> thread in the Git mailing list would be fine, too - it's difficult to
> search for '.have' usefully.

Well, "Transfer Protocols" section in "Git Community Book" is a good
start. I think that it would be better to have pack protocol described
first there, not necessary with the amount of detail required for 
technical reference documentation (format) like RFC.

Currently the "Pushing Data" subsection in "Transfer Protocols" consist
currently of two short paragraphs... and that is the section where 
description of fake '.have' refs should go to. They do not matter and
are not used for fetching.

P.S. I'll try to send, as a summary of this thread (and my experiments),
second round of my comments about smart protocols section of "Transfer
Protocols" section soon, and most probably third round would be in the
form of a patch to Markdown sources for this section (chapter).

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Comments pack protocol description in "Git Community Book" (second round)
  2009-06-02 21:39     ` Jakub Narebski
                         ` (2 preceding siblings ...)
  2009-06-04 20:55       ` Jakub Narebski
@ 2009-06-06 21:38       ` Jakub Narebski
  2009-06-06 21:58         ` Scott Chacon
  2009-06-07 20:06         ` Comments pack protocol description in "Git Community Book" (second round) Shawn O. Pearce
  3 siblings, 2 replies; 66+ messages in thread
From: Jakub Narebski @ 2009-06-06 21:38 UTC (permalink / raw)
  To: Scott Chacon
  Cc: Shawn O. Pearce, git, Junio C Hamano, Andreas Ericsson,
	Tony Finch, Johannes Sixt, Linus Torvalds

There are beginnings of description of git pack protocol in section
"Transfer Protocols"[1][2] of chapter "7. Internals and Plumbing"
of "Git Community Book" (http://book.git-scm.com).

 [1] http://book.git-scm.com/7_transfer_protocols.html
 [2] http://github.com/schacon/gitbook/blob/master/text/54_Transfer_Protocols/0_Transfer_Protocols.markdown

This is second round of my comments about this item. I'd like to have
some more comments about git pack protocol before trying to come up
with formulation which is good enough to send as patch against source
of mentioned section.

The relevant parts of above source are quoted as if they were email
I am replying too.

I have CC-ed everybody who participated in this subthread (originally
named "Re: Request for detailed documentation of git pack protocol").

....
> ### Fetching Data with Upload Pack ###
> 
> For the smarter protocols, fetching objects is much more efficient.  A
> socket is opened, either over ssh or over port 9418 (in the case of
> the git:// protocol), and the git-fetch-pack(1) command on the client
> begins communicating with a forked git-upload-pack(1) process on the
> server.
> 
> Then the server will tell the client which SHAs it has for each ref,
> and the client figures out what it needs and responds with a list of
> SHAs it wants and already has.

It would be probably more clear here to state explicitely that there
are two lists, i.e. "a list of SHAs it wants and a list of SHAs it
already has".

> 
> At this point, the server will generate a packfile with all the
> objects that the client needs and begin streaming it down to the
> client.

This is a bit of oversimplification.  In most simple case like client
using git-clone to get all objects it is true that server can generate
packfile and stream it to client after client tells a list of wanted
SHAs.  In more complicated case however there can be series of
exchanges between client and server, with client sending sets of
commits it have, and server responding whether it is enough (or
perhaps this line of commits is uninteresting)... and only then
arriving at list of objects to send in a packfile.

> 
> Let's look at an example.

I think that before example we should have short description (sketch)
of the whole exchange; for example the one taken from
'Documentation/technical/pack-protocol.txt':

upload-pack (S) | fetch/clone-pack (C) protocol:

  # Tell the puller what commits we have and what their names are
  S: SHA1 name
  S: ...
  S: SHA1 name
  S: # flush -- it's your turn
  # Tell the pusher what commits we want, and what we have
  C: want name
  C: ..
  C: want name
  C: have SHA1
  C: have SHA1
  C: ...
  C: # flush -- occasionally ask "had enough?"
  S: NAK
  C: have SHA1
  C: ...
  C: have SHA1
  S: ACK
  C: done
  S: XXXXXXX -- packfile contents.

> 
> The client connects and sends the request header. The clone command
> 
> 	$ git clone git://myserver.com/project.git
> 
> produces the following request:
> 
> 	0032git-upload-pack /project.git\\000host=myserver.com\\000

Although fetching via SSH protocol is, I guess, much more rare than
fetching via anonymous unauthenticated git:// protocol, it _might_ be
good idea to tell there that fetching via SSH differs from above
sequence that instead of opening TCP connection to port 9418 and
sending above packet, and later reading from and writing to socket,
"git clone ssh://myserver.com/srv/git/project.git" calls

	ssh myserver.com git-upload-pack /srv/git/project.git

and later reads from standard output of the above command, and writes
to standard input of above command.

The rest of exchange is _identical_ for git:// and for ssh:// (and
I guess also for file:// pseudoprotocol).

> 
> The first four bytes contain the hex length of the line (including 4
> byte line length and trailing newline if present). Following are the
> command and arguments. This is followed by a null byte and then the
> host information. The request is terminated by a null byte.

I think it would be better to describe packet (chunk) format, called
pkt-line in git, separately from describing the contents of above
packet; either first pkt-line then command, or first command then
pkt-line.  Otherwise we would be left with describing pkt-line format
many times, as it is done in current version of this chapter.

In git clients communicates with server using a packetized stream,
where each line (packet, chunk) is preceded by its length (including
the header) as a 4-byte hex number.  A length of 'zero', i.e. packet
"0000" has a special meaning: it means end of stream / flush
connection.  The "# flush ..." in description of client--server
exchange above is done using exactly "0000" packet. 

Footnote: this format somewhat reminds / resembles 'chunked' transfer
encoding used in HTTP[1], although there are differences.
  http://en.wikipedia.org/wiki/Chunked_transfer_encoding

> 
> The request is processed and turned into a call to git-upload-pack:
> 
>  	$ git-upload-pack /path/to/repos/project.git

This is alternate place where we could tell about fetching via ssh://

We probably should tell where /path/to/repos that /project.git is
prefixed with comes from; it is from --base-path=/path/to/repos
argument to git-daemon (a sort of "GIT root").

BTW. (this is just a very minor nit) shouldn't we use FHS compliant
path, i.e. "/srv/git" instead of "/path/to/repos" (and follow RFC in
using "example.com" in place of "myserver.com")?

> 
> This immediately returns information of the repo:
> 
> 	007c74730d410fcb6603ace96f1dc55ea6196122532d HEAD\\000multi_ack thin-pack side-band side-band-64k ofs-delta shallow no-progress include-tag\\n
> 	003e7d1665144a3a975c05f1f43902ddaf084e784dbe refs/heads/debug\\n
> 	003d5a3f6be755bbb7deae50065988cbfa1ffa9ab68a refs/heads/dist\\n
> 	003e7e47fe2bd8d01d481f44d7af0531bd93d3b21c01 refs/heads/local\\n
> 	003f74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/master\\n
> 	0000

I have added explicit LF terminators in the form of "\\n" (which would
render as "\n"), mainly because "0000" flush packed _doesn't_ have it.
Also I have added "include-tag", as modern git installations provide
this capability.

Here is a dilemma: currently example output is provided almost exactly
as-is, only indented and with some quoting/escaping (\\000 or \\0 for
NUL character, \\n for LF, later \\001 and \\002 for 0x01 and 0x02
bytes).  To know if given example output is what client sends or what
server outputs, you have to read the narrative.  Alternate solution
would be to use "C: " and "S: " prefixing (perhaps with some extra
format to make it more clear that it is not part of data), used in
pack-protocol.txt technical documentation, and proposed for describing
network protocols by some RFC (I don't remember which, unfortunately).
Which one to choose?

We would want, at some point, describe that first line of first
response from server contains 'stuffed' behind "\0" (NUL) space
separated list of capabilities our server supports.  Those
capabilities would have to be described somewhere: as a sidebar, 
or in a separate subsection, or in an appendix.

Below there is (for completeness) list of git-upload-pack
capabilities, with short description of each:

 * multi_ack (for historical reasons not multi-ack)

   It allows the server to return "ACK $SHA1 continue" as soon as it
   finds a commit that it can use as a common base, between the
   client's wants and the client's have set.

   By sending this early, the server can potentially head off the
   client from walking any further down that particular branch of the
   client's repository history.

   See the thread for more details (posts by Shawn O. Pearce and by
   Junio C Hamano).

 * thin-pack

   Server can send thin packs, i.e. packs which do not contain base
   elements for some delta chains, if those base elements are
   available on client side.  Client has thin-pack capability when it
   understand how to "thicken" them adding required delta bases,
   making those packfiles independent.

   Of course it doesn't make sense for client to use (request) this
   capability for git-clone... But if the client does request it (and
   I think modern clients actually do request it, even on initial
   clone case) the server won't produce a thin pack. Why?  There is no
   common base, so there is no uninteresting set to omit from the
   pack.  :-)

 * side-band 
 * side-band-64k 

   This means that server can send, and client understand multiplexed
   (muxed) progress reports and error info interleaved with the
   packfile itself.

   These two options are mutually exclusive. A client should ask for
   only one of them, and a modern client always favors side-band-64k.
   If client ask for both, server uses side-band-64k.

   Older side-band allows only up to 1000 bytes per packet.

 * ofs-delta 

   Server can send, and client understand PACKv2 with delta refering
   to its base by position in pack rather than by SHA-1.  Both can
   send/read OBJ_OFS_DELTA, aka type 6 in a pack file.

 * shallow 

   Server can send shallow clone (git clone --depth ...).

 * no-progress

   Client should use it if it was started with "git clone -q" or
   something, and doesn't want that side brand 2.  We still want
   sideband 1 with actual data (packfile), and sideband 3 with error
   messages.

 * include-tag

   If we pack an object to the client, and a tag points exactly at
   that object, we pack the tag too.  In general this allows a client
   to get all new tags when it fetches a branch, in a single network
   connection, instead of two (separate connection for tags).

   This capability is not to be used when client was called with
   '--no-tags'.

> 
> Each line starts with a four byte line length declaration in hex. The
> section is terminated by a line length declaration of 0000.

This repetition would not be necessary if pkt-line format had its own
description somewhere before.  We would probably still want to remind
the reader that "0000" line length declaration means 'flush'.

> 
> This is sent back to the client verbatim. 

Hmmm... "sent back ... verbatim"? I wonder what did you want to say
here...

> The client responds with another request:
> 
> 	0054want 74730d410fcb6603ace96f1dc55ea6196122532d multi_ack side-band-64k ofs-delta\\n
> 	0032want 7d1665144a3a975c05f1f43902ddaf084e784dbe\\n
> 	0032want 5a3f6be755bbb7deae50065988cbfa1ffa9ab68a\\n
> 	0032want 7e47fe2bd8d01d481f44d7af0531bd93d3b21c01\\n
> 	0032want 74730d410fcb6603ace96f1dc55ea6196122532d\\n
> 	0000
>       0009done\\n

Here again I added explicit LF terminator, and split off "0000" flush
packet in separate line, to make this request (well, two requests)
more clear.

The first line of this request contains capabilities client wants to
use.  It should be some subset of capabilities server supports.

> 
> The is sent to the open git-upload-pack process which then streams out
> the final response:

"_The_ is send"?

I would remove quotes around lines of server response below, but would
leave explicit \n for LF, and \\001 and \\002 for bytes 0x01 and 0x02
denoting channel.

> 
> 	"0008NAK\n"

This NAK means that server did not found [closed] set of common
ancestors. It is response to "0000" flush line ("had enough?" line)
from client. As the example is about git-clone, and client doesn't
_have_ any commits to show server as candidates for common ancestors
(calculation), it replies with "done" to get pack.

> 	"0023\\002Counting objects: 2797, done.\n"

This is a bit untypical example, as for larger repositories like Linux
kernel or even git repository, usually you would have much more
objects, and actually object enumeration would take more time.  You
would see many

	"0020\\002Counting objects: 10662   \r"
	"0020\\002Counting objects: 22318   \r"
	"0020\\002Counting objects: 29506   \r"

packets before

 	"0023\\002Counting objects: 65058, done.\n"

> 	"002b\\002Compressing objects:   0% (1/1177)   \r"
> 	"002c\\002Compressing objects:   1% (12/1177)   \r"
> 	"002c\\002Compressing objects:   2% (24/1177)   \r"
> 	"002c\\002Compressing objects:   3% (36/1177)   \r"
> 	"002c\\002Compressing objects:   4% (48/1177)   \r"
> 	"002c\\002Compressing objects:   5% (59/1177)   \r"
> 	"002c\\002Compressing objects:   6% (71/1177)   \r"
> 	"0053\\002Compressing objects:   7% (83/1177)   \rCompressing objects:   8% (95/1177)   \r"
> 	...
> 	"005b\\002Compressing objects: 100% (1177/1177)   \rCompressing objects: 100% (1177/1177), done.\n"

Sidenote: the reason why there is sometimes more than one line send in
a single packet / single pkt-line is buffering between git-pack-objects
which produces those messages to pipe, and git-upload-pack which reads
them and sends them to client.  If pack-objects can write two messages
into the pipe buffer before upload-pack is woken to read them out,
upload-pack might find two (or more) messages ready to read without
blocking.  These get bundled into a single packet, because, why not,
its easier to code it that way.

Here or a little later we probably should explain (even though it is
fairly obvious), that final response from server is (here) in pkt-line
with sideband format, where first byte of data denotes channel
(stream) number: 1 for data, 2 for progress info, 3 for fatal errors.

> 	"2004\\001PACK\\000\\000\\000\\002\\000\\000\n\\355\\225\\017x\\234\\235\\216K\n\\302"...
> 	"2005\\001\\360\\204{\\225\\376\\330\\345]z\226\273"...

Here I think it would be enough to show only the fragment which is
packfile signature...

> 	...
> 	"0037\\002Total 2797 (delta 1799), reused 2360 (delta 1529)\n"
> 	...
> 	"<\\276\\255L\\273s\\005\\001w0006\\001[0000"

This line is I think is broken in wrong place.  It is the tail
end of some packet (each packed begins with 4 characters wide 0-padded
length of chunk as hex number; "<\\276\\255L" does not match 4HEXDIG),
followed by "0000" 'flush' packet (here it signals end of stream).

> 	
> See the Packfile chapter previously for the actual format of the
> packfile data in the response.
> 
> 
....
-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Comments pack protocol description in "Git Community Book"  (second round)
  2009-06-06 21:38       ` Comments pack protocol description in "Git Community Book" (second round) Jakub Narebski
@ 2009-06-06 21:58         ` Scott Chacon
  2009-06-07  8:21           ` Jakub Narebski
                             ` (2 more replies)
  2009-06-07 20:06         ` Comments pack protocol description in "Git Community Book" (second round) Shawn O. Pearce
  1 sibling, 3 replies; 66+ messages in thread
From: Scott Chacon @ 2009-06-06 21:58 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Shawn O. Pearce, git, Junio C Hamano, Andreas Ericsson,
	Tony Finch, Johannes Sixt, Linus Torvalds

Hey,

On Sat, Jun 6, 2009 at 2:38 PM, Jakub Narebski<jnareb@gmail.com> wrote:
> There are beginnings of description of git pack protocol in section
> "Transfer Protocols"[1][2] of chapter "7. Internals and Plumbing"
> of "Git Community Book" (http://book.git-scm.com).
>
>  [1] http://book.git-scm.com/7_transfer_protocols.html
>  [2] http://github.com/schacon/gitbook/blob/master/text/54_Transfer_Protocols/0_Transfer_Protocols.markdown
>
> This is second round of my comments about this item. I'd like to have
> some more comments about git pack protocol before trying to come up
> with formulation which is good enough to send as patch against source
> of mentioned section.
>

I can certainly fix up this chapter with these comments - I understand
the protocol a bit better now than I did when I originally wrote this.

In addition to that, I started taking a shot at putting together an
RFC formatted documentation of this protocol as was requested.  I may
have _way_ missed the mark on what you were looking for originally,
it's hard to say, not having read a lot of RFC documents - I probably
ended up writing in a more bookish format rather than a technical
spec, but whatever - maybe you'll find it helpful or can fix it up to
more what you were expecting.  I'm not done with it - some of it is
still basically unformatted comments from this previous thread, but at
least it's laid out roughly how I thought it might be useful and I
have fleshed out a lot of it.  You can find the RFC text output
document here:

http://git-scm.com/gitserver.txt

And the xml doc I generated it from here:

http://github.com/schacon/gitserver-rfc

Perhaps if we're going to spend time getting this all correct, we
should get a standalone technical doc all agreed upon, then I can
relatively easily extract what's needed into that chapter of the
Community book.

Thoughts?

Scott

> The relevant parts of above source are quoted as if they were email
> I am replying too.
>
> I have CC-ed everybody who participated in this subthread (originally
> named "Re: Request for detailed documentation of git pack protocol").
>
> ....
>> ### Fetching Data with Upload Pack ###
>>
>> For the smarter protocols, fetching objects is much more efficient.  A
>> socket is opened, either over ssh or over port 9418 (in the case of
>> the git:// protocol), and the git-fetch-pack(1) command on the client
>> begins communicating with a forked git-upload-pack(1) process on the
>> server.
>>
>> Then the server will tell the client which SHAs it has for each ref,
>> and the client figures out what it needs and responds with a list of
>> SHAs it wants and already has.
>
> It would be probably more clear here to state explicitely that there
> are two lists, i.e. "a list of SHAs it wants and a list of SHAs it
> already has".
>
>>
>> At this point, the server will generate a packfile with all the
>> objects that the client needs and begin streaming it down to the
>> client.
>
> This is a bit of oversimplification.  In most simple case like client
> using git-clone to get all objects it is true that server can generate
> packfile and stream it to client after client tells a list of wanted
> SHAs.  In more complicated case however there can be series of
> exchanges between client and server, with client sending sets of
> commits it have, and server responding whether it is enough (or
> perhaps this line of commits is uninteresting)... and only then
> arriving at list of objects to send in a packfile.
>
>>
>> Let's look at an example.
>
> I think that before example we should have short description (sketch)
> of the whole exchange; for example the one taken from
> 'Documentation/technical/pack-protocol.txt':
>
> upload-pack (S) | fetch/clone-pack (C) protocol:
>
>  # Tell the puller what commits we have and what their names are
>  S: SHA1 name
>  S: ...
>  S: SHA1 name
>  S: # flush -- it's your turn
>  # Tell the pusher what commits we want, and what we have
>  C: want name
>  C: ..
>  C: want name
>  C: have SHA1
>  C: have SHA1
>  C: ...
>  C: # flush -- occasionally ask "had enough?"
>  S: NAK
>  C: have SHA1
>  C: ...
>  C: have SHA1
>  S: ACK
>  C: done
>  S: XXXXXXX -- packfile contents.
>
>
>>
>> The client connects and sends the request header. The clone command
>>
>>       $ git clone git://myserver.com/project.git
>>
>> produces the following request:
>>
>>       0032git-upload-pack /project.git\\000host=myserver.com\\000
>
> Although fetching via SSH protocol is, I guess, much more rare than
> fetching via anonymous unauthenticated git:// protocol, it _might_ be
> good idea to tell there that fetching via SSH differs from above
> sequence that instead of opening TCP connection to port 9418 and
> sending above packet, and later reading from and writing to socket,
> "git clone ssh://myserver.com/srv/git/project.git" calls
>
>        ssh myserver.com git-upload-pack /srv/git/project.git
>
> and later reads from standard output of the above command, and writes
> to standard input of above command.
>
> The rest of exchange is _identical_ for git:// and for ssh:// (and
> I guess also for file:// pseudoprotocol).
>
>>
>> The first four bytes contain the hex length of the line (including 4
>> byte line length and trailing newline if present). Following are the
>> command and arguments. This is followed by a null byte and then the
>> host information. The request is terminated by a null byte.
>
> I think it would be better to describe packet (chunk) format, called
> pkt-line in git, separately from describing the contents of above
> packet; either first pkt-line then command, or first command then
> pkt-line.  Otherwise we would be left with describing pkt-line format
> many times, as it is done in current version of this chapter.
>
>
> In git clients communicates with server using a packetized stream,
> where each line (packet, chunk) is preceded by its length (including
> the header) as a 4-byte hex number.  A length of 'zero', i.e. packet
> "0000" has a special meaning: it means end of stream / flush
> connection.  The "# flush ..." in description of client--server
> exchange above is done using exactly "0000" packet.
>
> Footnote: this format somewhat reminds / resembles 'chunked' transfer
> encoding used in HTTP[1], although there are differences.
>  http://en.wikipedia.org/wiki/Chunked_transfer_encoding
>
>>
>> The request is processed and turned into a call to git-upload-pack:
>>
>>       $ git-upload-pack /path/to/repos/project.git
>
> This is alternate place where we could tell about fetching via ssh://
>
> We probably should tell where /path/to/repos that /project.git is
> prefixed with comes from; it is from --base-path=/path/to/repos
> argument to git-daemon (a sort of "GIT root").
>
> BTW. (this is just a very minor nit) shouldn't we use FHS compliant
> path, i.e. "/srv/git" instead of "/path/to/repos" (and follow RFC in
> using "example.com" in place of "myserver.com")?
>
>>
>> This immediately returns information of the repo:
>>
>>       007c74730d410fcb6603ace96f1dc55ea6196122532d HEAD\\000multi_ack thin-pack side-band side-band-64k ofs-delta shallow no-progress include-tag\\n
>>       003e7d1665144a3a975c05f1f43902ddaf084e784dbe refs/heads/debug\\n
>>       003d5a3f6be755bbb7deae50065988cbfa1ffa9ab68a refs/heads/dist\\n
>>       003e7e47fe2bd8d01d481f44d7af0531bd93d3b21c01 refs/heads/local\\n
>>       003f74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/master\\n
>>       0000
>
> I have added explicit LF terminators in the form of "\\n" (which would
> render as "\n"), mainly because "0000" flush packed _doesn't_ have it.
> Also I have added "include-tag", as modern git installations provide
> this capability.
>
> Here is a dilemma: currently example output is provided almost exactly
> as-is, only indented and with some quoting/escaping (\\000 or \\0 for
> NUL character, \\n for LF, later \\001 and \\002 for 0x01 and 0x02
> bytes).  To know if given example output is what client sends or what
> server outputs, you have to read the narrative.  Alternate solution
> would be to use "C: " and "S: " prefixing (perhaps with some extra
> format to make it more clear that it is not part of data), used in
> pack-protocol.txt technical documentation, and proposed for describing
> network protocols by some RFC (I don't remember which, unfortunately).
> Which one to choose?
>
>
> We would want, at some point, describe that first line of first
> response from server contains 'stuffed' behind "\0" (NUL) space
> separated list of capabilities our server supports.  Those
> capabilities would have to be described somewhere: as a sidebar,
> or in a separate subsection, or in an appendix.
>
> Below there is (for completeness) list of git-upload-pack
> capabilities, with short description of each:
>
>  * multi_ack (for historical reasons not multi-ack)
>
>   It allows the server to return "ACK $SHA1 continue" as soon as it
>   finds a commit that it can use as a common base, between the
>   client's wants and the client's have set.
>
>   By sending this early, the server can potentially head off the
>   client from walking any further down that particular branch of the
>   client's repository history.
>
>   See the thread for more details (posts by Shawn O. Pearce and by
>   Junio C Hamano).
>
>  * thin-pack
>
>   Server can send thin packs, i.e. packs which do not contain base
>   elements for some delta chains, if those base elements are
>   available on client side.  Client has thin-pack capability when it
>   understand how to "thicken" them adding required delta bases,
>   making those packfiles independent.
>
>   Of course it doesn't make sense for client to use (request) this
>   capability for git-clone... But if the client does request it (and
>   I think modern clients actually do request it, even on initial
>   clone case) the server won't produce a thin pack. Why?  There is no
>   common base, so there is no uninteresting set to omit from the
>   pack.  :-)
>
>  * side-band
>  * side-band-64k
>
>   This means that server can send, and client understand multiplexed
>   (muxed) progress reports and error info interleaved with the
>   packfile itself.
>
>   These two options are mutually exclusive. A client should ask for
>   only one of them, and a modern client always favors side-band-64k.
>   If client ask for both, server uses side-band-64k.
>
>   Older side-band allows only up to 1000 bytes per packet.
>
>  * ofs-delta
>
>   Server can send, and client understand PACKv2 with delta refering
>   to its base by position in pack rather than by SHA-1.  Both can
>   send/read OBJ_OFS_DELTA, aka type 6 in a pack file.
>
>  * shallow
>
>   Server can send shallow clone (git clone --depth ...).
>
>  * no-progress
>
>   Client should use it if it was started with "git clone -q" or
>   something, and doesn't want that side brand 2.  We still want
>   sideband 1 with actual data (packfile), and sideband 3 with error
>   messages.
>
>  * include-tag
>
>   If we pack an object to the client, and a tag points exactly at
>   that object, we pack the tag too.  In general this allows a client
>   to get all new tags when it fetches a branch, in a single network
>   connection, instead of two (separate connection for tags).
>
>   This capability is not to be used when client was called with
>   '--no-tags'.
>
>>
>> Each line starts with a four byte line length declaration in hex. The
>> section is terminated by a line length declaration of 0000.
>
> This repetition would not be necessary if pkt-line format had its own
> description somewhere before.  We would probably still want to remind
> the reader that "0000" line length declaration means 'flush'.
>
>>
>> This is sent back to the client verbatim.
>
> Hmmm... "sent back ... verbatim"? I wonder what did you want to say
> here...
>
>> The client responds with another request:
>>
>>       0054want 74730d410fcb6603ace96f1dc55ea6196122532d multi_ack side-band-64k ofs-delta\\n
>>       0032want 7d1665144a3a975c05f1f43902ddaf084e784dbe\\n
>>       0032want 5a3f6be755bbb7deae50065988cbfa1ffa9ab68a\\n
>>       0032want 7e47fe2bd8d01d481f44d7af0531bd93d3b21c01\\n
>>       0032want 74730d410fcb6603ace96f1dc55ea6196122532d\\n
>>       0000
>>       0009done\\n
>
> Here again I added explicit LF terminator, and split off "0000" flush
> packet in separate line, to make this request (well, two requests)
> more clear.
>
> The first line of this request contains capabilities client wants to
> use.  It should be some subset of capabilities server supports.
>
>>
>> The is sent to the open git-upload-pack process which then streams out
>> the final response:
>
> "_The_ is send"?
>
> I would remove quotes around lines of server response below, but would
> leave explicit \n for LF, and \\001 and \\002 for bytes 0x01 and 0x02
> denoting channel.
>
>>
>>       "0008NAK\n"
>
> This NAK means that server did not found [closed] set of common
> ancestors. It is response to "0000" flush line ("had enough?" line)
> from client. As the example is about git-clone, and client doesn't
> _have_ any commits to show server as candidates for common ancestors
> (calculation), it replies with "done" to get pack.
>
>>       "0023\\002Counting objects: 2797, done.\n"
>
> This is a bit untypical example, as for larger repositories like Linux
> kernel or even git repository, usually you would have much more
> objects, and actually object enumeration would take more time.  You
> would see many
>
>        "0020\\002Counting objects: 10662   \r"
>        "0020\\002Counting objects: 22318   \r"
>        "0020\\002Counting objects: 29506   \r"
>
> packets before
>
>        "0023\\002Counting objects: 65058, done.\n"
>
>>       "002b\\002Compressing objects:   0% (1/1177)   \r"
>>       "002c\\002Compressing objects:   1% (12/1177)   \r"
>>       "002c\\002Compressing objects:   2% (24/1177)   \r"
>>       "002c\\002Compressing objects:   3% (36/1177)   \r"
>>       "002c\\002Compressing objects:   4% (48/1177)   \r"
>>       "002c\\002Compressing objects:   5% (59/1177)   \r"
>>       "002c\\002Compressing objects:   6% (71/1177)   \r"
>>       "0053\\002Compressing objects:   7% (83/1177)   \rCompressing objects:   8% (95/1177)   \r"
>>       ...
>>       "005b\\002Compressing objects: 100% (1177/1177)   \rCompressing objects: 100% (1177/1177), done.\n"
>
> Sidenote: the reason why there is sometimes more than one line send in
> a single packet / single pkt-line is buffering between git-pack-objects
> which produces those messages to pipe, and git-upload-pack which reads
> them and sends them to client.  If pack-objects can write two messages
> into the pipe buffer before upload-pack is woken to read them out,
> upload-pack might find two (or more) messages ready to read without
> blocking.  These get bundled into a single packet, because, why not,
> its easier to code it that way.
>
> Here or a little later we probably should explain (even though it is
> fairly obvious), that final response from server is (here) in pkt-line
> with sideband format, where first byte of data denotes channel
> (stream) number: 1 for data, 2 for progress info, 3 for fatal errors.
>
>>       "2004\\001PACK\\000\\000\\000\\002\\000\\000\n\\355\\225\\017x\\234\\235\\216K\n\\302"...
>>       "2005\\001\\360\\204{\\225\\376\\330\\345]z\226\273"...
>
> Here I think it would be enough to show only the fragment which is
> packfile signature...
>
>>       ...
>>       "0037\\002Total 2797 (delta 1799), reused 2360 (delta 1529)\n"
>>       ...
>>       "<\\276\\255L\\273s\\005\\001w0006\\001[0000"
>
> This line is I think is broken in wrong place.  It is the tail
> end of some packet (each packed begins with 4 characters wide 0-padded
> length of chunk as hex number; "<\\276\\255L" does not match 4HEXDIG),
> followed by "0000" 'flush' packet (here it signals end of stream).
>
>>
>> See the Packfile chapter previously for the actual format of the
>> packfile data in the response.
>>
>>
> ....
> --
> Jakub Narebski
> Poland
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Comments pack protocol description in "Git Community Book"  (second round)
  2009-06-06 21:58         ` Scott Chacon
@ 2009-06-07  8:21           ` Jakub Narebski
  2009-06-07 20:13             ` Shawn O. Pearce
  2009-06-07 20:43           ` Shawn O. Pearce
  2009-06-13  9:30           ` Comments pack protocol description in "RFC for the Git Packfile Protocol" (long) Jakub Narebski
  2 siblings, 1 reply; 66+ messages in thread
From: Jakub Narebski @ 2009-06-07  8:21 UTC (permalink / raw)
  To: Scott Chacon
  Cc: Shawn O. Pearce, git, Junio C Hamano, Andreas Ericsson,
	Tony Finch, Johannes Sixt, Linus Torvalds

On Sat, 6 June 2009, Scott Chacon wrote:

> In addition to that, I started taking a shot at putting together an
> RFC formatted documentation of this protocol as was requested.  I may
> have _way_ missed the mark on what you were looking for originally,
> it's hard to say, not having read a lot of RFC documents - I probably
> ended up writing in a more bookish format rather than a technical
> spec, but whatever - maybe you'll find it helpful or can fix it up to
> more what you were expecting.  I'm not done with it - some of it is
> still basically unformatted comments from this previous thread, but at
> least it's laid out roughly how I thought it might be useful and I
> have fleshed out a lot of it.  You can find the RFC text output
> document here:
> 
> http://git-scm.com/gitserver.txt

[...]
> Thoughts?

Those are only preliminary thoughts; more detailed analysis is to follow 
(hopefully).

Usually RFC documents refer to RFC 2119 (Key words for use in RFCs to 
Indicate Requirement Levels) for definitions of words such as MUST, 
SHOULD, MAY in the following way:

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
   NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in
   [RFC 2119][1].

 [1]: http://tools.ietf.org/html/rfc2119

Definitions are done using RFC 5234 (Augmented BNF for Syntax 
Specifications: ABNF), referring to it for example in the following 
way:

   All the mechanisms specified in this document are described in both
   prose and an augmented Backus-Naur form (ABNF).  It is described in
   detail in [RFC 4234][2].

 [2]: http://tools.ietf.org/html/rfc5234

The description of pkt-line and pkt-line-sb formats is wrong: length
includes the header. It is IMHO more natural to define it from 
generality to detail, and not in reverse direction; instead of this:

   pkt-length = 4HEXDIGIT   ; length of pkt-payload
   pkt-line   = pkt-length pkt-payload [ LF / CR ]

for example like this:

   pkt-line   = pkt-length pkt-payload [ LF ]
   pkt-length = 4HEXDIGIT   ; length of pkt-line (including pkt-length)

By the way, we probably accept any terminator, but I'd rather standarize 
on LF ("\n").

In description of sideband:

>  When a sideband is used, 2 means "progress messages, most likely
>  suitable for stderr". 1 means "pack data". 3 means "fatal error
>  message, and we're dead now".  No other channels are used or valid.

it is true that no other channels are used, but it is not true that 
other channels are invalid. If they are not supported by client, there 
are simply dropped. This opens possibility of future extension. I guess 
that channel 0 is invalid, because it would be understood as _input_ 
channel (for sending data from client to server), though.

Please correct me if I am wrong here...

P.S. Could you please try to not quote large fragments of email which
you do not refer to in your reply, and which are not relevant to given 
post, like the long quoting at the end of your email without any word 
from you? Thanks in advance.
-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Comments pack protocol description in "Git Community Book" (second round)
  2009-06-06 21:38       ` Comments pack protocol description in "Git Community Book" (second round) Jakub Narebski
  2009-06-06 21:58         ` Scott Chacon
@ 2009-06-07 20:06         ` Shawn O. Pearce
  2009-06-09  9:39           ` Jakub Narebski
  1 sibling, 1 reply; 66+ messages in thread
From: Shawn O. Pearce @ 2009-06-07 20:06 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Scott Chacon, git, Junio C Hamano, Andreas Ericsson, Tony Finch,
	Johannes Sixt, Linus Torvalds

Jakub Narebski <jnareb@gmail.com> wrote:
> There are beginnings of description of git pack protocol in section
> "Transfer Protocols"[1][2] of chapter "7. Internals and Plumbing"
> of "Git Community Book" (http://book.git-scm.com).

I'm going to try to clip unrelevant context... but I apologize if
I still quoted too much, there's a lot of text here.

> > ### Fetching Data with Upload Pack ###
...
> Although fetching via SSH protocol is, I guess, much more rare than
> fetching via anonymous unauthenticated git:// protocol,

Actually, fetching via SSH might be quite common, think about all of
those companies using Git internally... they are running something
like Gitosis or Gerrit Code Review, both of which support SSH only
access to the hosted repositories.

> it _might_ be
> good idea to tell there that fetching via SSH differs from above
> sequence that instead of opening TCP connection to port 9418 and
> sending above packet, and later reading from and writing to socket,
> "git clone ssh://myserver.com/srv/git/project.git" calls
> 
> 	ssh myserver.com git-upload-pack /srv/git/project.git
> 
> and later reads from standard output of the above command, and writes
> to standard input of above command.

Yes, this should be mentioned.  We actually should document in
the protocol specifiction how we fork SSH, and what the SSH server
should then be presenting as the command line.

I've run into problems with hosting sites like GitHub and Gitoriuous
not correctly honoring some ssh invokes, because they use the forced
command execution model and were handling only one case that could
be presented to them.

> The rest of exchange is _identical_ for git:// and for ssh:// (and
> I guess also for file:// pseudoprotocol).

Yes, the file:// pseduoprotocol works by forking a child to run the
`git-upload-pack /srv/git/project.git` executable and runs a pair
of pipes between them, just like ssh:// does when it spawns off
the ssh client process.

> I think it would be better to describe packet (chunk) format, called
> pkt-line in git, separately from describing the contents of above
> packet; either first pkt-line then command, or first command then
> pkt-line.

pkt-line is a basic building block, describe it early, before we
describe anything else.

> Footnote: this format somewhat reminds / resembles 'chunked' transfer
> encoding used in HTTP[1], although there are differences.
>   http://en.wikipedia.org/wiki/Chunked_transfer_encoding

This is not worth mentioning.  pkt-line is different enough that
it may just confuse the reader.

> Below there is (for completeness) list of git-upload-pack
> capabilities, with short description of each:
> 
>  * multi_ack (for historical reasons not multi-ack)
...
>    See the thread for more details (posts by Shawn O. Pearce and by
>    Junio C Hamano).

This really needs a diagram example, like the one I drew, to
explain the concept.  Its really hard to grasp from just reading
that paragraph what that implies, especially if you are implementing
a client or a server.

>  * no-progress
> 
>    Client should use it if it was started with "git clone -q" or
>    something, and doesn't want that side brand 2.  We still want

typo, should be "... side band 2." :-)

>    sideband 1 with actual data (packfile), and sideband 3 with error
>    messages.

Also, this capability really only makes sense if side-band or
side-band-64k was requested.  IOW, a sane client wouldn't ask
for this if it doesn't support side-band.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Comments pack protocol description in "Git Community Book" (second round)
  2009-06-07  8:21           ` Jakub Narebski
@ 2009-06-07 20:13             ` Shawn O. Pearce
  0 siblings, 0 replies; 66+ messages in thread
From: Shawn O. Pearce @ 2009-06-07 20:13 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Scott Chacon, git, Junio C Hamano, Andreas Ericsson, Tony Finch,
	Johannes Sixt, Linus Torvalds

Jakub Narebski <jnareb@gmail.com> wrote:
> In description of sideband:
> 
> >  When a sideband is used, 2 means "progress messages, most likely
> >  suitable for stderr". 1 means "pack data". 3 means "fatal error
> >  message, and we're dead now".  No other channels are used or valid.
> 
> it is true that no other channels are used, but it is not true that 
> other channels are invalid. If they are not supported by client, there 
> are simply dropped. This opens possibility of future extension. I guess 
> that channel 0 is invalid, because it would be understood as _input_ 
> channel (for sending data from client to server), though.
> 
> Please correct me if I am wrong here...

An implementation reading a muxed stream SHOULD fail fast if it
encounters a channel number it doesn't understand.

JGit already fails fast with an error if it gets anything not in 1-3.
C Git already fails fast with an error as well.

An implementation writing a muxed stream shouldn't produce a channel
number unless it knows the reader can support it.

To add a new channel number to the supported set, a new capability
should be introduced to the protocol, and enabled if both sides
have agreed to support it.

Currently, stream 0 and stream 4-255 are undefined.  That is,
any new capability could claim that stream and start to use it,
if it needed to.

I think the primary Git contributors would prefer to see new channels
in the 4-255 range, as then 0 can continue to stay invalid... aka
"not true" in C.  Like in the pack type codes, we might want to save
0 for the day when all 1-255 are filled and we need to expand the
channel number range into 2 bytes.  But even then, we could just
do a new side-band-64kv2 capability or something.  :-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Comments pack protocol description in "Git Community Book" (second round)
  2009-06-06 21:58         ` Scott Chacon
  2009-06-07  8:21           ` Jakub Narebski
@ 2009-06-07 20:43           ` Shawn O. Pearce
  2009-06-13  9:30           ` Comments pack protocol description in "RFC for the Git Packfile Protocol" (long) Jakub Narebski
  2 siblings, 0 replies; 66+ messages in thread
From: Shawn O. Pearce @ 2009-06-07 20:43 UTC (permalink / raw)
  To: Scott Chacon
  Cc: Jakub Narebski, git, Junio C Hamano, Andreas Ericsson, Tony Finch,
	Johannes Sixt, Linus Torvalds

Scott Chacon <schacon@gmail.com> wrote:
> In addition to that, I started taking a shot at putting together an
> RFC formatted documentation of this protocol as was requested.
...
> http://git-scm.com/gitserver.txt

SSH is described by RFC 4251 and RFC 4254.  Reference it when you
mention it.

Section 2.2.3 Commit is missing spaces after the parent, author,
committer, encoding headers:

>  parent    = "parent" + sha + \n
>  userinfo  = NAME <EMAIL> TIME
>  author    = "author" + userinfo + \n
>  committer = "committer" + userinfo + \n
>  encoding  = "encoding" + encoding + \n

2.2.4. Tag, same problem.

>   At the end of the
> packfile is a 20-byte SHA1 sum of all the shas in that packfile.

No.  The SHA-1 checksum on the footer of the pack is over all of
the preceeding bytes of the pack.

> (B << 4) & A bytes when expanded

No.  (B << 4) | A bytes when expanded.

>  [1 byte]   | 1 | type (3) | size A (4)     |  |- object #3 header
>             +-------------------------------+  |
>  [1 byte]   | 0 | size data B (7)           |  |
>             +-------------------------------+  |
>  [1 byte]   | 0 | size data C (7)           |  |
>             +-------------------------------+ -+
>             | compressed object data        | (C << 11) & (B << 4) & A
>             |                               | bytes when expanded

The B byte has the high bit set (1).  And the length is
(C << 11) | (B << 4) | A.

Also, I found reading that difficult, and it doesn't mention the
OBJ_REF_DELTA or OBJ_OFS_DELTA cases.

You also need to note that the version number in the file header
is currently '2', as described by this RFC.

>    Finally, the trailer records 20-byte SHA1 checksum of the rest of the
>   file.

Like I said above, its the preceeding bytes of the pack.

Section 4.2 Git Protocol, explain the git:// URI first, and then
how a client splits that into the request, and then how it formats
the request.  Don't forget to include an example with a non-standard
port number.

Also document what the standard port number is.

Elsewhere in the document you say 'upload-pack' or 'receive-pack'.
I think you should be saying 'git-upload-pack' or 'git-receive-pack'
everywhere, as these are the formal names in the protocol.

Section 5.2, Capabilities:

>  Client sends space separated list of capabilities it wants.  It
>  SHOULD send a subset of server capabilities, i.e do not send
>  capabilities served does not advertise.  The client SHOULD NOT ask
>  for capabilities the server did not say it supports.

I thought we had said it was client MUST send a subset of server
capabilities; client MUST NOT ask for capabilities server did
not advertise support of.

>  Server MUST ignore capabilities it does not understand.  Server MUST
>  NOT ignore capabilities that client requested and server advertised.

I think that's just lazy coding on the server part.  If the server
gets a capability request it can't honor, it MUST abort, it might
corrupt the stream to the client.

> 5.2.1.  multi-ack
>
>  The 'multi-ack' capability allows the server to return "ACK $SHA1

multi_ack

>  Without multi_ack, a client sends have lines in --date-order until
>  the server has found a common base.  That means the client will send

Explain --date-order, don't assume the reader knows it.

I'm giving up for now.  :-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Comments pack protocol description in "Git Community Book" (second round)
  2009-06-07 20:06         ` Comments pack protocol description in "Git Community Book" (second round) Shawn O. Pearce
@ 2009-06-09  9:39           ` Jakub Narebski
  2009-06-09 14:28             ` Shawn O. Pearce
  0 siblings, 1 reply; 66+ messages in thread
From: Jakub Narebski @ 2009-06-09  9:39 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Scott Chacon, git, Junio C Hamano, Andreas Ericsson, Tony Finch,
	Johannes Sixt, Linus Torvalds

On Sun, 7 June 2009, Shawn O. Pearce wrote:
> Jakub Narebski <jnareb@gmail.com> wrote:

> > > ### Fetching Data with Upload Pack ###
> ...
> > Although fetching via SSH protocol is, I guess, much more rare than
> > fetching via anonymous unauthenticated git:// protocol,
> 
> Actually, fetching via SSH might be quite common, think about all of
> those companies using Git internally... they are running something
> like Gitosis or Gerrit Code Review, both of which support SSH only
> access to the hosted repositories.

I am blind... I forgot that Git is not only used for F/OSS software,
but also for developing proprietary code in company intranets; here
you have limited number of people you want to have read access to
repository.

> 
> > it _might_ be
> > good idea to tell there that fetching via SSH differs from above
> > sequence that instead of opening TCP connection to port 9418 and
> > sending above packet, and later reading from and writing to socket,
> > "git clone ssh://myserver.com/srv/git/project.git" calls
> > 
> > 	ssh myserver.com git-upload-pack /srv/git/project.git
> > 
> > and later reads from standard output of the above command, and writes
> > to standard input of above command.
> 
> Yes, this should be mentioned.  We actually should document in
> the protocol specifiction how we fork SSH, and what the SSH server
> should then be presenting as the command line.
> 
> I've run into problems with hosting sites like GitHub and Gitoriuous
> not correctly honoring some ssh invokes, because they use the forced
> command execution model and were handling only one case that could
> be presented to them.

Can you offer some details?  Or is it out of scope of git pack protocol
description, and more about correctly implementing SSH protocol and
remote command invocation in it?

> 
> > The rest of exchange is _identical_ for git:// and for ssh:// (and
> > I guess also for file:// pseudoprotocol).
> 
> Yes, the file:// pseduoprotocol works by forking a child to run the
> `git-upload-pack /srv/git/project.git` executable and runs a pair
> of pipes between them, just like ssh:// does when it spawns off
> the ssh client process.

That would be nice information to have for people (re)implementing Git,
I think.

Sidenote: it will be the same for planned "smart" HTTP protocol, but 
for the fact that HTTP is stateless, and additionally some kind of state
information would have to be passed.

> > Footnote: [pkt-line format] somewhat reminds / resembles 'chunked' transfer
> > encoding used in HTTP[1], although there are differences.
> >   http://en.wikipedia.org/wiki/Chunked_transfer_encoding
> 
> This is not worth mentioning.  pkt-line is different enough that
> it may just confuse the reader.

O.K. 

I mentioned it because it also uses hexadecimal for length.

>  
> > Below there is (for completeness) list of git-upload-pack
> > capabilities, with short description of each:
> > 
> >  * multi_ack (for historical reasons not multi-ack)
> ...
> >    See the thread for more details (posts by Shawn O. Pearce and by
> >    Junio C Hamano).
> 
> This really needs a diagram example, like the one I drew, to
> explain the concept.  Its really hard to grasp from just reading
> that paragraph what that implies, especially if you are implementing
> a client or a server.

While I don't think that one would have to describe Git object model,
and Git repository storage model (the Git repository storage model, 
i.e. loose object format, and packfile and packfile index format,
and everything else in .git should be described in separate RFC-like
document, in my opinion), it would be helpful to describe "history DAG"
model Git uses, and a bit about revision walking.  What use would be
describing git pack protocol, if the idea behind it, namely coming up
with optimal packfile to send won't be understood?

> >  * no-progress
> > 
> >    Client should use it if it was started with "git clone -q" or
> >    something, and doesn't want that side brand 2.  We still want
> 
> typo, should be "... side band 2." :-)
> 
> >    sideband 1 with actual data (packfile), and sideband 3 with error
> >    messages.
> 
> Also, this capability really only makes sense if side-band or
> side-band-64k was requested.  IOW, a sane client wouldn't ask
> for this if it doesn't support side-band.

Right. "no-progress" makes sense only in context of sideband, currently
"side-band" and "side-band-64k". For server it means that it MUST send
(currently) only streams 1 (data) and 3 (fatal error); conversely it
MUST NOT send stream 2 (progress).

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Comments pack protocol description in "Git Community Book" (second round)
  2009-06-09  9:39           ` Jakub Narebski
@ 2009-06-09 14:28             ` Shawn O. Pearce
  0 siblings, 0 replies; 66+ messages in thread
From: Shawn O. Pearce @ 2009-06-09 14:28 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Scott Chacon, git, Junio C Hamano, Andreas Ericsson, Tony Finch,
	Johannes Sixt, Linus Torvalds

Jakub Narebski <jnareb@gmail.com> wrote:
> On Sun, 7 June 2009, Shawn O. Pearce wrote:
> > 
> > I've run into problems with hosting sites like GitHub and Gitoriuous
> > not correctly honoring some ssh invokes, because they use the forced
> > command execution model and were handling only one case that could
> > be presented to them.
> 
> Can you offer some details?  Or is it out of scope of git pack protocol
> description, and more about correctly implementing SSH protocol and
> remote command invocation in it?

For URI user@site:project.git the following should all succeed:

  1) ssh user@site "git-receive-pack project.git"
  2) ssh user@site "git receive-pack project.git"

  3) ssh user@site "git-receive-pack 'project.git'"
  4) ssh user@site "git receive-pack 'project.git'"

Note that the command name can be dash or dashless, and the project
name can be wrapped in single quotes, or not wrapped in single
quotes.  C Git tends to create form 3 by default.  JGit also tries
to use form 3 by default, but I've heard some reports from users
saying it produced one of the other forms.

For ssh://user@site/project.git the following should succeed:

  1) ssh user@site "git-receive-pack /project.git"
  2) ssh user@site "git receive-pack /project.git"

  3) ssh user@site "git-receive-pack '/project.git'"
  4) ssh user@site "git receive-pack '/project.git'"

If you are a site like GitHub, where '/' means nothing to you
because all repositories are scoped by user, the extra leading '/'
appears here simply because of the style of URI used, and the '/'
should be dropped before evaluating the path.

GitHub had a problem with ssh:// URIs, its fixed now.  I think
Gitorious had a problem with the command name, but that's fuzzy
in my memory.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Comments pack protocol description in "RFC for the Git Packfile Protocol"  (long)
  2009-06-06 21:58         ` Scott Chacon
  2009-06-07  8:21           ` Jakub Narebski
  2009-06-07 20:43           ` Shawn O. Pearce
@ 2009-06-13  9:30           ` Jakub Narebski
  2 siblings, 0 replies; 66+ messages in thread
From: Jakub Narebski @ 2009-06-13  9:30 UTC (permalink / raw)
  To: Scott Chacon
  Cc: Shawn O. Pearce, git, Junio C Hamano, Andreas Ericsson,
	Tony Finch, Johannes Sixt, Linus Torvalds

On Sat, 6 June 2009, Scott Chacon wrote:

> In addition to that, I started taking a shot at putting together an
> RFC formatted documentation of this protocol as was requested.  I may
> have _way_ missed the mark on what you were looking for originally,
> it's hard to say, not having read a lot of RFC documents - I probably
> ended up writing in a more bookish format rather than a technical
> spec, but whatever - maybe you'll find it helpful or can fix it up to
> more what you were expecting.  I'm not done with it - some of it is
> still basically unformatted comments from this previous thread, but at
> least it's laid out roughly how I thought it might be useful and I
> have fleshed out a lot of it.  You can find the RFC text output
> document here:
> 
> http://git-scm.com/gitserver.txt
> 
> And the xml doc I generated it from here:
> 
> http://github.com/schacon/gitserver-rfc

It would be nice to have RFC text output document alongside XML source
in gitserver-rfc document, e.g. following example of 'man' and 'html'
branches in git.git repository; which means having unrelated 'txt'
branch with text version of document for review.  Not everybody has
and wants to install tools required to turn XML into pretty RFC-like
text document... and it is hard to read XML.

As it is now, I find it much easier to read plain text formatted
output, and in current situation I cannot be sure that I am working on
(where working on means in this case commenting) the most recent
version of this RFC draft.  See also comment about embedding version
number in plain text version.

> 
> Perhaps if we're going to spend time getting this all correct, we
> should get a standalone technical doc all agreed upon, then I can
> relatively easily extract what's needed into that chapter of the
> Community book.
> 
> Thoughts?
> 
> Scott

Here are my thoughts and comments about the RFC.  Below the quoted
text is gitserver.txt from around 2009-06-10, not Scott Chacon email.

Well, I think having _detailed_ technical documentation of git pack
protocol would help tremendously implementers (whether it is
reimplementation of git, or just a git-server equivalent).  What exact
format it would be is not that important, I think...

------------------------------------------------------------------------
> Internet Engineering Task Force                                S. Chacon
> Internet-Draft                                                    GitHub
> Intended status: Informational                              June 6, 2009
> Expires: December 8, 2009

Why is expire date set?  Is it required? Why this date, by the way?
Is it some mandated/consensus expiry length, or is it date you plan
to revise this draft?

By the way, it would be nice if you tagged the source (e.g. v0.1), and
in during conversion from XML to RFC in text format embed version
number (GIT-VERSION-GEN) somewhere... perhaps in pseudo-filename.

>
>                          Git Server Protocol
>                         git-server-protocol-01

About naming: RFC drafts found at IETF have draft-<person>- prefix in
a draft "filename", e.g. 
  draft-mirashi-url-irc-01.txt, 
  draft-ietf-atompub-format-11.txt, 
  draft-templin-autoconf-dhcp-38.txt,
  draft-rfc-editor-rfc2223bis-08.txt,
(I am not sure if the .txt extension should be stated or not).

For example if the root commit of gitserver-rfc was tagged with v0.1,
then version which is 2 commits later could have pseudo filename look
like this: 'draft-schacon-git-server-protocol-01_2_ge036f1.txt', or
something like that (but the exact version might be put in other
place, and the filename might be then 'draft-*-01.txt').

Shouldn't it be "Git Pack Protocol" (or "Git Pack Protocol Exchange",
or "Git Packfile Protocol" as you described gitserver-rfc repository)
rather than "Git Server Protocol": the protocol is the same for
file:// URLs (over pipe), ssh:// URLs (over SSH), and git:// (over
socket using git-daemon)?  Although I guess the naming (the name is
the hardest thing... ;-) of this RFC draft could be left for later...

> Status of this Memo
>
>   By submitting this Internet-Draft, each author represents that any
>   applicable patent or other IPR claims of which he or she is aware
>   have been or will be disclosed, and any of which he or she becomes
>   aware will be disclosed, in accordance with Section 6 of BCP 79.
[...]

This boilerplate is fairly short, and I guess it could remain; but
wouldn't it be a good idea (if it is possible) to leave adding larger
parts of boilerplate text, which is required for RFC, but is not
necessary for draft?

> Abstract
>
>   This documents the Git version control system packfile based server
>   protocol.  It describes expected behaviour of client and server and
>   best current practices to help avoid pitfalls when implementing Git
>   daemon or SSH based servers in other language implementations.  It
>   will describe the data structures underlying Git repositories, how
>   that data is compressed into a packfile and how the contents of that
>   packfile are negotiated and transferred.  This does not cover the
>   HTTP based Git server protocols.

Good abstract.  "Packfile based server protocol" is a good name.  
I think however that the goal of git server/pack protocol exchange
should be stated explicitly: it is coming up with optimal packfile
to send.

I am not sure if description of Git repositories data structures (and
layout) wouldn't be better left for separate RFC (or refer to existing
documentation of those).  What we need for git server protocol (git
pack protocol) is actually a subset of this information, I think.

"This does not cover the HTTP based Git server protocols." is not
entirely correct.  What is not covered are "commit walker" based
protocols, like for example HTTP, but also FTP; also other "dumb"
protocols like (deprecated) rsync.  Those "commit walker" and other
"dumb" protocols do not require Git aware server (but they do require
extra helper info), so they are not strictly "Git _server_
protocols".  (That is a bit splitting hairs.)

[Table of contents]

> 1.  Introduction
>
>   The Git SCM is a snapshot based distributed version control system.
>   Each clone of each repository can synchronize with other nodes if
>   they have read or write access to them.  The two most common
>   protocols that these communications happen over are the custom 'Git'
>   protocol and over SSH.  In both of these cases, the communication
>   happens between the 'send-pack' process on the client side and
>   'receieve-pack' process on the server in the case of pushing changes
>   from the client to the server.  For fetching changes from the server
>   to the client, the 'fetch-pack' process on the client communicates
>   with an 'upload-pack' process on the server.  This document will
>   describe the ways in which these pairs of processes communicate.

What I find lacking is reference to RFC 2119 (which covers meaning of
MUST, etc.) and RFC 5234 (which covers ABNF).  It was present in XML
source, but was subsequently removed.

TAP (Test Anything Protocol) draft uses the following wording:
http://testanything.org/wiki/index.php/TAP_at_IETF:_Draft_Standard

  Conventions Used In This Document
  =================================

  The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
  "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
  document are to be interpreted as described in [RFC2119].

  The grammatical rules in this document are following ABNF and are to
  be interpreted as described in [STD68].

The wording for RFC 2119 reference is given in this RFC.  For
reference to ABNF standard one has to come up with ones own wording.
Another example would be:

  All the mechanisms specified in this document are described in both
  prose and an augmented Backus-Naur form (ABNF).  It is described in
  detail in [RFC5234].

RFC 2616 (Hypertext Transfer Protocol -- HTTP/1.1) has section titled
"1.2 Requirements" with reference to RFC 2119, and the following
paragraph:

  An implementation is not compliant if it fails to satisfy one or more
  of the MUST or REQUIRED level requirements for the protocols it
  implements. An implementation that satisfies all the MUST or REQUIRED
  level and all the SHOULD level requirements for its protocols is said
  to be "unconditionally compliant"; one that satisfies all the MUST
  level requirements but not all the SHOULD level requirements for its
  protocols is said to be "conditionally compliant."

TAP standard draft has there "Conventions Used In This Document"
subsection, with additional reference to ABNF, as stated above.  Other
RFCs (like RFC 3501, Internet Message Access Protocol - Version 4rev1) 
have it specified/referenced in the section where it is used for the
first time, e.g. (from RFC 3501):

  9. Formal Syntax
  ================

  The following syntax specification uses the Augmented Backus-Naur
  Form (ABNF) notation as specified in [ABNF].

What is also missing is section titled "Terminology" or "Definitions",
which can also be put in "Conventions Used in This Document" section,
where we would have to define _at least_ (I think) 'client', 'server',
'connection', 'request', probably 'channel' or 'sideband', 'capability'.

Also, if we are using "C: "/"S: " prefix convention, we should state
it explicitly, e.g. like it is written in RFC 3501 (IMAP):

  In examples, "C:" and "S:" indicate lines sent by the client and
  server respectively.

or like in draft RFC3920bis (XMPP core):

  In examples, lines have been wrapped for improved readability,
  "[...]" means elision, and the following prepended strings are used
  (these prepended strings are not to be sent over the wire):

  o  C: = a client
  o  S: = a server

After list of conventions there should be _overview_ of protocol,
before details; RFC 2616 (HTTP/1.1) has there "Overall Operation",
RFC 3501 (IMAP v4) has "Protocol Overview", and XMPP draft has
"Overview" section inside "Architecture" chapter.

> 2.  Git Data
>
>   Git has a relatively simple data format for storing it's objects.
>   There are four different types of objects that Git stores and these
>   make up nearly all the data that is transferred between a Git client
>   and server.  The four object types are the 'blob', the 'tree', the
>   'commit' and the 'tag'.  [...]

On one hand side I think that definitions of Git object model and Git
repository structure, file formats and layout should be either put in
_separate_ RFC, or (for the time being at least) simply refer to
existing documentation (it is only pack protocol which was lacking
much in matter of detailed description).

On the other hand side some of this information is actually required
to understand git server/pack protocol.  One needs to know about
reachability (of commits) to understand git pack protocol exchange,
and a way to come up with minimal set of commits; also one would have
to know how to interpret ASCII-art diagrams we use to visualize
history (commits parentage).  One needs to know about trees and blobs
(but not necessarily details about their exact format) and about Git
assumptions to understand how one should arrive at minimal set of
objects (in ordinary case, and in 'shallow' case).  One needs to
understand what tags are (but not necessarily details about their
format) to know e.g. what "include-tag" capability is about.  One
needs to understand what packfile is about to know what is thin pack,
and what "ofs-delta" capability is about.

[...]

> 2.1.  The SHA-1 ID
>
>   The Git database operates as a key-value store, where each object
>   that is put into the database is given an ID and then can be
>   retrieved from the database by that ID.  The ID is calculated as the
>   SHA-1 checksum of the content being stored plus a small header
>   appended to it of the format: [...]

The fact that Git repository is content-addressed object database,
using SHA-1 identifiers is fairly important; for pack protocol
description what matters is SHA-1 ids of commits and of tags, only.

BTW. it is prepended, and not appended.

[...]

> 2.2.5.  Git Object Model
>
>   The Git object model then are tags that point to commits, 

And that misunderstanding led to GitHub not supporting tags to blobs
(like e.g. junio-gpg-pub tag in git repository) :-).  Tag (both tag
reference, and tag object) can point to _any_ kind of object, be it
commit (most common), tag (chain of trust, retagging etc.), blob (like
e.g. public GPG keys) or trees (in Linux kernel first version didn't
have commit, so tree was tagged).

>   which point to zero or more commits and a single tree, which
>   points to one or more trees and/or blobs.

Sidenote: since submodules trees can point to commits; what is
important for pack-protocol considerations is that those commits do
not enter reachability considerations.

>
>                            +---+         +--+
>                            v   |         v  |
>             +-----+     +--------+     +--------+     +--------+
>             | Tag | --> | Commit | --> |  Tree  | --> |  Blob  |
>             +-----+     +--------+     +--------+     +--------+
>
>                           The Git Object Model
>
>   This creates a directed acyclic graph that can represent the project
>   state at any point.

Because tags can point at any kind of objects (though this is rare)
above diagram is not entirely true.  Also above diagram at least for
me seems to imply that there are loops in DAG of objects, while loops
are impossible (well, unless one screws up with grafts...) in Git.

But it would be hard to come up with ASCII-art diagram of DAG of
objects in Git repository; perhaps we can use ABNF to describe it (it
would be used to describe how objects refer to one another, not to
describe some syntax).

[...]

But I do not think that describing details of object format (loose,
packed, object) is necessarily here.  I'd rather leave it to existing
documentation (which describes Git model, and loose format, and pack
format with its indexes quite well), or separate RFC.

> 2.2.5.1.  The Commit Graph
>
>   Importantly for calculating data needs later on, the commit objects
>   by themselves are also a directed acyclic graph.  [...]

This is very important section, and I think necessary to understand
what git pack protocol is about.  We also introduce here ASCII-art
diagram convention for representing history in examples and
explanations. 

> 2.3.  Git References
>
>   The last major concept in the Git data structure is the reference.  A
>   reference is like a tag that moves.  When users work on a branch in
>   Git, the branch reference that is currently checked out is moved
>   forward to point to each new commit that is created.  So in Git, a
>   branch is really just a pointer to the latest commit on that branch -
>   the rest of the commits are obtained by walking the SHA-1 values one
>   commit at a time.
>
>                              +-- E  <= topic1
>                             /
>                        A -- B -- C -- G  <= master
>                          \
>                           +-- D -- F  <= topic2
>
>                       Commit Graph with References

I don't think that it is good description.  What's more such level of
detail is not necessary; what we have to know is that references are
local symbolic names for objects, usually for commits in DAG of
revisions.  Perhaps introduction of HEAD, refs/heads/* (which must
point to commits) and refs/tags/* would also be needed.

[...]

> 3.  Git Packfile Format
>
>   Once the client and the server figure out what objects need to be
>   transferred from one system to another, it will put all of those
>   objects into a "packfile".  This packfile is then streamed from one
>   system to the other.
>
>   The packfile itself is a very simple format.  There is a header, a
>   series of packed objects (each with it's own header and body) and
>   then a checksum trailer.  The first four bytes is the string 'PACK',
>   [...]

I am not sure if detailed description of packfile format is really
necessary.  We can always refer to Documentation/technical/pack-format.txt
and Documentation/technical/pack-heuristics.txt

By the way, if we are to describe details of packfile format, perhaps
we should use format described in section "5.2. Protocol Data
Definitions" of "RFC Style Guide"[1], based on RFC 791, isn't it?

[1]: http://www.rfc-editor.org/rfc-style-guide/rfc-style-manual-08.txt

> 3.1.  Deltified Objects
>
>   There are two object types that are new here - the delta object
>   types.  These are object data that are deltas of existing objects,
>   saving space in the storage.  The instance that creates the packfile
>   determines which objects it wants to deltify, if any, in order to
>   save space.  It is possible to send packfiles with no delta objects
>   in it, though it often saves quite a bit of space.
>
>   For the two delta object representations, the data portion contains
>   something that identifies which base object this delta representation
>   depends on, and then the delta to apply on the base object to
>   resurrect this object.
>
>   REF_DELTA uses 20-byte hash of the base object at the beginning of
>   data, while OFS_DELTA stores an offset within the same packfile to
>   identify the base object.  In either case, two important constraints
>   a reimplementor must adhere to are:
>   [...]

REF_DELTA vs OFS_DELTA is required to understand "ofs-delta"
capability; delta object and delta base is required to understand thin
packs.  But do we need more?

.......................................................................
The majority of comments touches the following chapter.  I haven't
examined previous chapters in more detail.

> 4.  Protocols
>
>   There are two transports over which the packfile protocol is
>   initiated.  The Git protocol is a simple, unauthenticated server that
>   simply takes the command (almost always 'upload-pack', though Git
>   servers can be configured to be globally writable, in which 'receive-
>   pack' initiation is also allowed) with which the client wishes to
>   communicate and executes it and connects it to the requesting
>   process.  The other transport is the SSH protocol, in which the
>   client basically just runs the 'upload-pack' or 'receive-pack'
>   process over the SSH protocol.

Sidenote: Actually there are three transports over which packfile
protocol is initiated.  There are: 
1.) transport over TCP socket, with git-daemon server being thin
    wrapper around whitelist of allowed commands, which uses
    git://git.example.com/repo.git URLs,
2.) transport over SSH, where clients run 'upload-pack' or
    'receive-pack' process over SSH protocol,
3.) transport over pipe, where client run 'upload-pack' or
    'receive-pack' process locally, file://git.example.com/repo.git 
    (pseudo)protocol URL.
We do not want to close possibility of other transports, like JGit's
amazon-s3:// protocol.

Sidenote: The set of commands which can be run via git protocol (via
TCP socket) from git-daemon is not limited to 'upload-pack';
'upload-archive' can be enables as well.

> 4.1.  Packet Line Format
>
>   Some data transmission in Git is done in what is called 'packet-line'
>   format.  

Should we mention that it is called pkt-line or pkt_line in sources
and in other documentation?

>           This is where each line of data sent is prepended with the
>   four byte hex encoded length of the rest of the payload being sent.
                                        ^^^^^^^
                                              \--- this is wrong!

The four hexadecimal characters length of packet line is length of the
whole line, including length prefix, and not only of payload.

>   This way the side receiving data can read 4 bytes and then know how
>   much more data is coming in that request.

Therefore it is "how much data", not "how much more data".

>
>   pkt-length = 4HEXDIGIT   ; length of pkt-payload
>   pkt-line   = pkt-length pkt-payload [ LF / CR ]

This is wrong.  I would use the following top-down definition:

    pkt-line   = pkt-length pkt-payload [ LF / CR / NUL ]
    pkt-length = 4HEXDIG     ; length of pkt-line

where

    NUL = %x00               ; \0

Perhaps even

    pkt-payload = *OCTET     ; data

OCTET, HEXDIG and LF are defined in "Core rules" (Appendix B) of ABNF
standard.  I think that _any_ terminator is allowed, but we should use
only LF (CR and NUL is used in some special cases); I am not convinced
if we should state that there is/can be terminator explicitly.

This definition does not cover special case (which should be
described) of

    pkt-flush = "0000"

>
>   In some cases Git will use a sideband packet-line format, where each
>   line is transmitted with the hex length prepended, followed by the
>   sideband channel (one byte) that the data is meant for, followed by
>   the actual data.
>
>   pkt-length   = 4HEXDIGIT   ; length of pkt-sb-payload
>   sideband-ch  = %d01-%d03
>   pkt-line-sb  = pkt-length sideband-ch pkt-payload [LF/CR]

This also contains some errors.

    pkt-line-sb  = pkt-length sideband-ch pkt-payload [LF/CR]
    pkt-length   = 4HEXDIGIT   ; length of pkt-line-sb
    sideband-ch  = %d01-%d03

I don't think that there is situation where we use NUL for terminator
for sideband packet-line format.  Also the CR terminator appear only
in sideband packer-line format for 2 channel, "progress message",
isn't it?

>
>   When a sideband is used, 2 means "progress messages, most likely
>   suitable for stderr". 1 means "pack data". 3 means "fatal error
>   message, and we're dead now".  No other channels are used or valid.
>
>   For the hex encoding, client and server SHOULD use lowercase, but
>   MUST accept mixed case (do case insensitive parsing of hex4).

Here, as you can see, we make use of RFC 2119, so we should reference
it somewhere at the beginning of RFC, as I have stated way above.

>
> 4.2.  Git Protocol
>
>   The Git protocol starts off by sending "git-receive-pack 'repo.git'"
>   on the wire using the pkt-line format, followed by a null byte and a
>   hostname parameter, terminated by a null byte.
>
>           0032git-upload-pack /project.git\0host=myserver.com\0

You didn't mention that above is only an example.  You didn't provide
with [example] invocation that results with such request.  Also it
would be better to use 'example.com' or 'git.example.com' as hostname,
see RFC 2606 ("Reserved Top-Level DNS Names" [TLD99]), to avoid
accidental conflicts.

We can use ABNF to encode it

    git-proto-request = request-command SP pathname NUL [ host-parameter NUL ]
    request-command   = 'git-upload-pack' / 'git-receive-pack' /
                        'git-upload-archive'   ; case sensitive
    pathname          = *( %x01-ff ) ; exclude NUL
    host-parameter    = 'host' "=" hostname [ ":" port ]    

One should probably take a look at RFC for URLs/URIs for definitions
pertaining to host, pathname, URL, absolute and relative, etc.

Here I used for simplicity extension to ABNF that all characters
inside single quotes, like for example 'host' above are case
sensitive, as opposed to ABNF strings like "A" or "cmd" which are case
insensitive (in US-ASCII).

Otherwise one would have to specify 'host' as e.g. %x68.6f.73.74
or %d104.111.115.116, and similar for other cases.

>
>   Currently only 'host' is supported in the extra information.  It's
>   for the git-daemon name based virtual hosting.  See --interpolated-
>   path option to git daemon, with the %H/%CH format characters.

Actually from the discussion on git mailing list (which was after you
wrote the above) 'host' information is the only ALLOWED, not only the
only supported extra information (hence my definition).

>
>   Basically what the Git client is doing to connect to an 'upload-pack'
>   process on the server side over the Git protocol is this:
>
>     $ echo -e -n \
>       "0039git-upload-pack /schacon/gitbook.git\0host=github.com\0" |
>       nc -v github.com 9418

Nice to have example here, but in RFC you should, I think, not use
real-life URLs, as per "Instructions to RFC Authors" (The use of URLs
in RFCs is discouraged, because many URLs are not stable references.)

>
> 4.3.  SSH Protocol
>
>   Initiating the upload-pack or receive-pack processes over SSH is
>   simply executing the binary on the server via SSH remote execution.
>   It is basically equivalent to running this:
>
>            $ ssh git.example.com 'git-upload-pack /project.git'
>

This of course depends on the command used, whether it uses URL-like
request, or scp-like / ssh-like request (possibly with relative path).
But you write about this below.

>   For a server to support Git pushing and pulling for a given user over
>   SSH, that user needs to be able to execute one or both of those
>   commands via the SSH shell that they are provided on login.  On some
>   systems, that shell access is limited to only being able to run those
>   two commands, or even just one of them.
>
>   In an ssh:// format URI, it's absolute in the URI, so the '/' after
>   the host name (or port number) is sent as an argument, which is then
>   read by the remote git-upload-pack exactly as is, so it's effectively
>   an absolute path in the remote filesystem.
>
>                git clone ssh://user@example.com/project.git
>                                  |
>                                  v
>             ssh user@example.com 'git-upload-pack /project.git'
>
>   In a "user@host:path" format URI, its relative to the user's home
>   directory, because the Git client will run:
>
>                   git clone user@example.com:project.git
>                                    |
>                                    v
>             ssh user@example.com 'git-upload-pack project.git'
>
>
> 5.  Fetching Data From a Server
>
>   When one Git repository wants to get all the data that a second
>   repository has, the first can 'fetch' from the second.  This
>   operation determines what data the server has that the client does
>   not then streams that data down to the client in packfile format.

I know this is only a draft, but this paragraph could have been
written better...

>
>   The server side binary needs to be executable as 'git-upload-pack'
>   for fetching over SSH, since the Git clients will connect to the
>   server and attempt to run that.

Why do you single out SSH protocol here?

(Yes, I know it is a draft...)

>
>   The basic communication structure looks like this:
>
>    # Tell the client current branch heads and the last commit on each
>    S: SHA1 refname
>    S: ...
>    S: SHA1 refname
>    S: # flush -- it's your turn
>    # Tell the server what commits we want, and what we have
>    C: want name
>    C: ..
>    C: want name
>    C: have SHA1
>    C: have SHA1
>    C: ...
>    C: # flush -- occasionally ask "had enough?"
>    S: NAK          # nope, keep sending 'have's
>    C: have SHA1
>    C: ...
>    C: have SHA1
>    S: ACK
>    C: done
>    S: XXXXXXX -- packfile contents.

This perhaps is too detailed for the "Overview" section (to be
introduced)... or perhaps it is not.

Sidenote: we should standarize on SHA1 or SHA-1 thorough document.

>
> 5.1.  Initial Server Response
>
>   When the client initially connects, whether over the SSH or Git
>   transports, the server will immediately respond with a listing of
>   each reference it has (all branches and tags) along with the commit
>   SHA that each reference currently points to.
>
>   $ echo -e -n \
>     "0039git-upload-pack /schacon/gitbook.git\0host=github.com\0" |
>      nc -v github.com 9418

See the comment about using URLs in RFC.

>   Connection to github.com 9418 port [tcp/*] succeeded!

This is message from 'nc' (to be more exact from 'nc -v'), not from
git server.

>   00887217a7c7e582c46cec22a130adf4b9d7d950fba0 HEAD\0multi_ack \
>     thin-pack side-band side-band-64k ofs-delta shallow no-progress \
>     include-tag

Here we should point that we line-wrap lines for convenience, and that
we mark such situation with "\" as the last character (line
continuation)... which would be not necessary if we used "C: "/"S: "
prefix convention and the convention about line-wrapping here (see
note about "Conventions" chapter way above).

Also I think it would be better to specify LF terminator, e.g. 
as "\n" explicitly, as it is not present for "0000" pkt-flush line.

>   00441d3fcd5ced445d1abc402225c0b8a1299641f497 refs/heads/integration
>   003f7217a7c7e582c46cec22a130adf4b9d7d950fba0 refs/heads/master
>   003cb88d2441cac0977faf98efc80305012112238d9d refs/tags/v0.9
>   003c525128480b96c89e6418b1e40909bf6c5b2d580f refs/tags/v1.0
>   003fe92df48743b7bc7d26bcaabfddde0a1e20cae47c refs/tags/v1.0^{}
>   0000

Should we explain what "refs/tags/v1.0^{}" means, or should this be
left for later (or earlier, perhaps in separate subsection)?

>
>   Each line is terminated by a "\n" by convention only, which is
>   included in the 4 byte length declaration.  If a newline does not
>   terminate the line, the client should not complain.

I think it should read: server SHOULD terminate each non-flush line
using LF ("\n") terminator; client MUST NOT complain if there is no
terminator (is it?).

>
>   The exception is the flush line.  A length of "0000" means its a
>   flush packet, which has no data payload.  An "\n" after the "0000"
>   would break the protocol as the server would read that "\n" in a
>   context where it is expecting another pkt-line length declaration.
>   "\n" is not a hex digit, so "0000\n" is horribly horribly broken.

I know it is copy'n'paste from existing discussion (existing post),
but something like above shouldn't take place in RFC.

>
>   HEAD is not included if its detached - that is, if HEAD is not a
>   symbolic reference, a pointer to another branch, it is not included
>   in the initial server response.

Here we mix levels of abstraction a bit: a protocol, and a set (well,
a sequence) of references (reference info) returned.

By the way why detached HEAD is not present?  Server do not know
anything about what refspec client uses.  (That is not a comment to
Scott, but to the mailing list in general).

>                                    The client pattern matches the
>   advertisements against the fetch refspec, which is "refs/heads/
>   *:refs/remotes/origin/*" by default.  HEAD doesn't match the LHS, so
>   it doesn't get wanted by the client.

In this section we should describe both format of response, and also
(perhaps below "Capabilities" section) ordering and format of
reference info returned in server response.

Format of response could be presented in the following way:

    refs-response = refs-line [ NUL capabilities ] [ LF ]
                  *(refs-line [ LF ] )
                    pkt-flush
    refs-line     = pkt-length sha1-str SP refname
    sha1-str      = 40*HEXDIG
    capabilities  = capability *(SP capability)

Server and client SHOULD use lowercase for SHA1, both MUST treat SHA1
as case-insensitive.  (But I think you are writing about this later.)

>
> 5.2.  Capabilities

Sidenote: IMAP has a list of capabilities (usually defined in separate
"upgrade" RFC) maintained by IANA[3].  Perhaps Git also should have
such list of capabilities in the event some new capabilities will get
invented (e.g. symbolic references transfer, or multipack transport)?

[3]: http://www.iana.org/assignments/imap4-capabilities

>
>   On the very first line of the initial server response, the first
>   reference is followed by a null byte and then a list of space
>   delimited server capabilities.  These allow the server to declare
>   what it can and cannot do to the client.

"These allow the server to declare to the client what it can and
cannot do" perhaps?

>
>   Client sends space separated list of capabilities it wants.  It
>   SHOULD send a subset of server capabilities, i.e do not send
>   capabilities served does not advertise.  The client SHOULD NOT ask
>   for capabilities the server did not say it supports.
>
>   Server MUST ignore capabilities it does not understand.  Server MUST
>   NOT ignore capabilities that client requested and server advertised.

There was discussion about exact meaning of capabilities, and silently
ignoring unknown capabilities... but if I remember correctly you are
right here.

>
> 5.2.1.  multi-ack

multi_ack (historical reason).

>
>   The 'multi-ack' capability allows the server to return "ACK $SHA1
>   continue" as soon as it finds a commit that it can use as a common
>   base, between the client's wants and the client's have set.
[...]

Good explanation of this capability will require, I think, much
explanation and many ASCII-art diagrams.  Perhaps description of
capabilities should be moved to appendix?

> 5.2.2.  thin-pack
>
>   Server can send thin packs, i.e. packs which do not contain base
>   elements, if those base elements are available on clients side.
>   Client has thin-pack capability when it understand how to "thicken"
>   them adding required delta bases making them independent.

Client doesn't "have" capability.  Client "requests" capability.
But client MUST NOT (I think) request 'thin-pack' capability if it
cannot turn thin packs into proper independent packs.

>
>   Of course it doesn't make sense for client to use (request) this
>   capability for git-clone.

It turned out in further discussion to be not true.  Cloning with
'--reference' option can make use of thin packs.  Besides even client
can ask for 'thin-pack' option if it understand it even for initial
git-clone, as it is simply ignored when there are no common commits
(no bases and base objects to exclude from pack).

>
> 5.2.3.  side-band, side-band-64k
>
>   This means that server can send, and client understand multiplexed
>   (muxed) progress reports and error info interleaved with the packfile
>   itself.

[...]
>   The client MUST send only maximum of one of "side-band" and "side-
>   band-64k".  Server MUST favor side-band-64k if client requests both.

One should check further discussion about this...

[cut fragment for which I have no comments currently]

The question is: should we list all capabilities with detailed
description as usual in reference documentation like RFC (draft), or
should we move detailed description of current set of allowed
capabilities to the appendix?  (Note also that set of possible
capabilities is different for different commands; it is different for
git-upload-pack and for git-receive-pack.)

>
> 5.3.  Client Response
>
>   Once the client has the initial list of references that the server
>   has, as well as the list of capabilities, it will begin telling the
>   server what objects it wants and what objects it has, so the server
>   can make a packfile that only has the objects that the client needs.
>   The client will also send a list of the capabilities it supports out
>   of what the server said it could do.
>

In the example below you use "C: " prefix convention, and explicitly
mark end of line character ("\n" = LF) terminating 'packets'.  In the
final version we should, I think, standarize on one convention.

>   C: 0054want 74730d410fcb6603ace96f1dc55ea6196122532d\0multi_ack \
>     side-band-64k ofs-delta\n

Actually here client uses simply SPC (' ') to separate 'want' line
data from the list of requested capabilities, not NUL ("\0").

>   C: 0032want 7d1665144a3a975c05f1f43902ddaf084e784dbe\n
>   C: 0032want 5a3f6be755bbb7deae50065988cbfa1ffa9ab68a\n
>   C: 0032want 7e47fe2bd8d01d481f44d7af0531bd93d3b21c01\n
>   C: 0032want 74730d410fcb6603ace96f1dc55ea6196122532d\n
>   C: 0000
>   C: 0009done\n
>
>   S: 0008NAK\n
>   S: 0023\002Counting objects: 2797, done.\n
>   [...]
>   S: 2004\001PACK\000\000\000\002 [...]
>
>   It means the server is answering a prior flush from the client, and
>   is saying "I still can't serve you, keep tell me more have".

It is not clear that "it" here refers to NAK response from the server.

Here we have the following:

    client-cmd-set = want-cmd SP sha1-str [ SP capabilities ] [ LF ]
                   *(want-cmd SP sha1-str [ LF ] / pkt-flush )
                   *(have-cmd SP sha1-str [ LF ] / pkt-flush )
                     pkt-flush 
                     done-cmd
    want-cmd = 'want'
    have-cmd = 'have'
    done-cmd = 'done'

But it is a bit hard to express the exchange using only ABNF
formalism.  You would have to understand and describe when flush
packet can be send, when it should be send, and how to write
and read from socket as to not deadlock (block).

Also this is not _complete_ list of commands.  See upload-pack.c, and
builtin-fetch-pack.c where there is also "shallow", "unshallow" and
"deepen" commands listed (from shallow clone / fetch).  But perhaps we
should describe extra commands with the description of capabilities;
alternatively we can have here a subsection for each capability that
allows new client commands ("shallow", "unshallow" and "deepen" for
"shallow" capability) and new server responses ("ACK %s continue" for
"multi_ack" capability).

In the full description you would have to describe not only "NAK"
server response, but also "ACK %s" (and "ACK %s continue" for
"multi_ack" capability).

>
>   I have thought that after sending "0000" flush line client can wait
>   for NAK or ACK server response... but it is not the case.  When I
>   tried to read from server after "0000" flush and before "0009done\n",
>   my client (or netcat instance) deadlocked (hung) waiting for server
>   response.  I either did a mistake in my fake client, or I don't
>   understand git pack protocol correctly.  Should client wait for NAK
>   or ACK from server _only_ after sending maximum number of want/have
>   lines (256 if I remember correctly?)?  Yes. It means the client will
>   not issue any more "have" lines, as it has nothing further in its
>   history, so the server just has to give up and start generating a
>   pack based on what it knows.  After the client receives a "ACK" or
>   "NAK" for the number of outstanding flushes it still has, *after* it
>   has sent "done".  This also varies based on whether or not multi_ack
>   was enabled.  Its ugly.  But basically you keep a running counter of
>   each "flush" sent, and then you send a "done" out, and then you wait
>   until you have the right number of ACK/NAK answers back, and then the
>   stream changes format.

This certainly requires cleanup, as it is simple dump of fragments of
conversation on git mailing list.

>
>   > Should commands such as "have", "want", "done" use lower case or >
>   be case insensitive?  These MUST be lowercase. > Should status
>   indicators "ACK" and "NAK" be upper case, These MUST be uppercase.
>   Though "ACK %s continue" MUST be mixed case, as I just wrote it. >
>   Should capabilities be case sensitive, and should they be > compared
>   case sensitive or not?  No, they are case sensitive.

This also requires cleanup, even more so.

>
>   One thing that I did not see mentioned in this thread is that the
>   implementation is allowed to buffer non-flush packets and send
>   multiple of them out with a single write(2).  In other words,
>   packet_write() could buffer instead of directly calling safe_write(),
>   while packet_flush() must do safe_write() and make sure it drains. -
>   junio That's one reason why in JGit I call the flush packet of "0000"
>   end(), and flush() triggers the drain.  JGit buffers everything its
>   writing, but only by one standard "have" window IIRC.  JGit server
>   code triggers a flush() after side-band channel 2 packet ends, but
>   not an end(), because we only want to drain to the network, not
>   inject a bad "0000" packet in the stream.

These implementation details should probably go later, if they are to
be in this RFC at all...

>
>   0023\\002Counting objects: 2797, done.\n
>   002b\\002Compressing objects:   0% (1/1177)   \r
>   002c\\002Compressing objects:   1% (12/1177)   \r
>   002c\\002Compressing objects:   2% (24/1177)   \r
>   0053\\002Compressing objects:   7% (83/1177)   \r \
>           Compressing objects:   8% (95/1177)   \r
>   2004\\001PACK\\000\\000\\000\\002\\000\\000\n\\355\\225
>       \\017x\\234\\235\\216K\n\\302"...
>   2005\\001\\360\\204{\\225\\376\\330\\345]z\226\273"...
>   ...
>   0037\\002Total 2797 (delta 1799), reused 2360 (delta 1529)\n"

The server response should also be described (as far as it can) using
ABNF for syntax.  Server responds NAK or ACK if it has enough
information to generate full set of common commits, and responds with
packfile and sideband data (progress info and fatal errors) to the
'done' command from client.  There should be, I think, list of
possible error conditions, and also more details about coming up with
set of common objects, and stop condition9s).

Also one has to note that sideband data is sent only if "side-band" or
"side-band-64k" capability is requested by client (and server supports
it), and that length of pkt-line for older "side-band" is 1000
characters (max length is "03e8").  And that when client requests
"no-progress" no sideband 2 info is sent (but git server still use
channel 1 to send packfile).

>
>   Buffering.  There are two processes running on the server side, git-
>   pack-objects is producing these messages on its stderr, and the pack
>   data on stdout.  Both are actually a pipe read by git-upload-pack in
>   a select loop.  If pack-objects can write two messages into the pipe
>   buffer before upload-pack is woken to read them out, upload-pack
>   might find two (or more) messages ready to read without blocking.
>   These get bundled into a single packet, because, why not, its easier
>   to code it that way.  Its most common on the end like that, where we
>   dump 100%, and then immediately add the ", done" and start a new
>   progress meter.  Its less likely in the middle, where we try to space
>   out the progress updates to around 1 per second, or 1 per percentage
>   push - determines objects in DAG(C) not in DAG(S) and transfers them
>   via packfile

Those are implementation details; some of this info should be there in
RFC as security precautions (against deadlocking).  Such info would be
interesting, as e.g. JGit missed one of hidden in C git assumptions
and could deadlock due to lower buffer size (or something) in Java...

[removed part about pushing]

Note that for pushing 1.) set of possible server capabilities is
different, 2.) capabilities are presented in another way (with SP and
not NUL as separator).

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2009-06-13  9:31 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-12 21:29 Request for detailed documentation of git pack protocol Jakub Narebski
2009-05-12 23:34 ` Shawn O. Pearce
2009-05-14  8:24   ` Jakub Narebski
2009-05-14 14:57     ` Shawn O. Pearce
2009-05-14 15:02       ` Andreas Ericsson
2009-05-15 20:29         ` Linus Torvalds
2009-05-15 16:51       ` Clemens Buchacher
2009-05-14 18:13     ` Nicolas Pitre
2009-05-14 20:27       ` Jakub Narebski
2009-05-14 13:55   ` Scott Chacon
2009-05-14 14:44     ` Shawn O. Pearce
2009-05-14 15:01     ` Jakub Narebski
2009-05-15  0:58       ` A Large Angry SCM
2009-05-15 19:05         ` Ealdwulf Wuffinga
2009-06-02 21:39     ` Jakub Narebski
2009-06-02 23:27       ` Shawn O. Pearce
2009-06-03  0:50         ` Jakub Narebski
2009-06-03  1:29           ` Shawn O. Pearce
2009-06-03  2:11             ` Junio C Hamano
2009-06-03  2:15               ` Shawn O. Pearce
2009-06-03  9:21             ` Jakub Narebski
2009-06-03 14:48               ` Shawn O. Pearce
2009-06-03 15:07                 ` Shawn O. Pearce
2009-06-03 15:39                   ` Jakub Narebski
2009-06-03 15:50                     ` Shawn O. Pearce
2009-06-03 16:51                 ` Jakub Narebski
2009-06-03 16:56                   ` Shawn O. Pearce
2009-06-03 20:19                     ` Jakub Narebski
2009-06-03 20:24                       ` Shawn O. Pearce
2009-06-03 22:04                         ` Jakub Narebski
2009-06-03 22:04                           ` Shawn O. Pearce
2009-06-03 22:16                           ` Junio C Hamano
2009-06-03 22:46                             ` Jakub Narebski
2009-06-04  7:17                         ` Andreas Ericsson
2009-06-04  7:26                           ` Junio C Hamano
2009-06-06 16:33                     ` Scott Chacon
2009-06-06 17:24                       ` Junio C Hamano
2009-06-06 17:41                       ` Jakub Narebski
2009-06-03 21:38                   ` Tony Finch
2009-06-03 17:11                 ` Junio C Hamano
2009-06-03 19:05                 ` Johannes Sixt
2009-06-03  2:18           ` Robin H. Johnson
2009-06-03 10:47             ` Jakub Narebski
2009-06-03 14:17               ` Shawn O. Pearce
2009-06-03 20:56           ` Tony Finch
2009-06-03 21:20             ` Jakub Narebski
2009-06-03 21:53               ` Tony Finch
2009-06-04  8:45                 ` Jakub Narebski
2009-06-04 11:41                   ` Tony Finch
2009-06-04 18:41                   ` Shawn O. Pearce
2009-06-03 12:29       ` Jakub Narebski
2009-06-03 14:19         ` Shawn O. Pearce
2009-06-04 20:55       ` Jakub Narebski
2009-06-04 21:57         ` Shawn O. Pearce
2009-06-05  0:45         ` Shawn O. Pearce
2009-06-05  7:24           ` Jakub Narebski
2009-06-05  8:45             ` Jakub Narebski
2009-06-06 21:38       ` Comments pack protocol description in "Git Community Book" (second round) Jakub Narebski
2009-06-06 21:58         ` Scott Chacon
2009-06-07  8:21           ` Jakub Narebski
2009-06-07 20:13             ` Shawn O. Pearce
2009-06-07 20:43           ` Shawn O. Pearce
2009-06-13  9:30           ` Comments pack protocol description in "RFC for the Git Packfile Protocol" (long) Jakub Narebski
2009-06-07 20:06         ` Comments pack protocol description in "Git Community Book" (second round) Shawn O. Pearce
2009-06-09  9:39           ` Jakub Narebski
2009-06-09 14:28             ` Shawn O. Pearce

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).