git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Exec upload-pack on remote with what parameters to get direntries.
@ 2021-08-28 12:56 Stef Bon
  2021-08-30 19:10 ` Jeff King
  0 siblings, 1 reply; 12+ messages in thread
From: Stef Bon @ 2021-08-28 12:56 UTC (permalink / raw)
  To: Git Users

Hi,

I've got a custom ssh library which I use to make a connection to a
git server like www.github.com, user stefbon.

Now I want to get the direntries of a remote repo, and I know I have
to use upload-pack for that, but with what parameters?

I want to use the outcome to make a fuse fs, user can browse the
files. Possibly the user can also view the contents.

Stef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Exec upload-pack on remote with what parameters to get direntries.
  2021-08-28 12:56 Exec upload-pack on remote with what parameters to get direntries Stef Bon
@ 2021-08-30 19:10 ` Jeff King
  2021-08-30 19:43   ` Junio C Hamano
  2021-08-31  6:38   ` Stef Bon
  0 siblings, 2 replies; 12+ messages in thread
From: Jeff King @ 2021-08-30 19:10 UTC (permalink / raw)
  To: Stef Bon; +Cc: Git Users

On Sat, Aug 28, 2021 at 02:56:17PM +0200, Stef Bon wrote:

> I've got a custom ssh library which I use to make a connection to a
> git server like www.github.com, user stefbon.
> 
> Now I want to get the direntries of a remote repo, and I know I have
> to use upload-pack for that, but with what parameters?
> 
> I want to use the outcome to make a fuse fs, user can browse the
> files. Possibly the user can also view the contents.

The protocol used by upload-pack is described in
Documentation/technical/pack-protocol.txt, but in short: I don't think
it will do what you want.

There is no operation to list the tree contents, for example, nor really
even a good way to fetch a single object. The protocol is geared around
efficiently transferring slices of history, so it is looking at sets of
reachable objects (what the client is asking for, and what it claims to
have).

You might be able to cobble something together with shallow and partial
fetches. E.g., something like:

  git clone --depth 1 --filter=blob:none --single-branch -b $branch

is basically asking to send only a single commit, plus all of its trees,
but no blobs. From there you could parse the tree objects to assemble a
directory listing. Possibly with a tree:depth filter you could even do
it iteratively.

Some hosts offer a separate API that would give you a much nicer
interface. E.g., GitHub has:

  https://docs.github.com/en/rest/reference/git#trees

But of course that won't work with GitLab, etc, and you'd have to
implement against the API for each hosting provider.

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Exec upload-pack on remote with what parameters to get direntries.
  2021-08-30 19:10 ` Jeff King
@ 2021-08-30 19:43   ` Junio C Hamano
  2021-08-30 20:46     ` Jeff King
  2021-08-31  6:38   ` Stef Bon
  1 sibling, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2021-08-30 19:43 UTC (permalink / raw)
  To: Jeff King; +Cc: Stef Bon, Git Users

Jeff King <peff@peff.net> writes:

> There is no operation to list the tree contents, for example, nor really
> even a good way to fetch a single object. The protocol is geared around
> efficiently transferring slices of history, so it is looking at sets of
> reachable objects (what the client is asking for, and what it claims to
> have).
>
> You might be able to cobble something together with shallow and partial
> fetches. E.g., something like:
>
>   git clone --depth 1 --filter=blob:none --single-branch -b $branch

I was hoping that our support for fetching a single object (not
necessarily a commit) at the protocol level was good enough, so that
Stef's fuse/nfs daemon can fetch the tree object it is interested
in.

There also is an effort, slowly moving to add verbs like object-info
to the protocol to help the vfs usecase, but primitives at too low a
level would be killed by latency, so it is somewhat unknown how
effective it would be.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Exec upload-pack on remote with what parameters to get direntries.
  2021-08-30 19:43   ` Junio C Hamano
@ 2021-08-30 20:46     ` Jeff King
  2021-08-30 21:21       ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff King @ 2021-08-30 20:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Stef Bon, Git Users

On Mon, Aug 30, 2021 at 12:43:38PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > There is no operation to list the tree contents, for example, nor really
> > even a good way to fetch a single object. The protocol is geared around
> > efficiently transferring slices of history, so it is looking at sets of
> > reachable objects (what the client is asking for, and what it claims to
> > have).
> >
> > You might be able to cobble something together with shallow and partial
> > fetches. E.g., something like:
> >
> >   git clone --depth 1 --filter=blob:none --single-branch -b $branch
> 
> I was hoping that our support for fetching a single object (not
> necessarily a commit) at the protocol level was good enough, so that
> Stef's fuse/nfs daemon can fetch the tree object it is interested
> in.

I don't think there's a clean way to ask for a single object. But
thinking on it more, I suspect you could do something _really_ hacky
using the new object-type filters:

  git fetch --filter=object:type=commit --filter=object:type=blob

Because we AND the filters together, no object can satisfy both. But
because we also send any objects which were _explicitly_ requested by
the client, you can now fetch whatever single objects you want.

And as long as you tell the other side you don't have any objects, it
won't send any deltas.

> There also is an effort, slowly moving to add verbs like object-info
> to the protocol to help the vfs usecase, but primitives at too low a
> level would be killed by latency, so it is somewhat unknown how
> effective it would be.

Yes. At GitHub we actually have a custom endpoint which hooks up
"cat-file --batch" with a format of the client's choosing. That's what
(indirectly) feeds things like raw.github.com.

I've been tempted to send it upstream, but it's pretty ugly, and does
give the client a lot of power (for now, the placeholders you can use
with cat-file are not that powerful, but if we start to unify with
ref-filter, etc, then we run into situations like we had with
%(describe) recently). Likewise, the v2 object-info endpoint _could_
accept arbitrary format strings (it's the same idea, just with
--batch-check instead of --batch).

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Exec upload-pack on remote with what parameters to get direntries.
  2021-08-30 20:46     ` Jeff King
@ 2021-08-30 21:21       ` Junio C Hamano
  2021-08-31 14:23         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2021-08-30 21:21 UTC (permalink / raw)
  To: Jeff King; +Cc: Stef Bon, Git Users

Jeff King <peff@peff.net> writes:

> Yes. At GitHub we actually have a custom endpoint which hooks up
> "cat-file --batch" with a format of the client's choosing. That's what
> (indirectly) feeds things like raw.github.com.
>
> I've been tempted to send it upstream, but it's pretty ugly, and does
> give the client a lot of power (for now, the placeholders you can use
> with cat-file are not that powerful, but if we start to unify with
> ref-filter, etc, then we run into situations like we had with
> %(describe) recently). Likewise, the v2 object-info endpoint _could_
> accept arbitrary format strings (it's the same idea, just with
> --batch-check instead of --batch).

Yeah, the object-info actually was from folks who are interested in
doing something similar, and it would be nice if we can share the
protocol endpoint that is more suitable for interactive tree and
history traversal to help those who want to do virtual filesystem.

Thanks.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Exec upload-pack on remote with what parameters to get direntries.
  2021-08-30 19:10 ` Jeff King
  2021-08-30 19:43   ` Junio C Hamano
@ 2021-08-31  6:38   ` Stef Bon
  2021-08-31  7:07     ` Jeff King
  1 sibling, 1 reply; 12+ messages in thread
From: Stef Bon @ 2021-08-31  6:38 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Users

Hi,

thank you for the answer.

I understand that the core of git is to make people work together when
writing code.
To get a tree of the source files is not directly part of that, but
pure informational. That is also the intent of my fuse fs: provide the
user information about the source files.

Now I have a working ssh connection to the server, and open a channel
for running the upload-pack on the server using the exec channel
request:

https://datatracker.ietf.org/doc/html/rfc4254#section-6.5

So in my program I do not have to do something like:

ssh -x git@server "git-upload-pack 'simplegit-progit.git'"

It is only the sending of an exec message with the right command.
Via the SSH_MSG_CHANNEL_DATA message the server will return the
output. In my program I have to write a parser to get the
tree/direntries.

Now you suggest the git clone --depth 1 --filter=blob:none
--single-branch -b $branch
command. How does that look when writing it in lowlevel git messages
as described in

https://git-scm.com/book/en/v2/Git-Internals-Transfer-Protocols

?
I'm programming at this low level, so I have to write the messages to
send to the server myself.

And you mention the api github has for a git tree object. But git2 has
already the git_tree object?

Stef

My project  by the way is: https://github.com/stefbon/OSNS

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Exec upload-pack on remote with what parameters to get direntries.
  2021-08-31  6:38   ` Stef Bon
@ 2021-08-31  7:07     ` Jeff King
  2021-08-31  9:44       ` Stef Bon
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff King @ 2021-08-31  7:07 UTC (permalink / raw)
  To: Stef Bon; +Cc: Git Users

On Tue, Aug 31, 2021 at 08:38:39AM +0200, Stef Bon wrote:

> So in my program I do not have to do something like:
> 
> ssh -x git@server "git-upload-pack 'simplegit-progit.git'"
> 
> It is only the sending of an exec message with the right command.
> Via the SSH_MSG_CHANNEL_DATA message the server will return the
> output. In my program I have to write a parser to get the
> tree/direntries.
> 
> Now you suggest the git clone --depth 1 --filter=blob:none
> --single-branch -b $branch
> command. How does that look when writing it in lowlevel git messages
> as described in
> 
> https://git-scm.com/book/en/v2/Git-Internals-Transfer-Protocols
> 
> ?
> I'm programming at this low level, so I have to write the messages to
> send to the server myself.

You'll have to read the documentation I pointed to earlier:

  https://github.com/git/git/blob/master/Documentation/technical/pack-protocol.txt

In short: the server tells you which refs it has and what they point to,
then the client says which objects it wants and which objects it has,
and then the server sends a packfile. The flow of the protocol and the
format of the messages is laid out there.

You might also set GIT_TRACE_PACKET=1 in your environment and try
running some Git commands. They will show you what's being said on the
wire, up until the packfile is sent (decoding the packfile itself is a
whole other story).

> And you mention the api github has for a git tree object. But git2 has
> already the git_tree object?

If you mean libgit2, then yes, it has a git_tree struct. Just like we
have internally within regular Git. But those are for accessing _local_
objects, that have already been fetched.

You could build a fuse filesystem around a local Git repository pretty
easily, either by using libgit2 or around tools like "git ls-tree" and
"git cat-file". But if your purpose is to access a remote one without
downloading all of the objects first, then no, Git does not expose any
of the endpoints you'd need remotely (but provider-specific APIs like
GitHub's do).

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Exec upload-pack on remote with what parameters to get direntries.
  2021-08-31  7:07     ` Jeff King
@ 2021-08-31  9:44       ` Stef Bon
  2021-08-31 14:01         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 12+ messages in thread
From: Stef Bon @ 2021-08-31  9:44 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Users

Op di 31 aug. 2021 om 09:07 schreef Jeff King <peff@peff.net>:
>
> On Tue, Aug 31, 2021 at 08:38:39AM +0200, Stef Bon wrote:
>

> You might also set GIT_TRACE_PACKET=1 in your environment and try
> running some Git commands. They will show you what's being said on the
> wire, up until the packfile is sent (decoding the packfile itself is a
> whole other story).
>

Yes that will give me the insight I need.
I will come back when it comes to decoding the packfile.

Thanks,
Stef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Exec upload-pack on remote with what parameters to get direntries.
  2021-08-31  9:44       ` Stef Bon
@ 2021-08-31 14:01         ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 12+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-08-31 14:01 UTC (permalink / raw)
  To: Stef Bon; +Cc: Jeff King, Git Users


On Tue, Aug 31 2021, Stef Bon wrote:

> Op di 31 aug. 2021 om 09:07 schreef Jeff King <peff@peff.net>:
>>
>> On Tue, Aug 31, 2021 at 08:38:39AM +0200, Stef Bon wrote:
>>
>
>> You might also set GIT_TRACE_PACKET=1 in your environment and try
>> running some Git commands. They will show you what's being said on the
>> wire, up until the packfile is sent (decoding the packfile itself is a
>> whole other story).
>>
>
> Yes that will give me the insight I need.
> I will come back when it comes to decoding the packfile.

Aside from the "here's how you can do it", you haven't said why you'd
like to do such "online" browsing of the repository.

I'd think that even for something that e.g. implements a file browser
with magic git-remote support (think GNOME VFS-like), what you'd want to
do in the background would be to do a "clone", although a clone with
some combination of --single-branch, --no-tags, and perhaps --depth and
the filters discussed upthread.

It will take the same time to get the pack, but once you do you can use
libgit2, git's plumbing etc. to do really fast browsing/wildcarding
etc. of the entries locally.

So is there a real performance or other use-case for wanting to do this,
or does it just come down a lack of nice a "one-shot" API for "list
remote files?".

In any case, on the topic of clever things you can (ab)use to do this,
some remotes support running "git archive" for you. Notably GitHub
doesn't, but GitLab does. Please don't take this as an endorsement to
run this command "in production"

    $ time (git archive --format=tar  --remote=git@gitlab.com:git-vcs/git.git --prefix=t/t4018/ HEAD:t/t4018 | tar -tf- | head -n 3)
    t/t4018/
    t/t4018/README
    t/t4018/bash-arithmetic-function

    real    0m1.545s

I idly wonder if there's a want/need for a file listing API whether
doing so via the tar/zip format wouldn't be a more viable & widely
supported thing than expecting everyone to come up with their own git
packfile decoders. I.e. if we just supported some option to create
all-empty dummy files via "git archive" this could be even better as a
dummy file listing API. Right now this (ab)use of it requires
e.g. sending ~10MB of t/'s content just to list everything in the t/
directory.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Exec upload-pack on remote with what parameters to get direntries.
  2021-08-30 21:21       ` Junio C Hamano
@ 2021-08-31 14:23         ` Ævar Arnfjörð Bjarmason
  2021-08-31 15:35           ` Bruno Albuquerque
  0 siblings, 1 reply; 12+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-08-31 14:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Stef Bon, Git Users, Bruno Albuquerque


On Mon, Aug 30 2021, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
>
>> Yes. At GitHub we actually have a custom endpoint which hooks up
>> "cat-file --batch" with a format of the client's choosing. That's what
>> (indirectly) feeds things like raw.github.com.
>>
>> I've been tempted to send it upstream, but it's pretty ugly, and does
>> give the client a lot of power (for now, the placeholders you can use
>> with cat-file are not that powerful, but if we start to unify with
>> ref-filter, etc, then we run into situations like we had with
>> %(describe) recently). Likewise, the v2 object-info endpoint _could_
>> accept arbitrary format strings (it's the same idea, just with
>> --batch-check instead of --batch).
>
> Yeah, the object-info actually was from folks who are interested in
> doing something similar, and it would be nice if we can share the
> protocol endpoint that is more suitable for interactive tree and
> history traversal to help those who want to do virtual filesystem.

While this is all clever, I think this discussion really suggests that
the first thing we should do is make the relatively recent "object-info"
protocol verb not a default part of the supported v2 protocol we ship in
git.git.

I.e. someone setting up a git server probably isn't going to suspect
that one day their server load is going to go up by some big % because
some developer somewhere is using a local IDE whose every file click on
a directory is a new remote server request (i.e. the case where
"object-info"'s functionality is expanded like this).

I found myself wondering this when reading serve.c the other day,
i.e. why we have "always_advertise" for object-info, but it seemed
innocuous enough given how it's described in a2ba162cda2 (object-info:
support for retrieving object info, 2021-04-20).

But just as a general thing, while I'm very much in favor of git growing
*optional* support for more server<->client cooperation and CPU
offloading, even things like "git grep" or "git log" optimistically
running server-side, I think those sorts of features should definitely
be off by default for the reasons noted above.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Exec upload-pack on remote with what parameters to get direntries.
  2021-08-31 14:23         ` Ævar Arnfjörð Bjarmason
@ 2021-08-31 15:35           ` Bruno Albuquerque
  2021-08-31 16:23             ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Bruno Albuquerque @ 2021-08-31 15:35 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Junio C Hamano, Jeff King, Stef Bon, Git Users

On Tue, Aug 31, 2021 at 7:28 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:

[Replying again as I used HTML mail by mistake. Sorry.]

> I.e. someone setting up a git server probably isn't going to suspect
> that one day their server load is going to go up by some big % because
> some developer somewhere is using a local IDE whose every file click on
> a directory is a new remote server request (i.e. the case where
> "object-info"'s functionality is expanded like this).

Do you mean by someone directly sending object-info requests? I am
working on wiring object-info to some of the existing tools
(cat-file/ls-tree) so this is the general idea about how I see this
being used:

- object-info would be used when it made sense but only if the actual
object being queried is not already fetched locally. If you think of a
virtual filesystem that is backed by, say, partial clones, this mostly
means retrieving metadata information to be displayed to the user.
- Still in the context of a virtual filesystem, metadata is usually
cached locally independently of Git itself, further reducing the need
to call object-info (but, of course, this is a brittle assumption as
it is not controlled by Git).
- git cat-file, for example, would be changed to support real batching
and then send a single request instead of the multiple requests it
does currently.

My point is that I understand where your worry is coming from and as
long as someone can send arbitrary requests then it is possible your
scenario of a heavier server load can potentially happen but as far as
the expected canonical usage, I do not think this would be a problem
and, in fact, under some usage patterns it might make things better
(mostly due to batching support in object-info).

With all that being said, I don' t think making it optional would be
an issue so I have no strong feelings about this. I am fine with
whatever is agreed to be the best approach.

> I found myself wondering this when reading serve.c the other day,
> i.e. why we have "always_advertise" for object-info, but it seemed
> innocuous enough given how it's described in a2ba162cda2 (object-info:
> support for retrieving object info, 2021-04-20).
For what it is worth, The same change is now being reviewed in JGit
and there the feature is conditionally enabled. But that was a
side-effect of needing to deploy it to multiple servers before making
the feature available to clients.

--

Bruno Albuquerque | Software Engineer | bga@google.com | +1 650-395-8242

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Exec upload-pack on remote with what parameters to get direntries.
  2021-08-31 15:35           ` Bruno Albuquerque
@ 2021-08-31 16:23             ` Junio C Hamano
  0 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2021-08-31 16:23 UTC (permalink / raw)
  To: Bruno Albuquerque
  Cc: Ævar Arnfjörð Bjarmason, Jeff King, Stef Bon,
	Git Users

Bruno Albuquerque <bga@google.com> writes:

> With all that being said, I don' t think making it optional would be
> an issue so I have no strong feelings about this. I am fine with
> whatever is agreed to be the best approach.
>
>> I found myself wondering this when reading serve.c the other day,
>> i.e. why we have "always_advertise" for object-info, but it seemed
>> innocuous enough given how it's described in a2ba162cda2 (object-info:
>> support for retrieving object info, 2021-04-20).
> For what it is worth, The same change is now being reviewed in JGit
> and there the feature is conditionally enabled. But that was a
> side-effect of needing to deploy it to multiple servers before making
> the feature available to clients.

FWIW, I do not mind, and probably prefer if I think about it a bit
longer, to make it an opt-in feature, like all other capabilities
defined in the serve.c file.

Thanks.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-08-31 16:23 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-28 12:56 Exec upload-pack on remote with what parameters to get direntries Stef Bon
2021-08-30 19:10 ` Jeff King
2021-08-30 19:43   ` Junio C Hamano
2021-08-30 20:46     ` Jeff King
2021-08-30 21:21       ` Junio C Hamano
2021-08-31 14:23         ` Ævar Arnfjörð Bjarmason
2021-08-31 15:35           ` Bruno Albuquerque
2021-08-31 16:23             ` Junio C Hamano
2021-08-31  6:38   ` Stef Bon
2021-08-31  7:07     ` Jeff King
2021-08-31  9:44       ` Stef Bon
2021-08-31 14:01         ` Ævar Arnfjörð Bjarmason

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).