git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Proposed approaches to supporting HTTP remotes in "git archive"
@ 2018-07-27 21:47 Josh Steadmon
  2018-07-27 21:56 ` Jonathan Nieder
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Josh Steadmon @ 2018-07-27 21:47 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 3672 bytes --]

# Supporting HTTP remotes in "git archive"

We would like to allow remote archiving from HTTP servers. There are a
few possible implementations to be discussed:

## Shallow clone to temporary repo

This approach builds on existing endpoints. Clients will connect to the
remote server's git-upload-pack service and fetch a shallow clone of the
requested commit into a temporary local repo. The write_archive()
function is then called on the local clone to write out the requested
archive.

### Benefits

* This can be implemented entirely in builtin/archive.c. No new service
  endpoints or server code are required.

* The archive is generated and compressed on the client side. This
  reduces CPU load on the server (for compressed archives) which would
   otherwise be a potential DoS vector.

* This provides a git-native way to archive any HTTP servers that
  support the git-upload-pack service; some providers (including GitHub)
  do not currently allow the git-upload-archive service.

### Drawbacks

* Archives generated remotely may not be bit-for-bit identical compared
  to those generated locally, if the versions of git used on the client
  and on the server differ.

* This requires higher bandwidth compared to transferring a compressed
  archive generated on the server.


## Use git-upload-archive

This approach requires adding support for the git-upload-archive
endpoint to the HTTP backend. Clients will connect to the remote
server's git-upload-archive service and the server will generate the
archive which is then delivered to the client.

### Benefits

* Matches existing "git archive" behavior for other remotes.

* Requires less bandwidth to send a compressed archive than a shallow
  clone.

* Resulting archive does not depend in any way on the client
  implementation.

### Drawbacks

* Implementation is more complicated; it will require changes to (at
  least) builtin/archive.c, http-backend.c, and
  builtin/upload-archive.c.

* Generates more CPU load on the server when compressing archives. This
  is potentially a DoS vector.

* Does not allow archiving from servers that don't support the
  git-upload-archive service.


## Add a new protocol v2 "archive" command

I am still a bit hazy on the exact details of this approach, please
forgive any inaccuracies (I'm a new contributor and haven't examined
custom v2 commands in much detail yet).

This approach builds off the existing v2 upload-pack endpoint. The
client will issue an archive command (with options to select particular
paths or a tree-ish). The server will generate the archive and deliver
it to the client.

### Benefits

* Requires less bandwidth to send a compressed archive than a shallow
  clone.

* Resulting archive does not depend in any way on the client
  implementation.

### Drawbacks

* Generates more CPU load on the server when compressing archives. This
  is potentially a DoS vector.

* Servers must support the v2 protocol (although the client could
  potentially fallback to some other supported remote archive
   functionality).

### Unknowns

* I am not clear on the relative complexity of this approach compared to
  the others, and would appreciate any guidance offered.


## Summary

Personally, I lean towards the first approach. It could give us an
opportunity to remove server-side complexity; there is no reason that
the shallow-clone approach must be restricted to the HTTP transport, and
we could re-implement other transports using this method.  Additionally,
it would allow clients to pull archives from remotes that would not
otherwise support it.

That said, I am happy to work on whichever approach the community deems
most worthwhile.

[-- Attachment #2: Type: text/html, Size: 4097 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Proposed approaches to supporting HTTP remotes in "git archive"
  2018-07-27 21:47 Proposed approaches to supporting HTTP remotes in "git archive" Josh Steadmon
@ 2018-07-27 21:56 ` Jonathan Nieder
  2018-07-27 22:00 ` Jonathan Nieder
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Jonathan Nieder @ 2018-07-27 21:56 UTC (permalink / raw)
  To: Josh Steadmon
  Cc: git, René Scharfe, Jeff King, Frack Bui-Huu,
	Johannes Schindelin

(just cc-ing René Scharfe, archive expert; Peff; Dscho; Franck Bui-Huu
to see how his creation is evolving)
Josh Steadmon wrote:

> # Supporting HTTP remotes in "git archive"
>
> We would like to allow remote archiving from HTTP servers. There are a
> few possible implementations to be discussed:
>
> ## Shallow clone to temporary repo
>
> This approach builds on existing endpoints. Clients will connect to the
> remote server's git-upload-pack service and fetch a shallow clone of the
> requested commit into a temporary local repo. The write_archive()
> function is then called on the local clone to write out the requested
> archive.
>
> ### Benefits
>
> * This can be implemented entirely in builtin/archive.c. No new service
>   endpoints or server code are required.
>
> * The archive is generated and compressed on the client side. This
>   reduces CPU load on the server (for compressed archives) which would
>    otherwise be a potential DoS vector.
>
> * This provides a git-native way to archive any HTTP servers that
>   support the git-upload-pack service; some providers (including GitHub)
>   do not currently allow the git-upload-archive service.
>
> ### Drawbacks
>
> * Archives generated remotely may not be bit-for-bit identical compared
>   to those generated locally, if the versions of git used on the client
>   and on the server differ.
>
> * This requires higher bandwidth compared to transferring a compressed
>   archive generated on the server.
>
>
> ## Use git-upload-archive
>
> This approach requires adding support for the git-upload-archive
> endpoint to the HTTP backend. Clients will connect to the remote
> server's git-upload-archive service and the server will generate the
> archive which is then delivered to the client.
>
> ### Benefits
>
> * Matches existing "git archive" behavior for other remotes.
>
> * Requires less bandwidth to send a compressed archive than a shallow
>   clone.
>
> * Resulting archive does not depend in any way on the client
>   implementation.
>
> ### Drawbacks
>
> * Implementation is more complicated; it will require changes to (at
>   least) builtin/archive.c, http-backend.c, and
>   builtin/upload-archive.c.
>
> * Generates more CPU load on the server when compressing archives. This
>   is potentially a DoS vector.
>
> * Does not allow archiving from servers that don't support the
>   git-upload-archive service.
>
>
> ## Add a new protocol v2 "archive" command
>
> I am still a bit hazy on the exact details of this approach, please
> forgive any inaccuracies (I'm a new contributor and haven't examined
> custom v2 commands in much detail yet).
>
> This approach builds off the existing v2 upload-pack endpoint. The
> client will issue an archive command (with options to select particular
> paths or a tree-ish). The server will generate the archive and deliver
> it to the client.
>
> ### Benefits
>
> * Requires less bandwidth to send a compressed archive than a shallow
>   clone.
>
> * Resulting archive does not depend in any way on the client
>   implementation.
>
> ### Drawbacks
>
> * Generates more CPU load on the server when compressing archives. This
>   is potentially a DoS vector.
>
> * Servers must support the v2 protocol (although the client could
>   potentially fallback to some other supported remote archive
>    functionality).
>
> ### Unknowns
>
> * I am not clear on the relative complexity of this approach compared to
>   the others, and would appreciate any guidance offered.
>
>
> ## Summary
>
> Personally, I lean towards the first approach. It could give us an
> opportunity to remove server-side complexity; there is no reason that
> the shallow-clone approach must be restricted to the HTTP transport, and
> we could re-implement other transports using this method.  Additionally,
> it would allow clients to pull archives from remotes that would not
> otherwise support it.
>
> That said, I am happy to work on whichever approach the community deems
> most worthwhile.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Proposed approaches to supporting HTTP remotes in "git archive"
  2018-07-27 21:47 Proposed approaches to supporting HTTP remotes in "git archive" Josh Steadmon
  2018-07-27 21:56 ` Jonathan Nieder
@ 2018-07-27 22:00 ` Jonathan Nieder
  2018-07-27 22:32 ` Junio C Hamano
  2018-07-28 18:52 ` brian m. carlson
  3 siblings, 0 replies; 6+ messages in thread
From: Jonathan Nieder @ 2018-07-27 22:00 UTC (permalink / raw)
  To: Josh Steadmon
  Cc: git, René Scharfe, Jeff King, Frack Bui-Huu,
	Johannes Schindelin

(just cc-ing René Scharfe, archive expert; Peff; Dscho; Franck Bui-Huu
to see how his creation is evolving.

Using the correct address for René this time. Sorry for the noise.)

Josh Steadmon wrote:

> # Supporting HTTP remotes in "git archive"
>
> We would like to allow remote archiving from HTTP servers. There are a
> few possible implementations to be discussed:
>
> ## Shallow clone to temporary repo
>
> This approach builds on existing endpoints. Clients will connect to the
> remote server's git-upload-pack service and fetch a shallow clone of the
> requested commit into a temporary local repo. The write_archive()
> function is then called on the local clone to write out the requested
> archive.
>
> ### Benefits
>
> * This can be implemented entirely in builtin/archive.c. No new service
>   endpoints or server code are required.
>
> * The archive is generated and compressed on the client side. This
>   reduces CPU load on the server (for compressed archives) which would
>    otherwise be a potential DoS vector.
>
> * This provides a git-native way to archive any HTTP servers that
>   support the git-upload-pack service; some providers (including GitHub)
>   do not currently allow the git-upload-archive service.
>
> ### Drawbacks
>
> * Archives generated remotely may not be bit-for-bit identical compared
>   to those generated locally, if the versions of git used on the client
>   and on the server differ.
>
> * This requires higher bandwidth compared to transferring a compressed
>   archive generated on the server.
>
>
> ## Use git-upload-archive
>
> This approach requires adding support for the git-upload-archive
> endpoint to the HTTP backend. Clients will connect to the remote
> server's git-upload-archive service and the server will generate the
> archive which is then delivered to the client.
>
> ### Benefits
>
> * Matches existing "git archive" behavior for other remotes.
>
> * Requires less bandwidth to send a compressed archive than a shallow
>   clone.
>
> * Resulting archive does not depend in any way on the client
>   implementation.
>
> ### Drawbacks
>
> * Implementation is more complicated; it will require changes to (at
>   least) builtin/archive.c, http-backend.c, and
>   builtin/upload-archive.c.
>
> * Generates more CPU load on the server when compressing archives. This
>   is potentially a DoS vector.
>
> * Does not allow archiving from servers that don't support the
>   git-upload-archive service.
>
>
> ## Add a new protocol v2 "archive" command
>
> I am still a bit hazy on the exact details of this approach, please
> forgive any inaccuracies (I'm a new contributor and haven't examined
> custom v2 commands in much detail yet).
>
> This approach builds off the existing v2 upload-pack endpoint. The
> client will issue an archive command (with options to select particular
> paths or a tree-ish). The server will generate the archive and deliver
> it to the client.
>
> ### Benefits
>
> * Requires less bandwidth to send a compressed archive than a shallow
>   clone.
>
> * Resulting archive does not depend in any way on the client
>   implementation.
>
> ### Drawbacks
>
> * Generates more CPU load on the server when compressing archives. This
>   is potentially a DoS vector.
>
> * Servers must support the v2 protocol (although the client could
>   potentially fallback to some other supported remote archive
>    functionality).
>
> ### Unknowns
>
> * I am not clear on the relative complexity of this approach compared to
>   the others, and would appreciate any guidance offered.
>
>
> ## Summary
>
> Personally, I lean towards the first approach. It could give us an
> opportunity to remove server-side complexity; there is no reason that
> the shallow-clone approach must be restricted to the HTTP transport, and
> we could re-implement other transports using this method.  Additionally,
> it would allow clients to pull archives from remotes that would not
> otherwise support it.
>
> That said, I am happy to work on whichever approach the community deems
> most worthwhile.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Proposed approaches to supporting HTTP remotes in "git archive"
  2018-07-27 21:47 Proposed approaches to supporting HTTP remotes in "git archive" Josh Steadmon
  2018-07-27 21:56 ` Jonathan Nieder
  2018-07-27 22:00 ` Jonathan Nieder
@ 2018-07-27 22:32 ` Junio C Hamano
  2018-07-29 11:54   ` René Scharfe
  2018-07-28 18:52 ` brian m. carlson
  3 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2018-07-27 22:32 UTC (permalink / raw)
  To: Josh Steadmon; +Cc: git

Josh Steadmon <steadmon@google.com> writes:

> # Supporting HTTP remotes in "git archive"
>
> We would like to allow remote archiving from HTTP servers. There are a
> few possible implementations to be discussed:
>
> ## Shallow clone to temporary repo
>
> This approach builds on existing endpoints. Clients will connect to the
> remote server's git-upload-pack service and fetch a shallow clone of the
> requested commit into a temporary local repo. The write_archive()
> function is then called on the local clone to write out the requested
> archive.
>
> ...
>
> ## Summary
>
> Personally, I lean towards the first approach. It could give us an
> opportunity to remove server-side complexity; there is no reason that
> the shallow-clone approach must be restricted to the HTTP transport, and
> we could re-implement other transports using this method.  Additionally,
> it would allow clients to pull archives from remotes that would not
> otherwise support it.

I consider the first one (i.e. make a shallow clone and tar it up
locally) a hack that does *not* belong to "git archive --remote"
command, especially when it is only done to "http remotes".  The
only reason HTTP remotes are special is because there is no ready
"http-backend" equivalent that passes the "git archive" traffic over
smart-http transport, unlike the one that exists for "git
upload-pack".

It however still _is_ attractive to drive such a hack from "git
archive" at the UI level, as the end users do not care how ugly the
hack is ;-)  As you mentioned, the approach would work for any
transport that allows one-commit shallow clone, so it might become
more palatable if it is designed as a different _mode_ of operation
of "git archive" that is orthogonal to the underlying transport,
i.e.

  $ git archive --remote=<repo> --shallow-clone-then-local-archive-hack master

or

  $ git config archive.<repo>.useShallowCloneThenLocalArchiveHack true
  $ git archive --remote=<repo> master

It might turn out that it may work better than the native "git
archive" access against servers that offer both shallow clone
and native archive access.  I doubt a single-commit shallow clone
would benefit from reusing of premade deltas and compressed bases
streamed straight out of packfiles from the server side that much,
but you'd never know until you measure ;-)



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Proposed approaches to supporting HTTP remotes in "git archive"
  2018-07-27 21:47 Proposed approaches to supporting HTTP remotes in "git archive" Josh Steadmon
                   ` (2 preceding siblings ...)
  2018-07-27 22:32 ` Junio C Hamano
@ 2018-07-28 18:52 ` brian m. carlson
  3 siblings, 0 replies; 6+ messages in thread
From: brian m. carlson @ 2018-07-28 18:52 UTC (permalink / raw)
  To: Josh Steadmon; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1549 bytes --]

On Fri, Jul 27, 2018 at 02:47:00PM -0700, Josh Steadmon wrote:
> ## Use git-upload-archive
> 
> This approach requires adding support for the git-upload-archive
> endpoint to the HTTP backend. Clients will connect to the remote
> server's git-upload-archive service and the server will generate the
> archive which is then delivered to the client.
> 
> ### Benefits
> 
> * Matches existing "git archive" behavior for other remotes.
> 
> * Requires less bandwidth to send a compressed archive than a shallow
>   clone.
> 
> * Resulting archive does not depend in any way on the client
>   implementation.
> 
> ### Drawbacks
> 
> * Implementation is more complicated; it will require changes to (at
>   least) builtin/archive.c, http-backend.c, and
>   builtin/upload-archive.c.
> 
> * Generates more CPU load on the server when compressing archives. This
>   is potentially a DoS vector.
> 
> * Does not allow archiving from servers that don't support the
>   git-upload-archive service.

I happen to like this option because it has the potential to be driven
by a non-git client (e.g. a curl invocation).  That would be enormously
valuable, especially in cases where authentication isn't desired or an
SSH key isn't a good form of authentication.

I'm not really worried about the DoS vector because an implementation is
almost certainly going to support both SSH and HTTPS or neither, and the
DoS potential is the same either way.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Proposed approaches to supporting HTTP remotes in "git archive"
  2018-07-27 22:32 ` Junio C Hamano
@ 2018-07-29 11:54   ` René Scharfe
  0 siblings, 0 replies; 6+ messages in thread
From: René Scharfe @ 2018-07-29 11:54 UTC (permalink / raw)
  To: Junio C Hamano, Josh Steadmon; +Cc: git

Am 28.07.2018 um 00:32 schrieb Junio C Hamano:
> Josh Steadmon <steadmon@google.com> writes:
> 
>> # Supporting HTTP remotes in "git archive"
>>
>> We would like to allow remote archiving from HTTP servers. There are a
>> few possible implementations to be discussed:
>>
>> ## Shallow clone to temporary repo
>>
>> This approach builds on existing endpoints. Clients will connect to the
>> remote server's git-upload-pack service and fetch a shallow clone of the
>> requested commit into a temporary local repo. The write_archive()
>> function is then called on the local clone to write out the requested
>> archive.

A prototype would require just a few lines of shell script, I guess..

A downside that was only stated implicitly: This method needs temporary
disk space for the clone, while the existing archive modes only ever
write out the resulting file.  I guess the required space is in the same
order as the compressed archive.  This shouldn't be a problem if we
assume the user would eventually want to extract its contents, right?

>> ## Summary
>>
>> Personally, I lean towards the first approach. It could give us an
>> opportunity to remove server-side complexity; there is no reason that
>> the shallow-clone approach must be restricted to the HTTP transport, and
>> we could re-implement other transports using this method.  Additionally,
>> it would allow clients to pull archives from remotes that would not
>> otherwise support it.
> 
> I consider the first one (i.e. make a shallow clone and tar it up
> locally) a hack that does *not* belong to "git archive --remote"
> command, especially when it is only done to "http remotes".  The
> only reason HTTP remotes are special is because there is no ready
> "http-backend" equivalent that passes the "git archive" traffic over
> smart-http transport, unlike the one that exists for "git
> upload-pack".
> 
> It however still _is_ attractive to drive such a hack from "git
> archive" at the UI level, as the end users do not care how ugly the
> hack is ;-)  As you mentioned, the approach would work for any
> transport that allows one-commit shallow clone, so it might become
> more palatable if it is designed as a different _mode_ of operation
> of "git archive" that is orthogonal to the underlying transport,
> i.e.
> 
>    $ git archive --remote=<repo> --shallow-clone-then-local-archive-hack master
> 
> or
> 
>    $ git config archive.<repo>.useShallowCloneThenLocalArchiveHack true
>    $ git archive --remote=<repo> master

Archive-via-clone would also work with full clones (if shallow ones are
not available), but that would be wasteful and a bit cruel, of course.

Anyway, I think we should find a better (shorter) name for that option;
that could turn out to be the hardest part. :)

> It might turn out that it may work better than the native "git
> archive" access against servers that offer both shallow clone
> and native archive access.  I doubt a single-commit shallow clone
> would benefit from reusing of premade deltas and compressed bases
> streamed straight out of packfiles from the server side that much,
> but you'd never know until you measure ;-)

It could benefit from GIT_ALTERNATE_OBJECT_DIRECTORIES, but I guess
typical users of git archive --remote won't have any good ones lying
around.

René

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-07-29 11:54 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-27 21:47 Proposed approaches to supporting HTTP remotes in "git archive" Josh Steadmon
2018-07-27 21:56 ` Jonathan Nieder
2018-07-27 22:00 ` Jonathan Nieder
2018-07-27 22:32 ` Junio C Hamano
2018-07-29 11:54   ` René Scharfe
2018-07-28 18:52 ` brian m. carlson

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).