git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Christian Couder <christian.couder@gmail.com>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git <git@vger.kernel.org>, Junio C Hamano <gitster@pobox.com>,
	Jeff King <peff@peff.net>, Ben Peart <Ben.Peart@microsoft.com>,
	Nguyen Thai Ngoc Duy <pclouds@gmail.com>,
	Mike Hommey <mh@glandium.org>,
	Lars Schneider <larsxschneider@gmail.com>,
	Eric Wong <e@80x24.org>,
	Christian Couder <chriscool@tuxfamily.org>
Subject: Re: [PATCH v6 09/40] Add initial external odb support
Date: Tue, 3 Oct 2017 11:45:02 +0200	[thread overview]
Message-ID: <CAP8UFD05QzCn53SjbbgaToVWyd=3N+HayDvR4OZNaP7YKbEBfg@mail.gmail.com> (raw)
In-Reply-To: <20170929133614.1c0cd68ad80139fafdf68b86@google.com>

On Fri, Sep 29, 2017 at 10:36 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> On Wed, 27 Sep 2017 18:46:30 +0200
> Christian Couder <christian.couder@gmail.com> wrote:

>> I don't think single-shot processes would be a huge burden, because
>> the code is simpler, and because for example for filters we already
>> have single shot and long-running processes and no one complains about
>> that. It's code that is useful as it makes it much easier for people
>> to do some things (see the clone bundle example).
>>
>> In fact in Git development we usually start to by first implementing
>> simpler single-shot solutions, before thinking, when the need arise,
>> to make it faster. So a perhaps an equally valid opinion could be to
>> first only submit the patches for the single-shot protocol and later
>> submit the rest of the series when we start getting feedback about how
>> external odbs are used.
>
> My concern is that, as far as I understand about the Microsoft use case,
> we already know that we need the faster solution, so the need has
> already arisen.

Yeah, some people need the faster solution, but my opinion is that
many other people would prefer the single shot protocol.
If all you want to do is a simple resumable clone using bundles for
example, then the long running process solution is very much overkill.

For example with filters there are people using them to do keyword
expansion (maybe to emulate the way Subversion and CVS substitutes
keywords like $Id$, $Author$ and so on). It would be really bad to
deprecate the single shot filters and tell those people they now have
to use long running processes because we don't want to maintain the
small code that make single shot filters work.

The Microsoft GVFS use case is just one use case that is very far from
what most people need. And my opinion is that many more people could
benefit from the single shot protocol. For example many people and
admins could benefit from resumable clones using bundles and, if I
remove the single shot protocol, this use case will be unnecessarily
more difficult to implement in the same way as keyword expansion would
be unnecessarily more difficult to implement if we removed the single
shot filters.

See the first article in
https://git.github.io/rev_news/2016/03/16/edition-13/ about resumable
clone if you are not convinced that resumable clones are not an old
and important problem.

>> And yeah I could change the order of the patch series to implement the
>> long-running processes first and the single-shot process last, so that
>> it could be possible to first get feedback about the long-running
>> processes, before we decide to merge or not the single-shot stuff, but
>> I don't think it would look like the most logical order.
>
> My thinking was that we would just implement the long-running process
> and not implement the single-shot process at all (besides maybe a script
> in contrib/). If we are going to do both anyway, I agree that we should
> do the single-shot process first.

Nice to hear that!

>> > And I think that my design can be extended to support a use case in
>> > which, for example, blobs corresponding to a certain type of filename
>> > (defined by a glob like in gitattributes) can be excluded during
>> > fetch/clone, much like --blob-max-bytes, and they can be fetched either
>> > through the built-in mechanism or through a custom hook.
>>
>> Sure, we could probably rebuild something equivalent to what I did on
>> top of your design.
>> My opinion though is that if we want to eventually get to the same
>> goal, it is better to first merge something that get us very close to
>> the end goal and then add some improvements on top of it.
>
> I agree

So are you ok to rebase your patch series on top of my patch series?

My opinion is that my patch series is trying to get to the end goal,
and succeeding to a very large extent, with the smallest amount of
deep technical changes as possible, and that it is the right way to
approach this problem for the following reasons:

1) The root problem is that the current object stores (packfiles and
loose object files) are not good ways to store some objects,
especially some blobs.

2) This root problem cannot be dealt with by Git itself without any
help from external programs, because Git cannot realistically
implement many different object stores (like http servers, artifact
stores, etc). So Git must be improved so that it becomes capable of
communicating with external object stores.

3) As the Git protocol uses packfiles to send objects and is not very
flexible, it might be better if external stores can also be used to
transfer objects that are not stored any more in the current object
stores. (As packfiles are not good for storing some objects, they are
probably also not a good format for sending them. Also, as the Git
protocol is not resumable, we might easily be able to implement
resumable clones if we let external stores handle some transfer.)

4) Making it easy and flexible to exchange objects (and maybe meta
information) with the external stores is very important.

5) Protocol changes are more difficult than many other code changes,
so we should care a lot about the protocol between Git and external
stores.

> - I mentioned that because I personally prefer to review smaller
> patch sets at a time,

I am ok with sending small patch sets, and I will send smaller patch
sets about this from now on.

> and my patch set already includes a lot of the
> same infrastructure needed by yours - for example, the places in the
> code to dynamically fetch objects, exclusion of objects when fetching or
> cloning, configuring the cloned repo when cloning, fsck, and gc.

I agree that your patch set already includes some infrastructure that
could be used by my work, and your patch sets are perhaps implementing
some of this infrastructure better than in my work (I haven't taken a
deep look). But I really think that the right approach is to focus
first on designing a flexible protocol between Git and external
stores. Then the infrastructure work should be related to improving or
enabling the flexible protocol and the communication between Git and
external stores.

Doing infrastructure work first and improving things on top of this
new infrastructure without relying first on a design of the protocol
between Git and external stores is not the best approach as I think we
might over engineer some infrastructure work or base some user
interfaces on the infrastructure work and not on the end goal.

For example if we improve the current protocol, which is not
necessarily a bad thing in itself, we might forget that for resumable
clone it is much better if we just let external stores and helpers
handle the transfer.

I am not saying that doing infrastructure work is bad or will not in
the end let us reach our goals, but I see it as something that is
potentially distracting, or misleading, from focusing first on the
protocol between Git and external stores.

>> >  - I get compile errors when I "git am" these onto master. I think
>> >    '#include "config.h"' is needed in some places.
>>
>> It's strange because I get no compile errors even after a "make clean"
>> from my branch.
>> Could you show the actual errors?
>
> I don't have the error messages with me now, but it was something about
> a function being implicitly declared. You will probably get these errors
> if you sync past commit e67a57f ("config: create config.h", 2017-06-15).

I am past this commit and I get no errors.
I rebased on top of: ea220ee40c "The eleventh batch for 2.15"

>> > Any reason why you prefer to update the loose object functions than to
>> > update the generic one (sha1_object_info_extended)? My concern with just
>> > updating the loose object functions was that a caller might have
>> > obtained the path by iterating through the loose object dirs, and in
>> > that case we shouldn't query the external ODB for anything.
>>
>> You are thinking about fsck or gc?
>> Otherwise I don't think it would be clean to iterate through loose object dirs.
>
> Yes, fsck and gc (well, prune, I think) do that. I agree that Git
> typically doesn't do that (except for exceptional cases like fsck and
> gc), but I was thinking about supporting existing code that does that
> iteration, not introducing new code that does that.

I haven't taken a look at how fsck and prune work and this is still
code that Peff wrote (though a long time ago), so I tend to trust it.
But I will take a look, and if it is indeed better for them, I am ok
to update sha1_object_info_extended() instead of loose object
functions.

Thanks.

  parent reply	other threads:[~2017-10-03  9:45 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-16  8:06 [PATCH v6 00/40] Add initial experimental external ODB support Christian Couder
2017-09-16  8:06 ` [PATCH v6 01/40] builtin/clone: get rid of 'value' strbuf Christian Couder
2017-09-16  8:06 ` [PATCH v6 02/40] t0021/rot13-filter: refactor packet reading functions Christian Couder
2017-09-16  8:06 ` [PATCH v6 03/40] t0021/rot13-filter: improve 'if .. elsif .. else' style Christian Couder
2017-09-16  8:06 ` [PATCH v6 04/40] t0021/rot13-filter: improve error message Christian Couder
2017-09-16  8:06 ` [PATCH v6 05/40] t0021/rot13-filter: add packet_initialize() Christian Couder
2017-09-16  8:06 ` [PATCH v6 06/40] t0021/rot13-filter: add capability functions Christian Couder
2017-09-16  8:06 ` [PATCH v6 07/40] Add Git/Packet.pm from parts of t0021/rot13-filter.pl Christian Couder
2017-09-16  8:06 ` [PATCH v6 08/40] sha1_file: prepare for external odbs Christian Couder
2017-09-16  8:07 ` [PATCH v6 09/40] Add initial external odb support Christian Couder
2017-09-19 17:45   ` Jonathan Tan
2017-09-27 16:46     ` Christian Couder
2017-09-29 20:36       ` Jonathan Tan
2017-10-02 14:34         ` Ben Peart
2017-10-03  9:45         ` Christian Couder [this message]
2017-10-04  0:15           ` Jonathan Tan
2017-09-16  8:07 ` [PATCH v6 10/40] odb-helper: add odb_helper_init() to send 'init' instruction Christian Couder
2017-09-16  8:07 ` [PATCH v6 11/40] t0400: add 'put_raw_obj' instruction to odb-helper script Christian Couder
2017-09-16  8:07 ` [PATCH v6 12/40] external odb: add 'put_raw_obj' support Christian Couder
2017-09-16  8:07 ` [PATCH v6 13/40] external-odb: accept only blobs for now Christian Couder
2017-09-16  8:07 ` [PATCH v6 14/40] t0400: add test for external odb write support Christian Couder
2017-09-16  8:07 ` [PATCH v6 15/40] Add GIT_NO_EXTERNAL_ODB env variable Christian Couder
2017-09-16  8:07 ` [PATCH v6 16/40] Add t0410 to test external ODB transfer Christian Couder
2017-09-16  8:07 ` [PATCH v6 17/40] lib-httpd: pass config file to start_httpd() Christian Couder
2017-09-16  8:07 ` [PATCH v6 18/40] lib-httpd: add upload.sh Christian Couder
2017-09-16  8:07 ` [PATCH v6 19/40] lib-httpd: add list.sh Christian Couder
2017-09-16  8:07 ` [PATCH v6 20/40] lib-httpd: add apache-e-odb.conf Christian Couder
2017-09-16  8:07 ` [PATCH v6 21/40] odb-helper: add odb_helper_get_raw_object() Christian Couder
2017-09-16  8:07 ` [PATCH v6 22/40] pack-objects: don't pack objects in external odbs Christian Couder
2017-09-16  8:07 ` [PATCH v6 23/40] Add t0420 to test transfer to HTTP external odb Christian Couder
2017-09-16  8:07 ` [PATCH v6 24/40] external-odb: add 'get_direct' support Christian Couder
2017-09-16  8:07 ` [PATCH v6 25/40] odb-helper: add 'script_mode' to 'struct odb_helper' Christian Couder
2017-09-16  8:07 ` [PATCH v6 26/40] odb-helper: add init_object_process() Christian Couder
2017-09-16  8:07 ` [PATCH v6 27/40] Add t0450 to test 'get_direct' mechanism Christian Couder
2017-09-16  8:07 ` [PATCH v6 28/40] Add t0460 to test passing git objects Christian Couder
2017-09-16  8:07 ` [PATCH v6 29/40] odb-helper: add put_object_process() Christian Couder
2017-09-16  8:07 ` [PATCH v6 30/40] Add t0470 to test passing raw objects Christian Couder
2017-09-16  8:07 ` [PATCH v6 31/40] odb-helper: add have_object_process() Christian Couder
2017-09-16  8:07 ` [PATCH v6 32/40] Add t0480 to test "have" capability and raw objects Christian Couder
2017-09-16  8:07 ` [PATCH v6 33/40] external-odb: use 'odb=magic' attribute to mark odb blobs Christian Couder
2017-09-16  8:07 ` [PATCH v6 34/40] Add Documentation/technical/external-odb.txt Christian Couder
2017-09-16  8:07 ` [PATCH v6 35/40] clone: add 'initial' param to write_remote_refs() Christian Couder
2017-09-16  8:07 ` [PATCH v6 36/40] clone: add --initial-refspec option Christian Couder
2017-09-16  8:07 ` [PATCH v6 37/40] clone: disable external odb before initial clone Christian Couder
2017-09-16  8:07 ` [PATCH v6 38/40] Add tests for 'clone --initial-refspec' Christian Couder
2017-09-16  8:07 ` [PATCH v6 39/40] Add t0430 to test cloning using bundles Christian Couder
2017-09-16  8:07 ` [PATCH v6 40/40] Doc/external-odb: explain transfering objects and metadata Christian Couder
2017-10-02 14:18 ` [PATCH v6 00/40] Add initial experimental external ODB support Ben Peart
2017-10-03  6:32   ` Christian Couder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAP8UFD05QzCn53SjbbgaToVWyd=3N+HayDvR4OZNaP7YKbEBfg@mail.gmail.com' \
    --to=christian.couder@gmail.com \
    --cc=Ben.Peart@microsoft.com \
    --cc=chriscool@tuxfamily.org \
    --cc=e@80x24.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=larsxschneider@gmail.com \
    --cc=mh@glandium.org \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).