Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Christian Couder <christian.couder@gmail.com>
To: Ben Peart <peartben@gmail.com>
Cc: git <git@vger.kernel.org>, Junio C Hamano <gitster@pobox.com>,
	Jeff King <peff@peff.net>, Ben Peart <Ben.Peart@microsoft.com>,
	Nguyen Thai Ngoc Duy <pclouds@gmail.com>,
	Mike Hommey <mh@glandium.org>,
	Lars Schneider <larsxschneider@gmail.com>,
	Eric Wong <e@80x24.org>,
	Christian Couder <chriscool@tuxfamily.org>
Subject: Re: [RFC/PATCH v4 00/49] Add initial experimental external ODB support
Date: Fri, 15 Sep 2017 14:56:57 +0200	[thread overview]
Message-ID: <CAP8UFD2ORL=5_wB7iQdLy9F47yvusieiRPbzCt1DwnhKtyT4PA@mail.gmail.com> (raw)
In-Reply-To: <b218004c-200c-b45a-0d0a-e273707d5aa0@gmail.com>

(It looks like I did not reply to this email yet, sorry about this late reply.)

On Thu, Jul 6, 2017 at 7:36 PM, Ben Peart <peartben@gmail.com> wrote:
>
> On 7/1/2017 3:41 PM, Christian Couder wrote:
>>
>> On Fri, Jun 23, 2017 at 8:24 PM, Ben Peart <peartben@gmail.com> wrote:
>>>
>>> Great to see this making progress!
>>>
>>> My thoughts and questions are mostly about the overall design tradeoffs.
>>>
>>> Is your intention to enable the ODB to completely replace the regular
>>> object
>>> store or just to supplement it?
>>
>> It is to supplement it, as I think the regular object store works very
>> well most of the time.
>
> I certainly understand the desire to restrict the scope of the patch series.
> I know full replacement is a much larger problem as it would touch much more
> of the codebase.
>
> I'd still like to see an object store that was thread safe, more robust (ie
> transactional) and hopefully faster so I am hoping we can design the ODB
> interface to eventually enable that.

I doubt that the way Git and the external odb helpers communicate in
process mode is good enough for multi-threading, so I think this would
require another communication mechanism altogether.

> For example: it seems the ODB helpers need to be able to be called before
> the regular object store in the "put" case (so they can intercept large
> objects for example) and after in in the "get" case to enable "fault-in."
> Something like this:
>
> have/get
> ========
> git object store
> large object ODB helper
>
> put
> ===
> large object ODB helper
> git object store
>
> It would be nice if that order wasn't hard coded but that the order or level
> of the "git object store" could be specified using the same mechanism as
> used for the ODB helpers so that some day you could do something like this:
>
> have/get
> ========
> "LMDB" ODB helper
> git object store
>
> put
> ===
> "LMDB" ODB helper
> git object store
>
> (and even further out, drop the current git object store completely :)).

Yeah, I understand that it could help.

>>> I think it would be good to ensure the
>>> interface is robust and performant enough to actually replace the current
>>> object store interface (even if we don't actually do that just yet).
>>
>>
>> I agree that it should be robust and performant, but I don't think it
>> needs to be as performant in all cases as the current object store
>> right now.
>>
>>> Another way of asking this is: do the 3 verbs (have, get, put) and the 3
>>> types of "get" enable you to wrap the current loose object and pack file
>>> code as ODBs and run completely via the external ODB interface?  If not,
>>> what is missing and can it be added?
>
> One example of what I think is missing is a way to stream objects (ie
> get_stream, put_stream).  This isn't used often in git but it did exist last
> I checked.  I'm not saying this needs to be supported in the first version -
> more if we want to support total replacement.

I agree and it seems to me that others have already pointed that the
streaming API could be used.

> I also wonder if we'd need an "optimize" verb (for "git gc") or a "validate"
> verb (for "git fsck").  Again, only if/when we are looking at total
> replacement.

Yeah, I agree that something might be useful for these commands.

>> Right now the "put" verb only send plain blobs, so the most logical
>> way to run completely via the external ODB interface would be to use
>> it to send and receive plain blobs. There are tests scripts (t0420,
>> t0470 and t0480) that use an http server as the external ODB and all
>> the blobs are stored in it.
>>
>> And yeah for now it works only for blobs. There is a temporary patch
>> in the series that limits it to blobs. For the non RFC patch series, I
>> think it should either use the attribute system to tell which objects
>> should be run via the external ODB interface, or perhaps there should
>> be a way to ask each external ODB helper which kind of objects and
>> blobs it can handle. I should add that in the future work part.
>
> Sounds good.  For GVFS we handle all object types (including commits and
> trees) so would need this to be enabled so that we can switch to using it.

Ok.

>>> _Eventually_ it would be great to see the current object store(s) moved
>>> behind the new ODB interface.
>>
>> This is not one of my goals and I think it could be a problem if we
>> want to keep the "fault in" mode.
>
>> In this mode the helper writes or reads directly to or from the
>> current object store, so it needs the current object store to be
>> available.
>
> I think implementing "fault in" should be an option that the ODB handler can
> implement but should not be required by the design/interface.  As you state
> above, this could be as simple as having the ODB handler write the object to
> the git object store on "get."

This is 'get_direct' since v5 and yeah it is optional.

>> Also I think compatibility with other git implementations is important
>> and it is a good thing that they can all work on a common repository
>> format.
>
> I agree this should be an option but I don't want to say we'll _never_ move
> to a better object store.

I agree but that looks like a different problem with a lot of additional issues.

>>> When there are multiple ODB providers, what is the order they are called?
>>
>> The external_odb_config() function creates the helpers for the
>> external ODBs in the order they are found in the config file, and then
>> these helpers are called in turn in the same order.
>>
>>> If one fails a request (get, have, put) are the others called to see if
>>> they
>>> can fulfill the request?
>>
>> Yes, but there are no tests to check that it works well. I will need
>> to add some.
>>
>>> Can the order they are called for various verb be configured explicitly?
>>
>> Right now, you can configure the order by changing the config file,
>> but the order will be the same for all the verbs.
>
> Do you mean it will work like this?
>
> have/get
> ========
> git object store
> first ODB helper
> second ODB helper
> third ODB helper
>
> put
> ===
> first ODB helper
> second ODB helper
> third ODB helper
> git object store

Yes.

> If so, I'd prefer having more flexibility and be able to specify where the
> "git object store" fits in the stack along with the ODB helpers so that you
> could eventually support the "LMDB" ODB helper example above.

Maybe this could be supported later with odb.<odbname>.priority or
odb.<odbname>.<instruction>_priority config option that could be
possibly negative integer.
A negative integer would means before the git object store and a
positive one after the object store.

>>> For
>>> example, it would be nice to have a "large object ODB handler" configured
>>> to
>>> get first try at all "put" verbs.  Then if it meets it's size
>>> requirements,
>>> it will handle the verb, otherwise it fail and git will try the other
>>> ODBs.
>>
>> This can work if the "large object ODB handler" is configured first.
>>
>> Also this is linked with how you define which objects are handled by
>> which helper. For example if the attribute system is used to describe
>> which external ODB is used for which files, there could be a way to
>> tell for example that blobs larger than 1MB are handled by the "large
>> object ODB handler" while those that are smaller are handled by
>> another helper.
>
> I can see that using at attribute system could get complex to implement
> generically in git.  It seems defining the set of attributes and the rules
> for matching against them could get complex.
>
> I wonder if it is sufficient to just have a hierarchy and then let each ODB
> handler determine which objects it wants to handle based on whatever
> criteria (size, object type, etc) it wants.

Yeah, I think that could be interesting to have later.

> The downside of doing it this way is that calling the various handlers needs
> to be fast as there may be a lot of calls passed through to the next
> handler.  I think this can be accomplished by using persistent background
> processes instead of spawning a new one on each call.

Yeah, this kind of mechanism would not be so simple to design if we
want things to be fast. That's why I prefer to leave it for later.

>> Yeah, capabilities are the same, but I think the script mode has
>> value, because helper are simpler to  write and debug.
>> If for example one wants to use the external ODB system to implement
>> clone bundles, like what is done in t0430 at the end of the patch
>> series, then it's much simpler if the helper uses the script mode, and
>> there is no performance downside as the helper will be called once.
>>
>> I think people might want to implement these kinds of helpers to just
>> setup a repo quickly and properly for their specific needs.
>> For example for companies using big monorepos, there could be
>> different helpers for different kind of people working on different
>> parts of the code. For example one for front end developers and one
>> for back end developers.
>
> I think the ease of writing a script mode helper vs sub-process can be
> alleviated with a library of helper routines to help with the sub-process
> interface.  You have a start on that already and given it is also already
> used by the filter interface there should be a good body of code to
> copy/paste for future implementations.  I've written them in perl and in
> C/C++ and once you have some helper functions, it's really quite simple.

It's quite simple once you know how to use the libraries, and then the
debugging is also more complex.
In the following email I gave an example of how simple some things can
be with the script mode:

https://public-inbox.org/git/CAP8UFD3ZV4Ezucn+Tv-roY6vzDyk2j4ypRsNR1YbOqoQK_qr8A@mail.gmail.com/

>> Also my goal is to share as much of the implementation as possible
>> between the script and the sub process mode. I think this can help us
>> be confident that the overall design is good.
>
> Ahh, but a single implementation is easier to code, test and maintain than
> two implementations - even with shared code. :)

Sure.

>>>> There are 3 different kinds of "get" instructions depending on how the
>>>> helper passes objects to Git:
>>>>
>>>>     - "fault_in": the helper will write the requested objects directly
>>>>       into the regular Git object database, and then Git will retry
>>>>       reading it from there.
>>>>
>>>
>>> I think the "fault_in" behavior can be implemented efficiently without
>>> the
>>> overhead of a 3rd special "get" instruction if we enable some of the
>>> other
>>> capabilities discussed.
>>
>> The "fault_in" behavior will be just a special kind of "get"
>> capability. Git needs to know that the helper has this capability, but
>> there should be no overhead. So I am not sure if I understand you
>> properly.
>
> Hopefully my sample diagrams above better illustrated what I was thinking
> and why I don't think git needs to be aware of "fault_in" behaviors.

I am not sure I understand as if a helper has some 'get_*' capability
Git needs to know anyway about that.

>>> I don't believe there is.  Instead, I think we should allow the various
>>> "partial clone" patch series already in progress solve the problem of how
>>> you do a partial clone of a repo.
>>
>> I think this is not really related to partial clone, but maybe I
>> should take a closer look at these patch series.
>
> My thought was that since a standard clone copies all objects into the local
> object store, there is no need for a way to retrieve "missing" objects as,
> by definition, none are missing.
>
> The various partial clone patch series are discussing how to clone a repo
> and _not_ copy down all objects which creates the need for a way to retrieve
> "missing" objects.  They are also dealing with how to enable git to know the
> object is intentionally missing (ie "have" can succeed but "get" will have
> to go retrieve the missing object before it can return it).
>
> This enables having missing objects without some local placeholder (ref) so
> that repos with large numbers of objects can still do a fast clone as they
> don't have to download millions of refs for the "missing" objects.

There could be just one ref pointing to a blob that contains a custom
service URL and the helper can just use the custom service URL to get
all the info it wants.
The one ref per missing object is obviously when you have a very small
number of missing objects.

next prev parent reply	other threads:[~2017-09-15 12:57 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-20  7:54 [RFC/PATCH v4 00/49] Add initial experimental external ODB support Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 01/49] builtin/clone: get rid of 'value' strbuf Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 02/49] t0021/rot13-filter: refactor packet reading functions Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 03/49] t0021/rot13-filter: improve 'if .. elsif .. else' style Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 04/49] Add Git/Packet.pm from parts of t0021/rot13-filter.pl Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 05/49] t0021/rot13-filter: use Git/Packet.pm Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 06/49] Git/Packet.pm: improve error message Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 07/49] Git/Packet.pm: add packet_initialize() Christian Couder
2017-06-23 18:55   ` Ben Peart
2017-06-20  7:54 ` [RFC/PATCH v4 08/49] Git/Packet: add capability functions Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 09/49] Add initial external odb support Christian Couder
2017-06-23 19:49   ` Ben Peart
2017-06-20  7:54 ` [RFC/PATCH v4 10/49] external odb foreach Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 11/49] t0400: add 'put' command to odb-helper script Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 12/49] external odb: add write support Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 13/49] external-odb: accept only blobs for now Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 14/49] t0400: add test for external odb write support Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 15/49] Add GIT_NO_EXTERNAL_ODB env variable Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 16/49] Add t0410 to test external ODB transfer Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 17/49] lib-httpd: pass config file to start_httpd() Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 18/49] lib-httpd: add upload.sh Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 19/49] lib-httpd: add list.sh Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 20/49] lib-httpd: add apache-e-odb.conf Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 21/49] odb-helper: add 'store_plain_objects' to 'struct odb_helper' Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 22/49] pack-objects: don't pack objects in external odbs Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 23/49] t0420: add test with HTTP external odb Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 24/49] odb-helper: start fault in implementation Christian Couder
2017-06-20  7:54 ` [RFC/PATCH v4 25/49] external-odb: add external_odb_fault_in_object() Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 26/49] odb-helper: add script_mode Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 27/49] Documentation: add read-object-protocol.txt Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 28/49] contrib: add long-running-read-object/example.pl Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 29/49] Add t0410 to test read object mechanism Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 30/49] odb-helper: add read_object_process() Christian Couder
2017-07-10 15:57   ` Ben Peart
2017-06-20  7:55 ` [RFC/PATCH v4 31/49] external-odb: add external_odb_get_capabilities() Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 32/49] t04*: add 'get_cap' support to helpers Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 33/49] odb-helper: call odb_helper_lookup() with 'have' capability Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 34/49] odb-helper: fix odb_helper_fetch_object() for read_object Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 35/49] Add t0460 to test passing git objects Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 36/49] odb-helper: add read_packetized_git_object_to_fd() Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 37/49] odb-helper: add read_packetized_plain_object_to_fd() Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 38/49] Add t0470 to test passing plain objects Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 39/49] odb-helper: add write_object_process() Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 40/49] Add t0480 to test "have" capability and plain objects Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 41/49] external-odb: add external_odb_do_fetch_object() Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 42/49] odb-helper: advertise 'have' capability Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 43/49] odb-helper: advertise 'put' capability Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 44/49] odb-helper: add have_object_process() Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 45/49] clone: add initial param to write_remote_refs() Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 46/49] clone: add --initial-refspec option Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 47/49] clone: disable external odb before initial clone Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 48/49] Add test for 'clone --initial-refspec' Christian Couder
2017-06-20  7:55 ` [RFC/PATCH v4 49/49] t: add t0430 to test cloning using bundles Christian Couder
2017-06-20 13:48 ` [RFC/PATCH v4 00/49] Add initial experimental external ODB support Christian Couder
2017-06-23 18:24 ` Ben Peart
2017-07-01 19:41   ` Christian Couder
2017-07-01 20:12     ` Christian Couder
2017-07-01 20:33     ` Junio C Hamano
2017-07-02  4:25       ` Christian Couder
2017-07-03 16:56         ` Junio C Hamano
2017-07-06 17:36     ` Ben Peart
2017-09-15 12:56       ` Christian Couder [this message]
2017-07-12 19:06 ` Jonathan Tan
2017-09-15 13:16   ` Christian Couder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAP8UFD2ORL=5_wB7iQdLy9F47yvusieiRPbzCt1DwnhKtyT4PA@mail.gmail.com' \
    --to=christian.couder@gmail.com \
    --cc=Ben.Peart@microsoft.com \
    --cc=chriscool@tuxfamily.org \
    --cc=e@80x24.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=larsxschneider@gmail.com \
    --cc=mh@glandium.org \
    --cc=pclouds@gmail.com \
    --cc=peartben@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).