git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: Christian Couder <christian.couder@gmail.com>,
	Duy Nguyen <pclouds@gmail.com>, git <git@vger.kernel.org>
Subject: Re: [PATCH v2 4/4] bundle v3: the beginning
Date: Tue, 7 Jun 2016 16:23:51 -0400	[thread overview]
Message-ID: <20160607202351.GA5726@sigill.intra.peff.net> (raw)
In-Reply-To: <xmqqinxkpzur.fsf@gitster.mtv.corp.google.com>

On Tue, Jun 07, 2016 at 12:23:40PM -0700, Junio C Hamano wrote:

> Christian Couder <christian.couder@gmail.com> writes:
> 
> > Git can store its objects only in the form of loose objects in
> > separate files or packed objects in a pack file.
> > To be able to better handle some kind of objects, for example big
> > blobs, it would be nice if Git could store its objects in other object
> > databases (ODB).
> >
> > To do that, this patch series makes it possible to register commands,
> > using "odb.<odbname>.command" config variables, to access external
> > ODBs. Each specified command will then be called the following ways:
> 
> Hopefully it is done via a cheap RPC instead of forking/execing the
> command for each and every object lookup.

This interface comes from my earlier patches, so I'll try to shed a
little light on the decisions I made there.

Because this "external odb" essentially acts as a git alternate, we
would hit it only when we couldn't find an object through regular means.
Git would then make the object available in the usual on-disk format
(probably as a loose object).

So in most processes, we would not need to consult the odb command at
all. And when we do, the first thing would be to get its "have" list,
which would at most run once per process.

So the per-object cost is really calling "get", and my assumption there
was that the cost of actually retrieving the object over the network
would dwarf the fork/exec cost.

I also waffled on having git cache the output of "<command> have" in
some fast-lookup format to save even the single fork/exec. But I figured
that was something that could be added later if needed.

You'll note that this is sort of a "fault-in" model. Another model would
be to treat external odb updates similar to fetches. I.e., we touch the
network only during a special update operation, and then try to work
locally with whatever the external odb has. IMHO this policy could
actually be up to the external odb itself (i.e., its "have" command
could serve from a local cache if it likes).

> >   - "<command> have": the command should output the sha1, size and
> > type of all the objects the external ODB contains, one object per
> > line.
> 
> Why size and type at this point is needed by the clients?  That is
> more expensive to compute than just a bare list of object names.

Yes, but it lets get avoid doing a lot of "get" operations. For example,
in a regular diff without binary-diffs enabled, we can automatically
determine that a diff will be considered binary based purely on the size
of the objects (related to core.bigfilethreshold). So if we know the
sizes, we can run "git log -p" without faulting-in each of the objects
just to say "woah, that looks binary".

One can accomplish this with .gitattributes, too, of course, but the
size thing just works out of the box.

There are other places where it will come in handy, too. E.g., fscking a
tree object you have, you want to make sure that the object referred to
with mode 100644 is actually a blob.

I also don't think the cost to compute size and type on the server is
all that important. Yes, if you're backing your external odb with a git
repository that runs "git cat-file" on the fly, it is more expensive.
But in practice, I'd expect the server side to create a static manifest
and serve it over HTTP (this also gives the benefit of things like
ETags).

> >   - "<command> get <sha1>": the command should then read from the
> > external ODB the content of the object corresponding to <sha1> and
> > output it on stdout.
> 
> The type and size should be given at this point.

I don't think there's a reason not to; I didn't here because it would be
redundant with what Git already knows from the "have" manifest it
receives above.

> >   - "<command> put <sha1> <size> <type>": the command should then read
> > from stdin an object and store it in the external ODB.
> 
> Is ODB required to sanity check that <sha1> matches what the data
> hashes down to?

I think that would be up to the ODB, but it does seem like a good idea.

Likewise, I'm not sure if "get" should be allowed to return contents
that don't match the sha1. That would be fine for things like "diff",
but would probably make "fsck" unhappy.

> If this thing is primarily to offload large blobs, you might also
> want not "get" but "checkout <sha1> <path>" to bypass Git entirely,
> but I haven't thought it through.

My mental model is that the external odb gets the object into the local
odb, and then you can use the regular streaming-checkout code path. And
the local odb serves as your cache.

That does mean you might have two copies of each object (one in the odb,
and one in the working tree), as opposed to a true cacheless system,
which can get away with one.

I think you could do that cacheless thing with the interface here,
though; the "get" operation can stream, and you can stream directly to
the working tree.

-Peff

  reply	other threads:[~2016-06-07 20:23 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-01 23:35 [PATCH 1/2] bundle: plug resource leak Junio C Hamano
2016-03-01 23:36 ` [PATCH 2/2] bundle: keep a copy of bundle file name in the in-core bundle header Junio C Hamano
2016-03-02  9:01   ` Jeff King
2016-03-02 18:15     ` Junio C Hamano
2016-03-02 20:32       ` [PATCH v2 0/4] "split bundle" preview Junio C Hamano
2016-03-02 20:32         ` [PATCH v2 1/4] bundle doc: 'verify' is not about verifying the bundle Junio C Hamano
2016-03-02 20:32         ` [PATCH v2 2/4] bundle: plug resource leak Junio C Hamano
2016-03-02 20:32         ` [PATCH v2 3/4] bundle: keep a copy of bundle file name in the in-core bundle header Junio C Hamano
2016-03-02 20:49           ` Jeff King
2016-03-02 20:32         ` [PATCH v2 4/4] bundle v3: the beginning Junio C Hamano
2016-03-03  1:36           ` Duy Nguyen
2016-03-03  2:57             ` Junio C Hamano
2016-03-03  5:15               ` Duy Nguyen
2016-05-20 12:39           ` Christian Couder
2016-05-31 12:43             ` Duy Nguyen
2016-05-31 13:18               ` Christian Couder
2016-06-01 13:37                 ` Duy Nguyen
2016-06-07 14:49                   ` Christian Couder
2016-06-01 14:00                 ` Duy Nguyen
2016-06-07  8:46                   ` Christian Couder
2016-06-07  8:53                     ` Mike Hommey
2016-06-07 10:22                     ` Duy Nguyen
2016-06-07 19:23                     ` Junio C Hamano
2016-06-07 20:23                       ` Jeff King [this message]
2016-06-08 10:44                         ` Duy Nguyen
2016-06-08 16:19                           ` Jeff King
2016-06-09  8:53                             ` Duy Nguyen
2016-06-09 17:23                               ` Jeff King
2016-06-08 18:05                         ` Junio C Hamano
2016-06-08 19:00                           ` Jeff King
2016-05-31 22:23               ` Jeff King
2016-05-31 22:31             ` Jeff King
2016-06-07 13:19               ` Christian Couder
2016-06-07 20:35                 ` Jeff King
2016-03-02  8:54 ` [PATCH 1/2] bundle: plug resource leak Jeff King
2016-03-02  9:00   ` Junio C Hamano
2016-03-02  9:02     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160607202351.GA5726@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).