git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Lars Schneider <larsxschneider@gmail.com>
Cc: Christian Couder <christian.couder@gmail.com>,
	git <git@vger.kernel.org>, Junio C Hamano <gitster@pobox.com>,
	Nguyen Thai Ngoc Duy <pclouds@gmail.com>,
	Mike Hommey <mh@glandium.org>, Eric Wong <e@80x24.org>,
	Christian Couder <chriscool@tuxfamily.org>
Subject: Re: [RFC/PATCH v3 00/16] Add initial experimental external ODB support
Date: Mon, 5 Dec 2016 08:23:34 -0500	[thread overview]
Message-ID: <20161205132334.vlojtzecfhvhedew@sigill.intra.peff.net> (raw)
In-Reply-To: <A5ABBF3E-BED9-4FF3-9DE5-B529DEF0B8E8@gmail.com>

On Sat, Dec 03, 2016 at 07:47:51PM +0100, Lars Schneider wrote:

> >  - "<command> have": the command should output the sha1, size and
> > type of all the objects the external ODB contains, one object per
> > line.
> 
> This looks impractical. If a repo has 10k external files with
> 100 versions each then you need to read/transfer 1m hashes (this is
> not made up - I am working with Git repos than contain >>10k files
> in GitLFS).

Are you worried about the client-to-server communication, or the pipe
between git and the helper? I had imagined that the client-to-server
communication happen infrequently and be cached.

But 1m hashes is 20MB, which is still a lot to dump over the pipe.
Another option is that Git defines a simple on-disk data structure
(e.g., a flat file of sorted 20-byte binary sha1s), and occasionally
asks the filter "please update your on-disk index".

That still leaves open the question of how the external odb script
efficiently gets updates from the server. It can use an ETag or similar
to avoid downloading an identical copy, but if one hash is added, we'd
want to know that efficiently. That is technically outside the scope of
the git<->external-odb interface, but obviously it's related. The design
of the on-disk format might be make that problem easier or harder on the
external-odb script.

> Wouldn't it be better if Git collects all hashes that it currently 
> needs and then asks the external ODBs if they have them?

I think you're going to run into latency problems when git wants to ask
"do we have this object" and expects the answer to be no. You wouldn't
want a network request for each.

And I think it would be quite complex to teach all operations to work on
a promise-like system where the answer to "do we have it" might be
"maybe; check back after you've figured out the whole batch of hashes
you're interested in".

> >  - "<command> get <sha1>": the command should then read from the
> > external ODB the content of the object corresponding to <sha1> and
> > output it on stdout.
> > 
> >  - "<command> put <sha1> <size> <type>": the command should then read
> > from stdin an object and store it in the external ODB.
> 
> Based on my experience with Git clean/smudge filters I think this kind 
> of single shot protocol will be a performance bottleneck as soon as 
> people store more than >1000 files in the external ODB.
> Maybe you can reuse my "filter process protocol" (edcc858) here?

Yeah. This interface comes from my original proposal, which used the
rationale "well, the files are big, so process startup shouldn't be a
big deal". And I don't think I wrote it down, but an implicit rationale
was "it seems to work for LFS, so it should work here too". But of
course LFS hit scaling problems, and so would this. It was one of the
reasons I was interested in making sure your filter protocol could be
used as a generic template, and I think we would want to do something
similar here.

> > * Transfer
> > 
> > To tranfer information about the blobs stored in external ODB, some
> > special refs, called "odb ref", similar as replace refs, are used.
> > 
> > For now there should be one odb ref per blob. Each ref name should be
> > refs/odbs/<odbname>/<sha1> where <sha1> is the sha1 of the blob stored
> > in the external odb named <odbname>.
> > 
> > These odb refs should all point to a blob that should be stored in the
> > Git repository and contain information about the blob stored in the
> > external odb. This information can be specific to the external odb.
> > The repos can then share this information using commands like:
> > 
> > `git fetch origin "refs/odbs/<odbname>/*:refs/odbs/<odbname>/*"`

I'd worry about scaling this part. Traditionally our refs storage does
not scale very well.

-Peff

  reply	other threads:[~2016-12-05 13:28 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-30 21:04 [RFC/PATCH v3 00/16] Add initial experimental external ODB support Christian Couder
2016-11-30 21:04 ` [RFC/PATCH v3 01/16] Add initial external odb support Christian Couder
2016-11-30 23:30   ` Junio C Hamano
2016-11-30 23:37     ` Jeff King
2017-08-03  7:48       ` Christian Couder
2017-08-03  7:46     ` Christian Couder
2017-08-03  8:06       ` Jeff King
2016-11-30 21:04 ` [RFC/PATCH v3 02/16] external odb foreach Christian Couder
2016-11-30 21:04 ` [RFC/PATCH v3 03/16] t0400: use --batch-all-objects to get all objects Christian Couder
2016-11-30 21:04 ` [RFC/PATCH v3 04/16] t0400: add 'put' command to odb-helper script Christian Couder
2016-11-30 21:04 ` [RFC/PATCH v3 05/16] t0400: add test for 'put' command Christian Couder
2016-11-30 21:04 ` [RFC/PATCH v3 06/16] external odb: add write support Christian Couder
2016-11-30 21:04 ` [RFC/PATCH v3 07/16] external-odb: accept only blobs for now Christian Couder
2016-11-30 21:04 ` [RFC/PATCH v3 08/16] t0400: add test for external odb write support Christian Couder
2016-11-30 21:04 ` [RFC/PATCH v3 09/16] Add GIT_NO_EXTERNAL_ODB env variable Christian Couder
2016-11-30 21:04 ` [RFC/PATCH v3 10/16] Add t0410 to test external ODB transfer Christian Couder
2016-11-30 21:04 ` [RFC/PATCH v3 11/16] lib-httpd: pass config file to start_httpd() Christian Couder
2016-11-30 21:04 ` [RFC/PATCH v3 12/16] lib-httpd: add upload.sh Christian Couder
2016-11-30 21:04 ` [RFC/PATCH v3 13/16] lib-httpd: add list.sh Christian Couder
2016-11-30 21:04 ` [RFC/PATCH v3 14/16] lib-httpd: add apache-e-odb.conf Christian Couder
2016-11-30 21:04 ` [RFC/PATCH v3 15/16] odb-helper: add 'store_plain_objects' to 'struct odb_helper' Christian Couder
2016-11-30 21:04 ` [RFC/PATCH v3 16/16] t0420: add test with HTTP external odb Christian Couder
2016-11-30 22:36 ` [RFC/PATCH v3 00/16] Add initial experimental external ODB support Junio C Hamano
2016-12-13 16:40   ` Christian Couder
2016-12-13 20:05     ` Junio C Hamano
2016-12-15  9:56       ` Christian Couder
2016-12-03 18:47 ` Lars Schneider
2016-12-05 13:23   ` Jeff King [this message]
2016-12-13 17:20   ` Christian Couder
2016-12-18 13:13     ` Lars Schneider

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161205132334.vlojtzecfhvhedew@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=chriscool@tuxfamily.org \
    --cc=christian.couder@gmail.com \
    --cc=e@80x24.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=larsxschneider@gmail.com \
    --cc=mh@glandium.org \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).