git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: Ævar Arnfjörð Bjarmason <avarab@gmail.com>,
	Git Mailing List <git@vger.kernel.org>
Subject: Re: [PATCH 3/4] peel_ref: check object type before loading
Date: Thu, 4 Oct 2012 17:59:33 -0400
Message-ID: <20121004215933.GA17358@sigill.intra.peff.net> (raw)
In-Reply-To: <7vd30yx94r.fsf@alter.siamese.dyndns.org>

On Thu, Oct 04, 2012 at 01:41:40PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > [1] One thing I've been toying is with "external alternates"; dumping
> >     your large objects in some realtively slow data store (e.g., a
> >     RESTful HTTP service). You could cache and cheaply query a list of
> >     "sha1 / size / type" for each object from the store, but getting the
> >     actual objects would be much more expensive. But again, it would
> >     depend on whether you would actually have such a store directly
> >     accessible by a ref.
> 
> Yeah, that actually has been another thing we were discussing
> locally, without coming to something concrete enough to present to
> the list.
> 
> The basic idea is to mark such paths with attributes, and use a
> variant of smudge/clean filter that is _not_ a filter (as we do not
> want to have the interface to this external helper to be "we feed
> the whole big blob to you").  Instead, these smudgex/cleanx things
> work on a pathname.

There are a few ways that smudge/clean filters are not up to the task
currently. One is definitely the efficiency of the interface.

Another is that this distinction is not "canonical in-repo versus
working tree representation".  Yes, part of it is about what you store
in the repo versus what you checkout. But you may also want to diff or
merge these large items. Almost certainly not with the internal text
tools, but you might want to feed the _real_ blobs (not the fake
surrogates) to external tools, and smudge and clean are not a natural
fit for that. If you teach the external tools to dereference the
surrogates, it would work. I think you could also get around it by
giving these smudgex/cleanx filters slightly different semantics.

But the thing I most dislike about a solution like this is that the
representation pollutes git's data model. That is, git stores a blob
representing the surrogate, and that is the official sha1 that goes into
the tree. That means your surrogate decisions are locked into history
forever, which has a few implications:

  1. If I store a surrogate and you store the real blob, we do not get
     the same tree (and in fact get conflicts).

  2. Once I store a blob, I can never revise my decision not to store
     that blob in every subsequent clone without rewriting history. I
     can convert it into a surrogate, but old versions of that blob will
     always be required for object connectivity.

  3. If your surrogate contains more than just the sha1 (e.g., if it
     points to http://cdn.example.com), your history is forever tied to
     that (and you are stuck with configuring run-time redirects if your
     URL changes).

My thinking was to make it all happen below the git object level, at the
same level as packs and loose objects. This is more invasive to git, but
much more flexible.

The basic implementation is pretty straightforward. You configure one or
more helpers which have two operations: provide a list of sha1s, and
fetch a single sha1. Whenever an object lookup fails in the packs and
loose objects, we check the helper lists.  We can cache the lists in
some mmap'able format similar to a pack index, so we can do quick checks
for object existence, type, and size. And if the lookup actually needs
the object, we fetch and cache (where the caching policy would all be
determined by the external script).

The real challenges are:

  1. It is expensive to meet normal reachability guarantees. So you
     would not want a remote to delta against a blob that you _could_
     get, but do not have.

  2. You need to tell remotes that you can access some objects in a
     different way, and not to send them as part of an object transfer.

  3. Many commands need to be taught not to load objects carelessly. For
     the most part, we do well with this because it's already expensive
     to load large objects from disk. I think we've got most of them on
     the diff code path, but I wouldn't be surprised if one or two more
     crop up. Fsck would need to learn to handle these objects
     differently.

Item (3) is really just about trying it and seeing where the problems
come up. For items (1) and (2), we'd need a protocol extension.

Having the receiver send something like "I have these blobs from my external
source, don't send them" is a nice idea, but it doesn't scale well (you
have to advertise the whole list for each transfer, because the receiver
doesn't know which ones are actually referenced).

Having the receiver say "I have access to external database $FOO, go check
the list of objects there and don't send anything it has" unnecessarily
ties the sender to the external database (i.e., they have to implement
$FOO for it to work).

Something simple like "do not send blobs larger than N bytes, nor make
deltas against such blobs" would work. It does mean that your value of N
really needs to match up with what goes into your external database, but
I don't think that will be a huge problem in practice (after the
transfer you can do a consistency check that between the fetched objects
and the external db, you have everything).

> [details on smudgex/cleanx]

All of what you wrote seems very sane; I think the real question is
whether this should be "above" or "below" git's user-visible data model
layer.

-Peff

  reply index

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-03 12:36 upload-pack is slow with lots of refs Ævar Arnfjörð Bjarmason
2012-10-03 13:06 ` Nguyen Thai Ngoc Duy
2012-10-03 18:03 ` Jeff King
2012-10-03 18:53   ` Junio C Hamano
2012-10-03 18:55     ` Jeff King
2012-10-03 19:41       ` Shawn Pearce
2012-10-03 20:13         ` Jeff King
2012-10-04 21:52           ` Sascha Cunz
2012-10-05  0:20             ` Jeff King
2012-10-05  6:24         ` Johannes Sixt
2012-10-05 16:57           ` Shawn Pearce
2012-10-08 15:05             ` Johannes Sixt
2012-10-09  6:46               ` Shawn Pearce
2012-10-09 20:30                 ` Johannes Sixt
2012-10-09 20:46                   ` Johannes Sixt
2012-10-03 20:16   ` Ævar Arnfjörð Bjarmason
2012-10-03 21:20     ` Jeff King
2012-10-03 22:15       ` Ævar Arnfjörð Bjarmason
2012-10-03 23:15         ` Jeff King
2012-10-03 23:54           ` Ævar Arnfjörð Bjarmason
2012-10-04  7:56             ` [PATCH 0/4] optimizing upload-pack ref peeling Jeff King
2012-10-04  7:58               ` [PATCH 1/4] peel_ref: use faster deref_tag_noverify Jeff King
2012-10-04 18:24                 ` Junio C Hamano
2012-10-04  8:00               ` [PATCH 2/4] peel_ref: do not return a null sha1 Jeff King
2012-10-04 18:32                 ` Junio C Hamano
2012-10-04  8:02               ` [PATCH 3/4] peel_ref: check object type before loading Jeff King
2012-10-04 19:06                 ` Junio C Hamano
2012-10-04 19:41                   ` Jeff King
2012-10-04 20:41                     ` Junio C Hamano
2012-10-04 21:59                       ` Jeff King [this message]
2012-10-04  8:03               ` [PATCH 4/4] upload-pack: use peel_ref for ref advertisements Jeff King
2012-10-04  8:04               ` [PATCH 0/4] optimizing upload-pack ref peeling Jeff King
2012-10-04  9:01                 ` Ævar Arnfjörð Bjarmason
2012-10-04 12:14                   ` Nazri Ramliy
2012-10-03 22:32   ` upload-pack is slow with lots of refs Ævar Arnfjörð Bjarmason
2012-10-03 23:21     ` Jeff King
2012-10-03 23:47       ` Ævar Arnfjörð Bjarmason
2012-10-03 19:13 ` Junio C Hamano

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121004215933.GA17358@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

git@vger.kernel.org list mirror (unofficial, one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git