git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Christian Couder <christian.couder@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>, Jeff King <peff@peff.net>,
	Nguyen Thai Ngoc Duy <pclouds@gmail.com>,
	Mike Hommey <mh@glandium.org>,
	Christian Couder <chriscool@tuxfamily.org>
Subject: [RFC/PATCH 0/8] Add initial experimental external ODB support
Date: Mon, 13 Jun 2016 10:55:38 +0200	[thread overview]
Message-ID: <20160613085546.11784-1-chriscool@tuxfamily.org> (raw)

Goal
~~~~

Git can store its objects only in the form of loose objects in
separate files or packed objects in a pack file.
To be able to better handle some kind of objects, for example big
blobs, it would be nice if Git could store its objects in other object
databases (ODB).
To do that, this patch series makes it possible to register commands,
using "odb.<odbname>.command" config variables, to access external
ODBs where objects can be stored and retrieved.

Design
~~~~~~

Each registered command manages access to one external ODB and will be
called the following ways:

  - "<command> have": the command should output the sha1, size and
type of all the objects the external ODB contains, one object per
line.

  - "<command> get <sha1>": the command should then read from the
external ODB the content of the object corresponding to <sha1> and
output it on stdout.

  - "<command> put <sha1> <size> <type>": the command should then read
from stdin an object and store it in the external ODB.

This RFC patch series for now does not address the following important
parts of a complete solution:

  - There is no way to transfer external ODB content using Git.

  - No real external ODB has been interfaced with Git. The tests use
another git repo in a separate directory for this purpose which is
probably useless in the real world.

Design discussion about performance
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Yeah, it is not efficient to fork/exec a command to just read or write
one object to or from the external ODB. Batch calls and/or using a
daemon and/or RPC should be used instead to be able to store regular
objects in an external ODB. But for now the external ODB would be all
about really big files, where the cost of a fork+exec should not
matter much. If we later want to extend usage of external ODBs, yeah
we will probably need to design other mechanisms.

Here are some related explanations from Peff:

{{{
Because this "external odb" essentially acts as a git alternate, we
would hit it only when we couldn't find an object through regular means.
Git would then make the object available in the usual on-disk format
(probably as a loose object).

So in most processes, we would not need to consult the odb command at
all. And when we do, the first thing would be to get its "have" list,
which would at most run once per process.

So the per-object cost is really calling "get", and my assumption there
was that the cost of actually retrieving the object over the network
would dwarf the fork/exec cost.

I also waffled on having git cache the output of "<command> have" in
some fast-lookup format to save even the single fork/exec. But I figured
that was something that could be added later if needed.

You'll note that this is sort of a "fault-in" model. Another model would
be to treat external odb updates similar to fetches. I.e., we touch the
network only during a special update operation, and then try to work
locally with whatever the external odb has. IMHO this policy could
actually be up to the external odb itself (i.e., its "have" command
could serve from a local cache if it likes).
}}}

Implementation
~~~~~~~~~~~~~~

This series adds a set of function in external-odb.{c,h} that are
called by the rest of Git to manage all the external ODBs.

These functions use 'struct odb_helper' and its associated functions
defined in odb-helper.{c,h} to talk to the different external ODBs by
launching the configured "odb.<odbname>.command" commands and writing
to or reading from them.

The tests in this series creates an odb-helper script that is
registered using the "odb.magic.command" config variable, and then
called to read from and write to the external ODB.

Highlevel view of the patches in the series
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    - Patches 01/08 and 02/08 are Peff's initial work fixed a little
      bit so that it compiles and pass tests.

    - Patches 03/08 is an optimization in the odb-helper script that
      is used for testing. I will probably squash it into 01/08.

    - Patches 04/08 and 05/08 are adding "put" support in the
      odb-helper script and testing that.

    - Patches 06/08 and 08/08 are enhancing external-odb.{c,h} and
      odb-helper.{c,h}, so that Git can write into an external ODB.

    - Patches 07/08 limits write support to "blobs" for now to
      simplify things.

Future work
~~~~~~~~~~~

>From the discussions it appear that using the bundle v3 mechanism to
tranfer external ODB data could work, but only if the server has access
to its own external ODB.

Another possible mechanism to transfer external ODB data would be some
kind of replace refs. This would be slower but the mechanism for the
transfer already fully exists.

So I think I am going to experiment with some kind of replace refs.

One interesting thing also would be to use the streaming api when
reading from or writing to the external ODB.

Previous work and discussions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Peff started to work on this and discuss this some years ago:

http://thread.gmane.org/gmane.comp.version-control.git/206886/focus=207040
http://thread.gmane.org/gmane.comp.version-control.git/247171
http://thread.gmane.org/gmane.comp.version-control.git/202902/focus=203020

His work, which is not compile-tested any more, is still there:

https://github.com/peff/git/commits/jk/external-odb-wip

Initial discussions about this new series are there:

http://thread.gmane.org/gmane.comp.version-control.git/288151/focus=295160

Links
~~~~~

This patch series is available here:

https://github.com/chriscool/git/commits/external-odb


Christian Couder (6):
  t0400: use --batch-all-objects to get all objects
  t0400: add 'put' command to odb-helper script
  t0400: add test for 'put' command
  external odb: add write support
  external-odb: accept only blobs for now
  t0400: add test for external odb write support

Jeff King (2):
  Add initial external odb support
  external odb foreach

 Makefile                |   2 +
 cache.h                 |   9 ++
 external-odb.c          | 148 +++++++++++++++++++++++++
 external-odb.h          |  16 +++
 odb-helper.c            | 287 ++++++++++++++++++++++++++++++++++++++++++++++++
 odb-helper.h            |  32 ++++++
 sha1_file.c             |  66 ++++++++---
 t/t0400-external-odb.sh |  77 +++++++++++++
 8 files changed, 622 insertions(+), 15 deletions(-)
 create mode 100644 external-odb.c
 create mode 100644 external-odb.h
 create mode 100644 odb-helper.c
 create mode 100644 odb-helper.h
 create mode 100755 t/t0400-external-odb.sh

-- 
2.9.0.rc2.362.g3cd93d0

             reply	other threads:[~2016-06-13  8:56 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-13  8:55 Christian Couder [this message]
2016-06-13  8:55 ` [RFC/PATCH 1/8] Add initial external odb support Christian Couder
2016-06-13  8:55 ` [RFC/PATCH 2/8] external odb foreach Christian Couder
2016-06-13  8:55 ` [RFC/PATCH 3/8] t0400: use --batch-all-objects to get all objects Christian Couder
2016-06-13  8:55 ` [RFC/PATCH 4/8] t0400: add 'put' command to odb-helper script Christian Couder
2016-06-13  8:55 ` [RFC/PATCH 5/8] t0400: add test for 'put' command Christian Couder
2016-06-13  8:55 ` [RFC/PATCH 6/8] external odb: add write support Christian Couder
2016-06-13  8:55 ` [RFC/PATCH 7/8] external-odb: accept only blobs for now Christian Couder
2016-06-13  8:55 ` [RFC/PATCH 8/8] t0400: add test for external odb write support Christian Couder
2016-06-13 10:10 ` [RFC/PATCH 0/8] Add initial experimental external ODB support Duy Nguyen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160613085546.11784-1-chriscool@tuxfamily.org \
    --to=christian.couder@gmail.com \
    --cc=chriscool@tuxfamily.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=mh@glandium.org \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).