git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Christian Couder <christian.couder@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>, Jeff King <peff@peff.net>,
	Ben Peart <Ben.Peart@microsoft.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Nguyen Thai Ngoc Duy <pclouds@gmail.com>,
	Mike Hommey <mh@glandium.org>,
	Lars Schneider <larsxschneider@gmail.com>,
	Eric Wong <e@80x24.org>,
	Christian Couder <chriscool@tuxfamily.org>
Subject: [PATCH v6 34/40] Add Documentation/technical/external-odb.txt
Date: Sat, 16 Sep 2017 10:07:25 +0200	[thread overview]
Message-ID: <20170916080731.13925-35-chriscool@tuxfamily.org> (raw)
In-Reply-To: <20170916080731.13925-1-chriscool@tuxfamily.org>

This describes the external odb mechanism's purpose and
how it works.

Helped-by: Ben Peart <benpeart@microsoft.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/technical/external-odb.txt | 342 +++++++++++++++++++++++++++++++
 1 file changed, 342 insertions(+)
 create mode 100644 Documentation/technical/external-odb.txt

diff --git a/Documentation/technical/external-odb.txt b/Documentation/technical/external-odb.txt
new file mode 100644
index 0000000000..58ec8a8145
--- /dev/null
+++ b/Documentation/technical/external-odb.txt
@@ -0,0 +1,342 @@
+External ODBs
+^^^^^^^^^^^^^
+
+The External ODB mechanism makes it possible for Git objects, only
+blobs for now though, to be stored in an "external object database"
+(External ODB).
+
+An External ODB can be any object store as long as there is an helper
+program called an "odb helper" that can communicate with Git to
+transfer objects to/from the external odb and to retrieve information
+about available objects in the external odb.
+
+Purpose
+=======
+
+The purpose of this mechanism is to make possible to handle Git
+objects, especially blobs, in much more flexible ways.
+
+Currently Git can store its objects only in the form of loose objects
+in separate files or packed objects in a pack file. These existing
+object stores cannot be easily optimized for many different kind of
+contents.
+
+So the current stores are not flexible enough for some important use
+cases like handling really big binary files or handling a really big
+number of files that are fetched only as needed. And it is not
+realistic to expect that Git could fully natively handle many of such
+use cases. Git would need to natively implement different internal
+stores which would be a huge burden and which could lead to
+re-implement things like HTTP servers, Docker registries or artifact
+stores that already exist outside Git.
+
+Furthermore many improvements that are dependent on specific setups
+could be implemented in the way Git objects are managed if it was
+possible to customize how the Git objects are handled. For example a
+restartable clone using the bundle mechanism has often been requested,
+but implementing that would go against the current strict rules under
+which the Git objects are currently handled.
+
+What Git needs is a mechanism to make it possible to customize in a
+lot of different ways how the Git objects are handled. Though this
+mechanism should try as much as possible to avoid interfering with the
+usual way in which Git handle its objects.
+
+Helpers
+=======
+
+ODB helpers are commands that have to be registered using either the
+"odb.<odbname>.subprocessCommand" or the "odb.<odbname>.scriptCommand"
+config variables.
+
+Registering such a command tells Git that an external odb called
+<odbname> exists and that the registered command should be used to
+communicate with it.
+
+The communication happens through instructions that are sent by Git
+and that the commands should answer. If it makes sense, Git can send
+the same instruction to many commands in the order in which they are
+configured.
+
+There are 2 kinds of commands. Commands registered using the
+"odb.<odbname>.subprocessCommand" config variable are called "process
+commands" and the associated mode is called "process mode". Commands
+registered using the "odb.<odbname>.scriptCommand" config variables
+are called "script commands" and the associated mode is called "script
+mode".
+
+Early on git commands send an 'init' instruction to the registered
+commands. A capability negociation will take place during this
+request/response exchange which will let Git and the helpers know how
+they can further collaborate. The attribute system can also be used to
+tell Git which objects should be handled by which helper.
+
+Process Mode
+============
+
+In process mode the command is started as a single process invocation
+that should last for the entire life of the single Git command that
+started it.
+
+A packet format (pkt-line, see technical/protocol-common.txt) based
+protocol over standard input and standard output is used for
+communication between Git and the helper command.
+
+After the process command is started, Git sends a welcome message
+("git-read-object-client"), a list of supported protocol version
+numbers, and a flush packet. Git expects to read a welcome response
+message ("git-read-object-server"), exactly one protocol version
+number from the previously sent list, and a flush packet. All further
+communication will be based on the selected version.
+
+The remaining protocol description below documents "version=1". Please
+note that "version=42" in the example below does not exist and is only
+there to illustrate how the protocol would look with more than one
+version.
+
+After the version negotiation Git sends a list of all capabilities
+that it supports and a flush packet. Git expects to read a list of
+desired capabilities, which must be a subset of the supported
+capabilities list, and a flush packet as response:
+
+------------------------
+packet: git> git-read-object-client
+packet: git> version=1
+packet: git> version=42
+packet: git> 0000
+packet: git< git-read-object-server
+packet: git< version=1
+packet: git< 0000
+packet: git> capability=get_raw_obj
+packet: git> capability=have
+packet: git> capability=put_raw_obj
+packet: git> capability=not-yet-invented
+packet: git> 0000
+packet: git< capability=get_raw_obj
+packet: git< 0000
+------------------------
+
+Afterwards Git sends a list of "key=value" pairs terminated with a
+flush packet. The list will contain at least the instruction (based on
+the supported capabilities) and the arguments for the
+instruction. Please note, that the process must not send any response
+before it received the final flush packet.
+
+In general any response from the helper should end with a status
+packet. See the documentation of the 'get_*' instructions below for
+examples of status packets.
+
+After the helper has processed an instruction, it is expected to wait
+for the next "key=value" list containing another instruction.
+
+On exit Git will close the pipe to the helper. The helper is then
+expected to detect EOF and exit gracefully on its own. Git will wait
+until the process has stopped.
+
+Script Mode
+===========
+
+In this mode Git launches the script command each time it wants to
+communicates with the helper. There is no welcome message and no
+protocol version in this mode.
+
+The instruction and associated arguments are passed as arguments when
+launching the script command and if needed further information is
+passed between Git and the command through stdin and stdout.
+
+Capabilities/Instructions
+=========================
+
+The following instructions are currently supported by Git:
+
+- init
+- get_git_obj
+- get_raw_obj
+- get_direct
+- put_raw_obj
+- have
+
+The plan is to also support 'put_git_obj' and 'put_direct' soon, for
+consistency with the 'get_*' instructions.
+
+ - 'init'
+
+All the process and script commands must accept the 'init'
+instruction. It should be the first instruction sent to a command. It
+should not be advertised in the capability exchange. Any argument
+should be ignored.
+
+In process mode, after receiving the 'init' instruction and a flush
+packet, the helper should just send a status packet and then a flush
+packet. See the 'get_*' instructions below for examples of status
+packets.
+
+In script mode the command should print on stdout the capabilities
+that it supports if any. This is the only time in script mode when a
+capability exchange happens.
+
+For example a script command could use the following shell code
+snippet to handle the 'init' instruction:
+
+------------------------
+case "$1" in
+init)
+	echo "capability=get_git_obj"
+	echo "capability=put_raw_obj"
+	echo "capability=have"
+	;;
+------------------------
+
+ - 'get_git_obj <sha1>' and 'get_raw_obj <sha1>'
+
+These instructions should have a hexadecimal <sha1> argument to tell
+which object the helper should send to git.
+
+In process mode the sha1 argument should be followed by a flush packet
+like this:
+
+------------------------
+packet: git> command=get_git_obj
+packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05
+packet: git> 0000
+------------------------
+
+After reading that the helper should send the requested object to Git in a
+packet series followed by a flush packet. If the helper does not experience
+problems then the helper must send a "success" status like the following:
+
+------------------------
+packet: git< status=success
+packet: git< 0000
+------------------------
+
+In case the helper cannot or does not want to send the requested
+object as well as any other object for the lifetime of the Git
+process, then it is expected to respond with an "abort" status at any
+point in the protocol:
+
+------------------------
+packet: git< status=abort
+packet: git< 0000
+------------------------
+
+Git neither stops nor restarts the helper in case a
+"notfound"/"error"/"abort" status is set. An "error" status means a
+possibly more transient error than an abort. The helper should also
+send a "notfound" error in case of a "get_*" instruction, which means
+that the requested object cannot be found.
+
+If the helper dies during the communication or does not adhere to the
+protocol then Git will stop and restart it with the next instruction.
+
+In script mode the helper should just send the requested object to Git
+by writing it to stdout and should then exit. The exit code should
+signal to Git if a problem occured or not.
+
+The only difference between 'get_git_obj' and 'get_raw_obj' is that in
+case of 'get_git_obj' the requested object should be sent as a Git
+object (that is in the same format as loose object files). In case of
+'get_raw_obj' the object should be sent in its raw format (that is the
+same output as `git cat-file <type> <sha1>`).
+
+ - 'get_direct <sha1>'
+
+This instruction is similar as the other 'get_*' instructions except
+that no object should be sent from the helper to Git. Instead the
+helper should directly write the requested object into a loose object
+file in the ".git/objects" directory.
+
+After the helper has sent the "status=success" packet and the
+following flush packet in process mode, or after it has exited in the
+script mode, Git will lookup again for the requested sha1 in its loose
+object files and pack files.
+
+ - 'put_raw_obj <sha1> <size> <type>'
+
+This instruction should be following by three arguments to tell which
+object the helper will receive from git: <sha1>, <size> and
+<type>. The hexadecimal <sha1> argument describes the object that will
+be sent from Git to the helper. The <type> is the object type ("blob",
+"tree", "commit" or "tag") of this object. The <size> is the size in
+bytes of the (decompressed) object content.
+
+In process mode the last argument (the type) should be followed by a
+flush packet.
+
+After reading that the helper should read the announced object from
+Git in a packet series followed by a flush packet.
+
+If the helper does not experience problems when receiving and storing
+or processing the object, then the helper must send a "success" status
+as described for the 'get_*' instructions.
+
+In script mode the helper should just receive the announced object
+from its standard input. After receiving and processing the object,
+the helper should exit and its exit code should signal to Git if a
+problem occured or not.
+
+- 'have'
+
+In process mode this instruction should be followed by a flush
+packet. After receiving this packet the helper should send the sha1,
+size and type, in this order, of all the objects it can provide to Git
+(through a 'get_*' instruction). There should be a space character
+between the sha1 and the size and between the size and the type, and
+then a new line character after the type.
+
+If many packets are needed to send back all this information, the
+split between packets should be made after the new line characters.
+
+If the helper does not experience problems, then it must then send a
+"success" status as described for the 'get_*' instructions.
+
+In script mode the helper should send to its standard output the sha1,
+size and type, in this order of all the objects it can provide to
+Git. There should also be a space character between the sha1 and the
+size and between the size and the type, and then a new line character
+after the type.
+
+After sending this, the script helper should exit and its exit code
+should signal to Git if a problem occured or not.
+
+Order of instructions
+=====================
+
+For get_*_object instructions the regular code to find objects is
+called before the odb helpers.
+
+For put_*_object instructions the regular code to store the objects is
+called after the odb helpers.
+
+For now this order is not configurable.
+
+Object caching
+==============
+
+If a helper returns the object data as requested by get_git_obj or
+get_raw_obj, then Git will itself store the object locally in its
+regular object store, so it is redundant for the helper to also store
+or try to store the object in the regular object store.
+
+Yeah, this seems to defeat the goal of enabling specialized object
+handlers to handle large or other "unusual" objects that git normally
+doesn't deal well with. So in the long run there should be a way to
+make this configurable.
+
+Selecting objects
+=================
+
+To select objects that should be handled by an external odb, one can
+use the git attributes system. For now this will only work with blobs
+and this will only work along with the 'put_raw_obj' instruction.
+
+For example if one has an external odb called "magic" and has
+registered an associated a process command helper that supports the
+'put_raw_obj' instruction, then one can tell Git that all the .jpg
+files should be handled by the "magic" odb using a .gitattributes file
+can that contains:
+
+------------------------
+*.jpg           odb=magic
+------------------------
+
-- 
2.14.1.576.g3f707d88cd


  parent reply	other threads:[~2017-09-16  8:09 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-16  8:06 [PATCH v6 00/40] Add initial experimental external ODB support Christian Couder
2017-09-16  8:06 ` [PATCH v6 01/40] builtin/clone: get rid of 'value' strbuf Christian Couder
2017-09-16  8:06 ` [PATCH v6 02/40] t0021/rot13-filter: refactor packet reading functions Christian Couder
2017-09-16  8:06 ` [PATCH v6 03/40] t0021/rot13-filter: improve 'if .. elsif .. else' style Christian Couder
2017-09-16  8:06 ` [PATCH v6 04/40] t0021/rot13-filter: improve error message Christian Couder
2017-09-16  8:06 ` [PATCH v6 05/40] t0021/rot13-filter: add packet_initialize() Christian Couder
2017-09-16  8:06 ` [PATCH v6 06/40] t0021/rot13-filter: add capability functions Christian Couder
2017-09-16  8:06 ` [PATCH v6 07/40] Add Git/Packet.pm from parts of t0021/rot13-filter.pl Christian Couder
2017-09-16  8:06 ` [PATCH v6 08/40] sha1_file: prepare for external odbs Christian Couder
2017-09-16  8:07 ` [PATCH v6 09/40] Add initial external odb support Christian Couder
2017-09-19 17:45   ` Jonathan Tan
2017-09-27 16:46     ` Christian Couder
2017-09-29 20:36       ` Jonathan Tan
2017-10-02 14:34         ` Ben Peart
2017-10-03  9:45         ` Christian Couder
2017-10-04  0:15           ` Jonathan Tan
2017-09-16  8:07 ` [PATCH v6 10/40] odb-helper: add odb_helper_init() to send 'init' instruction Christian Couder
2017-09-16  8:07 ` [PATCH v6 11/40] t0400: add 'put_raw_obj' instruction to odb-helper script Christian Couder
2017-09-16  8:07 ` [PATCH v6 12/40] external odb: add 'put_raw_obj' support Christian Couder
2017-09-16  8:07 ` [PATCH v6 13/40] external-odb: accept only blobs for now Christian Couder
2017-09-16  8:07 ` [PATCH v6 14/40] t0400: add test for external odb write support Christian Couder
2017-09-16  8:07 ` [PATCH v6 15/40] Add GIT_NO_EXTERNAL_ODB env variable Christian Couder
2017-09-16  8:07 ` [PATCH v6 16/40] Add t0410 to test external ODB transfer Christian Couder
2017-09-16  8:07 ` [PATCH v6 17/40] lib-httpd: pass config file to start_httpd() Christian Couder
2017-09-16  8:07 ` [PATCH v6 18/40] lib-httpd: add upload.sh Christian Couder
2017-09-16  8:07 ` [PATCH v6 19/40] lib-httpd: add list.sh Christian Couder
2017-09-16  8:07 ` [PATCH v6 20/40] lib-httpd: add apache-e-odb.conf Christian Couder
2017-09-16  8:07 ` [PATCH v6 21/40] odb-helper: add odb_helper_get_raw_object() Christian Couder
2017-09-16  8:07 ` [PATCH v6 22/40] pack-objects: don't pack objects in external odbs Christian Couder
2017-09-16  8:07 ` [PATCH v6 23/40] Add t0420 to test transfer to HTTP external odb Christian Couder
2017-09-16  8:07 ` [PATCH v6 24/40] external-odb: add 'get_direct' support Christian Couder
2017-09-16  8:07 ` [PATCH v6 25/40] odb-helper: add 'script_mode' to 'struct odb_helper' Christian Couder
2017-09-16  8:07 ` [PATCH v6 26/40] odb-helper: add init_object_process() Christian Couder
2017-09-16  8:07 ` [PATCH v6 27/40] Add t0450 to test 'get_direct' mechanism Christian Couder
2017-09-16  8:07 ` [PATCH v6 28/40] Add t0460 to test passing git objects Christian Couder
2017-09-16  8:07 ` [PATCH v6 29/40] odb-helper: add put_object_process() Christian Couder
2017-09-16  8:07 ` [PATCH v6 30/40] Add t0470 to test passing raw objects Christian Couder
2017-09-16  8:07 ` [PATCH v6 31/40] odb-helper: add have_object_process() Christian Couder
2017-09-16  8:07 ` [PATCH v6 32/40] Add t0480 to test "have" capability and raw objects Christian Couder
2017-09-16  8:07 ` [PATCH v6 33/40] external-odb: use 'odb=magic' attribute to mark odb blobs Christian Couder
2017-09-16  8:07 ` Christian Couder [this message]
2017-09-16  8:07 ` [PATCH v6 35/40] clone: add 'initial' param to write_remote_refs() Christian Couder
2017-09-16  8:07 ` [PATCH v6 36/40] clone: add --initial-refspec option Christian Couder
2017-09-16  8:07 ` [PATCH v6 37/40] clone: disable external odb before initial clone Christian Couder
2017-09-16  8:07 ` [PATCH v6 38/40] Add tests for 'clone --initial-refspec' Christian Couder
2017-09-16  8:07 ` [PATCH v6 39/40] Add t0430 to test cloning using bundles Christian Couder
2017-09-16  8:07 ` [PATCH v6 40/40] Doc/external-odb: explain transfering objects and metadata Christian Couder
2017-10-02 14:18 ` [PATCH v6 00/40] Add initial experimental external ODB support Ben Peart
2017-10-03  6:32   ` Christian Couder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170916080731.13925-35-chriscool@tuxfamily.org \
    --to=christian.couder@gmail.com \
    --cc=Ben.Peart@microsoft.com \
    --cc=chriscool@tuxfamily.org \
    --cc=e@80x24.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=larsxschneider@gmail.com \
    --cc=mh@glandium.org \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).