git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
58ec8a8145d64cc8451b60ffe24882c4c5a7154b blob 13161 bytes (raw)

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
 
External ODBs
^^^^^^^^^^^^^

The External ODB mechanism makes it possible for Git objects, only
blobs for now though, to be stored in an "external object database"
(External ODB).

An External ODB can be any object store as long as there is an helper
program called an "odb helper" that can communicate with Git to
transfer objects to/from the external odb and to retrieve information
about available objects in the external odb.

Purpose
=======

The purpose of this mechanism is to make possible to handle Git
objects, especially blobs, in much more flexible ways.

Currently Git can store its objects only in the form of loose objects
in separate files or packed objects in a pack file. These existing
object stores cannot be easily optimized for many different kind of
contents.

So the current stores are not flexible enough for some important use
cases like handling really big binary files or handling a really big
number of files that are fetched only as needed. And it is not
realistic to expect that Git could fully natively handle many of such
use cases. Git would need to natively implement different internal
stores which would be a huge burden and which could lead to
re-implement things like HTTP servers, Docker registries or artifact
stores that already exist outside Git.

Furthermore many improvements that are dependent on specific setups
could be implemented in the way Git objects are managed if it was
possible to customize how the Git objects are handled. For example a
restartable clone using the bundle mechanism has often been requested,
but implementing that would go against the current strict rules under
which the Git objects are currently handled.

What Git needs is a mechanism to make it possible to customize in a
lot of different ways how the Git objects are handled. Though this
mechanism should try as much as possible to avoid interfering with the
usual way in which Git handle its objects.

Helpers
=======

ODB helpers are commands that have to be registered using either the
"odb.<odbname>.subprocessCommand" or the "odb.<odbname>.scriptCommand"
config variables.

Registering such a command tells Git that an external odb called
<odbname> exists and that the registered command should be used to
communicate with it.

The communication happens through instructions that are sent by Git
and that the commands should answer. If it makes sense, Git can send
the same instruction to many commands in the order in which they are
configured.

There are 2 kinds of commands. Commands registered using the
"odb.<odbname>.subprocessCommand" config variable are called "process
commands" and the associated mode is called "process mode". Commands
registered using the "odb.<odbname>.scriptCommand" config variables
are called "script commands" and the associated mode is called "script
mode".

Early on git commands send an 'init' instruction to the registered
commands. A capability negociation will take place during this
request/response exchange which will let Git and the helpers know how
they can further collaborate. The attribute system can also be used to
tell Git which objects should be handled by which helper.

Process Mode
============

In process mode the command is started as a single process invocation
that should last for the entire life of the single Git command that
started it.

A packet format (pkt-line, see technical/protocol-common.txt) based
protocol over standard input and standard output is used for
communication between Git and the helper command.

After the process command is started, Git sends a welcome message
("git-read-object-client"), a list of supported protocol version
numbers, and a flush packet. Git expects to read a welcome response
message ("git-read-object-server"), exactly one protocol version
number from the previously sent list, and a flush packet. All further
communication will be based on the selected version.

The remaining protocol description below documents "version=1". Please
note that "version=42" in the example below does not exist and is only
there to illustrate how the protocol would look with more than one
version.

After the version negotiation Git sends a list of all capabilities
that it supports and a flush packet. Git expects to read a list of
desired capabilities, which must be a subset of the supported
capabilities list, and a flush packet as response:

------------------------
packet: git> git-read-object-client
packet: git> version=1
packet: git> version=42
packet: git> 0000
packet: git< git-read-object-server
packet: git< version=1
packet: git< 0000
packet: git> capability=get_raw_obj
packet: git> capability=have
packet: git> capability=put_raw_obj
packet: git> capability=not-yet-invented
packet: git> 0000
packet: git< capability=get_raw_obj
packet: git< 0000
------------------------

Afterwards Git sends a list of "key=value" pairs terminated with a
flush packet. The list will contain at least the instruction (based on
the supported capabilities) and the arguments for the
instruction. Please note, that the process must not send any response
before it received the final flush packet.

In general any response from the helper should end with a status
packet. See the documentation of the 'get_*' instructions below for
examples of status packets.

After the helper has processed an instruction, it is expected to wait
for the next "key=value" list containing another instruction.

On exit Git will close the pipe to the helper. The helper is then
expected to detect EOF and exit gracefully on its own. Git will wait
until the process has stopped.

Script Mode
===========

In this mode Git launches the script command each time it wants to
communicates with the helper. There is no welcome message and no
protocol version in this mode.

The instruction and associated arguments are passed as arguments when
launching the script command and if needed further information is
passed between Git and the command through stdin and stdout.

Capabilities/Instructions
=========================

The following instructions are currently supported by Git:

- init
- get_git_obj
- get_raw_obj
- get_direct
- put_raw_obj
- have

The plan is to also support 'put_git_obj' and 'put_direct' soon, for
consistency with the 'get_*' instructions.

 - 'init'

All the process and script commands must accept the 'init'
instruction. It should be the first instruction sent to a command. It
should not be advertised in the capability exchange. Any argument
should be ignored.

In process mode, after receiving the 'init' instruction and a flush
packet, the helper should just send a status packet and then a flush
packet. See the 'get_*' instructions below for examples of status
packets.

In script mode the command should print on stdout the capabilities
that it supports if any. This is the only time in script mode when a
capability exchange happens.

For example a script command could use the following shell code
snippet to handle the 'init' instruction:

------------------------
case "$1" in
init)
	echo "capability=get_git_obj"
	echo "capability=put_raw_obj"
	echo "capability=have"
	;;
------------------------

 - 'get_git_obj <sha1>' and 'get_raw_obj <sha1>'

These instructions should have a hexadecimal <sha1> argument to tell
which object the helper should send to git.

In process mode the sha1 argument should be followed by a flush packet
like this:

------------------------
packet: git> command=get_git_obj
packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05
packet: git> 0000
------------------------

After reading that the helper should send the requested object to Git in a
packet series followed by a flush packet. If the helper does not experience
problems then the helper must send a "success" status like the following:

------------------------
packet: git< status=success
packet: git< 0000
------------------------

In case the helper cannot or does not want to send the requested
object as well as any other object for the lifetime of the Git
process, then it is expected to respond with an "abort" status at any
point in the protocol:

------------------------
packet: git< status=abort
packet: git< 0000
------------------------

Git neither stops nor restarts the helper in case a
"notfound"/"error"/"abort" status is set. An "error" status means a
possibly more transient error than an abort. The helper should also
send a "notfound" error in case of a "get_*" instruction, which means
that the requested object cannot be found.

If the helper dies during the communication or does not adhere to the
protocol then Git will stop and restart it with the next instruction.

In script mode the helper should just send the requested object to Git
by writing it to stdout and should then exit. The exit code should
signal to Git if a problem occured or not.

The only difference between 'get_git_obj' and 'get_raw_obj' is that in
case of 'get_git_obj' the requested object should be sent as a Git
object (that is in the same format as loose object files). In case of
'get_raw_obj' the object should be sent in its raw format (that is the
same output as `git cat-file <type> <sha1>`).

 - 'get_direct <sha1>'

This instruction is similar as the other 'get_*' instructions except
that no object should be sent from the helper to Git. Instead the
helper should directly write the requested object into a loose object
file in the ".git/objects" directory.

After the helper has sent the "status=success" packet and the
following flush packet in process mode, or after it has exited in the
script mode, Git will lookup again for the requested sha1 in its loose
object files and pack files.

 - 'put_raw_obj <sha1> <size> <type>'

This instruction should be following by three arguments to tell which
object the helper will receive from git: <sha1>, <size> and
<type>. The hexadecimal <sha1> argument describes the object that will
be sent from Git to the helper. The <type> is the object type ("blob",
"tree", "commit" or "tag") of this object. The <size> is the size in
bytes of the (decompressed) object content.

In process mode the last argument (the type) should be followed by a
flush packet.

After reading that the helper should read the announced object from
Git in a packet series followed by a flush packet.

If the helper does not experience problems when receiving and storing
or processing the object, then the helper must send a "success" status
as described for the 'get_*' instructions.

In script mode the helper should just receive the announced object
from its standard input. After receiving and processing the object,
the helper should exit and its exit code should signal to Git if a
problem occured or not.

- 'have'

In process mode this instruction should be followed by a flush
packet. After receiving this packet the helper should send the sha1,
size and type, in this order, of all the objects it can provide to Git
(through a 'get_*' instruction). There should be a space character
between the sha1 and the size and between the size and the type, and
then a new line character after the type.

If many packets are needed to send back all this information, the
split between packets should be made after the new line characters.

If the helper does not experience problems, then it must then send a
"success" status as described for the 'get_*' instructions.

In script mode the helper should send to its standard output the sha1,
size and type, in this order of all the objects it can provide to
Git. There should also be a space character between the sha1 and the
size and between the size and the type, and then a new line character
after the type.

After sending this, the script helper should exit and its exit code
should signal to Git if a problem occured or not.

Order of instructions
=====================

For get_*_object instructions the regular code to find objects is
called before the odb helpers.

For put_*_object instructions the regular code to store the objects is
called after the odb helpers.

For now this order is not configurable.

Object caching
==============

If a helper returns the object data as requested by get_git_obj or
get_raw_obj, then Git will itself store the object locally in its
regular object store, so it is redundant for the helper to also store
or try to store the object in the regular object store.

Yeah, this seems to defeat the goal of enabling specialized object
handlers to handle large or other "unusual" objects that git normally
doesn't deal well with. So in the long run there should be a way to
make this configurable.

Selecting objects
=================

To select objects that should be handled by an external odb, one can
use the git attributes system. For now this will only work with blobs
and this will only work along with the 'put_raw_obj' instruction.

For example if one has an external odb called "magic" and has
registered an associated a process command helper that supports the
'put_raw_obj' instruction, then one can tell Git that all the .jpg
files should be handled by the "magic" odb using a .gitattributes file
can that contains:

------------------------
*.jpg           odb=magic
------------------------

debug log:

solving 58ec8a8145 ...
found 58ec8a8145 in https://public-inbox.org/git/20170916080731.13925-35-chriscool@tuxfamily.org/ ||
	https://public-inbox.org/git/20180319133147.15413-37-chriscool@tuxfamily.org/ ||
	https://public-inbox.org/git/20180103163403.11303-41-chriscool@tuxfamily.org/

applying [1/3] https://public-inbox.org/git/20170916080731.13925-35-chriscool@tuxfamily.org/
diff --git a/Documentation/technical/external-odb.txt b/Documentation/technical/external-odb.txt
new file mode 100644
index 0000000000..58ec8a8145

Checking patch Documentation/technical/external-odb.txt...
1:348: new blank line at EOF.
+
Applied patch Documentation/technical/external-odb.txt cleanly.
warning: 1 line adds whitespace errors.

skipping https://public-inbox.org/git/20180319133147.15413-37-chriscool@tuxfamily.org/ for 58ec8a8145
skipping https://public-inbox.org/git/20180103163403.11303-41-chriscool@tuxfamily.org/ for 58ec8a8145
index at:
100644 58ec8a8145d64cc8451b60ffe24882c4c5a7154b	Documentation/technical/external-odb.txt

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).