Re: [PATCH 15/40] external-odb: add script mode support

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Christian Couder <christian.couder@gmail.com>
To: Jeff Hostetler <git@jeffhostetler.com>
Cc: git <git@vger.kernel.org>, Junio C Hamano <gitster@pobox.com>,
	Jeff King <peff@peff.net>, Ben Peart <Ben.Peart@microsoft.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Nguyen Thai Ngoc Duy <pclouds@gmail.com>,
	Mike Hommey <mh@glandium.org>,
	Lars Schneider <larsxschneider@gmail.com>,
	Eric Wong <e@80x24.org>,
	Christian Couder <chriscool@tuxfamily.org>,
	Jeff Hostetler <jeffhost@microsoft.com>
Subject: Re: [PATCH 15/40] external-odb: add script mode support
Date: Mon, 19 Mar 2018 14:15:15 +0100	[thread overview]
Message-ID: <CAP8UFD0DNKVwz4D+s61+QMvtwcA3nomy0wnnfbAnSA4prnBbxg@mail.gmail.com> (raw)
In-Reply-To: <ebf67bcc-3e17-3fda-9f56-dd152e7bf3af@jeffhostetler.com>

On Thu, Jan 4, 2018 at 8:55 PM, Jeff Hostetler <git@jeffhostetler.com> wrote:
>
> On 1/3/2018 11:33 AM, Christian Couder wrote:
>>
>> diff --git a/odb-helper.c b/odb-helper.c
>> index 4b70b287af..c1a3443dc7 100644
>> --- a/odb-helper.c
>> +++ b/odb-helper.c
>> @@ -21,13 +21,124 @@ struct odb_helper_cmd {
>>         struct child_process child;
>>   };
>>   +/*
>> + * Callers are responsible to ensure that the result of vaddf(fmt, ap)
>> + * is properly shell-quoted.
>> + */
>> +static void prepare_helper_command(struct argv_array *argv, const char
>> *cmd,
>> +                                  const char *fmt, va_list ap)
>> +{
>> +       struct strbuf buf = STRBUF_INIT;
>> +
>> +       strbuf_addstr(&buf, cmd);
>> +       strbuf_addch(&buf, ' ');
>> +       strbuf_vaddf(&buf, fmt, ap);
>
> I do find this a bit odd that you're putting the cmd, a space,
> and the printf results into a single arg in the argv, rather than
> directly loading up the argv.
>
> Is there an issue with the whitespace between the cmd and the
> printf result being in the same arg -- especially if there are
> quoting issues in the fmt string as you mention in the comment
> above?  (not sure, just asking)

This was discussed with Junio here:

https://public-inbox.org/git/xmqqmvggbl6m.fsf@gitster.mtv.corp.google.com/

I agree that I should take another look at it though. I will do that
in the next version after the one I will send really soon now.

>> +static int parse_object_line(struct odb_helper_object *o, const char
>> *line)
>> +{
>
> Is there a reason to order the fields this way?  In the test
> at the bottom of this commit, you take cat-file output and
> re-order the columns with awk.   I'm just wondering if we kept
> cat-file ordering in the format here, we could simplify things.

Yeah, maybe the shell script could be simplified while the C code
would not be more complex. I don't remember if that was in the
original version from Peff or if there was a reason to do it this way.
I will take a look at that in the next version after the one I will
send really soon now.

>> +       char *end;
>> +       if (get_sha1_hex(line, o->sha1) < 0)
>> +               return -1;
>> +
>> +       line += 40;
>> +       if (*line++ != ' ')
>> +               return -1;
>> +
>> +       o->size = strtoul(line, &end, 10);
>> +       if (line == end || *end++ != ' ')
>> +               return -1;
>> +
>> +       o->type = type_from_string(end);
>> +       return 0;
>> +}
>> +
>> +static int add_have_entry(struct odb_helper *o, const char *line)
>> +{
>> +       ALLOC_GROW(o->have, o->have_nr+1, o->have_alloc);
>
> I didn't see where o->have is initially allocated.  The default is
> to start with 64 and then grow by 3/2 as needed.  If we are getting
> lots of objects here, we'll have lots of reallocs slowing things down.
> It would be better to seed it somewhere to a large value.

Yeah but using this is optional. It depends on the cap_have
capability. And I think that if there is a really huge number of
objects in the external odb, it might be better to just not use
cap_have.

Another possibility would be for the helper to send the number of
objects it has first, so that we could alloc the right number before
receiving the haves, but this could require the helper to be more
complex.

>> +       if (parse_object_line(&o->have[o->have_nr], line) < 0) {
>> +               warning("bad 'have' input from odb helper '%s': %s",
>> +                       o->name, line);
>> +               return 1;
>> +       }
>> +       o->have_nr++;
>> +       return 0;
>> +}
>> +
>> +static int odb_helper_object_cmp(const void *va, const void *vb)
>> +{
>> +       const struct odb_helper_object *a = va, *b = vb;
>> +       return hashcmp(a->sha1, b->sha1);
>> +}
>> +
>>   static void odb_helper_load_have(struct odb_helper *o)
>>   {
>> +       struct odb_helper_cmd cmd;
>> +       FILE *fh;
>> +       struct strbuf line = STRBUF_INIT;
>> +
>>         if (o->have_valid)
>>                 return;
>>         o->have_valid = 1;
>>   -     /* TODO */
>> +       if (odb_helper_start(o, &cmd, "have") < 0)
>> +               return;
>> +
>> +       fh = xfdopen(cmd.child.out, "r");
>> +       while (strbuf_getline(&line, fh) != EOF)
>> +               if (add_have_entry(o, line.buf))
>> +                       break;
>> +
>> +       strbuf_release(&line);
>> +       fclose(fh);
>> +       odb_helper_finish(o, &cmd);
>> +
>> +       qsort(o->have, o->have_nr, sizeof(*o->have),
>> odb_helper_object_cmp);
>>   }
>
> Help me understand this.  I originally thought that the "have"
> command would ask about one or more specific OIDs, but after a
> couple of readings it looks like the "have" command is getting the
> *complete* list of objects known to this external ODB and building
> a sorted array of them.  And then we do this for each external ODB
> configured.
>
> If this is the case, I'm concerned that this will have scale problems.
> "git cat-file..." shows that even my little git.git repo has 360K
> objects.  And "time git cat-file..." takes over 1.1 seconds.

Yeah, but originally external odbs were not supposed to contain a huge
number of objects. And now that cap_have is optional, it should
probably not be used if indeed the external odb contains a huge number
of objects.

> I'm wondering if there is a better/different way to do this.
> (Sorry if you've already covered this in earlier versions of this
> patch series and I missed it.)
>
> I'm wondering about "struct odb_helper" maintain a long-running
> connection to the sub-command which would let git ask for the object
> and either get it back (as you have below) or get an not-found error.
> The sub-command would then wait for another get-object request over
> stdin.
>
> I say this because whatever operation I'm doing (like a checkout or
> log or blame) is only going to need a small percentage of the total
> set of objects.  I think it would be more efficient to try/retry to
> fault them in as-needed via one or more external helpers than to build
> these tables in advance.

This is possible when using the "process mode" and not using cap_have.
I reused Ben Peart's work to make that possible by the way.

>>   static const unsigned char *have_sha1_access(size_t index, void *table)
>> @@ -53,6 +164,111 @@ int odb_helper_has_object(struct odb_helper *o, const
>> unsigned char *sha1)
>>         return !!odb_helper_lookup(o, sha1);
>>   }
>>   +int odb_helper_get_object(struct odb_helper *o, const unsigned char
>> *sha1,
>> +                           int fd)
>> +{
>> +       struct odb_helper_object *obj;
>> +       struct odb_helper_cmd cmd;
>> +       unsigned long total_got;
>> +       git_zstream stream;
>> +       int zret = Z_STREAM_END;
>> +       git_SHA_CTX hash;
>> +       unsigned char real_sha1[20];
>> +       struct strbuf header = STRBUF_INIT;
>> +       unsigned long hdr_size;
>> +
>> +       obj = odb_helper_lookup(o, sha1);
>> +       if (!obj)
>> +               return -1;
>> +
>> +       if (odb_helper_start(o, &cmd, "get_git_obj %s", sha1_to_hex(sha1))
>> < 0)
>> +               return -1;
>> +
>> +       memset(&stream, 0, sizeof(stream));
>> +       git_inflate_init(&stream);
>> +       git_SHA1_Init(&hash);
>> +       total_got = 0;
>> +
>> +       for (;;) {
>> +               unsigned char buf[4096];
>> +               int r;
>> +
>> +               r = xread(cmd.child.out, buf, sizeof(buf));
>> +               if (r < 0) {
>> +                       error("unable to read from odb helper '%s': %s",
>> +                             o->name, strerror(errno));
>> +                       close(cmd.child.out);
>> +                       odb_helper_finish(o, &cmd);
>> +                       git_inflate_end(&stream);
>> +                       return -1;
>> +               }
>> +               if (r == 0)
>> +                       break;
>> +
>> +               write_or_die(fd, buf, r);
>> +
>> +               stream.next_in = buf;
>> +               stream.avail_in = r;
>> +               do {
>> +                       unsigned char inflated[4096];
>> +                       unsigned long got;
>> +
>> +                       stream.next_out = inflated;
>> +                       stream.avail_out = sizeof(inflated);
>> +                       zret = git_inflate(&stream, Z_SYNC_FLUSH);
>> +                       got = sizeof(inflated) - stream.avail_out;
>> +
>> +                       git_SHA1_Update(&hash, inflated, got);
>> +                       /* skip header when counting size */
>> +                       if (!total_got) {
>> +                               const unsigned char *p = memchr(inflated,
>> '\0', got);
>> +                               if (p) {
>> +                                       unsigned long hdr_last = p -
>> inflated + 1;
>> +                                       strbuf_add(&header, inflated,
>> hdr_last);
>> +                                       got -= hdr_last;
>> +                               } else {
>> +                                       strbuf_add(&header, inflated,
>> got);
>> +                                       got = 0;
>> +                               }
>> +                       }
>> +                       total_got += got;
>> +               } while (stream.avail_in && zret == Z_OK);
>> +       }
>> +
>> +       close(cmd.child.out);
>> +       git_inflate_end(&stream);
>> +       git_SHA1_Final(real_sha1, &hash);
>> +       if (odb_helper_finish(o, &cmd))
>> +               return -1;
>> +       if (zret != Z_STREAM_END) {
>> +               warning("bad zlib data from odb helper '%s' for %s",
>> +                       o->name, sha1_to_hex(sha1));
>> +               return -1;
>> +       }
>> +       if (total_got != obj->size) {
>> +               warning("size mismatch from odb helper '%s' for %s (%lu !=
>> %lu)",
>> +                       o->name, sha1_to_hex(sha1), total_got, obj->size);
>> +               return -1;
>> +       }
>> +       if (hashcmp(real_sha1, sha1)) {
>> +               warning("sha1 mismatch from odb helper '%s' for %s (got
>> %s)",
>> +                       o->name, sha1_to_hex(sha1),
>> sha1_to_hex(real_sha1));
>> +               return -1;
>> +       }
>> +       if (parse_sha1_header(header.buf, &hdr_size) < 0) {
>> +               warning("could not parse header from odb helper '%s' for
>> %s",
>> +                       o->name, sha1_to_hex(sha1));
>> +               return -1;
>> +       }
>> +       if (total_got != hdr_size) {
>> +               warning("size mismatch from odb helper '%s' for %s (%lu !=
>> %lu)",
>> +                       o->name, sha1_to_hex(sha1), total_got, hdr_size);
>> +               return -1;
>> +       }
>
> Does "strbuf header" need to be released before all of the returns?

Yeah, thanks for noticing. This is fixed in the version I will send
really soon now.

>> +
>> +       return 0;
>> +}
>
> If I understand this function, it is receiving an object from the odb
> helper sub-command and creating a loose object for it in the local odb.
> The object is compressed on the wire and this function unzips it to
> verify the object is completely received and without errors.
>
> But this function does not keep the resulting object around in memory
> (which might not fit anyway).  I'm assuming that the higher-layer that
> needs the object will just be told to read the new loose object.

Yeah, that's how it works. I think I took most of the code and the way
this function works from the work Ben Peart had sent to the mailing
list.

Thanks,
Christian.

next prev parent reply	other threads:[~2018-03-19 13:15 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-03 16:33 [PATCH 00/40] Promisor remotes and external ODB support Christian Couder
2018-01-03 16:33 ` [PATCH 01/40] Add initial external odb support Christian Couder
2018-01-04 19:59   ` Jeff Hostetler
2018-01-15 14:34     ` Christian Couder
2018-01-03 16:33 ` [PATCH 02/40] Add GIT_NO_EXTERNAL_ODB env variable Christian Couder
2018-01-03 16:33 ` [PATCH 03/40] external-odb: add has_external_odb() Christian Couder
2018-01-03 16:33 ` [PATCH 04/40] fsck: introduce promisor objects Christian Couder
2018-01-03 16:33 ` [PATCH 05/40] fsck: support refs pointing to " Christian Couder
2018-01-03 16:33 ` [PATCH 06/40] fsck: support referenced " Christian Couder
2018-01-03 16:33 ` [PATCH 07/40] fsck: support promisor objects as CLI argument Christian Couder
2018-01-03 16:33 ` [PATCH 08/40] index-pack: refactor writing of .keep files Christian Couder
2018-01-03 16:33 ` [PATCH 09/40] introduce fetch-object: fetch one promisor object Christian Couder
2018-01-03 16:33 ` [PATCH 10/40] external-odb: implement external_odb_get_direct Christian Couder
2018-01-04 17:44   ` Jeff Hostetler
2018-01-15 14:47     ` Christian Couder
2018-01-03 16:33 ` [PATCH 11/40] sha1_file: support lazily fetching missing objects Christian Couder
2018-01-03 16:33 ` [PATCH 12/40] rev-list: support termination at promisor objects Christian Couder
2018-01-03 16:33 ` [PATCH 13/40] gc: do not repack promisor packfiles Christian Couder
2018-01-03 16:33 ` [PATCH 14/40] sha1_file: prepare for external odbs Christian Couder
2018-01-04 18:00   ` Jeff Hostetler
2018-01-16  7:23     ` Christian Couder
2018-01-03 16:33 ` [PATCH 15/40] external-odb: add script mode support Christian Couder
2018-01-04 19:55   ` Jeff Hostetler
2018-03-19 13:15     ` Christian Couder [this message]
2018-01-03 16:33 ` [PATCH 16/40] odb-helper: add 'enum odb_helper_type' Christian Couder
2018-01-03 16:33 ` [PATCH 17/40] odb-helper: add odb_helper_init() to send 'init' instruction Christian Couder
2018-01-03 16:33 ` [PATCH 18/40] t0400: add 'put_raw_obj' instruction to odb-helper script Christian Couder
2018-01-03 16:33 ` [PATCH 19/40] external odb: add 'put_raw_obj' support Christian Couder
2018-01-03 16:33 ` [PATCH 20/40] external-odb: accept only blobs for now Christian Couder
2018-01-03 16:33 ` [PATCH 21/40] t0400: add test for external odb write support Christian Couder
2018-01-03 16:33 ` [PATCH 22/40] Add t0410 to test external ODB transfer Christian Couder
2018-01-03 16:33 ` [PATCH 23/40] lib-httpd: pass config file to start_httpd() Christian Couder
2018-01-03 16:33 ` [PATCH 24/40] lib-httpd: add upload.sh Christian Couder
2018-01-03 16:33 ` [PATCH 25/40] lib-httpd: add list.sh Christian Couder
2018-01-03 16:33 ` [PATCH 26/40] lib-httpd: add apache-e-odb.conf Christian Couder
2018-01-03 16:33 ` [PATCH 27/40] odb-helper: add odb_helper_get_raw_object() Christian Couder
2018-01-03 16:33 ` [PATCH 28/40] pack-objects: don't pack objects in external odbs Christian Couder
2018-01-04 20:54   ` Jeff Hostetler
2018-03-19 13:27     ` Christian Couder
2018-01-03 16:33 ` [PATCH 29/40] Add t0420 to test transfer to HTTP external odb Christian Couder
2018-01-03 16:33 ` [PATCH 30/40] external-odb: add 'get_direct' support Christian Couder
2018-01-03 16:33 ` [PATCH 31/40] odb-helper: add 'script_mode' to 'struct odb_helper' Christian Couder
2018-01-03 16:33 ` [PATCH 32/40] odb-helper: add init_object_process() Christian Couder
2018-01-03 16:33 ` [PATCH 33/40] Add t0450 to test 'get_direct' mechanism Christian Couder
2018-01-03 16:33 ` [PATCH 34/40] Add t0460 to test passing git objects Christian Couder
2018-01-03 16:33 ` [PATCH 35/40] odb-helper: add put_object_process() Christian Couder
2018-01-03 16:33 ` [PATCH 36/40] Add t0470 to test passing raw objects Christian Couder
2018-01-03 16:34 ` [PATCH 37/40] odb-helper: add have_object_process() Christian Couder
2018-01-03 16:34 ` [PATCH 38/40] Add t0480 to test "have" capability and raw objects Christian Couder
2018-01-03 16:34 ` [PATCH 39/40] external-odb: use 'odb=magic' attribute to mark odb blobs Christian Couder
2018-01-03 16:34 ` [PATCH 40/40] Add Documentation/technical/external-odb.txt Christian Couder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAP8UFD0DNKVwz4D+s61+QMvtwcA3nomy0wnnfbAnSA4prnBbxg@mail.gmail.com \
    --to=christian.couder@gmail.com \
    --cc=Ben.Peart@microsoft.com \
    --cc=chriscool@tuxfamily.org \
    --cc=e@80x24.org \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jeffhost@microsoft.com \
    --cc=jonathantanmy@google.com \
    --cc=larsxschneider@gmail.com \
    --cc=mh@glandium.org \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).