From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.5 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,RCVD_IN_SORBS_SPAM, RP_MATCHES_RCVD shortcircuit=no autolearn=no autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 01F3320401 for ; Sat, 24 Jun 2017 17:22:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754923AbdFXRWq (ORCPT ); Sat, 24 Jun 2017 13:22:46 -0400 Received: from mail-wr0-f194.google.com ([209.85.128.194]:32795 "EHLO mail-wr0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751455AbdFXRWp (ORCPT ); Sat, 24 Jun 2017 13:22:45 -0400 Received: by mail-wr0-f194.google.com with SMTP id x23so20384224wrb.0 for ; Sat, 24 Jun 2017 10:22:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=I1YdlXxQyVV9KlBaBkLLPi5700nZG+5oOo5pVzPqMBg=; b=LWtWfH3IpCIsuAoDGraUlvt4kRFYdFp0JlIMlqN9TWlWtzFip9947cbtUbfip9SJtl h9Ysn8aXhanLTUVHI8EhvpbaHvk2VbIQ6WQBpPq/LtYhA8prisG3SOAgfIZPepbTjIaD U5lqwacuzFRgSx1vRFdvDSPHMOgxNkcik92aW+vX6H3vkVesL4NKH45L6ohSls+Ou/CW 46Rylk3AVpU2pSPhH3EypjIoeYWE/OJeW7WZ4fhQVmUlsD0b+EfWo/ZnSKLXPy4yWhYv EVUorVb+AuU6Kc/fXH7VRCGp27FHvBuPjYoTFA2BvvvbKECBu3IHhFyZt0R3V42CIJvo A6Ng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=I1YdlXxQyVV9KlBaBkLLPi5700nZG+5oOo5pVzPqMBg=; b=G+DwdRbNmclbJMqN8anrtg90O6aB8Bxx3wlawg8GeTXV4m6D4W/YPJNkTewoHqf5nb Ar2GH8sUGZY4N97WxPY5A9LqeXwHuQHn4Efc8K1EWU3ZUzZyBo8QFvGKGdAHU+GRRjpd Xazz4bA1H6EiPiU5j9EUfnCq9kRiaNvir+QVC+MFwlnaGffYPXunK5LCSozKG2dLRtkg H/87ldjkRFNXoZEbrJjSoZj5T4+t4sW8IzD5j1hdPJOJUFhbEwv4dXIzqvt4WWnuvclo W2xqb2nAi52jHt2A8q+T6A9xIfAJJdtySXLI/glcSyN4UzCym6beU/rp8vUWnRXc3cqC 5pdw== X-Gm-Message-State: AKS2vOzITku7jpADeonR5Lq0TDv/XiQYvgQF0kPVFxvdxSuVJKwEALdS bGe55xIdZc0/jClYYy4= X-Received: by 10.28.132.210 with SMTP id g201mr8876148wmd.26.1498324963760; Sat, 24 Jun 2017 10:22:43 -0700 (PDT) Received: from slxbook4.fritz.box (p5DDB763B.dip0.t-ipconnect.de. [93.219.118.59]) by smtp.gmail.com with ESMTPSA id a99sm3786808wrc.64.2017.06.24.10.22.41 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Sat, 24 Jun 2017 10:22:42 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: [PATCH v5 5/5] convert: add "status=delayed" to filter process protocol From: Lars Schneider In-Reply-To: <20170624141941.usy2pyhid3jrf3ku@sigill.intra.peff.net> Date: Sat, 24 Jun 2017 19:22:40 +0200 Cc: git@vger.kernel.org, gitster@pobox.com, tboegi@web.de, e@80x24.org, ttaylorr@github.com, peartben@gmail.com Content-Transfer-Encoding: quoted-printable Message-Id: References: <20170601082203.50397-1-larsxschneider@gmail.com> <20170601082203.50397-6-larsxschneider@gmail.com> <20170624141941.usy2pyhid3jrf3ku@sigill.intra.peff.net> To: Jeff King X-Mailer: Apple Mail (2.3124) Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org > On 24 Jun 2017, at 16:19, Jeff King wrote: >=20 > On Thu, Jun 01, 2017 at 10:22:03AM +0200, Lars Schneider wrote: >=20 >> Some `clean` / `smudge` filters might require a significant amount of >> time to process a single blob (e.g. the Git LFS smudge filter might >> perform network requests). During this process the Git checkout >> operation is blocked and Git needs to wait until the filter is done = to >> continue with the checkout. >>=20 >> Teach the filter process protocol (introduced in edcc858) to accept = the >> status "delayed" as response to a filter request. Upon this response = Git >> continues with the checkout operation. After the checkout operation = Git >> calls "finish_delayed_checkout" which queries the filter for = remaining >> blobs. If the filter is still working on the completion, then the = filter >> is expected to block. If the filter has completed all remaining blobs >> then an empty response is expected. >>=20 >> Git has a multiple code paths that checkout a blob. Support delayed >> checkouts only in `clone` (in unpack-trees.c) and `checkout` = operations. >=20 > It might be worth giving a reason in this last paragraph. I think the > reason is "because it's more complicated for the caller, as they have = to > be OK with out-of-order processing and remembering to go back and = handle > the delayed cases". Correct! However, my real reason was that these code paths process all files of the tree. Therefore the "out-of-order" processing can be effective.=20 How about this: Git has a multiple code paths that checkout a blob. Support delayed checkouts only in `clone` (in unpack-trees.c) and `checkout` operations for now. The optimization is most effective in these code paths as all=20= files of the tree are processed. > Speaking of which, _are_ we OK with out-of-order processing in things > like checkout? Certainly we care about deleting items before checking > out new ones (so getting rid of "foo/bar" to make room for "foo" and > vice versa). I guess it's OK as long as the delayed items are always > filters that check out new items. Junio noticed that too but thinks we should be OK: "[...] We checkout removals first to make room so that creation of a path X can succeed if an existing path X/Y that used to want to see X as a directory can succeed [...]" = http://public-inbox.org/git/xmqqvavotych.fsf@gitster.mtv.corp.google.com/ >> ... >=20 >> +After Git received the pathnames, it will request the corresponding >> +blobs again. These requests contain a pathname and an empty content >> +section. The filter is expected to respond with the smudged content >> +in the usual way as explained above. >> +------------------------ >> +packet: git> command=3Dsmudge >> +packet: git> pathname=3Dpath/testfile.dat >> +packet: git> 0000 >> +packet: git> 0000 # empty content! >> +packet: git< status=3Dsuccess >> +packet: git< 0000 >> +packet: git< SMUDGED_CONTENT >> +packet: git< 0000 >> +packet: git< 0000 # empty list, keep "status=3Dsuccess" = unchanged! >> +------------------------ >=20 > Out of curiosity, what should happen if we re-ask for a pathname that > wasn't mentioned in list_available_blobs? I guess it would depend on = the > filter implementation, but probably most would just block (since we > wouldn't have asked for can-delay). The other option is returning an > error, I suppose, but blocking and filling the request sounds > friendlier. I agree that is up to the filter. I would expect the blocking strategy. The filter cannot delay it anymore as we have not send the "can-delay" flag. > ... >> diff --git a/cache.h b/cache.h >> index ae4c45d379..523ead1d95 100644 >> --- a/cache.h >> +++ b/cache.h >> @@ -1543,16 +1543,20 @@ extern int ident_cmp(const struct ident_split = *, const struct ident_split *); >> struct checkout { >> struct index_state *istate; >> const char *base_dir; >> + struct delayed_checkout *delayed_checkout; >> int base_dir_len; >=20 > Should base_dir_len and base_dir be kept together? >=20 > I suspect you ordered it this way because... >=20 >> unsigned force:1, >> quiet:1, >> not_new:1, >> refresh_cache:1; >> }; >> -#define CHECKOUT_INIT { NULL, "" } >> +#define CHECKOUT_INIT { NULL, "", NULL } >> + >=20 > ...you wanted to add the NULL to the initializer here. But it's not > necessary. Once one element of a struct is initialized, all the > remaining members are initialized according to the usual static rules > (so NULL for pointers). We're already using that to zero out > base_dir_len and the flags (the "" is necessary because we want a real > empty string, not NULL). >=20 > Since you want NULL for your new member, you can just leave the > initializer as-is and it will do the right thing. That was indeed my reasoning. I am no C expert and I wasn't sure about the initialization. The "Once one element of a struct is initialized" thing is news to me :-) Thanks! I'll keep the base_dir* together then! >> @@ -533,7 +534,8 @@ static int start_multi_file_filter_fn(struct = subprocess_entry *subprocess) >> if (err) >> goto done; >>=20 >> - err =3D packet_writel(process->in, "capability=3Dclean", = "capability=3Dsmudge", NULL); >> + err =3D packet_writel(process->in, >> + "capability=3Dclean", "capability=3Dsmudge", = "capability=3Ddelay", NULL); >=20 > Why do we need to tell the filter we know about delay? Shouldn't it = just > need to tell us that it knows about delay, and then we choose whether = to > ask for can-delay for particular entries? Because in the protocol I defined that the filter needs to answer with a strict subset of this list [1]. I thought that this would make the = protocol more future proof/backward compatible. Because the filter is not allowed = to answer with something that Git does not understand. [1] = https://github.com/git/git/blob/5402b1352f5181247405fbff1887008a0cb3b04a/D= ocumentation/gitattributes.txt#L408-L411 >=20 >> @@ -647,6 +653,14 @@ static int apply_multi_file_filter(const char = *path, const char *src, size_t len >> if (err) >> goto done; >>=20 >> + if (CAP_DELAY & entry->supported_capabilities && >> + dco && dco->state =3D=3D CE_CAN_DELAY) { >=20 > In a complicated conditional with bit operations, we usually put the = bit > operation in its own parentheses so it's more obvious that it wasn't > supposed to be "&&". Like: >=20 > if ((CAP_DELAY & entry->supported_capabilities) && > dco && dco->state =3D=3D CE_CAN_DELAY)) Agreed! > The operator precedence is such that it works without them, so this is > just a style question (I'd also usually put the flags field before the > flag itself, but that's really getting into aesthetics). You mean (entry & CAP_DELAY) instead of (CAP_DELAY & entry)? > The dco check here is basically asking if can-delay is enabled for the > whole checkout. Would we ever need to mark a specific entry as > not-delayable? Probably not from the start, but I wondered if = something > might get marked as a failure and need to be retried without delay or > similar. It's probably not worth going down that road until some = caller > needs it. Agreed. >=20 >> @@ -662,14 +676,78 @@ static int apply_multi_file_filter(const char = *path, const char *src, size_t len >> if (err) >> goto done; >>=20 >> - err =3D strcmp(filter_status.buf, "success"); >> + if (can_delay && !strcmp(filter_status.buf, "delayed")) { >> + dco->state =3D CE_DELAYED; >> + string_list_insert(&dco->filters, cmd); >> + string_list_insert(&dco->paths, path); >=20 > These string_list_insert()s have bad worst-case performance, because > they keep the list sorted. So the worst case is reverse-sorted input, > where each entry goes at the front of the existing array and we have = to > memmove() the whole array on each insert, and we do O(n^2) byte = copies. >=20 > The best case is when the input is already sorted. And I think in > general that's how we'll iterate over paths to checkout. The cmds are > more likely to be random, but we're also not likely to have more than = a > handful of them. So this is probably an OK data structure to use, at > least for now. Agreed! Up until now I tested the delayed technique only with my small = test=20 case. It will become really interesting when GitLFS implements this = feature. Then I can test it with our repos that have +20k filtered files. At this = point I will be able to identify the spots that require more optimization. >> + for (;;) { >> + const char* pre =3D "pathname=3D"; >> + const int pre_len =3D strlen(pre); >> + line =3D packet_read_line(process->out, NULL); >> + if (!line) >> + break; >> + err =3D strlen(line) <=3D pre_len || strncmp(line, pre, = pre_len); >> + if (err) >> + goto done; >> + string_list_insert(delayed_paths, = xstrdup(line+pre_len)); >> + } >=20 > I think the prefix matching can be done more simply (and slightly more > efficiently) with: >=20 > const char *path; > ... > err =3D !skip_prefix(line, "pathname=3D", &path); > if (err) > goto done; > string_list_insert(delayed_paths, xstrdup(path)); >=20 > I notice we loop here, though. Would we want to be resilient to seeing > other keys besides pathname? Something like: >=20 > while ((line =3D packet_read_line(process->out, NULL)) { > const char *path; > if (skip_prefix(line, "pathname=3D", &path)) > string_list_insert(delayed_paths, xstrdup(path); > else > ; /* ignore unknown keys */ > } >=20 > I can imagine some future where the filter passes back additional > information about the delayed item. It would be nice if older versions > of Git would just quietly ignore that if they don't know how to handle > it. I agree with all the above. skip_prefix is neat! Although it's so dominant in the source code I never noticed it. >> +int finish_delayed_checkout(struct checkout *state) >> +{ >> + int errs =3D 0; >> + struct string_list_item *filter, *path; >> + struct delayed_checkout *dco =3D state->delayed_checkout; >> + >> + if (!state->delayed_checkout) { >> + return errs; >> + } >=20 > Style: we'd usually omit the braces here. Sure! I'll remove it. (This is the only rule of the Git style guide that don't agree with :-)=20 >=20 >> + if (available_paths.nr <=3D 0) { >> + /* Filter responded with no entries. = That means >> + the filter is done and we can remove = the >> + filter from the list (see >> + "string_list_remove_empty_items" call = below). >> + */ >=20 > This is true because we know the filter would block until it had at > least one item to return. >=20 > It does seem to put a lot of faith in the filter. What happens if we = hit > this conditional and we know that we still have entries in dco->paths = to > be handled? We check this later as you noticed below. >> + /* In dco->paths we store a list of all delayed = paths. >> + The filter just send us a list of available = paths. >> + Remove them from the list. >> + */ >> + filter_string_list(&dco->paths, 0, >> + &remove_available_paths, = &available_paths); >=20 > This is O(m log n), which is fine. Since both lists are sorted, we = could > technically do it in an O(m+n) merge, but I don't think we have a > function to do that easily. >=20 > Sorry to harp on run-times, but it's one place I think it would be = easy > to subtly screw this up, so I'm trying to pay close attention. No worries at all! Thanks for thinking it through on that level. As I wrote above - I plan to really profile it under "real life = conditions" as soon as we have full GitLFS support. >> ... >=20 >> + /* At this point we should not have any delayed paths anymore. = */ >> + errs |=3D dco->paths.nr; >=20 > And I guess this answers my question above about a filter with no > entries. When we bail from the loop, we'll hit this point and = complain. >=20 > We probably should call string_list_clear(&dco->paths) here to avoid > leaking the paths on error (possibly we should also print them? Or = maybe > they'd be warned about elsewhere). Agreed, I will clear the list. I'll also print the items because as far as I can see they are not printed elsewhere. Git just says: fatal: unable to checkout working tree warning: Clone succeeded, but checkout failed. You can inspect what was checked out with 'git status' and retry the checkout with 'git checkout -f HEAD' How about this? errs |=3D dco->paths.nr; for_each_string_list_item(path, &dco->paths) { warning("%s was not processed properly.", path->string); } string_list_clear(&dco->paths, 0); The output would be: warning: test-delay10.a was not processed properly. warning: test-delay10.b was not processed properly. warning: test-delay11.a was not processed properly. warning: test-delay20.a was not processed properly. fatal: unable to checkout working tree warning: Clone succeeded, but checkout failed. You can inspect what was checked out with 'git status' and retry the checkout with 'git checkout -f HEAD' I contemplated about the warning text. "$FILE was not filtered properly." is technical more correct but maybe it would confuse the user? Plus, at this point I don't really know what filter should have given me the files. That's why I can't print it in=20 the warning message. >=20 >> @@ -179,11 +254,36 @@ static int write_entry(struct cache_entry *ce, >> /* >> * Convert from git internal format to working tree = format >> */ >> - if (ce_mode_s_ifmt =3D=3D S_IFREG && >> - convert_to_working_tree(ce->name, new, size, &buf)) = { >> - free(new); >> - new =3D strbuf_detach(&buf, &newsize); >> - size =3D newsize; >> + if (ce_mode_s_ifmt =3D=3D S_IFREG) { >> + struct delayed_checkout *dco =3D = state->delayed_checkout; >> + if (dco && dco->state !=3D CE_NO_DELAY) { >> + /* Do not send the blob in case of a = retry. */ >> + if (dco->state =3D=3D CE_RETRY) { >> + new =3D NULL; >> + size =3D 0; >> + } >> + ret =3D async_convert_to_working_tree( >> + ce->name, new, size, &buf, dco); >> + if (ret && dco->state =3D=3D CE_DELAYED) = { >> + free(new); >> + /* Reset the state of the next = blob */ >> + dco->state =3D CE_CAN_DELAY; >> + goto finish; >> + } >=20 > Hmm. This "reset the state" bit at the end surprised me. I guess it's > not wrong, but it goes against the mental model I had formed above. ;) >=20 > We really are using dco->state as a per-entry state flag. It just > happens to be in a persistent shared struct. I don't think it's wrong, > it was mostly just surprising. I don't know if it's worth trying to > simplify, but I think you could do it by: >=20 > 1. Passing back the "was delayed" state from async_convert... in the > return value or via a separate out-parameter. In the beginning I had it implemented that way. But that meant that I had to pass two variables through the entire convert stack: async_convert_to_working_tree -> convert_to_working_tree_internal --> apply_filter ---> apply_multi_file_filter >=20 > 2. Setting dco->state to CE_RETRY at the top of finish_delayed... so > that it's clear that it's about what phase of the conversation > we're in. I could do that. However, I thought it is safer to set the state = *before* every checkout operation in case convert.c messes with this field (it should not in this phase). > But I'm OK with it as-is, too. I'll try 2. >> diff --git a/t/t0021/rot13-filter.pl b/t/t0021/rot13-filter.pl >> index 5e43faeec1..f0dc0aff4a 100644 >> --- a/t/t0021/rot13-filter.pl >> +++ b/t/t0021/rot13-filter.pl >=20 > I'll admit to skimming the perl changes here, but I didn't see = anything > obviously wrong. Thanks a lot for the review, Lars