From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS3215 2.6.0.0/16 X-Spam-Status: No, score=-3.2 required=3.0 tests=AWL,BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from out1.vger.email (out1.vger.email [IPv6:2620:137:e000::1:20]) by dcvr.yhbt.net (Postfix) with ESMTP id 568801F54E for ; Fri, 5 Aug 2022 16:36:57 +0000 (UTC) Authentication-Results: dcvr.yhbt.net; dkim=pass (2048-bit key; unprotected) header.d=github.com header.i=@github.com header.b="LbUgfAi6"; dkim-atps=neutral Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238518AbiHEQgx (ORCPT ); Fri, 5 Aug 2022 12:36:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37932 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232185AbiHEQgw (ORCPT ); Fri, 5 Aug 2022 12:36:52 -0400 Received: from mail-pl1-x62c.google.com (mail-pl1-x62c.google.com [IPv6:2607:f8b0:4864:20::62c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C45D617AA6 for ; Fri, 5 Aug 2022 09:36:50 -0700 (PDT) Received: by mail-pl1-x62c.google.com with SMTP id iw1so3084509plb.6 for ; Fri, 05 Aug 2022 09:36:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=github.com; s=google; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc; bh=u6tWR9mCoaRw0f9SlGoDAahKwr5mFV4eT2sf777E9VI=; b=LbUgfAi6L5LZdinRJnLSpV8KCXpEI0JEqeoPLaIklcDFyflIncM3V+5r40y5ssF3It oenGsKWEj+jFVvlezw4SFjmni/fRvqKMycLcivlKeF/f2b7txKqIhC8qMoQSjVwUCyP/ YS+9fN7YAg7d/dFjgeDXel5XxTMdzQBAgflnTnQflTBlmPYXH5WbloUo0FYk5dxfyYtf W08arkSvActZIaTtgklfj0RD750HyQ8fS/RyUbSkMUdnm2naBBM1+ZAf4Z8QVy97biFS TyRgOg2VJuublMk9TbdAU1NdhnKCXPDsXCYCAOr5mAED5RvRWDDhIyPaLsJ6SE466BzV tv7A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc; bh=u6tWR9mCoaRw0f9SlGoDAahKwr5mFV4eT2sf777E9VI=; b=Rnojst58iTL0vYxpyU9f2NdGdie5FeEBIOE+rWSZFiIvHHOjcWzW/XHmUsHbCBX9VG 5448H2LXXnO+x44UN6QyMq9FaZhhKQAvsK1G4P7+O6S8KvzwqiH+gJCICm8BqW58f8dA xc7fWXIg6ADGEInyYYqUqrO4VU/rrpMiw6nZDWLea31yTkv8JhNYMT1IJoK2q4Zlixza R5aLtDDKKKxcEv97q6et+nhgmSG3OgA42aIxC0G9kiM2SH/9OMzuhWh1iTyiLG7srwGb iQEuP+FiohsH+5EWukQOC6CLbNyPa6VVjFTY0vDlTpVm21EN+s7nAUDp+xVFz9FpEDsw LVjg== X-Gm-Message-State: ACgBeo0yZoV0iF3jRKrYv2wJohRsMIHayoOqJZ6NZQ4OgrvEXLp8KWgH E400CPRAyF/tS6zSOwwOSfhg X-Google-Smtp-Source: AA6agR6Fo1uYPUMUMDYanEIsiGWYruKf/1Yu5wBk64Tyn97BKqAa882ND8nQwFGWlFMJG1pHRMWaqg== X-Received: by 2002:a17:902:c943:b0:16f:15a0:d497 with SMTP id i3-20020a170902c94300b0016f15a0d497mr7591381pla.80.1659717410228; Fri, 05 Aug 2022 09:36:50 -0700 (PDT) Received: from [192.168.0.104] (cpe-172-249-73-112.socal.res.rr.com. [172.249.73.112]) by smtp.gmail.com with ESMTPSA id t22-20020a17090aba9600b001f5513f6fb9sm3257375pjr.14.2022.08.05.09.36.48 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 05 Aug 2022 09:36:49 -0700 (PDT) Message-ID: <3825ef9a-4c71-21ed-6452-bbd322ca839c@github.com> Date: Fri, 5 Aug 2022 09:36:47 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.12.0 Subject: Re: [PATCH 4/4] unpack-trees: handle missing sparse directories Content-Language: en-US To: Junio C Hamano , Victoria Dye via GitGitGadget Cc: git@vger.kernel.org, derrickstolee@github.com, shaoxuan.yuan02@gmail.com, newren@gmail.com References: <016971a67112efe2d15fe7908e86c5d2631f8e66.1659645967.git.gitgitgadget@gmail.com> From: Victoria Dye In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Junio C Hamano wrote: > "Victoria Dye via GitGitGadget" writes: > >> From: Victoria Dye >> >> If a sparse directory does not exist in the index, unpack it at the >> directory level rather than recursing into it an unpacking its contents >> file-by-file. This helps keep the sparse index as collapsed as possible in >> cases such as 'git reset --hard' restoring a sparse directory. > > My reading hiccuped at "does not exist in". After reading the above > twice, I think the situation is that there is a directory A/B in > which a file C sits, and A/ is marked as sparse and the paragraph is > talking about directory A/B. Because the index has A/ as a tree in > index, A/B does not exist in the index. > > When we _somehow_ need to ensure that A/B exists in the index for > _unknown_ reason, we could flatten the index fully and have A/B/C as > a blob (without A or A/B in the index proper, even though they may > appear in cache-tree), but if we stop at removing A and adding A/B > still as a tree without showing A/B/C in the index, we may gain > efficiency, under the assumption that we do not have to access A/B/C > and its siblings. > > Did I read the intention correctly? I suspect future readers of the > commit that would result from this patch would also wonder what the > "somehow" and "unknown" above are. > If I'm reading this correctly, it's not quite what I meant - the situation this patch addresses is when _nothing_ in the tree rooted at 'A/' (files, subdirectories) exists in the index, but 'unpack_trees()' is merging tree(s) where 'A/' *does* exist. I originally wanted to write "If a sparse directory *is deleted from* the index", but that doesn't cover the case where 'A/' never existed in the index and we're merging in tree(s) that introduce it. Maybe it would be clearer to describe it with a different perspective: "If 'unpack_callback()' is merging a new tree into a sparse index, merge the tree as a sparse directory rather than traversing its contents" or something like that? I'll try to come up with a better way of explaining this and update in V2. >> A directory is determined to be truly non-existent in the index (rather than >> the parent of existing index entries), if 1) its path is outside the sparse >> cone and 2) there are no children of the directory in the index. This check >> is performed by 'missing_dir_is_sparse()' in 'unpack_single_entry()'. If the >> directory is a missing sparse dir, 'unpack_single_entry()' will proceed >> with unpacking it. This determination is also propagated back up to >> 'unpack_callback()' via 'is_missing_sparse_dir' to prevent further tree >> traversal into the unpacked directory. > > Makes sense. > >> diff --git a/unpack-trees.c b/unpack-trees.c >> index 8a454e03bff..aa62cef20fe 100644 >> --- a/unpack-trees.c >> +++ b/unpack-trees.c >> @@ -1069,6 +1069,53 @@ static struct cache_entry *create_ce_entry(const struct traverse_info *info, >> return ce; >> } >> >> +/* >> + * Determine whether the path specified corresponds to a sparse directory >> + * completely missing from the index. This function is assumed to only be >> + * called when the named path isn't already in the index. >> + */ >> +static int missing_dir_is_sparse(const struct traverse_info *info, >> + const struct name_entry *p) >> +{ >> + int res, pos; >> + struct strbuf dirpath = STRBUF_INIT; >> + struct unpack_trees_options *o = info->data; >> + >> + /* >> + * First, check whether the path is in the sparse cone. If it is, >> + * then this directory shouldn't be sparse. >> + */ >> + strbuf_add(&dirpath, info->traverse_path, info->pathlen); >> + strbuf_add(&dirpath, p->path, p->pathlen); >> + strbuf_addch(&dirpath, '/'); >> + if (path_in_cone_mode_sparse_checkout(dirpath.buf, o->src_index)) { >> + res = 0; >> + goto cleanup; >> + } > > OK. > >> + /* >> + * Given that the directory is not inside the sparse cone, it could be >> + * (partially) expanded in the index. If child entries exist, the path >> + * is not a missing sparse directory. >> + */ >> + pos = index_name_pos_sparse(o->src_index, dirpath.buf, dirpath.len); >> + if (pos >= 0) >> + BUG("cache entry '%s%s' shouldn't exist in the index", >> + info->traverse_path, p->path); > > So, we fed 'p' to this function, and we didn't expect it to exist in > the index if it is outside the sparse cone. > Correct; if an index entry with the name 'p' existed, 'src[0]' would not be NULL in 'unpack_callback()'/'unpack_single_entry()' and this function wouldn't have been called. >> + pos = -pos - 1; > > This is the location that a cache entry for the dirpath.buf would > sit in the index if it were a sparse entry. > >> + if (pos >= o->src_index->cache_nr) { >> + res = 1; >> + goto cleanup; >> + } >> + res = strncmp(o->src_index->cache[pos]->name, dirpath.buf, dirpath.len); > > So, we found where dirpath.buf would be inserted. If we (can) look > at the cache entry that is currently sitting at that position, and > find that it is a path inside it, then res becomes zero. IOW, we > found that the sparse directory is missing in the index, but there > is a path that is in the directory recorded in the index. > > The current entry, on the other hand, is outside the dirpath.buf, > then res becomes non-zero. We are asked to yield true (i.e. > nonzero) if and only if the given directory and all paths in that > directory are missing from the index, so that is what happens here. > Sounds OK. > > And the "out of bounds" check just means that the entries that sit > near the end of the index sort strictly before our dirpath.buf, so > they cannot be inside our directory, which is also the case where we > are expected to yield "true". > > OK. > >> + >> +cleanup: >> + strbuf_release(&dirpath); >> + return res; >> +} >> + >> /* >> * Note that traverse_by_cache_tree() duplicates some logic in this function >> * without actually calling it. If you change the logic here you may need to >> @@ -1078,21 +1125,40 @@ static int unpack_single_entry(int n, unsigned long mask, >> unsigned long dirmask, >> struct cache_entry **src, >> const struct name_entry *names, >> - const struct traverse_info *info) >> + const struct traverse_info *info, >> + int *is_missing_sparse_dir) >> { >> int i; >> struct unpack_trees_options *o = info->data; >> unsigned long conflicts = info->df_conflicts | dirmask; >> + const struct name_entry *p = names; >> >> - if (mask == dirmask && !src[0]) >> - return 0; >> + *is_missing_sparse_dir = 0; >> + if (mask == dirmask && !src[0]) { >> + /* >> + * If the directory is completely missing from the index but >> + * would otherwise be a sparse directory, we should unpack it. >> + * If not, we'll return and continue recursively traversing the >> + * tree. >> + */ >> + if (!o->src_index->sparse_index) >> + return 0; >> + >> + /* Find first entry with a real name (we could use "mask" too) */ >> + while (!p->mode) >> + p++; >> + >> + *is_missing_sparse_dir = missing_dir_is_sparse(info, p); >> + if (!*is_missing_sparse_dir) >> + return 0; > > Interesting. Nobody checked if the name in 'p' is a directory up to > this point, but missing_dir_is_sparse() does not care? The only > thing we know is that !src[0], so the name represented by 'p' does > not exist in the index. IIRC, the function only checked if p names > a path inside or outside the sparse cone and if there are or are not > paths that would be inside the path if p _were_ a directory but I > didn't read it carefully what it did when the caller fed a path to a > non-directory to the function. As long as it will say "no" to such > a situation, I won't complain, but I have to wonder how the logic > around here tells apart a case where a sparse directory is > completely hidden (which missing_dir_is_sparse() gives us) from a > case where there is absolutely nothing, not a directory and not a > blob, at the path in the index. Or perhaps it works correctly > without having to tell them apart? I dunno. > I wrote 'missing_dir_is_sparse()' in an attempt keep some complex logic out of the already-complicated 'unpack_single_entry()', but as part of that it relies on information already established by its caller. We know 'p' is a directory because 'missing_dir_is_sparse()' is called inside a 'mask == dirmask' condition. 'mask' is a representation of which trees being traversed have an entry with the given name and 'dirmask' is a representation of which of those entries are directories, so the only way 'mask == dirmask' and 'p' is *not* a directory is if the currently-traversed entries in all of the trees do not exist. *That* won't happen because 'unpack_callback()' won't be invoked at all by 'traverse_trees()' if 'mask' is 0. Given that it requires jumping through multiple function invocations and callbacks to figure that out, I can add some assertions or 'return 0' conditions at the beginning of 'missing_dir_is_sparse()' to codify its assumptions. Even if the assertions are slightly redundant now, they'll make the code clearer and make the function safer for reuse. > By the way, there is a comment before the function that reminds us > that traverse_by_cache_tree() may need a matching change whenever a > change is made to this function. Does this require such a matching > change? In this case, I *think* a parallel change there is unnecessary. 'traverse_by_cache_tree()' iterates on the contents of the index, so its equivalent of 'src[0]' always exists when it merges individual entries. And, to trigger that function, the cache tree needs to identically match the tree(s) being merged; if a subtree doesn't exist in either the index or any of the merging trees, there's nothing to merge. In any case, thanks for the reminder - I didn't notice the comment while implementing this, so it could have just as easily needed fixing.