From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-3.6 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_BLOCKED, SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by dcvr.yhbt.net (Postfix) with ESMTP id 4067F1F86C for ; Tue, 24 Nov 2020 22:20:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730780AbgKXUTs (ORCPT ); Tue, 24 Nov 2020 15:19:48 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41214 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730773AbgKXUTr (ORCPT ); Tue, 24 Nov 2020 15:19:47 -0500 Received: from mail-ot1-x341.google.com (mail-ot1-x341.google.com [IPv6:2607:f8b0:4864:20::341]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1D4FBC0613D6 for ; Tue, 24 Nov 2020 12:19:47 -0800 (PST) Received: by mail-ot1-x341.google.com with SMTP id o3so44693ota.8 for ; Tue, 24 Nov 2020 12:19:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=EGZ8JXXfK1Nv9a12ZCxOXO9K/uuoZ2cgvRcRfJwY60A=; b=KmkIEdPXr/Jq5ZoBloUcaj6i5x0wdCQyEFn11pFooBx0W0TWaNm4H2aYcSUjtXS9ap ldZVHMlvsLu3RrQW9G3ls2RFSszAHlm8ctRzxpPHpbQudBstTx7rnhheYh17at9Nqjvo gCiySKtsTMjt2FPnkXz1hu3P42swiv+s3WVZwsAQDB/f5QsOwiherZRWailNnwOEgEYp fN/k1P/lTEIXBCVOKztMTZjTIuEfjZR6ekgLKFqXwairJKYGpcaV3REof/tsh+pBszJC MgEvCVfkdrDhuBdL/qJQ8HxkFKq3hY/TvTXEKT7Ua7C/nDwB3nfJD2M5zY9vbaNU2Y96 HTYw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=EGZ8JXXfK1Nv9a12ZCxOXO9K/uuoZ2cgvRcRfJwY60A=; b=PYQeAunI/fu3dX8wrZywKBK6wdY/gkyl5oyfSnLRz7B18UmC6OlHFFuWZk67sSaaCh /0qqR12bx8P9M6i6loxOCOm8qv9LQ5j0md1t610ga5UX54b0ko2zhDKSOMnL8VDFk+2X tRfRV/xvI0BFVgfQXYCn51ZpmLZZ8D8eB9L3sLvU1+kGrRZfrDThi/pnQdNcvoMqKr1A z3CIcLXXSRpYHhkYnUXiwaSEsuN9YHgpkNAA5l/aVm1BYfUeKzKggQ/+JLVUxVJnzUie p2re1rUcAxOGA0zXyxpfk3SitEaOdVTnFMy6DSPLCdq+UIbD2dKQYIvyPy7GOaHkoUSE YjDw== X-Gm-Message-State: AOAM531F5JYCcl7pKvC6ueUFwfHqxTnQDedOSm5sGX5v0c5+tAqgr/Hh 18Eti3UUP1tddeltf97DJTAi8oP8l7vzhWVMwG8= X-Google-Smtp-Source: ABdhPJwLGcIMCE73KYuyvuQn/CWTjww2NiBvKqZE+2MmTT3WObqzzYZ+zv8zVbWSJbK7OSmMsGrW8okWrv6QM3nDziM= X-Received: by 2002:a05:6830:1002:: with SMTP id a2mr269376otp.316.1606249186339; Tue, 24 Nov 2020 12:19:46 -0800 (PST) MIME-Version: 1.0 References: <20201102204344.342633-16-newren@gmail.com> <20201112201524.3438029-1-jonathantanmy@google.com> In-Reply-To: From: Elijah Newren Date: Tue, 24 Nov 2020 12:19:35 -0800 Message-ID: Subject: Re: [PATCH v2 15/20] merge-ort: step 3 of tree writing -- handling subdirectories as we go To: Jonathan Tan Cc: Git Mailing List Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Thu, Nov 12, 2020 at 2:30 PM Elijah Newren wrote: > > On Thu, Nov 12, 2020 at 12:15 PM Jonathan Tan wrote: > > > > > + /* > > > + * Now we've got an OID for last_directory in dir_info. We need to > > > + * add it to info->versions for it to be part of the computation of > > > + * its parent directories' OID. But first, we have to find out what > > > + * its' parent name was and whether that matches the previous > > > + * info->offsets or we need to set up a new one. > > > + */ > > > + prev_dir = info->offsets.nr == 0 ? NULL : > > > + info->offsets.items[info->offsets.nr-1].string; > > > + if (new_directory_name != prev_dir) { > > > + uintptr_t c = info->versions.nr; > > > + string_list_append(&info->offsets, > > > + new_directory_name)->util = (void*)c; > > > + } > > > > Because of the possible jump of 2 or more directories that I mentioned > > earlier, there may be gaps in the offsets. So it makes sense that we > > sometimes need to insert an intermediate one. > > > > I wonder if the code would be clearer if we had explicit "begin tree" > > and "end tree" steps just like in list-objects-filter.c (LOFS_BEGIN_TREE > > and LOFS_END_TREE). Here we have "end tree" (because of the way the > > entries were sorted) but not "begin tree". If we had "begin tree", we > > probably would be able to create the necessary offsets in a loop at that > > stage, and the reasoning about the contents of the offsets would not be > > so complicated. > > > > If we really only want one side (i.e. you don't want to introduce a > > synthetic entry just to mark the end or the beginning), then my personal > > experience is that having the "begin" side is easier to understand, as > > the state is more natural and easier to reason about. (Unlike here, > > where there could be gaps in the offsets and the reader has to > > understand that the gaps will be filled just in time.) But that may just > > be my own experience. > > Interesting, I'll take a look into it. > So, I've been going through making all the changes you and Derrick suggested or highlighted...but I don't see how to tackle this one. Perhaps I'm missing something. Using your example of LOFS_BEGIN_TREE and LOFS_END_TREE from list-objects-filter.c, I note that you handle it as part of traverse_trees(), and thus you have a very natural "I'm going to process this tree" point and "I'm done processing this tree" point. There is no equivalent mapping to merge-ort that I can figure out. merge-ort does use traverse_trees() in collect_merge_info(), and fills opt->priv->paths with all full pathnames (both files and directories) found in any of the three trees. But I cannot process files/directories at that time; rename detection needs traverse_trees() to be finished to have all paths so far. Further, the list of pathnames from traverse_trees is not necessarily complete; additional paths could be added by any of * Directory/file conflicts (need to move the file to a different location) * Directory/submodule conflicts (need to move something to a different location) * Add/add conflicts of files of different types (e.g. symlink/regular file; can't just content merge them with conflict markers) * Directory rename detection (can move new files or even directories on one side of history into a new directory on other side) Thus, after traverse_trees() ends, my rename detection stuff can add paths (including new directories), then process_entries() can add paths -- and remove some when the resolution is to delete. And the code here in question runs as part of the process_entries() loop. Now, we'd still be able to create synthetic BEGIN_TREE markers if we operated in lexicographic ordering, but process_entries() *must* operate in _reverse_ lexicographic ordering because: 1) subtrees need to be written out before trees are; hashes of those subtrees are used in the parent tree 2) it's the only sane way to handle directory/file conflicts; I need to know if all entries under the directory resolve to nothing; if not, the directory is still in the way when it comes time to process the file. Granted, I could do some tricky calculations based on the reverse lexicographic ordering of fullpaths (and their containing directories) to figure out where trees begin and end -- but that takes us to exactly what I *did* do. It was precisely this step that you thought should be made simpler, but I'm not seeing how to avoid it. For now, I'll keep the code as-is, but add more comments to both the data structure and the code. If I've missed something about how I could make use of your BEGIN_TREE idea, let me know and I'll look at it again.