git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: William Sprent via GitGitGadget <gitgitgadget@gmail.com>
Cc: "Git Mailing List" <git@vger.kernel.org>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"William Sprent" <williams@unity3d.com>
Subject: Re: [PATCH v3] fast-export: fix surprising behavior with --first-parent
Date: Tue, 21 Dec 2021 10:47:22 -0800	[thread overview]
Message-ID: <CABPp-BF7Rg3NH3UBnyzD6c1vuFh+FNrNc8eRvUffqmaLppUPkw@mail.gmail.com> (raw)
In-Reply-To: <pull.1084.v3.git.1639671789279.gitgitgadget@gmail.com>

On Thu, Dec 16, 2021 at 8:23 AM William Sprent via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: William Sprent <williams@unity3d.com>
>
> The revision traversal machinery typically processes and returns all
> children before any parent.  fast-export needs to operate in the
> reverse fashion, handling parents before any of their children in
> order to build up the history starting from the root commit(s).  This
> would be a clear case where we could just use the revision traversal
> machinery's "reverse" option to achieve this desired affect.
>
> However, this wasn't what the code did.  It added its own array for
> queuing.  The obvious hand-rolled solution would be to just push all
> the commits into the array and then traverse afterwards, but it didn't
> quite do that either.  It instead attempted to process anything it
> could as soon as it could, and once it could, check whether it could
> process anything that had been queued.  As far as I can tell, this was
> an effort to save a little memory in the case of multiple root commits
> since it could process some commits before queueing all of them.  This
> involved some helper functions named has_unshown_parent() and
> handle_tail().  For typical invocations of fast-export, this
> alternative essentially amounted to a hand-rolled method of reversing
> the commits -- it was a bunch of work to duplicate the revision
> traversal machinery's "reverse" option.
>
> This hand-rolled reversing mechanism is actually somewhat difficult to
> reason about.  It takes some time to figure out how it ensures in
> normal cases that it will actually process all traversed commits
> (rather than just dropping some and not printing anything for them).
>
> And it turns out there are some cases where the code does drop commits
> without handling them, and not even printing an error or warning for
> the user.  Due to the has_unshown_parent() checks, some commits could
> be left in the array at the end of the "while...get_revision()" loop
> which would be unprocessed.  This could be triggered for example with
>     git fast-export main -- --first-parent
> or non-sensical traversal rules such as
>     git fast-export main -- --grep=Merge --invert-grep
>
> While most traversals that don't include all parents should likely
> trigger errors in fast-export (or at least require being used in
> combination with --reference-excluded-parents), the --first-parent
> traversal is at least reasonable and it'd be nice if it didn't just drop
> commits. It'd also be nice for future readers of the code to have a
> simpler "reverse traversal" mechanism. Use the "reverse" option of the
> revision traversal machinery to achieve both.
>
> Even for the non-sensical traversal flags like the --grep one above,
> this would be an improvement. For example, in that case, the code
> previously would have silently truncated history to only those commits
> that do not have an ancestor containing "Merge" in their commit message.
> After this code change, that case would include all commits without
> "Merge" in their commit message -- but any commit that previously had a
> "Merge"-mentioning parent would lose that parent
> (likely resulting in many new root commits). While the new behavior is
> still odd, it is at least understandable given that
> --reference-excluded-parents is not the default.
>
> Signed-off-by: William Sprent <williams@unity3d.com>
> ---
>     fast-export: fix surprising behavior with --first-parent
>
>     Hi,
>
>     Here's a revised version of my previously submitted patch regarding
>     fast-export and how it reacts to the --first-parent flag.
>
>     Changes since v2:
>
>      * Changed commit message to include the detailed description of the
>        problem area as suggested by Elijah. I went back and forth with my
>        self about whether the message needs some "lead in", but it ended up
>        getting long without adding much.

If it feels weird quoting someone else so extensively and just using
their words for the commit message, we've used a "Commit-message-by:"
trailer in the past; that could be useful here.

>      * Cleaned up test such that the comparison with the --reverse output is
>        just below where it is generated.
>      * In test, moved all usages of --no-ff to just after 'git merge' .
>      * In test, replaced long 'git rev-list' incantation with simpler 'git
>        log' format.
>
>     Changes since v1:
>
>      * Moved revs.reverse assignment down to similar assignments below.
>      * Removed braces around single statement while loop.
>      * The test now only changes directory inside a sub-shell.
>      * Applied stylistic feedback on test such as: making redirections be on
>        the form >FILE etc.
>
>     There were questions raised about whether it makes sense at all for
>     fast-export to simply accept all git rev-list arguments whether they
>     have an effect or not - in particular for a flag like --reverse. I think
>     I agree that it is questionable behavior, or at least questionably
>     documented, but I also think it is out of scope for this change.
>
>     I did consider teaching fast-export to complain if given --reverse, but
>     it seemed inconsistent to me as it will gladly accept every other
>     rev-list argument (for example, "git fast-export HEAD --show-signature
>     --graph" works just fine).
>
>     Original Message: I've noticed that git fast-export exhibits some odd
>     behavior when passed the --first-parent flag. In the repository I was
>     working on, it would only output the initial commit before exiting.
>     Moreover, adding the --reverse flag causes it to behave differently and
>     instead output all commits in the first parent line that only have one
>     parent. My expectation is more or less that git fast-export should
>     output the same set of commits as git rev-list would output given the
>     same arguments, which matches how it acts when given revision ranges.
>
>     It seems like this behavior comes from the fact that has_unshown_parents
>     will never be false for any merge commits encountered when fast-export
>     is called with --first-parent. This causes the while loop to follow the
>     pattern of pushing all commits into the "commits" queue until the
>     initial commit is encountered, at which point it calls handle_tail which
>     falls through on the first merge commit, causing most of the commits to
>     be unhandled.
>
>     My impression is that this logic only serves to ensure that parents are
>     processed before their children, so in my patch I've opted to fix the
>     issue by delegating that responsibility to revision.c by adding the
>     reverse flag before performing the revision walk. From what I can see,
>     this should be equivalent to what the previous logic is trying to
>     achieve, but I can also see that there could be risk in these changes.
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1084%2Fwilliams-unity%2Ffast-export%2Ffirst-parent-issues-v3
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1084/williams-unity/fast-export/first-parent-issues-v3
> Pull-Request: https://github.com/gitgitgadget/git/pull/1084
>
> Range-diff vs v2:
>
>  1:  af552f37dbe ! 1:  312255260b1 fast-export: fix surprising behavior with --first-parent
>      @@ Metadata
>        ## Commit message ##
>           fast-export: fix surprising behavior with --first-parent
>
>      -    When invoking git-fast-export with the --first-parent flag on a branch
>      -    with merges, fast-export would early-out on processing the first merge
>      -    on the branch. If combined with --reverse, fast-export would instead
>      -    output all single parent commits on the branch.
>      +    The revision traversal machinery typically processes and returns all
>      +    children before any parent.  fast-export needs to operate in the
>      +    reverse fashion, handling parents before any of their children in
>      +    order to build up the history starting from the root commit(s).  This
>      +    would be a clear case where we could just use the revision traversal
>      +    machinery's "reverse" option to achieve this desired affect.
>
>      -    This commit makes fast-export output the same commits as rev-list
>      -    --first-parent, and makes --reverse not have an effect on which commits
>      -    are output.
>      +    However, this wasn't what the code did.  It added its own array for
>      +    queuing.  The obvious hand-rolled solution would be to just push all
>      +    the commits into the array and then traverse afterwards, but it didn't
>      +    quite do that either.  It instead attempted to process anything it
>      +    could as soon as it could, and once it could, check whether it could
>      +    process anything that had been queued.  As far as I can tell, this was
>      +    an effort to save a little memory in the case of multiple root commits
>      +    since it could process some commits before queueing all of them.  This
>      +    involved some helper functions named has_unshown_parent() and
>      +    handle_tail().  For typical invocations of fast-export, this
>      +    alternative essentially amounted to a hand-rolled method of reversing
>      +    the commits -- it was a bunch of work to duplicate the revision
>      +    traversal machinery's "reverse" option.
>
>      -    The fix involves removing logic within fast-export which was responsible
>      -    for ensuring that parents are processed before their children, which was
>      -    what was exiting early due to missing second parents. This is replaced
>      -    by setting 'reverse = 1' before revision walking, which, in conjuction
>      -    with topo_order, allows for delegating the ordering of commits to
>      -    revision.c. The reverse flag is set after parsing rev-list arguments to
>      -    avoid having it disabled.
>      +    This hand-rolled reversing mechanism is actually somewhat difficult to
>      +    reason about.  It takes some time to figure out how it ensures in
>      +    normal cases that it will actually process all traversed commits
>      +    (rather than just dropping some and not printing anything for them).
>      +
>      +    And it turns out there are some cases where the code does drop commits
>      +    without handling them, and not even printing an error or warning for
>      +    the user.  Due to the has_unshown_parent() checks, some commits could
>      +    be left in the array at the end of the "while...get_revision()" loop
>      +    which would be unprocessed.  This could be triggered for example with
>      +        git fast-export main -- --first-parent
>      +    or non-sensical traversal rules such as
>      +        git fast-export main -- --grep=Merge --invert-grep
>      +
>      +    While most traversals that don't include all parents should likely
>      +    trigger errors in fast-export (or at least require being used in
>      +    combination with --reference-excluded-parents), the --first-parent
>      +    traversal is at least reasonable and it'd be nice if it didn't just drop
>      +    commits. It'd also be nice for future readers of the code to have a
>      +    simpler "reverse traversal" mechanism. Use the "reverse" option of the
>      +    revision traversal machinery to achieve both.
>      +
>      +    Even for the non-sensical traversal flags like the --grep one above,
>      +    this would be an improvement. For example, in that case, the code
>      +    previously would have silently truncated history to only those commits
>      +    that do not have an ancestor containing "Merge" in their commit message.
>      +    After this code change, that case would include all commits without
>      +    "Merge" in their commit message -- but any commit that previously had a
>      +    "Merge"-mentioning parent would lose that parent
>      +    (likely resulting in many new root commits). While the new behavior is
>      +    still odd, it is at least understandable given that
>      +    --reference-excluded-parents is not the default.
>
>           Signed-off-by: William Sprent <williams@unity3d.com>
>
>      @@ t/t9350-fast-export.sh: test_expect_success 'merge commit gets exported with --i
>       +         git checkout -b topic1 &&
>       +         test_commit B &&
>       +         git checkout main &&
>      -+         git merge topic1 --no-ff &&
>      ++         git merge --no-ff topic1 &&
>       +
>       +         git checkout -b topic2 &&
>       +         test_commit C &&
>       +         git checkout main &&
>      -+         git merge topic2 --no-ff &&
>      ++         git merge --no-ff topic2 &&
>       +
>       +         test_commit D &&
>       +
>      -+         git rev-list --format="%ad%B" --first-parent --topo-order --no-commit-header main >expected &&
>      -+
>       +         git fast-export main -- --first-parent >first-parent-export &&
>       +         git fast-export main -- --first-parent --reverse >first-parent-reverse-export &&
>      ++         test_cmp first-parent-export first-parent-reverse-export &&
>       +
>       +         git init import &&
>       +         git -C import fast-import <first-parent-export &&
>       +
>      -+         git -C import rev-list --format="%ad%B" --topo-order --all --no-commit-header >actual &&
>      -+         git -C import rev-list --all >tmp &&
>      -+
>      -+         test_line_count = 4 tmp &&
>      ++         git log --format="%ad %s" --first-parent main >expected &&
>      ++         git -C import log --format="%ad %s" --all >actual &&
>       +         test_cmp expected actual &&
>      -+         test_cmp first-parent-export first-parent-reverse-export
>      ++         test_line_count = 4 actual
>       + )
>       +'
>       +
>
>
>  builtin/fast-export.c  | 40 ++++------------------------------------
>  t/t9350-fast-export.sh | 32 ++++++++++++++++++++++++++++++++
>  2 files changed, 36 insertions(+), 36 deletions(-)
>
> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> index 8e2caf72819..cb1d6473f12 100644
> --- a/builtin/fast-export.c
> +++ b/builtin/fast-export.c
> @@ -107,18 +107,6 @@ static int parse_opt_reencode_mode(const struct option *opt,
>
>  static struct decoration idnums;
>  static uint32_t last_idnum;
> -
> -static int has_unshown_parent(struct commit *commit)
> -{
> -       struct commit_list *parent;
> -
> -       for (parent = commit->parents; parent; parent = parent->next)
> -               if (!(parent->item->object.flags & SHOWN) &&
> -                   !(parent->item->object.flags & UNINTERESTING))
> -                       return 1;
> -       return 0;
> -}
> -
>  struct anonymized_entry {
>         struct hashmap_entry hash;
>         const char *anon;
> @@ -752,20 +740,6 @@ static char *anonymize_tag(void *data)
>         return strbuf_detach(&out, NULL);
>  }
>
> -static void handle_tail(struct object_array *commits, struct rev_info *revs,
> -                       struct string_list *paths_of_changed_objects)
> -{
> -       struct commit *commit;
> -       while (commits->nr) {
> -               commit = (struct commit *)object_array_pop(commits);
> -               if (has_unshown_parent(commit)) {
> -                       /* Queue again, to be handled later */
> -                       add_object_array(&commit->object, NULL, commits);
> -                       return;
> -               }
> -               handle_commit(commit, revs, paths_of_changed_objects);
> -       }
> -}
>
>  static void handle_tag(const char *name, struct tag *tag)
>  {
> @@ -1185,7 +1159,6 @@ static int parse_opt_anonymize_map(const struct option *opt,
>  int cmd_fast_export(int argc, const char **argv, const char *prefix)
>  {
>         struct rev_info revs;
> -       struct object_array commits = OBJECT_ARRAY_INIT;
>         struct commit *commit;
>         char *export_filename = NULL,
>              *import_filename = NULL,
> @@ -1283,18 +1256,13 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
>
>         if (prepare_revision_walk(&revs))
>                 die("revision walk setup failed");
> +
> +       revs.reverse = 1;
>         revs.diffopt.format_callback = show_filemodify;
>         revs.diffopt.format_callback_data = &paths_of_changed_objects;
>         revs.diffopt.flags.recursive = 1;
> -       while ((commit = get_revision(&revs))) {
> -               if (has_unshown_parent(commit)) {
> -                       add_object_array(&commit->object, NULL, &commits);
> -               }
> -               else {
> -                       handle_commit(commit, &revs, &paths_of_changed_objects);
> -                       handle_tail(&commits, &revs, &paths_of_changed_objects);
> -               }
> -       }
> +       while ((commit = get_revision(&revs)))
> +               handle_commit(commit, &revs, &paths_of_changed_objects);
>
>         handle_tags_and_duplicates(&extra_refs);
>         handle_tags_and_duplicates(&tag_refs);
> diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
> index 409b48e2442..7b7a18dd2c1 100755
> --- a/t/t9350-fast-export.sh
> +++ b/t/t9350-fast-export.sh
> @@ -750,4 +750,36 @@ test_expect_success 'merge commit gets exported with --import-marks' '
>         )
>  '
>
> +
> +test_expect_success 'fast-export --first-parent outputs all revisions output by revision walk' '
> +       git init first-parent &&
> +       (
> +               cd first-parent &&
> +               test_commit A &&
> +               git checkout -b topic1 &&
> +               test_commit B &&
> +               git checkout main &&
> +               git merge --no-ff topic1 &&
> +
> +               git checkout -b topic2 &&
> +               test_commit C &&
> +               git checkout main &&
> +               git merge --no-ff topic2 &&
> +
> +               test_commit D &&
> +
> +               git fast-export main -- --first-parent >first-parent-export &&
> +               git fast-export main -- --first-parent --reverse >first-parent-reverse-export &&
> +               test_cmp first-parent-export first-parent-reverse-export &&
> +
> +               git init import &&
> +               git -C import fast-import <first-parent-export &&
> +
> +               git log --format="%ad %s" --first-parent main >expected &&
> +               git -C import log --format="%ad %s" --all >actual &&
> +               test_cmp expected actual &&
> +               test_line_count = 4 actual
> +       )
> +'
> +
>  test_done
>
> base-commit: cd3e606211bb1cf8bc57f7d76bab98cc17a150bc

This version looks good to me:

Reviewed-by: Elijah Newren <newren@gmail.com>

  reply	other threads:[~2021-12-21 18:47 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-23 11:28 [PATCH] fast-export: fix surprising behavior with --first-parent William Sprent via GitGitGadget
2021-11-23 13:07 ` Ævar Arnfjörð Bjarmason
2021-11-24 11:57   ` William Sprent
2021-11-23 19:14 ` Elijah Newren
2021-11-24 13:05   ` William Sprent
2021-11-24  0:41 ` Junio C Hamano
2021-11-24 13:05   ` William Sprent
2021-12-09  8:13 ` [PATCH v2] " William Sprent via GitGitGadget
2021-12-10  3:48   ` Elijah Newren
2021-12-10 21:55     ` Junio C Hamano
2021-12-10 22:02       ` Elijah Newren
2021-12-13 15:09     ` William Sprent
2021-12-14  0:31       ` Elijah Newren
2021-12-14 13:11         ` William Sprent
2021-12-16 16:23   ` [PATCH v3] " William Sprent via GitGitGadget
2021-12-21 18:47     ` Elijah Newren [this message]
2021-12-21 20:50       ` Junio C Hamano
2021-12-22  8:38         ` William Sprent

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BF7Rg3NH3UBnyzD6c1vuFh+FNrNc8eRvUffqmaLppUPkw@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=williams@unity3d.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).