git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: [PATCH v2 09/20] merge-ort: record stage and auxiliary info for every path
Date: Wed, 11 Nov 2020 10:16:40 -0800	[thread overview]
Message-ID: <CABPp-BFRPuxLYz_n6jbr=j7Gu1GhsV95nKPE1=HxUcsvimrz0g@mail.gmail.com> (raw)
In-Reply-To: <94bf9b69-5d13-c914-fb1a-bce912018a63@gmail.com>

On Wed, Nov 11, 2020 at 7:26 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 11/2/2020 3:43 PM, Elijah Newren wrote:
> > +static void setup_path_info(struct merge_options *opt,
> > +                         struct string_list_item *result,
> > +                         const char *current_dir_name,
> > +                         int current_dir_name_len,
> > +                         char *fullpath, /* we'll take over ownership */
> > +                         struct name_entry *names,
> > +                         struct name_entry *merged_version,
> > +                         unsigned is_null,     /* boolean */
> > +                         unsigned df_conflict, /* boolean */
> > +                         unsigned filemask,
> > +                         unsigned dirmask,
> > +                         int resolved          /* boolean */)
> > +{
> > +     struct conflict_info *path_info;
>
> In addition to my concerns below about 'conflict_info' versus
> 'merged_info', I was doubly confused that 'result' in the parameter
> list is given a variable named 'pi' for "path info" and result->util
> eventually is equal to this path_info. What if we renamed 'result'
> to 'pi' for "path info" here, then operated on 'pi->util' in this
> method?

result->util (or pi->util if you rename) is void *, making it hard to
operate on; you'd have to typecast at every usage.  Since it is used a
*lot*, it makes sense to have a typed pointer, and then just set
result->util to a copy of that value at the end.  That is what
path_info is for.

>
> > +     path_info = xcalloc(1, resolved ? sizeof(struct merged_info) :
> > +                                       sizeof(struct conflict_info));
>
> Hm. I'm happy to have a `struct merged_info *` pointing to a
> `struct conflict_info`, but the opposite seems very dangerous.

Yeah, this is perhaps the scariest bit, and if it were a side data
structure rather than the fundamental main one that was central to the
algorithm, then safety would trump performance concerns.  But since it
is the main data structure and likely the biggest (once you count the
various copies for each relevant path), then it might be worth the
extra care needed to shave off the extra memory.  Maybe we can still
tweak things to get some safety back without killing performance so
let me consider each of your suggestions/questions.

If I define it as a merged_info*, the compiler will only let me modify
fields within the merged_info portion of the struct.  Should I
typecast every line that touches the bits in the resolved==0 path
where I need to set fields within the conflict_info portion?
Alternatively, would a code flow like the following make you happier?

    struct conflict_info *ci = NULL;
    struct merge_info *mi = xcalloc(...);
    result->util = mi;
    /* Operate on mi */
    ...
    if (resolved)
      return;
   ci = mi;
   /* Operate on ci */
   ...

In either case, the returned item has potentially different sizes, so
the caller will still have to take care so I'm not sure how much extra
this structure within setup_path_info() buys us.

> Perhaps we should always use sizeof(struct conflict_info)?

We could do that; it'd certainly waste memory as I expect many entries
to be unmodified (on one or both sides of history).  But I'd way
rather go this route than splitting or re-arranging this data
structure.

> We can use path_info->merged.clean to detect whether the rest of
> the data is worth looking at. (Or, in your case, whether or not
> it is allocated.)

ci->merged.clean is used to determine whether to look at the rest of
the data, yes -- and that's an enforced assumption throughout the code
(as alluded to by the comment in the merge_options_internal data
structure that "paths" maps pathanemes to merge_info and conflict_info
types).  However, that is not quite the same as using the clean bit to
determine if more data is allocated; something can be allocated as a
conflict_info rather than a merged_info due to both sides making
modifying the same path, but then a threeway content merge comes back
clean and ci->merged.clean is updated from 0 to 1.  The extra data
remains allocated, but nothing in the algorithm ever needs to use
anything outside the merged bits for that path again.  (Actually, let
me state that more forcefully: nothing is *allowed* to look outside
the merged bits for that path once the clean bit is updated to 1).

> I imagine that in a large repo we will need many of these structs,
> but very few of them will actually need to be conflicts, so using
> 'struct conflict_info' always will lead to memory bloat. But in
> that case, would we not be better off with an array instead of a
> scattering of data across the heap?

Not sure what you're trying to solve here.  Putting them in an array
would mean copying every single one of them every time the array is
resized.  It would also make insertion or deletion very expensive.
And it'd prevent O(1) lookup.  It'd be a horrible data structure all
around.  Maybe you're assuming you know exactly how many entries you
need and what they are before the merge algorithm starts?  I don't.
In fact, I can't even give a good magnitude approximation of how many
it'll be before a merge starts.  (Even if you assume it's a case where
you have an index loaded and that index is related to the merge being
done, the number can be and often is much smaller than the number of
entries in the index.  And just to cover the extremes, in unusual
cases the number might be much larger than the number of index entries
if the merge base and side being merged in has far more paths).

This was the whole point of the strmap API[1] I recently added --
provide a hashmap specialized for the case where the key is a string.
That way I get fast lookup, and relatively fast resize as the hash
only contains pointers to the values, not a copy of the values.

Is your concern that allocating many small structs is more expensive
than allocating a huge block of them?  If so, yes that matters, but
see the mem_pool related patches of the strmap API[1].

[1] https://lore.kernel.org/git/pull.835.v5.git.git.1604622298.gitgitgadget@gmail.com/

> Perhaps 'struct conflict_info' shouldn't contain a 'struct merged_info'
> and instead be just the "extra" data. Then we could have a contiguous
> array of 'struct merged_info' values for most of the paths, but heap
> pointers for 'struct conflict_info' as necessary.
>
> It's also true that I haven't fully formed a mental model for how these
> are used in your algorithm, so I'll keep reading.

I don't understand how contiguous arrays are practical or desirable
(I'm close to saying they're not possible, but one could employ some
extremes to get them, as mentioned above).

I could possibly have two strmaps; one mapping paths to a merge_info,
and another (with fewer entries) mapping paths to a conflict_info.
Seems like a royal pain, and would make for some pretty ugly code (I
have other places that had to use two strmaps and I've hated it every
time -- but those were cases of strmaps that were used much, much less
than the "paths" one).  Might also slightly hurt perf

> > +     path_info->merged.directory_name = current_dir_name;
> > +     path_info->merged.basename_offset = current_dir_name_len;
> > +     path_info->merged.clean = !!resolved;
> > +     if (resolved) {
> > +             path_info->merged.result.mode = merged_version->mode;
> > +             oidcpy(&path_info->merged.result.oid, &merged_version->oid);
> > +             path_info->merged.is_null = !!is_null;
> > +     } else {
> > +             int i;
> > +
> > +             for (i = 0; i < 3; i++) {
> > +                     path_info->pathnames[i] = fullpath;
> > +                     path_info->stages[i].mode = names[i].mode;
> > +                     oidcpy(&path_info->stages[i].oid, &names[i].oid);
> > +             }
> > +             path_info->filemask = filemask;
> > +             path_info->dirmask = dirmask;
> > +             path_info->df_conflict = !!df_conflict;
> > +     }
> > +     strmap_put(&opt->priv->paths, fullpath, path_info);
> > +     result->string = fullpath;
> > +     result->util = path_info;
>
> This is set in all cases, so should we use it everywhere? Naturally,
> there might be a cost to the extra pointer indirection, so maybe we
> create a 'struct conflict_info *util' to operate on during this
> method, but set 'result->util = util' right after allocating so we
> know how it should behave?

result->util is void*, so it's not just an extra pointer indirection,
it's also the need to cast it to the appropriate type every time you
want to use it.  It's easier to have that done via another copy of the
pointer with the correct type, which is the reason for path_info.  So,
essentially, I did use util everywhere, it's just that I spelled it as
"path_info".  If I had named "path_info" "util" as you suggest,
wouldn't everyone be annoyed that I used a lame name that didn't name
the variable's purpose?

Perhaps I should just add a comment saying that path_util is a typed
alias/copy of result->util when I define it?

> > @@ -91,10 +136,12 @@ static int collect_merge_info_callback(int n,
> >        */
> >       struct merge_options *opt = info->data;
> >       struct merge_options_internal *opti = opt->priv;
> > -     struct conflict_info *ci;
> > +     struct string_list_item pi;  /* Path Info */
> > +     struct conflict_info *ci; /* pi.util when there's a conflict */

Perhaps here I should mention that ci is just a typed copy of pi.util
(since pi.util is a void*).

> ...
>
> > +     setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
> > +                     names, NULL, 0, df_conflict, filemask, dirmask, 0);
> > +     ci = pi.util;
>
> Here is the use of 'pi' that I was talking about earlier.

...although, to be fair, I don't actually have all that many uses of
ci (at least not anymore) in this function.  So maybe typecasting
pi.util each of the three-or-so times it is used isn't so bad?

  reply	other threads:[~2020-11-11 18:16 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
2020-11-02 20:43 ` [PATCH v2 01/20] merge-ort: setup basic internal data structures Elijah Newren
2020-11-06 22:05   ` Jonathan Tan
2020-11-06 22:45     ` Elijah Newren
2020-11-09 20:55       ` Jonathan Tan
2020-11-02 20:43 ` [PATCH v2 02/20] merge-ort: add some high-level algorithm structure Elijah Newren
2020-11-02 20:43 ` [PATCH v2 03/20] merge-ort: port merge_start() from merge-recursive Elijah Newren
2020-11-11 13:52   ` Derrick Stolee
2020-11-11 16:22     ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 04/20] merge-ort: use histogram diff Elijah Newren
2020-11-11 13:54   ` Derrick Stolee
2020-11-11 16:47     ` Elijah Newren
2020-11-11 16:51       ` Derrick Stolee
2020-11-11 17:03         ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 05/20] merge-ort: add an err() function similar to one from merge-recursive Elijah Newren
2020-11-11 13:58   ` Derrick Stolee
2020-11-11 17:07     ` Elijah Newren
2020-11-11 17:10       ` Derrick Stolee
2020-11-02 20:43 ` [PATCH v2 06/20] merge-ort: implement a very basic collect_merge_info() Elijah Newren
2020-11-06 22:19   ` Jonathan Tan
2020-11-06 23:10     ` Elijah Newren
2020-11-09 20:59       ` Jonathan Tan
2020-11-11 14:38   ` Derrick Stolee
2020-11-11 17:02     ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 07/20] merge-ort: avoid repeating fill_tree_descriptor() on the same tree Elijah Newren
2020-11-11 14:51   ` Derrick Stolee
2020-11-11 17:13     ` Elijah Newren
2020-11-11 17:21       ` Eric Sunshine
2020-11-02 20:43 ` [PATCH v2 08/20] merge-ort: compute a few more useful fields for collect_merge_info Elijah Newren
2020-11-06 22:52   ` Jonathan Tan
2020-11-06 23:41     ` Elijah Newren
2020-11-09 22:04       ` Jonathan Tan
2020-11-09 23:05         ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 09/20] merge-ort: record stage and auxiliary info for every path Elijah Newren
2020-11-06 22:58   ` Jonathan Tan
2020-11-07  0:26     ` Elijah Newren
2020-11-09 22:09       ` Jonathan Tan
2020-11-09 23:08         ` Elijah Newren
2020-11-11 15:26   ` Derrick Stolee
2020-11-11 18:16     ` Elijah Newren [this message]
2020-11-11 22:06       ` Elijah Newren
2020-11-12 18:23         ` Derrick Stolee
2020-11-12 18:39       ` Derrick Stolee
2020-11-02 20:43 ` [PATCH v2 10/20] merge-ort: avoid recursing into identical trees Elijah Newren
2020-11-11 15:31   ` Derrick Stolee
2020-11-02 20:43 ` [PATCH v2 11/20] merge-ort: add a preliminary simple process_entries() implementation Elijah Newren
2020-11-11 19:51   ` Jonathan Tan
2020-11-12  1:48     ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 12/20] merge-ort: have process_entries operate in a defined order Elijah Newren
2020-11-11 16:09   ` Derrick Stolee
2020-11-11 18:58     ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 13/20] merge-ort: step 1 of tree writing -- record basenames, modes, and oids Elijah Newren
2020-11-11 20:01   ` Jonathan Tan
2020-11-11 20:24     ` Elijah Newren
2020-11-12 20:39       ` Jonathan Tan
2020-11-02 20:43 ` [PATCH v2 14/20] merge-ort: step 2 of tree writing -- function to create tree object Elijah Newren
2020-11-11 20:47   ` Jonathan Tan
2020-11-11 21:21     ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 15/20] merge-ort: step 3 of tree writing -- handling subdirectories as we go Elijah Newren
2020-11-12 20:15   ` Jonathan Tan
2020-11-12 22:30     ` Elijah Newren
2020-11-24 20:19       ` Elijah Newren
2020-11-25  2:07         ` Jonathan Tan
2020-11-26 18:13           ` Elijah Newren
2020-11-30 18:41             ` Jonathan Tan
2020-11-02 20:43 ` [PATCH v2 16/20] merge-ort: basic outline for merge_switch_to_result() Elijah Newren
2020-11-02 20:43 ` [PATCH v2 17/20] merge-ort: add implementation of checkout() Elijah Newren
2020-11-02 20:43 ` [PATCH v2 18/20] tree: enable cmp_cache_name_compare() to be used elsewhere Elijah Newren
2020-11-02 20:43 ` [PATCH v2 19/20] merge-ort: add implementation of record_unmerged_index_entries() Elijah Newren
2020-11-02 20:43 ` [PATCH v2 20/20] merge-ort: free data structures in merge_finalize() Elijah Newren
2020-11-03 14:49 ` [PATCH v2 00/20] fundamentals of merge-ort implementation Derrick Stolee
2020-11-03 16:36   ` Elijah Newren
2020-11-07  6:06     ` Elijah Newren
2020-11-07 15:02       ` Derrick Stolee
2020-11-07 19:39         ` Elijah Newren
2020-11-09 12:30           ` Derrick Stolee
2020-11-09 17:13             ` Elijah Newren
2020-11-09 19:51               ` Derrick Stolee
2020-11-09 22:44                 ` Elijah Newren
2020-11-11 17:08 ` Derrick Stolee
2020-11-11 18:35   ` Elijah Newren
2020-11-11 20:48     ` Derrick Stolee
2020-11-11 21:18       ` Elijah Newren
  -- strict thread matches above, loose matches on Subject: below --
2020-11-29  7:43 [PATCH " Elijah Newren via GitGitGadget
2020-12-04 20:47 ` [PATCH v2 " Elijah Newren via GitGitGadget
2020-12-04 20:47   ` [PATCH v2 09/20] merge-ort: record stage and auxiliary info for every path Elijah Newren via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABPp-BFRPuxLYz_n6jbr=j7Gu1GhsV95nKPE1=HxUcsvimrz0g@mail.gmail.com' \
    --to=newren@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).