git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jonathan Tan <jonathantanmy@google.com>
To: newren@gmail.com
Cc: git@vger.kernel.org, Jonathan Tan <jonathantanmy@google.com>
Subject: Re: [PATCH v2 15/20] merge-ort: step 3 of tree writing -- handling subdirectories as we go
Date: Thu, 12 Nov 2020 12:15:24 -0800	[thread overview]
Message-ID: <20201112201524.3438029-1-jonathantanmy@google.com> (raw)
In-Reply-To: <20201102204344.342633-16-newren@gmail.com>

Firstly, from [1]:

> Thanks for the reviews!  I was hoping to see some comments on patch
> 15, as it's possibly the gnarliest.  It's a relatively straightforward
> algorithm, just lots of bookkeeping.

I was planning to send this out yesterday, but couldn't finish it. :-P
Indeed, a lot of things to think about.

[1] https://lore.kernel.org/git/CABPp-BFgQX6Ash03x7z+RfE3ytbw3x0DzDSBrGddgMr_soODoA@mail.gmail.com/

[snip commit message]

Thanks for the thorough explanation.

> @@ -353,6 +353,9 @@ static int string_list_df_name_compare(const char *one, const char *two)
>  
>  struct directory_versions {
>  	struct string_list versions;
> +	struct string_list offsets;

Looking below (and at the explanation in the commit message), this is a
mapping from full paths to integers casted to the pointer type.

> +	const char *last_directory;
> +	unsigned last_directory_len;

Is this just the last entry in "versions"?

>  static void write_tree(struct object_id *result_oid,
> @@ -409,12 +412,100 @@ static void record_entry_for_tree(struct directory_versions *dir_metadata,
>  		/* nothing to record */
>  		return;
>  
> +	/*
> +	 * Note: write_completed_directories() already added
> +	 * entries for directories to dir_metadata->versions,
> +	 * so no need to handle ci->filemask == 0 again.
> +	 */
> +	if (!ci->merged.clean && !ci->filemask)
> +		return;
> +
>  	basename = path + ci->merged.basename_offset;
>  	assert(strchr(basename, '/') == NULL);
>  	string_list_append(&dir_metadata->versions,
>  			   basename)->util = &ci->merged.result;
>  }

Conceptually, I can see how the algorithm below inserts directories, but
I don't understand the significance of "!ci->merged.clean" in the change
above.

> +static void write_completed_directories(struct merge_options *opt,
> +					const char *new_directory_name,
> +					struct directory_versions *info)
> +{
> +	const char *prev_dir;
> +	struct merged_info *dir_info = NULL;
> +	unsigned int offset;
> +	int wrote_a_new_tree = 0;
> +
> +	if (new_directory_name == info->last_directory)
> +		return;

Pointer equality is OK here presumably because of the string interning
of directory names.

I'm starting to think that it might be too difficult to keep track of
where strings are interned. Maybe there should be a data structure
containing all interned strings, and make the path a struct or something
like that (to clearly indicate that the string inside comes from the
interned string data structure).

> +	/*
> +	 * If we are just starting (last_directory is NULL), or last_directory
> +	 * is a prefix of the current directory, then we can just update
> +	 * last_directory and record the offset where we started this directory.
> +	 */
> +	if (info->last_directory == NULL ||
> +	    !strncmp(new_directory_name, info->last_directory,
> +		     info->last_directory_len)) {

Git has starts_with() for prefix checking. (May not be as optimized as
this one, though.)

> +		uintptr_t offset = info->versions.nr;
> +
> +		info->last_directory = new_directory_name;
> +		info->last_directory_len = strlen(info->last_directory);
> +		string_list_append(&info->offsets,
> +				   info->last_directory)->util = (void*)offset;
> +		return;
> +	}

Due to the way this is sorted, there might be a jump of 2 or more
directories. (The commit message also provides such an example - from ""
to "src/moduleB", without going through "src".)

> +	/*
> +	 * At this point, ne (next entry) is within a different directory
> +	 * than the last entry, so we need to create a tree object for all
> +	 * the entries in info->versions that are under info->last_directory.
> +	 */

There's no "ne" below.

> +	dir_info = strmap_get(&opt->priv->paths, info->last_directory);
> +	assert(dir_info);
> +	offset = (uintptr_t)info->offsets.items[info->offsets.nr-1].util;
> +	if (offset == info->versions.nr) {
> +		dir_info->is_null = 1;
> +	} else {
> +		dir_info->result.mode = S_IFDIR;
> +		write_tree(&dir_info->result.oid, &info->versions, offset);
> +		wrote_a_new_tree = 1;
> +	}

I was trying to figure out the cases in which offset would be
info->versions.nr - if such a case existed, and if yes, would it be
incorrect to skip creating such a tree because presumably this offset
exists in info->offsets for a reason. Do you know in which situation
offset would equal info->versions.nr?

> +	/*
> +	 * We've now used several entries from info->versions and one entry
> +	 * from info->offsets, so we get rid of those values.
> +	 */
> +	info->offsets.nr--;
> +	info->versions.nr = offset;

OK.

> +	/*
> +	 * Now we've got an OID for last_directory in dir_info.  We need to
> +	 * add it to info->versions for it to be part of the computation of
> +	 * its parent directories' OID.  But first, we have to find out what
> +	 * its' parent name was and whether that matches the previous
> +	 * info->offsets or we need to set up a new one.
> +	 */
> +	prev_dir = info->offsets.nr == 0 ? NULL :
> +		   info->offsets.items[info->offsets.nr-1].string;
> +	if (new_directory_name != prev_dir) {
> +		uintptr_t c = info->versions.nr;
> +		string_list_append(&info->offsets,
> +				   new_directory_name)->util = (void*)c;
> +	}

Because of the possible jump of 2 or more directories that I mentioned
earlier, there may be gaps in the offsets. So it makes sense that we
sometimes need to insert an intermediate one.

I wonder if the code would be clearer if we had explicit "begin tree"
and "end tree" steps just like in list-objects-filter.c (LOFS_BEGIN_TREE
and LOFS_END_TREE). Here we have "end tree" (because of the way the
entries were sorted) but not "begin tree". If we had "begin tree", we
probably would be able to create the necessary offsets in a loop at that
stage, and the reasoning about the contents of the offsets would not be
so complicated.

If we really only want one side (i.e. you don't want to introduce a
synthetic entry just to mark the end or the beginning), then my personal
experience is that having the "begin" side is easier to understand, as
the state is more natural and easier to reason about. (Unlike here,
where there could be gaps in the offsets and the reader has to
understand that the gaps will be filled just in time.) But that may just
be my own experience.

> +	/*
> +	 * Okay, finally record OID for last_directory in info->versions,
> +	 * and update last_directory.
> +	 */
> +	if (wrote_a_new_tree) {
> +		const char *dir_name = strrchr(info->last_directory, '/');
> +		dir_name = dir_name ? dir_name+1 : info->last_directory;
> +		string_list_append(&info->versions, dir_name)->util = dir_info;
> +	}
> +	info->last_directory = new_directory_name;
> +	info->last_directory_len = strlen(info->last_directory);
> +}

OK - several entries in info->versions were deleted earlier (through
info->versions.nr = offset), and we add one here to represent the tree
containing all those deleted versions.

> @@ -541,22 +635,27 @@ static void process_entries(struct merge_options *opt,
>  		 */
>  		struct conflict_info *ci = entry->util;
>  
> +		write_completed_directories(opt, ci->merged.directory_name,
> +					    &dir_metadata);
>  		if (ci->merged.clean)
>  			record_entry_for_tree(&dir_metadata, path, ci);
>  		else
>  			process_entry(opt, path, ci, &dir_metadata);
>  	}

Trying to make sense of this: we pass in the directory name of the
current entry so that if the last directory is *not* a prefix of the
current directory (so we either went up a directory or went sideways),
then we write a tree (unless offset == info->versions.nr, which as I
stated above, I still don't fully understand - I thought we would always
have to write a tree). So maybe the name of the function should be
"write_completed_directory" (and document it as "write a tree if
???"), since we write at most one.

In this kind of algorithm (where intermediate accumulated results are
being written), there needs to be a last write after the loop that
writes whatever's left in the accumulation buffer. I do see it below
("write_tree"), so that's good.

> -	/*
> -	 * TODO: We can't actually write a tree yet, because dir_metadata just
> -	 * contains all basenames of all files throughout the tree with their
> -	 * mode and hash.  Not only is that a nonsensical tree, it will have
> -	 * lots of duplicates for paths such as "Makefile" or ".gitignore".
> -	 */
> -	die("Not yet implemented; need to process subtrees separately");
> +	if (dir_metadata.offsets.nr != 1 ||
> +	    (uintptr_t)dir_metadata.offsets.items[0].util != 0) {
> +		printf("dir_metadata.offsets.nr = %d (should be 1)\n",
> +		       dir_metadata.offsets.nr);
> +		printf("dir_metadata.offsets.items[0].util = %u (should be 0)\n",
> +		       (unsigned)(uintptr_t)dir_metadata.offsets.items[0].util);
> +		fflush(stdout);
> +		BUG("dir_metadata accounting completely off; shouldn't happen");
> +	}

Sanity check, OK.

[snip rest]

In summary, I think that the code would be easier to understand (for
everyone) if there were both BEGIN_TREE and END_TREE entries. And for me
personally, once the offset == info->versions.nr part is clarified
(perhaps there is something obvious that I'm missing).

  reply	other threads:[~2020-11-12 20:15 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
2020-11-02 20:43 ` [PATCH v2 01/20] merge-ort: setup basic internal data structures Elijah Newren
2020-11-06 22:05   ` Jonathan Tan
2020-11-06 22:45     ` Elijah Newren
2020-11-09 20:55       ` Jonathan Tan
2020-11-02 20:43 ` [PATCH v2 02/20] merge-ort: add some high-level algorithm structure Elijah Newren
2020-11-02 20:43 ` [PATCH v2 03/20] merge-ort: port merge_start() from merge-recursive Elijah Newren
2020-11-11 13:52   ` Derrick Stolee
2020-11-11 16:22     ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 04/20] merge-ort: use histogram diff Elijah Newren
2020-11-11 13:54   ` Derrick Stolee
2020-11-11 16:47     ` Elijah Newren
2020-11-11 16:51       ` Derrick Stolee
2020-11-11 17:03         ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 05/20] merge-ort: add an err() function similar to one from merge-recursive Elijah Newren
2020-11-11 13:58   ` Derrick Stolee
2020-11-11 17:07     ` Elijah Newren
2020-11-11 17:10       ` Derrick Stolee
2020-11-02 20:43 ` [PATCH v2 06/20] merge-ort: implement a very basic collect_merge_info() Elijah Newren
2020-11-06 22:19   ` Jonathan Tan
2020-11-06 23:10     ` Elijah Newren
2020-11-09 20:59       ` Jonathan Tan
2020-11-11 14:38   ` Derrick Stolee
2020-11-11 17:02     ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 07/20] merge-ort: avoid repeating fill_tree_descriptor() on the same tree Elijah Newren
2020-11-11 14:51   ` Derrick Stolee
2020-11-11 17:13     ` Elijah Newren
2020-11-11 17:21       ` Eric Sunshine
2020-11-02 20:43 ` [PATCH v2 08/20] merge-ort: compute a few more useful fields for collect_merge_info Elijah Newren
2020-11-06 22:52   ` Jonathan Tan
2020-11-06 23:41     ` Elijah Newren
2020-11-09 22:04       ` Jonathan Tan
2020-11-09 23:05         ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 09/20] merge-ort: record stage and auxiliary info for every path Elijah Newren
2020-11-06 22:58   ` Jonathan Tan
2020-11-07  0:26     ` Elijah Newren
2020-11-09 22:09       ` Jonathan Tan
2020-11-09 23:08         ` Elijah Newren
2020-11-11 15:26   ` Derrick Stolee
2020-11-11 18:16     ` Elijah Newren
2020-11-11 22:06       ` Elijah Newren
2020-11-12 18:23         ` Derrick Stolee
2020-11-12 18:39       ` Derrick Stolee
2020-11-02 20:43 ` [PATCH v2 10/20] merge-ort: avoid recursing into identical trees Elijah Newren
2020-11-11 15:31   ` Derrick Stolee
2020-11-02 20:43 ` [PATCH v2 11/20] merge-ort: add a preliminary simple process_entries() implementation Elijah Newren
2020-11-11 19:51   ` Jonathan Tan
2020-11-12  1:48     ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 12/20] merge-ort: have process_entries operate in a defined order Elijah Newren
2020-11-11 16:09   ` Derrick Stolee
2020-11-11 18:58     ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 13/20] merge-ort: step 1 of tree writing -- record basenames, modes, and oids Elijah Newren
2020-11-11 20:01   ` Jonathan Tan
2020-11-11 20:24     ` Elijah Newren
2020-11-12 20:39       ` Jonathan Tan
2020-11-02 20:43 ` [PATCH v2 14/20] merge-ort: step 2 of tree writing -- function to create tree object Elijah Newren
2020-11-11 20:47   ` Jonathan Tan
2020-11-11 21:21     ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 15/20] merge-ort: step 3 of tree writing -- handling subdirectories as we go Elijah Newren
2020-11-12 20:15   ` Jonathan Tan [this message]
2020-11-12 22:30     ` Elijah Newren
2020-11-24 20:19       ` Elijah Newren
2020-11-25  2:07         ` Jonathan Tan
2020-11-26 18:13           ` Elijah Newren
2020-11-30 18:41             ` Jonathan Tan
2020-11-02 20:43 ` [PATCH v2 16/20] merge-ort: basic outline for merge_switch_to_result() Elijah Newren
2020-11-02 20:43 ` [PATCH v2 17/20] merge-ort: add implementation of checkout() Elijah Newren
2020-11-02 20:43 ` [PATCH v2 18/20] tree: enable cmp_cache_name_compare() to be used elsewhere Elijah Newren
2020-11-02 20:43 ` [PATCH v2 19/20] merge-ort: add implementation of record_unmerged_index_entries() Elijah Newren
2020-11-02 20:43 ` [PATCH v2 20/20] merge-ort: free data structures in merge_finalize() Elijah Newren
2020-11-03 14:49 ` [PATCH v2 00/20] fundamentals of merge-ort implementation Derrick Stolee
2020-11-03 16:36   ` Elijah Newren
2020-11-07  6:06     ` Elijah Newren
2020-11-07 15:02       ` Derrick Stolee
2020-11-07 19:39         ` Elijah Newren
2020-11-09 12:30           ` Derrick Stolee
2020-11-09 17:13             ` Elijah Newren
2020-11-09 19:51               ` Derrick Stolee
2020-11-09 22:44                 ` Elijah Newren
2020-11-11 17:08 ` Derrick Stolee
2020-11-11 18:35   ` Elijah Newren
2020-11-11 20:48     ` Derrick Stolee
2020-11-11 21:18       ` Elijah Newren
  -- strict thread matches above, loose matches on Subject: below --
2020-11-29  7:43 [PATCH " Elijah Newren via GitGitGadget
2020-12-04 20:47 ` [PATCH v2 " Elijah Newren via GitGitGadget
2020-12-04 20:48   ` [PATCH v2 15/20] merge-ort: step 3 of tree writing -- handling subdirectories as we go Elijah Newren via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201112201524.3438029-1-jonathantanmy@google.com \
    --to=jonathantanmy@google.com \
    --cc=git@vger.kernel.org \
    --cc=newren@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).