git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [RFC PATCH 0/3] grep: don't add subrepos to in-memory alternates
@ 2019-09-18  1:56 Matheus Tavares
  2019-09-18  1:56 ` [RFC PATCH 1/3] diff: use the given repo at diff_populate_filespec() Matheus Tavares
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Matheus Tavares @ 2019-09-18  1:56 UTC (permalink / raw)
  To: git; +Cc: Jonathan Nieder

Make git-grep --recurse-submodules stop adding subrepos to the in-memory
alternates list and, instead, pass a reference to the subrepo struct
down to the threads.

This series is based on ew/hashmap (it uses the added container_of
macro).

There're some possibly needed changes I'm still unsure about. All
feedback will be highly appreciated:

- textconv cache is written to the_repository's object database even for
  submodules. Should it perhaps be written to submodules' odb instead?

- Considering the following call chain: grep_source_load_driver() >
  userdiff_find_by_path() > git_check_attr() > collect_some_attrs() >
  prepare_attr_stack() > bootstrap_attr_stack():

  * The last function tries to read the attributes from the
    .gitattributes and .git/info/attributes files of the_repository.
    However, for paths inside the submodule, shouldn't it try to read
    these files from the submodule?

  * This function will also call: read_attr() > read_attr_from_index() >
    read_blob_data_from_index() which might, in turn, call
    read_object_file(). Shouldn't we pass the subrepo to it so that it
    can call repo_read_object_file()? (Again, for paths inside the
    submodule, read_object_file() won't be able to find the object as
    we won't be adding to alternates anymore.)

Matheus Tavares (3):
  diff: use the given repo at diff_populate_filespec()
  object: allow parse_object_or_die() to handle any repo
  grep: don't add submodules to the alternates list

 builtin/grep.c  | 82 +++++++++++++++++++++++++++++++++++--------------
 builtin/prune.c |  4 +--
 bundle.c        | 13 +++++---
 diff.c          |  2 +-
 grep.c          | 26 ++++++++--------
 object.c        |  5 +--
 object.h        |  4 ++-
 pack-bitmap.c   |  5 +--
 reachable.c     |  4 +--
 upload-pack.c   |  2 +-
 10 files changed, 94 insertions(+), 53 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH 1/3] diff: use the given repo at diff_populate_filespec()
  2019-09-18  1:56 [RFC PATCH 0/3] grep: don't add subrepos to in-memory alternates Matheus Tavares
@ 2019-09-18  1:56 ` Matheus Tavares
  2019-09-18  1:56 ` [RFC PATCH 2/3] object: allow parse_object_or_die() to handle any repo Matheus Tavares
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Matheus Tavares @ 2019-09-18  1:56 UTC (permalink / raw)
  To: git; +Cc: Jonathan Nieder, brian m. carlson, Junio C Hamano, Eric Wong

diff_populate_filespec() accepts an struct repository "r" as parameter
but implicitly uses the_repository when calling read_object_file(). To
be more consistent, replace this call with
repo_read_object_file(r, ...).

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 diff.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/diff.c b/diff.c
index 663b5d01f8..c743e4ce50 100644
--- a/diff.c
+++ b/diff.c
@@ -4044,7 +4044,7 @@ int diff_populate_filespec(struct repository *r,
 				return 0;
 			}
 		}
-		s->data = read_object_file(&s->oid, &type, &s->size);
+		s->data = repo_read_object_file(r, &s->oid, &type, &s->size);
 		if (!s->data)
 			die("unable to read %s", oid_to_hex(&s->oid));
 		s->should_free = 1;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 2/3] object: allow parse_object_or_die() to handle any repo
  2019-09-18  1:56 [RFC PATCH 0/3] grep: don't add subrepos to in-memory alternates Matheus Tavares
  2019-09-18  1:56 ` [RFC PATCH 1/3] diff: use the given repo at diff_populate_filespec() Matheus Tavares
@ 2019-09-18  1:56 ` Matheus Tavares
  2019-09-18  1:56 ` [RFC PATCH 3/3] grep: don't add submodules to the alternates list Matheus Tavares
  2019-09-18 19:55 ` [RFC PATCH 0/3] grep: don't add subrepos to in-memory alternates Junio C Hamano
  3 siblings, 0 replies; 13+ messages in thread
From: Matheus Tavares @ 2019-09-18  1:56 UTC (permalink / raw)
  To: git
  Cc: Jonathan Nieder, Brandon Williams, Junio C Hamano, Jeff King,
	brian m. carlson

parse_object_or_die() was added before parse_object() learned to handle
arbitrary repositories. Now that it did, let's also bring this behavior
improvement to the former, adding a struct repository argument. Also
adjust the callers and, while we are here, avoid using `oid_to_hex(oid)`
for the third parameter as it's the default when NULL is given.

This will be used in the following patch, to allow git-grep to parse
submodule's objects.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 builtin/grep.c  |  4 ++--
 builtin/prune.c |  4 ++--
 bundle.c        | 13 ++++++++-----
 object.c        |  5 +++--
 object.h        |  4 +++-
 pack-bitmap.c   |  5 +++--
 reachable.c     |  4 ++--
 upload-pack.c   |  2 +-
 8 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index 2699001fbd..73ef00c426 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -455,7 +455,7 @@ static int grep_submodule(struct grep_opt *opt,
 		unsigned long size;
 		struct strbuf base = STRBUF_INIT;
 
-		object = parse_object_or_die(oid, oid_to_hex(oid));
+		object = parse_object_or_die(the_repository, oid, NULL);
 
 		grep_read_lock();
 		data = read_object_with_reference(&subrepo,
@@ -1037,7 +1037,7 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 			break;
 		}
 
-		object = parse_object_or_die(&oid, arg);
+		object = parse_object_or_die(the_repository, &oid, arg);
 		if (!seen_dashdash)
 			verify_non_filename(prefix, arg);
 		add_object_array_with_path(object, arg, &list, oc.mode, oc.path);
diff --git a/builtin/prune.c b/builtin/prune.c
index 2b76872ad2..996defe284 100644
--- a/builtin/prune.c
+++ b/builtin/prune.c
@@ -154,8 +154,8 @@ int cmd_prune(int argc, const char **argv, const char *prefix)
 		const char *name = *argv++;
 
 		if (!get_oid(name, &oid)) {
-			struct object *object = parse_object_or_die(&oid,
-								    name);
+			struct object *object =
+				parse_object_or_die(the_repository, &oid, name);
 			add_pending_object(&revs, object, "");
 		}
 		else
diff --git a/bundle.c b/bundle.c
index b5d21cd80f..67552786a5 100644
--- a/bundle.c
+++ b/bundle.c
@@ -315,14 +315,16 @@ static int compute_and_write_prerequisites(int bundle_fd,
 		if (buf.len > 0 && buf.buf[0] == '-') {
 			write_or_die(bundle_fd, buf.buf, buf.len);
 			if (!get_oid_hex(buf.buf + 1, &oid)) {
-				struct object *object = parse_object_or_die(&oid,
-									    buf.buf);
+				struct object *object =
+					parse_object_or_die(the_repository,
+							    &oid, buf.buf);
 				object->flags |= UNINTERESTING;
 				add_pending_object(revs, object, buf.buf);
 			}
 		} else if (!get_oid_hex(buf.buf, &oid)) {
-			struct object *object = parse_object_or_die(&oid,
-								    buf.buf);
+			struct object *object =
+				parse_object_or_die(the_repository,
+						    &oid, buf.buf);
 			object->flags |= SHOWN;
 		}
 	}
@@ -406,7 +408,8 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
 				 * end up triggering "empty bundle"
 				 * error.
 				 */
-				obj = parse_object_or_die(&oid, e->name);
+				obj = parse_object_or_die(the_repository, &oid,
+							  e->name);
 				obj->flags |= SHOWN;
 				add_pending_object(revs, obj, e->name);
 			}
diff --git a/object.c b/object.c
index 07bdd5b26e..54c4861f4c 100644
--- a/object.c
+++ b/object.c
@@ -237,10 +237,11 @@ struct object *parse_object_buffer(struct repository *r, const struct object_id
 	return obj;
 }
 
-struct object *parse_object_or_die(const struct object_id *oid,
+struct object *parse_object_or_die(struct repository *r,
+				   const struct object_id *oid,
 				   const char *name)
 {
-	struct object *o = parse_object(the_repository, oid);
+	struct object *o = parse_object(r, oid);
 	if (o)
 		return o;
 
diff --git a/object.h b/object.h
index 0120892bbd..d8f02f53cd 100644
--- a/object.h
+++ b/object.h
@@ -134,7 +134,9 @@ struct object *parse_object(struct repository *r, const struct object_id *oid);
  * "name" parameter is not NULL, it is included in the error message
  * (otherwise, the hex object ID is given).
  */
-struct object *parse_object_or_die(const struct object_id *oid, const char *name);
+struct object *parse_object_or_die(struct repository *r,
+				   const struct object_id *oid,
+				   const char *name);
 
 /* Given the result of read_sha1_file(), returns the object after
  * parsing it.  eaten_p indicates if the object has a borrowed copy
diff --git a/pack-bitmap.c b/pack-bitmap.c
index ed2befaac6..383b2351db 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -699,7 +699,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
 		struct object *object = revs->pending.objects[i].item;
 
 		if (object->type == OBJ_NONE)
-			parse_object_or_die(&object->oid, NULL);
+			parse_object_or_die(the_repository, &object->oid, NULL);
 
 		while (object->type == OBJ_TAG) {
 			struct tag *tag = (struct tag *) object;
@@ -711,7 +711,8 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
 
 			if (!tag->tagged)
 				die("bad tag");
-			object = parse_object_or_die(&tag->tagged->oid, NULL);
+			object = parse_object_or_die(the_repository,
+						     &tag->tagged->oid, NULL);
 		}
 
 		if (object->flags & UNINTERESTING)
diff --git a/reachable.c b/reachable.c
index 8f50235b28..7c5f7a0ba0 100644
--- a/reachable.c
+++ b/reachable.c
@@ -37,7 +37,7 @@ static int add_one_ref(const char *path, const struct object_id *oid,
 		return 0;
 	}
 
-	object = parse_object_or_die(oid, path);
+	object = parse_object_or_die(the_repository, oid, path);
 	add_pending_object(revs, object, "");
 
 	return 0;
@@ -86,7 +86,7 @@ static void add_recent_object(const struct object_id *oid,
 	switch (type) {
 	case OBJ_TAG:
 	case OBJ_COMMIT:
-		obj = parse_object_or_die(oid, NULL);
+		obj = parse_object_or_die(the_repository, oid, NULL);
 		break;
 	case OBJ_TREE:
 		obj = (struct object *)lookup_tree(the_repository, oid);
diff --git a/upload-pack.c b/upload-pack.c
index 222cd3ad89..648ff3355b 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -1220,7 +1220,7 @@ static int parse_want_ref(struct packet_writer *writer, const char *line,
 		item = string_list_append(wanted_refs, arg);
 		item->util = oiddup(&oid);
 
-		o = parse_object_or_die(&oid, arg);
+		o = parse_object_or_die(the_repository, &oid, arg);
 		if (!(o->flags & WANTED)) {
 			o->flags |= WANTED;
 			add_object_array(o, NULL, want_obj);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 3/3] grep: don't add submodules to the alternates list
  2019-09-18  1:56 [RFC PATCH 0/3] grep: don't add subrepos to in-memory alternates Matheus Tavares
  2019-09-18  1:56 ` [RFC PATCH 1/3] diff: use the given repo at diff_populate_filespec() Matheus Tavares
  2019-09-18  1:56 ` [RFC PATCH 2/3] object: allow parse_object_or_die() to handle any repo Matheus Tavares
@ 2019-09-18  1:56 ` Matheus Tavares
  2019-09-21 21:58   ` Brandon Williams
  2019-09-18 19:55 ` [RFC PATCH 0/3] grep: don't add subrepos to in-memory alternates Junio C Hamano
  3 siblings, 1 reply; 13+ messages in thread
From: Matheus Tavares @ 2019-09-18  1:56 UTC (permalink / raw)
  To: git
  Cc: Jonathan Nieder, brian m. carlson,
	Nguyễn Thái Ngọc Duy, Brandon Williams,
	Junio C Hamano

When grepping with --recurse-submodules, the object directory of the
submodule is added to the in-memory alternates list. This makes git need
to watch out for more packfiles which might bring bad consequences for
memory and performance.

Now that raw_object_store is a member of the struct repository, it's
possible to use the submodules' instances directly, without the need to
add its odbs to the alternates list. So let's instead pass the subrepo
down to the threads and replace function calls which use
the_repository by default for their more general counterparts. Also,
submodule cleanup must be refactored as calling repo_clear() at the end
of grep_submodules(), might free the subrepo's resources before threads
have finished working on it. To do that, let's keep track of the
workers progress and only call repo_clear() when it's safe to do so.

This change will also help future patches to re-enable threads in
non-worktree grep as less mutual exclusions will be needed. E.g.
grep_submodule()'s call to parse_object_or_die() won't conflict with
other calls to the same function by worker threads (as they should be
referencing different 'struct repository's).

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 builtin/grep.c | 80 ++++++++++++++++++++++++++++++++++++--------------
 grep.c         | 26 ++++++++--------
 2 files changed, 70 insertions(+), 36 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index 73ef00c426..0bb58c456b 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -43,6 +43,15 @@ static pthread_t *threads;
  */
 struct work_item {
 	struct grep_source source;
+
+	/*
+	 * Each worker thread is initialized with a 'struct grep_opt *opt' where
+	 * opt->repo points to the_repository. But a work item may refeer to a
+	 * subrepo. So the repository relative to each task is also passed to
+	 * the assigned thread and its opt->repo is updated when retrieving the
+	 * task.
+	 */
+	struct repository *repo;
 	char done;
 	struct strbuf out;
 };
@@ -65,6 +74,13 @@ static int todo_done;
 /* Has all work items been added? */
 static int all_work_added;
 
+/* Tracks the work progress in subrepos to correctly free them in the end */
+struct progress_in_subrepo {
+	struct repository subrepo;
+	int all_work_added;
+	int work_counter;
+};
+
 /* This lock protects all the variables above. */
 static pthread_mutex_t grep_mutex;
 
@@ -93,6 +109,8 @@ static int skip_first_line;
 
 static void add_work(struct grep_opt *opt, const struct grep_source *gs)
 {
+	struct progress_in_subrepo *p;
+
 	grep_lock();
 
 	while ((todo_end+1) % ARRAY_SIZE(todo) == todo_done) {
@@ -104,9 +122,15 @@ static void add_work(struct grep_opt *opt, const struct grep_source *gs)
 		grep_source_load_driver(&todo[todo_end].source,
 					opt->repo->index);
 	todo[todo_end].done = 0;
+	todo[todo_end].repo = opt->repo;
 	strbuf_reset(&todo[todo_end].out);
 	todo_end = (todo_end + 1) % ARRAY_SIZE(todo);
 
+	if (opt->repo != the_repository) {
+		p = container_of(opt->repo, struct progress_in_subrepo, subrepo);
+		p->work_counter++;
+	}
+
 	pthread_cond_signal(&cond_add);
 	grep_unlock();
 }
@@ -132,6 +156,7 @@ static struct work_item *get_work(void)
 
 static void work_done(struct work_item *w)
 {
+	struct progress_in_subrepo *p;
 	int old_done;
 
 	grep_lock();
@@ -156,6 +181,17 @@ static void work_done(struct work_item *w)
 
 			write_or_die(1, p, len);
 		}
+
+		if (w->repo != the_repository) {
+			p = container_of(w->repo, struct progress_in_subrepo,
+					 subrepo);
+			p->work_counter--;
+			if (p->work_counter == 0 && p->all_work_added) {
+				repo_clear(&p->subrepo);
+				free(p);
+			}
+		}
+
 		grep_source_clear(&w->source);
 	}
 
@@ -179,6 +215,7 @@ static void *run(void *arg)
 			break;
 
 		opt->output_priv = w;
+		opt->repo = w->repo;
 		hit |= grep_source(opt, &w->source);
 		grep_source_clear_data(&w->source);
 		work_done(w);
@@ -295,12 +332,14 @@ static int grep_cmd_config(const char *var, const char *value, void *cb)
 	return st;
 }
 
-static void *lock_and_read_oid_file(const struct object_id *oid, enum object_type *type, unsigned long *size)
+static void *lock_and_read_oid_file(struct repository *r,
+				    const struct object_id *oid,
+				    enum object_type *type, unsigned long *size)
 {
 	void *data;
 
 	grep_read_lock();
-	data = read_object_file(oid, type, size);
+	data = repo_read_object_file(r, oid, type, size);
 	grep_read_unlock();
 	return data;
 }
@@ -405,7 +444,8 @@ static int grep_submodule(struct grep_opt *opt,
 			  const struct object_id *oid,
 			  const char *filename, const char *path, int cached)
 {
-	struct repository subrepo;
+	struct progress_in_subrepo *p = xcalloc(1, sizeof(*p));
+	struct repository *subrepo = &p->subrepo;
 	struct repository *superproject = opt->repo;
 	const struct submodule *sub = submodule_from_path(superproject,
 							  &null_oid, path);
@@ -422,31 +462,21 @@ static int grep_submodule(struct grep_opt *opt,
 
 	if (!is_submodule_active(superproject, path)) {
 		grep_read_unlock();
+		free(p);
 		return 0;
 	}
 
-	if (repo_submodule_init(&subrepo, superproject, sub)) {
+	if (repo_submodule_init(subrepo, superproject, sub)) {
 		grep_read_unlock();
+		free(p);
 		return 0;
 	}
 
-	repo_read_gitmodules(&subrepo);
-
-	/*
-	 * NEEDSWORK: This adds the submodule's object directory to the list of
-	 * alternates for the single in-memory object store.  This has some bad
-	 * consequences for memory (processed objects will never be freed) and
-	 * performance (this increases the number of pack files git has to pay
-	 * attention to, to the sum of the number of pack files in all the
-	 * repositories processed so far).  This can be removed once the object
-	 * store is no longer global and instead is a member of the repository
-	 * object.
-	 */
-	add_to_alternates_memory(subrepo.objects->odb->path);
+	repo_read_gitmodules(subrepo);
 	grep_read_unlock();
 
 	memcpy(&subopt, opt, sizeof(subopt));
-	subopt.repo = &subrepo;
+	subopt.repo = subrepo;
 
 	if (oid) {
 		struct object *object;
@@ -455,10 +485,10 @@ static int grep_submodule(struct grep_opt *opt,
 		unsigned long size;
 		struct strbuf base = STRBUF_INIT;
 
-		object = parse_object_or_die(the_repository, oid, NULL);
+		object = parse_object_or_die(subrepo, oid, NULL);
 
 		grep_read_lock();
-		data = read_object_with_reference(&subrepo,
+		data = read_object_with_reference(subrepo,
 						  &object->oid, tree_type,
 						  &size, NULL);
 		grep_read_unlock();
@@ -478,7 +508,13 @@ static int grep_submodule(struct grep_opt *opt,
 		hit = grep_cache(&subopt, pathspec, cached);
 	}
 
-	repo_clear(&subrepo);
+	grep_lock();
+	p->all_work_added = 1;
+	if (p->work_counter == 0) {
+		repo_clear(&p->subrepo);
+		free(p);
+	}
+	grep_unlock();
 	return hit;
 }
 
@@ -587,7 +623,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 			void *data;
 			unsigned long size;
 
-			data = lock_and_read_oid_file(&entry.oid, &type, &size);
+			data = lock_and_read_oid_file(repo, &entry.oid, &type, &size);
 			if (!data)
 				die(_("unable to read tree (%s)"),
 				    oid_to_hex(&entry.oid));
diff --git a/grep.c b/grep.c
index cd952ef5d3..0b3f38aaae 100644
--- a/grep.c
+++ b/grep.c
@@ -10,9 +10,8 @@
 #include "quote.h"
 #include "help.h"
 
-static int grep_source_load(struct grep_source *gs);
-static int grep_source_is_binary(struct grep_source *gs,
-				 struct index_state *istate);
+static int grep_source_load(struct repository *r, struct grep_source *gs);
+static int grep_source_is_binary(struct repository *r, struct grep_source *gs);
 
 static struct grep_opt grep_defaults;
 
@@ -1714,7 +1713,7 @@ static int fill_textconv_grep(struct repository *r,
 	size_t size;
 
 	if (!driver || !driver->textconv)
-		return grep_source_load(gs);
+		return grep_source_load(r, gs);
 
 	/*
 	 * The textconv interface is intimately tied to diff_filespecs, so we
@@ -1820,11 +1819,11 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle
 	if (!textconv) {
 		switch (opt->binary) {
 		case GREP_BINARY_DEFAULT:
-			if (grep_source_is_binary(gs, opt->repo->index))
+			if (grep_source_is_binary(opt->repo, gs))
 				binary_match_only = 1;
 			break;
 		case GREP_BINARY_NOMATCH:
-			if (grep_source_is_binary(gs, opt->repo->index))
+			if (grep_source_is_binary(opt->repo, gs))
 				return 0; /* Assume unmatch */
 			break;
 		case GREP_BINARY_TEXT:
@@ -2105,12 +2104,12 @@ void grep_source_clear_data(struct grep_source *gs)
 	}
 }
 
-static int grep_source_load_oid(struct grep_source *gs)
+static int grep_source_load_oid(struct repository *r, struct grep_source *gs)
 {
 	enum object_type type;
 
 	grep_read_lock();
-	gs->buf = read_object_file(gs->identifier, &type, &gs->size);
+	gs->buf = repo_read_object_file(r, gs->identifier, &type, &gs->size);
 	grep_read_unlock();
 
 	if (!gs->buf)
@@ -2154,7 +2153,7 @@ static int grep_source_load_file(struct grep_source *gs)
 	return 0;
 }
 
-static int grep_source_load(struct grep_source *gs)
+static int grep_source_load(struct repository *r, struct grep_source *gs)
 {
 	if (gs->buf)
 		return 0;
@@ -2163,7 +2162,7 @@ static int grep_source_load(struct grep_source *gs)
 	case GREP_SOURCE_FILE:
 		return grep_source_load_file(gs);
 	case GREP_SOURCE_OID:
-		return grep_source_load_oid(gs);
+		return grep_source_load_oid(r, gs);
 	case GREP_SOURCE_BUF:
 		return gs->buf ? 0 : -1;
 	}
@@ -2184,14 +2183,13 @@ void grep_source_load_driver(struct grep_source *gs,
 	grep_attr_unlock();
 }
 
-static int grep_source_is_binary(struct grep_source *gs,
-				 struct index_state *istate)
+static int grep_source_is_binary(struct repository *r, struct grep_source *gs)
 {
-	grep_source_load_driver(gs, istate);
+	grep_source_load_driver(gs, r->index);
 	if (gs->driver->binary != -1)
 		return gs->driver->binary;
 
-	if (!grep_source_load(gs))
+	if (!grep_source_load(r, gs))
 		return buffer_is_binary(gs->buf, gs->size);
 
 	return 0;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/3] grep: don't add subrepos to in-memory alternates
  2019-09-18  1:56 [RFC PATCH 0/3] grep: don't add subrepos to in-memory alternates Matheus Tavares
                   ` (2 preceding siblings ...)
  2019-09-18  1:56 ` [RFC PATCH 3/3] grep: don't add submodules to the alternates list Matheus Tavares
@ 2019-09-18 19:55 ` Junio C Hamano
  2019-09-19  5:18   ` Matheus Tavares Bernardino
  3 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2019-09-18 19:55 UTC (permalink / raw)
  To: Matheus Tavares; +Cc: git, Jonathan Nieder

Matheus Tavares <matheus.bernardino@usp.br> writes:

> Make git-grep --recurse-submodules stop adding subrepos to the in-memory
> alternates list and, instead, pass a reference to the subrepo struct
> down to the threads.

Nice.  This is done by updating all the codepaths used by grep to
use the lower-level helper functions that can take a repository
instance and/or an object store instance that is not the one tied to
the top-level repository?  Quite nice.

> - textconv cache is written to the_repository's object database even for
>   submodules. Should it perhaps be written to submodules' odb instead?

You mention "is written", but that is what happens upon a cache
miss.  Before the code notices a cache miss, it must be checking if
a cached result is available.  In which odb is it done?  Writing
that follow the miss should happen to the same one, or the cache is
not very effective.

Because you would want the cache to be effective, after running "git
grep --recurse-submodules" from the top-level, when you chdir down
to the submodule and say "git grep" to dig further, the answer to
your question is most likely "yes".

> - Considering the following call chain: grep_source_load_driver() >
>   userdiff_find_by_path() > git_check_attr() > collect_some_attrs() >
>   prepare_attr_stack() > bootstrap_attr_stack():
>
>   * The last function tries to read the attributes from the
>     .gitattributes and .git/info/attributes files of the_repository.
>     However, for paths inside the submodule, shouldn't it try to read
>     these files from the submodule?

Yes, I think all of those would have worked correctly if we forked a
separate "git grep" inside submodule repository, but in the rush to
"do everything in process", many things like these are not done
correctly.  As there is only one attribute cache IIUC, invalidating
the whole cache for the top-level and replacing it with the one for
a submodule, every time we cross the module boundary, would probably
have a negative effect on the performance, and I am not sure what
would happen if you run more than one threads working in different
repositories (i.e. top-level and submodules).

>   * This function will also call: read_attr() > read_attr_from_index() >
>     read_blob_data_from_index() which might, in turn, call
>     read_object_file(). Shouldn't we pass the subrepo to it so that it
>     can call repo_read_object_file()? (Again, for paths inside the
>     submodule, read_object_file() won't be able to find the object as
>     we won't be adding to alternates anymore.)

Ditto.

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/3] grep: don't add subrepos to in-memory alternates
  2019-09-18 19:55 ` [RFC PATCH 0/3] grep: don't add subrepos to in-memory alternates Junio C Hamano
@ 2019-09-19  5:18   ` Matheus Tavares Bernardino
  2019-09-20 16:26     ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Matheus Tavares Bernardino @ 2019-09-19  5:18 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jonathan Nieder

On Wed, Sep 18, 2019 at 4:55 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Matheus Tavares <matheus.bernardino@usp.br> writes:
>
[...]
> > - textconv cache is written to the_repository's object database even for
> >   submodules. Should it perhaps be written to submodules' odb instead?
>
> You mention "is written", but that is what happens upon a cache
> miss.  Before the code notices a cache miss, it must be checking if
> a cached result is available.  In which odb is it done?  Writing
> that follow the miss should happen to the same one, or the cache is
> not very effective.

Right. Both writing and reading operations on the textconv cache
always happen in the_repository's odb.

> Because you would want the cache to be effective, after running "git
> grep --recurse-submodules" from the top-level, when you chdir down
> to the submodule and say "git grep" to dig further, the answer to
> your question is most likely "yes".

OK, makes sense. To keep this series simple, I think I'll try to work
on this in a following patchset.

> > - Considering the following call chain: grep_source_load_driver() >
> >   userdiff_find_by_path() > git_check_attr() > collect_some_attrs() >
> >   prepare_attr_stack() > bootstrap_attr_stack():
> >
> >   * The last function tries to read the attributes from the
> >     .gitattributes and .git/info/attributes files of the_repository.
> >     However, for paths inside the submodule, shouldn't it try to read
> >     these files from the submodule?
>
> Yes, I think all of those would have worked correctly if we forked a
> separate "git grep" inside submodule repository, but in the rush to
> "do everything in process", many things like these are not done
> correctly.  As there is only one attribute cache IIUC, invalidating
> the whole cache for the top-level and replacing it with the one for
> a submodule, every time we cross the module boundary, would probably
> have a negative effect on the performance, and I am not sure what
> would happen if you run more than one threads working in different
> repositories (i.e. top-level and submodules).

Hmm, I may have gotten a little confused here. Are you talking about
the attributes stack (which contains .gitattributes and
info/attributes)? If so, isn't this stack already rebuild for every
path? I mean, by the previous call chain it seems to me that at least
these two files are reread for every path.

> >   * This function will also call: read_attr() > read_attr_from_index() >
> >     read_blob_data_from_index() which might, in turn, call
> >     read_object_file(). Shouldn't we pass the subrepo to it so that it
> >     can call repo_read_object_file()? (Again, for paths inside the
> >     submodule, read_object_file() won't be able to find the object as
> >     we won't be adding to alternates anymore.)

Just as a side note, I noticed another problem with this: in a
submodule with a non-checked-out .gitattributes, `git grep --textconv`
will successfully retrieve the file from the index. But running `git
grep --recurse-submodules --textconv` from the superproject won't. The
problem is that read_blob_data_from_index() doesn't have the
repository struct and thus cannot strip the subrepo_prefix from the
path (so it won't find it in the subrepo's index). I'll try to fix
this in this series as well.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/3] grep: don't add subrepos to in-memory alternates
  2019-09-19  5:18   ` Matheus Tavares Bernardino
@ 2019-09-20 16:26     ` Junio C Hamano
  2019-09-21 20:34       ` Matheus Tavares Bernardino
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2019-09-20 16:26 UTC (permalink / raw)
  To: Matheus Tavares Bernardino; +Cc: git, Jonathan Nieder

Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:

> Hmm, I may have gotten a little confused here. Are you talking
> about the attributes stack (which contains .gitattributes and
> info/attributes)?  If so, isn't this stack already rebuild for
> every path? I mean, by the previous call chain it seems to me that
> at least these two files are reread for every path.

Yes, but for the switch that happens when coming out of a normal
directory and then descending into another normal directory is just
to pop the entries from the directory hierarchy we are getting out
of, and then pushing the entries from the new directory hierarchy.
We would not be discarding and rereading $GIT_DIR/info/ or the
.gitattribute file from the top-level of the working tree.

Descending into a submodule is fundamentally and completely
different.  None of the attributes defined in the superproject
should affect the paths in the submodule, as it is a totally
separate project, oblivious to the existence of enclosing the
superproject. 

Every time we look up attributes for a path in a project (either the
superproject or a submodule) that is different from the one we did a
look-up the last time, we'd not just need to pop a handful and push
a handful, but need to rebuild the stack completely from scratch,
no?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/3] grep: don't add subrepos to in-memory alternates
  2019-09-20 16:26     ` Junio C Hamano
@ 2019-09-21 20:34       ` Matheus Tavares Bernardino
  2019-09-28  3:24         ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Matheus Tavares Bernardino @ 2019-09-21 20:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jonathan Nieder

On Fri, Sep 20, 2019 at 1:26 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:
>
> > Hmm, I may have gotten a little confused here. Are you talking
> > about the attributes stack (which contains .gitattributes and
> > info/attributes)?  If so, isn't this stack already rebuild for
> > every path? I mean, by the previous call chain it seems to me that
> > at least these two files are reread for every path.
>
> Yes, but for the switch that happens when coming out of a normal
> directory and then descending into another normal directory is just
> to pop the entries from the directory hierarchy we are getting out
> of, and then pushing the entries from the new directory hierarchy.
> We would not be discarding and rereading $GIT_DIR/info/ or the
> .gitattribute file from the top-level of the working tree.

Right, this would be the best way of doing it. However, I think this
is not how it's currently implemented. I if correctly understood the
code in this call chain:

grep_source_load_driver() >  userdiff_find_by_path() >
git_check_attr() > collect_some_attrs() > prepare_attr_stack() >
bootstrap_attr_stack()

it seems that the whole stack is being rebuild for every path (even
for paths descending in the same superproject or submodule). So
$GIT_DIR/info/ and .gitattributes are being discarded and reread every
time :(

> Descending into a submodule is fundamentally and completely
> different.  None of the attributes defined in the superproject
> should affect the paths in the submodule, as it is a totally
> separate project, oblivious to the existence of enclosing the
> superproject.

I think we currently have a bug here as the attributes from the
superproject *are* affecting the paths in the submodule. Here is a
small script I wrote to test this: https://gitlab.com/snippets/1896951

The cause of this problem is that boostrap_attr_stack() doesn't read
"<subrepo_prefix>/.gitattributes" but just ".gitattributes", always
getting the superproject's file not the suprepo's. Yet another problem
is that when this file is not present and we need to retrieve it from
the index, this function calls read_attr() > read_attr_from_index() >
read_blob_data_from_index(). The last one always reads from
the_repository's odb, so it won't ever find the subrepo's
".gitattributes".

And a third bug is that when reading attributes of paths inside
subrepo's directories, from index, we call read_attr_from_index() with
a path such as "<subrepo_prefix>/<subdir>/.gitattributes". However,
the subrepo_prefix should be stripped when looking in the subrepo's
index, otherwise there will be no matches.

To fix these three problems, I think we would need to pass on a struct
repository in these call chains. But this would require a very big
modification as there are many places that can lead to one of them...
And there're corner cases such as index_stream_convert_blob() which
would need to receive a struct repository but it always writes to
the_repo, which would be kind of inconsistent. Do you think this would
be a good solution or should I try something else?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 3/3] grep: don't add submodules to the alternates list
  2019-09-18  1:56 ` [RFC PATCH 3/3] grep: don't add submodules to the alternates list Matheus Tavares
@ 2019-09-21 21:58   ` Brandon Williams
  0 siblings, 0 replies; 13+ messages in thread
From: Brandon Williams @ 2019-09-21 21:58 UTC (permalink / raw)
  To: Matheus Tavares
  Cc: git, Jonathan Nieder, brian m. carlson,
	Nguyễn Thái Ngọc Duy, Junio C Hamano

On Tue, Sep 17, 2019 at 6:56 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> When grepping with --recurse-submodules, the object directory of the
> submodule is added to the in-memory alternates list. This makes git need
> to watch out for more packfiles which might bring bad consequences for
> memory and performance.
>
> Now that raw_object_store is a member of the struct repository, it's
> possible to use the submodules' instances directly, without the need to
> add its odbs to the alternates list. So let's instead pass the subrepo
> down to the threads and replace function calls which use
> the_repository by default for their more general counterparts. Also,
> submodule cleanup must be refactored as calling repo_clear() at the end
> of grep_submodules(), might free the subrepo's resources before threads
> have finished working on it. To do that, let's keep track of the
> workers progress and only call repo_clear() when it's safe to do so.

This is incredibly exciting. Glad to see the effort that were started years ago
to finally come to fruition.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/3] grep: don't add subrepos to in-memory alternates
  2019-09-21 20:34       ` Matheus Tavares Bernardino
@ 2019-09-28  3:24         ` Junio C Hamano
  2019-09-28  4:20           ` Matheus Tavares Bernardino
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2019-09-28  3:24 UTC (permalink / raw)
  To: Matheus Tavares Bernardino; +Cc: git, Jonathan Nieder

Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:

> Right, this would be the best way of doing it. However, I think this
> is not how it's currently implemented. I if correctly understood the
> code in this call chain:
>
> grep_source_load_driver() >  userdiff_find_by_path() >
> git_check_attr() > collect_some_attrs() > prepare_attr_stack() >
> bootstrap_attr_stack()
>
> it seems that the whole stack is being rebuild for every path (even
> for paths descending in the same superproject or submodule).

bootstrap is guarded with "if (*stack) return;" and prepare knows to
rewind to the common level and push down the new ones, no?

At least that is how I remember writing the first version of it.
Have we broken the design over time, I wonder?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/3] grep: don't add subrepos to in-memory alternates
  2019-09-28  3:24         ` Junio C Hamano
@ 2019-09-28  4:20           ` Matheus Tavares Bernardino
  0 siblings, 0 replies; 13+ messages in thread
From: Matheus Tavares Bernardino @ 2019-09-28  4:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jonathan Nieder

On Sat, Sep 28, 2019 at 12:24 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:
>
> > Right, this would be the best way of doing it. However, I think this
> > is not how it's currently implemented. I if correctly understood the
> > code in this call chain:
> >
> > grep_source_load_driver() >  userdiff_find_by_path() >
> > git_check_attr() > collect_some_attrs() > prepare_attr_stack() >
> > bootstrap_attr_stack()
> >
> > it seems that the whole stack is being rebuild for every path (even
> > for paths descending in the same superproject or submodule).
>
> bootstrap is guarded with "if (*stack) return;" and prepare knows to
> rewind to the common level and push down the new ones, no?

Right, I've somehow missed this guard and the fact that 'check' is
static at userdiff_find_by_path() so the stack is persistent. Thanks
for pointing this out and sorry for the noise.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/3] grep: don'\''t add subrepos to in-memory alternates
  2021-09-27 12:08 [PATCH v3 6/8] grep: add repository to OID grep sources Ævar Arnfjörð Bjarmason
@ 2021-09-27 16:45 ` Matheus Tavares
  2021-09-27 17:30   ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 13+ messages in thread
From: Matheus Tavares @ 2021-09-27 16:45 UTC (permalink / raw)
  To: avarab; +Cc: git, gitster, jonathantanmy, matheus.bernardino

On Mon, Sep 27, 2021 at 9:09 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
>
> On Mon, Aug 16 2021, Jonathan Tan wrote:
>
> > Record the repository whenever an OID grep source is created, and teach
> > the worker threads to explicitly provide the repository when accessing
> > objects.
> > [...]
> > diff --git a/grep.h b/grep.h
> > index 480b3f5bba..128007db65 100644
> > --- a/grep.h
> > +++ b/grep.h
> > @@ -120,7 +120,20 @@ struct grep_opt {
> >       struct grep_pat *header_list;
> >       struct grep_pat **header_tail;
> >       struct grep_expr *pattern_expression;
> > +
> > +     /*
> > +      * NEEDSWORK: See if we can remove this field, because the repository
> > +      * should probably be per-source. That is, grep.c functions using this
> > +      * field should probably start using "repo" in "struct grep_source"
> > +      * instead.
> > +      *
> > +      * This is potentially the cause of at least one bug - "git grep"
> > +      * ignoring the textconv attributes from submodules. See [1] for more
> > +      * information.
> > +      * [1] https://lore.kernel.org/git/CAHd-oW5iEQarYVxEXoTG-ua2zdoybTrSjCBKtO0YT292fm0NQQ@mail.gmail.com/
> > +      */
> >       struct repository *repo;
> > +
>
> I ran into this comment and read the linked E-Mail, and then the
> downthread
> https://lore.kernel.org/git/CAHd-oW6uG1fap-T4UF17bJmjoHAqWCDq9KbY+_8a3cEnnfATxg@mail.gmail.com/;
>
> Given Matheus's "I've somehow missed this guard and the..." there I'm
> not quite sure what/if we should be doing here & what this comment is
> recommending? I.e. do we still need to adjust the call chains as noted
> in the E-Mail the comment links to, or not?

I think we should still adjust the call chains, yes. The downthread
message you mentioned is kind of a tangent about performance, where
Junio helped me understand something I had previously missed in the
code, regarding the persistence of the attributes stack.

But the issue that started the thread was about a correctness problem:
the superproject textconv attributes are being used on submodules'
files when running `git grep` with `--recurse-submodules --textconv`.
The three cases to consider are:

- .gitattributes from the working tree
- .gitattributes from the index
- .git/info/attributes

On all these cases, the superproject attributes are being used on the
submodule. Additionally, if the superproject does not define any
attribute, the submodule attributes are being ignored in all cases
except by the first one (but that is only because the code sees the
.gitattributes file on the submodule as if it were a "regular"
subdirectory of the surperproject. So the submodule's .gitattribures
takes higher precedence when evaluating the attributes for files in
that directory).

Another issue is that the textconv cache is always saved to (and read
from) the superproject gitdir, even for submodules' files.

Here are some test cases that demonstrate these issues:

-- snipsnap --
diff --git a/t/t7814-grep-recurse-submodules.sh b/t/t7814-grep-recurse-submodules.sh
index 3172f5b936..d01a3bc5d8 100755
--- a/t/t7814-grep-recurse-submodules.sh
+++ b/t/t7814-grep-recurse-submodules.sh
@@ -441,4 +441,104 @@ test_expect_success 'grep --recurse-submodules with --cached ignores worktree mo
 	test_must_fail git grep --recurse-submodules --cached "A modified line in submodule" >actual 2>&1 &&
 	test_must_be_empty actual
 '
+
+test_expect_failure 'grep --textconv: superproject .gitattributes does not affect submodules' '
+	reset_and_clean &&
+	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
+	echo "a diff=d2x" >.gitattributes &&
+
+	cat >expect <<-\EOF &&
+	a:(1|2)x(3|4)
+	EOF
+	git grep --textconv --recurse-submodules x >actual &&
+	test_cmp expect actual
+'
+
+test_expect_failure 'grep --textconv: superproject .gitattributes (from index) does not affect submodules' '
+	reset_and_clean &&
+	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
+	echo "a diff=d2x" >.gitattributes &&
+	git add .gitattributes &&
+	rm .gitattributes &&
+
+	cat >expect <<-\EOF &&
+	a:(1|2)x(3|4)
+	EOF
+	git grep --textconv --recurse-submodules x >actual &&
+	test_cmp expect actual
+'
+
+test_expect_failure 'grep --textconv: superproject .git/info/attributes does not affect submodules' '
+	reset_and_clean &&
+	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
+	super_attr="$(git rev-parse --path-format=relative --git-path info/attributes)" &&
+	test_when_finished rm -f "$super_attr" &&
+	echo "a diff=d2x" >"$super_attr" &&
+
+	cat >expect <<-\EOF &&
+	a:(1|2)x(3|4)
+	EOF
+	git grep --textconv --recurse-submodules x >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --textconv corectly reads submodule .gitattributes' '
+	reset_and_clean &&
+	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
+	echo "a diff=d2x" >submodule/.gitattributes &&
+
+	cat >expect <<-\EOF &&
+	submodule/a:(1|2)x(3|4)
+	EOF
+	git grep --textconv --recurse-submodules x >actual &&
+	test_cmp expect actual
+'
+
+test_expect_failure 'grep --textconv corectly reads submodule .gitattributes (from index)' '
+	reset_and_clean &&
+	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
+	echo "a diff=d2x" >submodule/.gitattributes &&
+	git -C submodule add .gitattributes &&
+	rm submodule/.gitattributes &&
+
+	cat >expect <<-\EOF &&
+	submodule/a:(1|2)x(3|4)
+	EOF
+	git grep --textconv --recurse-submodules x >actual &&
+	test_cmp expect actual
+'
+
+test_expect_failure 'grep --textconv corectly reads submodule .git/info/attributes' '
+	reset_and_clean &&
+	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
+
+	# Workaround: we use --path-format=relative because the absolute path
+	# contains whitespaces and that seems to confuse test_when_finished
+	#
+	submodule_attr="submodule/$(git -C submodule rev-parse --path-format=relative --git-path info/attributes)" &&
+	test_when_finished rm -f "$submodule_attr" &&
+	echo "a diff=d2x" >"$submodule_attr" &&
+
+	cat >expect <<-\EOF &&
+	submodule/a:(1|2)x(3|4)
+	EOF
+	git grep --textconv --recurse-submodules x >actual &&
+	test_cmp expect actual
+'
+
+test_expect_failure 'grep saves textconv cache in the appropriated repository' '
+	reset_and_clean &&
+	test_config_global diff.d2x_cached.textconv "sed -e \"s/d/x/\"" &&
+	test_config_global diff.d2x_cached.cachetextconv true &&
+	echo "a diff=d2x_cached" >submodule/.gitattributes &&
+
+	# Note: we only read/write to the textconv cache when grepping from an
+	# OID as the working tree file might have modifications. That is why
+	# we use --cached here.
+	#
+	git grep --textconv --cached --recurse-submodules x &&
+	test_path_is_missing "$(git rev-parse --git-path refs/notes/textconv/d2x_cached)" &&
+	test_path_is_file "$(git -C submodule rev-parse --git-path refs/notes/textconv/d2x_cached)"
+'
+
 test_done
--

Junio seemed to agree that the behavior I described above is not the
correct one:

"None of the attributes defined in the superproject
should affect the paths in the submodule, as it is a totally
separate project, oblivious to the existence of enclosing the
superproject."

But he raised an important concern regarding how to fix this without
affecting [too much] performance:

"As there is only one attribute cache IIUC, invalidating
the whole cache for the top-level and replacing it with the one for
a submodule, every time we cross the module boundary, would probably
have a negative effect on the performance, and I am not sure what
would happen if you run more than one threads working in different
repositories (i.e. top-level and submodules)."

Maybe we would need a different attributes stack for each repository?

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/3] grep: don'\''t add subrepos to in-memory alternates
  2021-09-27 16:45 ` [RFC PATCH 0/3] grep: don'\''t add subrepos to in-memory alternates Matheus Tavares
@ 2021-09-27 17:30   ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 13+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-27 17:30 UTC (permalink / raw)
  To: Matheus Tavares; +Cc: git, gitster, jonathantanmy


On Mon, Sep 27 2021, Matheus Tavares wrote:

> On Mon, Sep 27, 2021 at 9:09 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>>
>>
>> On Mon, Aug 16 2021, Jonathan Tan wrote:
>>
>> > Record the repository whenever an OID grep source is created, and teach
>> > the worker threads to explicitly provide the repository when accessing
>> > objects.
>> > [...]
>> > diff --git a/grep.h b/grep.h
>> > index 480b3f5bba..128007db65 100644
>> > --- a/grep.h
>> > +++ b/grep.h
>> > @@ -120,7 +120,20 @@ struct grep_opt {
>> >       struct grep_pat *header_list;
>> >       struct grep_pat **header_tail;
>> >       struct grep_expr *pattern_expression;
>> > +
>> > +     /*
>> > +      * NEEDSWORK: See if we can remove this field, because the repository
>> > +      * should probably be per-source. That is, grep.c functions using this
>> > +      * field should probably start using "repo" in "struct grep_source"
>> > +      * instead.
>> > +      *
>> > +      * This is potentially the cause of at least one bug - "git grep"
>> > +      * ignoring the textconv attributes from submodules. See [1] for more
>> > +      * information.
>> > +      * [1] https://lore.kernel.org/git/CAHd-oW5iEQarYVxEXoTG-ua2zdoybTrSjCBKtO0YT292fm0NQQ@mail.gmail.com/
>> > +      */
>> >       struct repository *repo;
>> > +
>>
>> I ran into this comment and read the linked E-Mail, and then the
>> downthread
>> https://lore.kernel.org/git/CAHd-oW6uG1fap-T4UF17bJmjoHAqWCDq9KbY+_8a3cEnnfATxg@mail.gmail.com/;
>>
>> Given Matheus's "I've somehow missed this guard and the..." there I'm
>> not quite sure what/if we should be doing here & what this comment is
>> recommending? I.e. do we still need to adjust the call chains as noted
>> in the E-Mail the comment links to, or not?
>
> I think we should still adjust the call chains, yes. The downthread
> message you mentioned is kind of a tangent about performance, where
> Junio helped me understand something I had previously missed in the
> code, regarding the persistence of the attributes stack.
>
> But the issue that started the thread was about a correctness problem:
> the superproject textconv attributes are being used on submodules'
> files when running `git grep` with `--recurse-submodules --textconv`.
> The three cases to consider are:
>
> - .gitattributes from the working tree
> - .gitattributes from the index
> - .git/info/attributes
>
> On all these cases, the superproject attributes are being used on the
> submodule. Additionally, if the superproject does not define any
> attribute, the submodule attributes are being ignored in all cases
> except by the first one (but that is only because the code sees the
> .gitattributes file on the submodule as if it were a "regular"
> subdirectory of the surperproject. So the submodule's .gitattribures
> takes higher precedence when evaluating the attributes for files in
> that directory).
>
> Another issue is that the textconv cache is always saved to (and read
> from) the superproject gitdir, even for submodules' files.
>
> Here are some test cases that demonstrate these issues:
>
> -- snipsnap --
> diff --git a/t/t7814-grep-recurse-submodules.sh b/t/t7814-grep-recurse-submodules.sh
> index 3172f5b936..d01a3bc5d8 100755
> --- a/t/t7814-grep-recurse-submodules.sh
> +++ b/t/t7814-grep-recurse-submodules.sh
> @@ -441,4 +441,104 @@ test_expect_success 'grep --recurse-submodules with --cached ignores worktree mo
>  	test_must_fail git grep --recurse-submodules --cached "A modified line in submodule" >actual 2>&1 &&
>  	test_must_be_empty actual
>  '
> +
> +test_expect_failure 'grep --textconv: superproject .gitattributes does not affect submodules' '
> +	reset_and_clean &&
> +	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
> +	echo "a diff=d2x" >.gitattributes &&
> +
> +	cat >expect <<-\EOF &&
> +	a:(1|2)x(3|4)
> +	EOF
> +	git grep --textconv --recurse-submodules x >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_failure 'grep --textconv: superproject .gitattributes (from index) does not affect submodules' '
> +	reset_and_clean &&
> +	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
> +	echo "a diff=d2x" >.gitattributes &&
> +	git add .gitattributes &&
> +	rm .gitattributes &&
> +
> +	cat >expect <<-\EOF &&
> +	a:(1|2)x(3|4)
> +	EOF
> +	git grep --textconv --recurse-submodules x >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_failure 'grep --textconv: superproject .git/info/attributes does not affect submodules' '
> +	reset_and_clean &&
> +	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
> +	super_attr="$(git rev-parse --path-format=relative --git-path info/attributes)" &&
> +	test_when_finished rm -f "$super_attr" &&
> +	echo "a diff=d2x" >"$super_attr" &&
> +
> +	cat >expect <<-\EOF &&
> +	a:(1|2)x(3|4)
> +	EOF
> +	git grep --textconv --recurse-submodules x >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'grep --textconv corectly reads submodule .gitattributes' '
> +	reset_and_clean &&
> +	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
> +	echo "a diff=d2x" >submodule/.gitattributes &&
> +
> +	cat >expect <<-\EOF &&
> +	submodule/a:(1|2)x(3|4)
> +	EOF
> +	git grep --textconv --recurse-submodules x >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_failure 'grep --textconv corectly reads submodule .gitattributes (from index)' '
> +	reset_and_clean &&
> +	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
> +	echo "a diff=d2x" >submodule/.gitattributes &&
> +	git -C submodule add .gitattributes &&
> +	rm submodule/.gitattributes &&
> +
> +	cat >expect <<-\EOF &&
> +	submodule/a:(1|2)x(3|4)
> +	EOF
> +	git grep --textconv --recurse-submodules x >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_failure 'grep --textconv corectly reads submodule .git/info/attributes' '
> +	reset_and_clean &&
> +	test_config_global diff.d2x.textconv "sed -e \"s/d/x/\"" &&
> +
> +	# Workaround: we use --path-format=relative because the absolute path
> +	# contains whitespaces and that seems to confuse test_when_finished
> +	#
> +	submodule_attr="submodule/$(git -C submodule rev-parse --path-format=relative --git-path info/attributes)" &&
> +	test_when_finished rm -f "$submodule_attr" &&
> +	echo "a diff=d2x" >"$submodule_attr" &&
> +
> +	cat >expect <<-\EOF &&
> +	submodule/a:(1|2)x(3|4)
> +	EOF
> +	git grep --textconv --recurse-submodules x >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_failure 'grep saves textconv cache in the appropriated repository' '
> +	reset_and_clean &&
> +	test_config_global diff.d2x_cached.textconv "sed -e \"s/d/x/\"" &&
> +	test_config_global diff.d2x_cached.cachetextconv true &&
> +	echo "a diff=d2x_cached" >submodule/.gitattributes &&
> +
> +	# Note: we only read/write to the textconv cache when grepping from an
> +	# OID as the working tree file might have modifications. That is why
> +	# we use --cached here.
> +	#
> +	git grep --textconv --cached --recurse-submodules x &&
> +	test_path_is_missing "$(git rev-parse --git-path refs/notes/textconv/d2x_cached)" &&
> +	test_path_is_file "$(git -C submodule rev-parse --git-path refs/notes/textconv/d2x_cached)"
> +'
> +
>  test_done

Thanks! I think it would be very good to have these tests in-tree along
with an updated comment pointing to them.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-09-27 17:46 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-18  1:56 [RFC PATCH 0/3] grep: don't add subrepos to in-memory alternates Matheus Tavares
2019-09-18  1:56 ` [RFC PATCH 1/3] diff: use the given repo at diff_populate_filespec() Matheus Tavares
2019-09-18  1:56 ` [RFC PATCH 2/3] object: allow parse_object_or_die() to handle any repo Matheus Tavares
2019-09-18  1:56 ` [RFC PATCH 3/3] grep: don't add submodules to the alternates list Matheus Tavares
2019-09-21 21:58   ` Brandon Williams
2019-09-18 19:55 ` [RFC PATCH 0/3] grep: don't add subrepos to in-memory alternates Junio C Hamano
2019-09-19  5:18   ` Matheus Tavares Bernardino
2019-09-20 16:26     ` Junio C Hamano
2019-09-21 20:34       ` Matheus Tavares Bernardino
2019-09-28  3:24         ` Junio C Hamano
2019-09-28  4:20           ` Matheus Tavares Bernardino
  -- strict thread matches above, loose matches on Subject: below --
2021-09-27 12:08 [PATCH v3 6/8] grep: add repository to OID grep sources Ævar Arnfjörð Bjarmason
2021-09-27 16:45 ` [RFC PATCH 0/3] grep: don'\''t add subrepos to in-memory alternates Matheus Tavares
2021-09-27 17:30   ` Ævar Arnfjörð Bjarmason

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).