git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [RFC PATCH 0/4] name-rev: improve memory usage
@ 2019-03-01 17:50 Alban Gruin
  2019-03-01 17:50 ` [RFC PATCH 1/4] name-rev: improve name_rev() " Alban Gruin
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Alban Gruin @ 2019-03-01 17:50 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Alban Gruin

rafasc reported on IRC that on a repository with a lot of branches,
tags, remotes, and commits, name-rev --stdin could use a massive amount
of memory (more than 2GB of RAM) to complete.

This patch series tries to improve name-rev’s memory usage.

There is some improvement that could be done, such as reference counting
the names attributed to commits.  Tell me if it could be worth to pursue
this way, or if name-rev’s internals would need a more thorough rewrite.

This is based on master (8104ec994e, "Git 2.21").

The tip of this series is tagged as "fix-name-rev-leak-rfc-v1" in
https://github.com/agrn/git.

Alban Gruin (4):
  name-rev: improve name_rev() memory usage
  commit-list: add a function to check if a commit is in a list
  name-rev: check if a commit should be named before naming it
  name-rev: avoid naming from a ref if it’s not a descendant of any
    commit

 builtin/name-rev.c | 124 +++++++++++++++++++++++++++++++++++----------
 commit.c           |  12 +++++
 commit.h           |   1 +
 3 files changed, 111 insertions(+), 26 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH 1/4] name-rev: improve name_rev() memory usage
  2019-03-01 17:50 [RFC PATCH 0/4] name-rev: improve memory usage Alban Gruin
@ 2019-03-01 17:50 ` Alban Gruin
  2019-03-01 18:00   ` Eric Sunshine
                     ` (2 more replies)
  2019-03-01 17:50 ` [RFC PATCH 2/4] commit-list: add a function to check if a commit is in a list Alban Gruin
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 18+ messages in thread
From: Alban Gruin @ 2019-03-01 17:50 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Alban Gruin

name_rev() is a recursive function.  For each commit, it allocates the
name of its parents, and call itself.  A parent may not use a name for
multiple reasons, but in any case, the name will not be released.  On a
repository with a lot of branches, tags, remotes, and commits, it can
use more than 2GB of RAM.

To improve the situation, name_rev() now returns a boolean to its caller
indicating if it can release the name.  The caller may free the name if
the commit is too old, or if the new name is not better than the current
name.

There a condition that will always be false here when name_rev() calls
itself for the first parent, but it will become useful when name_rev()
will stop to name commits that are not mentionned in the stdin buffer.
If the current commit should not be named, its parents may have to be,
but they may not.  In this case, name_rev() will tell to its caller that
the current commit and its first parent has not used the name, and that
it can be released.  However, if the current commit has been named but
not its parent, or the reverse, the name will not be released.

Signed-off-by: Alban Gruin <alban.gruin@gmail.com>
---
 builtin/name-rev.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/builtin/name-rev.c b/builtin/name-rev.c
index f1cb45c227..0719a9388d 100644
--- a/builtin/name-rev.c
+++ b/builtin/name-rev.c
@@ -77,7 +77,7 @@ static int is_better_name(struct rev_name *name,
 	return 0;
 }
 
-static void name_rev(struct commit *commit,
+static int name_rev(struct commit *commit,
 		const char *tip_name, timestamp_t taggerdate,
 		int generation, int distance, int from_tag,
 		int deref)
@@ -86,11 +86,12 @@ static void name_rev(struct commit *commit,
 	struct commit_list *parents;
 	int parent_number = 1;
 	char *to_free = NULL;
+	int free_alloc = 1;
 
 	parse_commit(commit);
 
 	if (commit->date < cutoff)
-		return;
+		return 1;
 
 	if (deref) {
 		tip_name = to_free = xstrfmt("%s^0", tip_name);
@@ -111,9 +112,10 @@ static void name_rev(struct commit *commit,
 		name->generation = generation;
 		name->distance = distance;
 		name->from_tag = from_tag;
+		free_alloc = 0;
 	} else {
 		free(to_free);
-		return;
+		return 1;
 	}
 
 	for (parents = commit->parents;
@@ -131,15 +133,18 @@ static void name_rev(struct commit *commit,
 				new_name = xstrfmt("%.*s^%d", (int)len, tip_name,
 						   parent_number);
 
-			name_rev(parents->item, new_name, taggerdate, 0,
-				 distance + MERGE_TRAVERSAL_WEIGHT,
-				 from_tag, 0);
+			if (name_rev(parents->item, new_name, taggerdate, 0,
+				      distance + MERGE_TRAVERSAL_WEIGHT,
+				      from_tag, 0))
+				free(new_name);
 		} else {
-			name_rev(parents->item, tip_name, taggerdate,
-				 generation + 1, distance + 1,
-				 from_tag, 0);
+			free_alloc &= name_rev(parents->item, tip_name, taggerdate,
+					       generation + 1, distance + 1,
+					       from_tag, 0);
 		}
 	}
+
+	return free_alloc;
 }
 
 static int subpath_matches(const char *path, const char *filter)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 2/4] commit-list: add a function to check if a commit is in a list
  2019-03-01 17:50 [RFC PATCH 0/4] name-rev: improve memory usage Alban Gruin
  2019-03-01 17:50 ` [RFC PATCH 1/4] name-rev: improve name_rev() " Alban Gruin
@ 2019-03-01 17:50 ` Alban Gruin
  2019-03-01 17:50 ` [RFC PATCH 3/4] name-rev: check if a commit should be named before naming it Alban Gruin
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Alban Gruin @ 2019-03-01 17:50 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Alban Gruin

To avoid naming some commits, name_rev() will need to check if a commit
is part of a commit list.

Signed-off-by: Alban Gruin <alban.gruin@gmail.com>
---
 commit.c | 12 ++++++++++++
 commit.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/commit.c b/commit.c
index a5333c7ac6..fcb3e9245f 100644
--- a/commit.c
+++ b/commit.c
@@ -524,6 +524,18 @@ struct commit_list *commit_list_insert(struct commit *item, struct commit_list *
 	return new_list;
 }
 
+int commit_list_contains(const struct commit_list *l, struct commit *commit)
+{
+	const struct commit_list *item;
+
+	for (item = l; item != NULL; item = item->next) {
+		if (oideq(&item->item->object.oid, &commit->object.oid))
+			return 1;
+	}
+
+	return 0;
+}
+
 unsigned commit_list_count(const struct commit_list *l)
 {
 	unsigned c = 0;
diff --git a/commit.h b/commit.h
index 42728c2906..c9df613b0e 100644
--- a/commit.h
+++ b/commit.h
@@ -165,6 +165,7 @@ struct commit_list *commit_list_insert(struct commit *item,
 					struct commit_list **list);
 struct commit_list **commit_list_append(struct commit *commit,
 					struct commit_list **next);
+int commit_list_contains(const struct commit_list *l, struct commit *commit);
 unsigned commit_list_count(const struct commit_list *l);
 struct commit_list *commit_list_insert_by_date(struct commit *item,
 				    struct commit_list **list);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 3/4] name-rev: check if a commit should be named before naming it
  2019-03-01 17:50 [RFC PATCH 0/4] name-rev: improve memory usage Alban Gruin
  2019-03-01 17:50 ` [RFC PATCH 1/4] name-rev: improve name_rev() " Alban Gruin
  2019-03-01 17:50 ` [RFC PATCH 2/4] commit-list: add a function to check if a commit is in a list Alban Gruin
@ 2019-03-01 17:50 ` Alban Gruin
  2019-03-01 18:05   ` Eric Sunshine
  2019-03-01 17:50 ` [RFC PATCH 4/4] name-rev: avoid naming from a ref if it’s not a descendant of any commit Alban Gruin
  2019-03-01 18:42 ` [RFC PATCH 0/4] name-rev: improve memory usage Jeff King
  4 siblings, 1 reply; 18+ messages in thread
From: Alban Gruin @ 2019-03-01 17:50 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Alban Gruin

Until now, name_rev() named every commit it found, even it the name
would ultimately be unused.

This makes name_rev() take a commit_list and check if the commit it wants
to name is part of the list.  If it is, the commit is named, and
name_rev() signals to its caller that the name should not be freed.  If
it is not, the commit is left unnamed.  In this case, the name can still
be used by the first descendant of this commit (or one of its descendants).

Signed-off-by: Alban Gruin <alban.gruin@gmail.com>
---
 builtin/name-rev.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/builtin/name-rev.c b/builtin/name-rev.c
index 0719a9388d..2f89ed50a1 100644
--- a/builtin/name-rev.c
+++ b/builtin/name-rev.c
@@ -80,7 +80,7 @@ static int is_better_name(struct rev_name *name,
 static int name_rev(struct commit *commit,
 		const char *tip_name, timestamp_t taggerdate,
 		int generation, int distance, int from_tag,
-		int deref)
+		int deref, struct commit_list *commits)
 {
 	struct rev_name *name = get_commit_rev_name(commit);
 	struct commit_list *parents;
@@ -107,12 +107,18 @@ static int name_rev(struct commit *commit,
 	} else if (is_better_name(name, tip_name, taggerdate,
 				  generation, distance, from_tag)) {
 copy_data:
-		name->tip_name = tip_name;
+		if (commit_list_contains(commits, commit) ||
+		    commit_list_count(commits) == 0) {
+			name->tip_name = tip_name;
+			free_alloc = 0;
+		} else {
+			name->tip_name = NULL;
+		}
+
 		name->taggerdate = taggerdate;
 		name->generation = generation;
 		name->distance = distance;
 		name->from_tag = from_tag;
-		free_alloc = 0;
 	} else {
 		free(to_free);
 		return 1;
@@ -135,12 +141,12 @@ static int name_rev(struct commit *commit,
 
 			if (name_rev(parents->item, new_name, taggerdate, 0,
 				      distance + MERGE_TRAVERSAL_WEIGHT,
-				      from_tag, 0))
+				      from_tag, 0, commits))
 				free(new_name);
 		} else {
 			free_alloc &= name_rev(parents->item, tip_name, taggerdate,
 					       generation + 1, distance + 1,
-					       from_tag, 0);
+					       from_tag, 0, commits);
 		}
 	}
 
@@ -279,7 +285,7 @@ static int name_ref(const char *path, const struct object_id *oid, int flags, vo
 			taggerdate = ((struct commit *)o)->date;
 		path = name_ref_abbrev(path, can_abbreviate_output);
 		name_rev(commit, xstrdup(path), taggerdate, 0, 0,
-			 from_tag, deref);
+			 from_tag, deref, NULL);
 	}
 	return 0;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 4/4] name-rev: avoid naming from a ref if it’s not a descendant of any commit
  2019-03-01 17:50 [RFC PATCH 0/4] name-rev: improve memory usage Alban Gruin
                   ` (2 preceding siblings ...)
  2019-03-01 17:50 ` [RFC PATCH 3/4] name-rev: check if a commit should be named before naming it Alban Gruin
@ 2019-03-01 17:50 ` Alban Gruin
  2019-03-01 18:07   ` Eric Sunshine
  2019-03-03 19:33   ` Christian Couder
  2019-03-01 18:42 ` [RFC PATCH 0/4] name-rev: improve memory usage Jeff King
  4 siblings, 2 replies; 18+ messages in thread
From: Alban Gruin @ 2019-03-01 17:50 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Alban Gruin

A ref may not be the descendant of all of the commits mentionned in
stdin.  In this case, we want to avoid naming its parents.

To do this, find_commits_in_strbuf() is created.  It returns a raw list
of all the commits that have been found in the input buffer.  ishex() is
converted to an inlined function.  Then, we add a raw list of commits in
the name_ref_data structure, and call find_commits_in_strbuf() before
for_each_ref() if the user wants name-ref to process stdin.  Then, for
each ref, we check if the reachable subset of this commit list is empty
or not.  If it is, we do not call name_rev(), so we don’t name its
parents.

The code dealing with stdin after calling for_each_ref() is no longer
needed as we already read it.  name_rev_line() is renamed name_rev_buf()
to reflect its new role better.

Signed-off-by: Alban Gruin <alban.gruin@gmail.com>
---
 builtin/name-rev.c | 91 ++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 76 insertions(+), 15 deletions(-)

diff --git a/builtin/name-rev.c b/builtin/name-rev.c
index 2f89ed50a1..f5860f5625 100644
--- a/builtin/name-rev.c
+++ b/builtin/name-rev.c
@@ -8,6 +8,7 @@
 #include "parse-options.h"
 #include "sha1-lookup.h"
 #include "commit-slab.h"
+#include "commit-reach.h"
 
 #define CUTOFF_DATE_SLOP 86400 /* one day */
 
@@ -183,6 +184,8 @@ struct name_ref_data {
 	int name_only;
 	struct string_list ref_filters;
 	struct string_list exclude_filters;
+	struct commit **commits;
+	int commits_nr;
 };
 
 static struct tip_table {
@@ -279,13 +282,21 @@ static int name_ref(const char *path, const struct object_id *oid, int flags, vo
 	}
 	if (o && o->type == OBJ_COMMIT) {
 		struct commit *commit = (struct commit *)o;
+		struct commit_list *reachable_commits = NULL;
 		int from_tag = starts_with(path, "refs/tags/");
 
-		if (taggerdate == TIME_MAX)
-			taggerdate = ((struct commit *)o)->date;
-		path = name_ref_abbrev(path, can_abbreviate_output);
-		name_rev(commit, xstrdup(path), taggerdate, 0, 0,
-			 from_tag, deref, NULL);
+		reachable_commits = get_reachable_subset(&commit, 1,
+							 data->commits, data->commits_nr, 0);
+
+		if (commit_list_count(reachable_commits) > 0 || data->commits_nr == 0) {
+			if (taggerdate == TIME_MAX)
+				taggerdate = ((struct commit *)o)->date;
+			path = name_ref_abbrev(path, can_abbreviate_output);
+			name_rev(commit, xstrdup(path), taggerdate, 0, 0,
+				 from_tag, deref, reachable_commits);
+		}
+
+		free_commit_list(reachable_commits);
 	}
 	return 0;
 }
@@ -369,13 +380,53 @@ static char const * const name_rev_usage[] = {
 	NULL
 };
 
-static void name_rev_line(char *p, struct name_ref_data *data)
+static inline int ishex(char p)
+{
+	return isdigit(p) || (p >= 'a' && p <= 'f');
+}
+
+static struct commit **find_commits_in_strbuf(struct strbuf *buf, int *count)
+{
+	int forty = 0;
+	char *p;
+	struct commit **commits = NULL;
+
+	*count = 0;
+
+	for (p = buf->buf; *p; p++) {
+		if (!ishex(*p))
+			forty = 0;
+		else if (++forty == GIT_SHA1_HEXSZ &&
+			 !ishex(*(p+1))) {
+			struct object_id oid;
+			char c = *(p+1);
+
+			*(p+1) = 0;
+			if (!get_oid(p - (GIT_SHA1_HEXSZ -1), &oid)) {
+				struct object *o =
+					parse_object(the_repository, &oid);
+
+				if (o && o->type == OBJ_COMMIT) {
+					struct commit *c = (struct commit *) o;
+
+					REALLOC_ARRAY(commits, (*count) + 1);
+					commits[(*count)++] = c;
+					set_commit_rev_name(c, NULL);
+				}
+			}
+			*(p+1) = c;
+		}
+	}
+
+	return commits;
+}
+
+static void name_rev_buf(char *p, struct name_ref_data *data)
 {
 	struct strbuf buf = STRBUF_INIT;
 	int forty = 0;
 	char *p_start;
 	for (p_start = p; *p; p++) {
-#define ishex(x) (isdigit((x)) || ((x) >= 'a' && (x) <= 'f'))
 		if (!ishex(*p))
 			forty = 0;
 		else if (++forty == GIT_SHA1_HEXSZ &&
@@ -419,7 +470,7 @@ int cmd_name_rev(int argc, const char **argv, const char *prefix)
 {
 	struct object_array revs = OBJECT_ARRAY_INIT;
 	int all = 0, transform_stdin = 0, allow_undefined = 1, always = 0, peel_tag = 0;
-	struct name_ref_data data = { 0, 0, STRING_LIST_INIT_NODUP, STRING_LIST_INIT_NODUP };
+	struct name_ref_data data = { 0, 0, STRING_LIST_INIT_NODUP, STRING_LIST_INIT_NODUP, NULL, 0 };
 	struct option opts[] = {
 		OPT_BOOL(0, "name-only", &data.name_only, N_("print only names (no SHA-1)")),
 		OPT_BOOL(0, "tags", &data.tags_only, N_("only use tags to name the commits")),
@@ -496,20 +547,27 @@ int cmd_name_rev(int argc, const char **argv, const char *prefix)
 
 	if (cutoff)
 		cutoff = cutoff - CUTOFF_DATE_SLOP;
-	for_each_ref(name_ref, &data);
 
 	if (transform_stdin) {
-		char buffer[2048];
+		struct strbuf buf = STRBUF_INIT;
 
-		while (!feof(stdin)) {
-			char *p = fgets(buffer, sizeof(buffer), stdin);
-			if (!p)
-				break;
-			name_rev_line(p, &data);
+		strbuf_read(&buf, STDIN_FILENO, 0);
+		data.commits = find_commits_in_strbuf(&buf, &data.commits_nr);
+
+		if (data.commits_nr > 0) {
+			for_each_ref(name_ref, &data);
+			name_rev_buf(buf.buf, &data);
+		} else {
+			fwrite(buf.buf, buf.len, 1, stdout);
 		}
+
+		free(data.commits);
+		strbuf_release(&buf);
 	} else if (all) {
 		int i, max;
 
+		for_each_ref(name_ref, &data);
+
 		max = get_max_object_index();
 		for (i = 0; i < max; i++) {
 			struct object *obj = get_indexed_object(i);
@@ -520,6 +578,9 @@ int cmd_name_rev(int argc, const char **argv, const char *prefix)
 		}
 	} else {
 		int i;
+
+		for_each_ref(name_ref, &data);
+
 		for (i = 0; i < revs.nr; i++)
 			show_name(revs.objects[i].item, revs.objects[i].name,
 				  always, allow_undefined, data.name_only);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/4] name-rev: improve name_rev() memory usage
  2019-03-01 17:50 ` [RFC PATCH 1/4] name-rev: improve name_rev() " Alban Gruin
@ 2019-03-01 18:00   ` Eric Sunshine
  2019-03-01 18:44   ` Jeff King
  2019-03-02 21:28   ` Johannes Schindelin
  2 siblings, 0 replies; 18+ messages in thread
From: Eric Sunshine @ 2019-03-01 18:00 UTC (permalink / raw)
  To: Alban Gruin; +Cc: Git Mailing List

On Fri, Mar 1, 2019 at 12:50 PM Alban Gruin <alban.gruin@gmail.com> wrote:
> There a condition that will always be false here when name_rev() calls
> itself for the first parent, but it will become useful when name_rev()
> will stop to name commits that are not mentionned in the stdin buffer.

s/mentionned/mentioned/

> If the current commit should not be named, its parents may have to be,
> but they may not.  In this case, name_rev() will tell to its caller that

s/tell/report/

> the current commit and its first parent has not used the name, and that
> it can be released.  However, if the current commit has been named but
> not its parent, or the reverse, the name will not be released.
>
> Signed-off-by: Alban Gruin <alban.gruin@gmail.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/4] name-rev: check if a commit should be named before naming it
  2019-03-01 17:50 ` [RFC PATCH 3/4] name-rev: check if a commit should be named before naming it Alban Gruin
@ 2019-03-01 18:05   ` Eric Sunshine
  2019-03-01 18:22     ` Alban Gruin
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Sunshine @ 2019-03-01 18:05 UTC (permalink / raw)
  To: Alban Gruin; +Cc: Git Mailing List

On Fri, Mar 1, 2019 at 12:50 PM Alban Gruin <alban.gruin@gmail.com> wrote:
> Until now, name_rev() named every commit it found, even it the name

s/even it/even if/

> would ultimately be unused.
> [...]
> Signed-off-by: Alban Gruin <alban.gruin@gmail.com>
> ---
> diff --git a/builtin/name-rev.c b/builtin/name-rev.c
> @@ -107,12 +107,18 @@ static int name_rev(struct commit *commit,
>  copy_data:
> -               name->tip_name = tip_name;
> +               if (commit_list_contains(commits, commit) ||
> +                   commit_list_count(commits) == 0) {

Minor: It would probably be more efficient to check if the count is 0
first, and only search the list if not.

By the way, how big is 'commits' likely to get? Will the linear scan
done by commit_list_contains() become an issue? Should it be using a
hash table instead?

> +                       name->tip_name = tip_name;
> +                       free_alloc = 0;
> +               } else {
> +                       name->tip_name = NULL;
> +               }

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 4/4] name-rev: avoid naming from a ref if it’s not a descendant of any commit
  2019-03-01 17:50 ` [RFC PATCH 4/4] name-rev: avoid naming from a ref if it’s not a descendant of any commit Alban Gruin
@ 2019-03-01 18:07   ` Eric Sunshine
  2019-03-03 19:33   ` Christian Couder
  1 sibling, 0 replies; 18+ messages in thread
From: Eric Sunshine @ 2019-03-01 18:07 UTC (permalink / raw)
  To: Alban Gruin; +Cc: Git Mailing List

On Fri, Mar 1, 2019 at 12:50 PM Alban Gruin <alban.gruin@gmail.com> wrote:
> A ref may not be the descendant of all of the commits mentionned in
> stdin.  In this case, we want to avoid naming its parents.

s/mentionned/mentioned/

>
> To do this, find_commits_in_strbuf() is created.  It returns a raw list
> of all the commits that have been found in the input buffer.  ishex() is
> converted to an inlined function.  Then, we add a raw list of commits in
> the name_ref_data structure, and call find_commits_in_strbuf() before
> for_each_ref() if the user wants name-ref to process stdin.  Then, for
> each ref, we check if the reachable subset of this commit list is empty
> or not.  If it is, we do not call name_rev(), so we don’t name its
> parents.
>
> The code dealing with stdin after calling for_each_ref() is no longer
> needed as we already read it.  name_rev_line() is renamed name_rev_buf()
> to reflect its new role better.
>
> Signed-off-by: Alban Gruin <alban.gruin@gmail.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/4] name-rev: check if a commit should be named before naming it
  2019-03-01 18:05   ` Eric Sunshine
@ 2019-03-01 18:22     ` Alban Gruin
  2019-03-01 18:37       ` Jeff King
  0 siblings, 1 reply; 18+ messages in thread
From: Alban Gruin @ 2019-03-01 18:22 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Git Mailing List

Hi Eric,

Le 01/03/2019 à 19:05, Eric Sunshine a écrit :
> On Fri, Mar 1, 2019 at 12:50 PM Alban Gruin <alban.gruin@gmail.com> wrote:
> -%<-
> Minor: It would probably be more efficient to check if the count is 0
> first, and only search the list if not.
> 
> By the way, how big is 'commits' likely to get? Will the linear scan
> done by commit_list_contains() become an issue? Should it be using a
> hash table instead?
> 

It depends on the amount of commits mentionned in stdin that are
reachable from the ref in name_ref().  If there is a lot of commit in
the input, it may effectively become a problem.

I thought of adding a field to the rev_name structure for this purpose.
 I think commit slabs are hash maps under the hood?

>> +                       name->tip_name = tip_name;
>> +                       free_alloc = 0;
>> +               } else {
>> +                       name->tip_name = NULL;
>> +               }

Cheers,
Alban


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/4] name-rev: check if a commit should be named before naming it
  2019-03-01 18:22     ` Alban Gruin
@ 2019-03-01 18:37       ` Jeff King
  0 siblings, 0 replies; 18+ messages in thread
From: Jeff King @ 2019-03-01 18:37 UTC (permalink / raw)
  To: Alban Gruin; +Cc: Eric Sunshine, Git Mailing List

On Fri, Mar 01, 2019 at 07:22:47PM +0100, Alban Gruin wrote:

> Le 01/03/2019 à 19:05, Eric Sunshine a écrit :
> > On Fri, Mar 1, 2019 at 12:50 PM Alban Gruin <alban.gruin@gmail.com> wrote:
> > -%<-
> > Minor: It would probably be more efficient to check if the count is 0
> > first, and only search the list if not.
> > 
> > By the way, how big is 'commits' likely to get? Will the linear scan
> > done by commit_list_contains() become an issue? Should it be using a
> > hash table instead?
> > 
> 
> It depends on the amount of commits mentionned in stdin that are
> reachable from the ref in name_ref().  If there is a lot of commit in
> the input, it may effectively become a problem.

Yeah, I think this is quadratic in the worst case.

> I thought of adding a field to the rev_name structure for this purpose.
>  I think commit slabs are hash maps under the hood?

They're not hash maps, but they're still O(1). Each commit struct is
allocated a unique integer id, and the slab just keeps a dynamic array
of one entry per commit.

So they're actually faster than a hashmap (it's a pure index, no oideq()
required).  The downside is that they can use more memory than a hashmap
if they're sparsely populated. However in your case, I think you'd load
a couple of "struct commit" objects at the start, mark them as "1" in
the slab, and then never mark any more as you traverse. So you'd only
allocate entries for those tightly-packed initial ones.

It seems like an ideal case for using a slab.

-Peff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] name-rev: improve memory usage
  2019-03-01 17:50 [RFC PATCH 0/4] name-rev: improve memory usage Alban Gruin
                   ` (3 preceding siblings ...)
  2019-03-01 17:50 ` [RFC PATCH 4/4] name-rev: avoid naming from a ref if it’s not a descendant of any commit Alban Gruin
@ 2019-03-01 18:42 ` Jeff King
  2019-03-01 19:14   ` Alban Gruin
  4 siblings, 1 reply; 18+ messages in thread
From: Jeff King @ 2019-03-01 18:42 UTC (permalink / raw)
  To: Alban Gruin; +Cc: Git Mailing List

On Fri, Mar 01, 2019 at 06:50:20PM +0100, Alban Gruin wrote:

> rafasc reported on IRC that on a repository with a lot of branches,
> tags, remotes, and commits, name-rev --stdin could use a massive amount
> of memory (more than 2GB of RAM) to complete.
> 
> This patch series tries to improve name-rev’s memory usage.

Have you tried this?

diff --git a/builtin/name-rev.c b/builtin/name-rev.c
index f1cb45c227..7aaa86f1c0 100644
--- a/builtin/name-rev.c
+++ b/builtin/name-rev.c
@@ -431,6 +431,8 @@ int cmd_name_rev(int argc, const char **argv, const char *prefix)
 		OPT_END(),
 	};
 
+	save_commit_buffer = 0;
+
 	init_commit_rev_name(&rev_names);
 	git_config(git_default_config, NULL);
 	argc = parse_options(argc, argv, prefix, opts, name_rev_usage, 0);

It seems to lower heap usage of:

  git name-rev 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2

in linux.git (that commit is the final one reported by "git log", so
it's traversing all of history) from ~1GB to ~300MB.

-Peff

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/4] name-rev: improve name_rev() memory usage
  2019-03-01 17:50 ` [RFC PATCH 1/4] name-rev: improve name_rev() " Alban Gruin
  2019-03-01 18:00   ` Eric Sunshine
@ 2019-03-01 18:44   ` Jeff King
  2019-03-02 21:28   ` Johannes Schindelin
  2 siblings, 0 replies; 18+ messages in thread
From: Jeff King @ 2019-03-01 18:44 UTC (permalink / raw)
  To: Alban Gruin; +Cc: Git Mailing List

On Fri, Mar 01, 2019 at 06:50:21PM +0100, Alban Gruin wrote:

> name_rev() is a recursive function.  For each commit, it allocates the
> name of its parents, and call itself.  A parent may not use a name for
> multiple reasons, but in any case, the name will not be released.  On a
> repository with a lot of branches, tags, remotes, and commits, it can
> use more than 2GB of RAM.

I don't know name-rev that well, so I didn't look super carefully at
your patch. But if it's recursive in history, that's almost certainly
going to be a problem for some histories on some platforms. We've had to
de-recursify several other algorithms (e.g., "tag --contains") because
of stack space issues.

Which isn't to say what you're doing here might not be separately
useful, but just know that in the long run that's probably where we'd
need to take this.

-Peff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] name-rev: improve memory usage
  2019-03-01 18:42 ` [RFC PATCH 0/4] name-rev: improve memory usage Jeff King
@ 2019-03-01 19:14   ` Alban Gruin
  2019-03-01 19:39     ` Jeff King
  0 siblings, 1 reply; 18+ messages in thread
From: Alban Gruin @ 2019-03-01 19:14 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List

Hi Jeff,

Le 01/03/2019 à 19:42, Jeff King a écrit :
> On Fri, Mar 01, 2019 at 06:50:20PM +0100, Alban Gruin wrote:
> 
>> rafasc reported on IRC that on a repository with a lot of branches,
>> tags, remotes, and commits, name-rev --stdin could use a massive amount
>> of memory (more than 2GB of RAM) to complete.
>>
>> This patch series tries to improve name-rev’s memory usage.
> 
> Have you tried this?
> 
> diff --git a/builtin/name-rev.c b/builtin/name-rev.c
> index f1cb45c227..7aaa86f1c0 100644
> --- a/builtin/name-rev.c
> +++ b/builtin/name-rev.c
> @@ -431,6 +431,8 @@ int cmd_name_rev(int argc, const char **argv, const char *prefix)
>  		OPT_END(),
>  	};
>  
> +	save_commit_buffer = 0;
> +
>  	init_commit_rev_name(&rev_names);
>  	git_config(git_default_config, NULL);
>  	argc = parse_options(argc, argv, prefix, opts, name_rev_usage, 0);
> 
> It seems to lower heap usage of:
> 
>   git name-rev 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2
> 
> in linux.git (that commit is the final one reported by "git log", so
> it's traversing all of history) from ~1GB to ~300MB.
> 
> -Peff
> 

Unfortunately this does not work in all cases, apparently.  On my git
copy, I have 3 origins.  If I run this:

	git log --graph --oneline --abbrev=-1 -5 | git name-rev --stdin

With or without your change, it uses 3GB of RAM.  With this series, it
uses 25MB of RAM.

With the first git commit:

	git name-rev e83c5163316f89bfbde7d9ab23ca2e25604af290

It also uses 3GB of RAM.  With this series, it uses around 350MB of RAM.
 Which makes me think that I should adapt 4/4 to arguments.

Sorry, I should have specified this in the cover letter.

Cheers,
Alban


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] name-rev: improve memory usage
  2019-03-01 19:14   ` Alban Gruin
@ 2019-03-01 19:39     ` Jeff King
  0 siblings, 0 replies; 18+ messages in thread
From: Jeff King @ 2019-03-01 19:39 UTC (permalink / raw)
  To: Alban Gruin; +Cc: Git Mailing List

On Fri, Mar 01, 2019 at 08:14:26PM +0100, Alban Gruin wrote:

> > diff --git a/builtin/name-rev.c b/builtin/name-rev.c
> > index f1cb45c227..7aaa86f1c0 100644
> > --- a/builtin/name-rev.c
> > +++ b/builtin/name-rev.c
> > @@ -431,6 +431,8 @@ int cmd_name_rev(int argc, const char **argv, const char *prefix)
> >  		OPT_END(),
> >  	};
> >  
> > +	save_commit_buffer = 0;
> > +
> [...]
> 
> Unfortunately this does not work in all cases, apparently.  On my git
> copy, I have 3 origins.  If I run this:
> 
> 	git log --graph --oneline --abbrev=-1 -5 | git name-rev --stdin
> 
> With or without your change, it uses 3GB of RAM.  With this series, it
> uses 25MB of RAM.

Sorry if I was unclear. I meant to try that _in addition_ to your
changes. It helps by avoiding keeping the useless commit-object buffers
in RAM as we traverse. But the most it can save is the total
uncompressed bytes of all commit objects. I.e., in git.git:

  $ git cat-file --batch-check='%(objectsize) %(objecttype)' --batch-all-objects |
    grep commit |
    perl -alne '$total += $F[0]; END { print $total }'
  74678114

or around 70MB. In linux.git, it's more like 700MB.

But in your examples, the problem is the inefficiencies in name-rev's
algorithm, and you're not actually traversing that many commits. So I
think you'd want to turn off save_commit_buffer as an extra patch in
your series. It may or not be a big win for any given case, but it's
quite easy to do.

-Peff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/4] name-rev: improve name_rev() memory usage
  2019-03-01 17:50 ` [RFC PATCH 1/4] name-rev: improve name_rev() " Alban Gruin
  2019-03-01 18:00   ` Eric Sunshine
  2019-03-01 18:44   ` Jeff King
@ 2019-03-02 21:28   ` Johannes Schindelin
  2 siblings, 0 replies; 18+ messages in thread
From: Johannes Schindelin @ 2019-03-02 21:28 UTC (permalink / raw)
  To: Alban Gruin; +Cc: Git Mailing List

Hi Alban,

On Fri, 1 Mar 2019, Alban Gruin wrote:

> name_rev() is a recursive function.  For each commit, it allocates the
> name of its parents, and call itself.  A parent may not use a name for
> multiple reasons, but in any case, the name will not be released.  On a
> repository with a lot of branches, tags, remotes, and commits, it can
> use more than 2GB of RAM.
> 
> To improve the situation, name_rev() now returns a boolean to its caller
> indicating if it can release the name.  The caller may free the name if
> the commit is too old, or if the new name is not better than the current
> name.
> 
> There a condition that will always be false here when name_rev() calls
> itself for the first parent, but it will become useful when name_rev()
> will stop to name commits that are not mentionned in the stdin buffer.
> If the current commit should not be named, its parents may have to be,
> but they may not.  In this case, name_rev() will tell to its caller that
> the current commit and its first parent has not used the name, and that
> it can be released.  However, if the current commit has been named but
> not its parent, or the reverse, the name will not be released.

Makes sense, and the patch looks mostly good to me, just one suggestion:

> Signed-off-by: Alban Gruin <alban.gruin@gmail.com>
> ---
>  builtin/name-rev.c | 23 ++++++++++++++---------
>  1 file changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/builtin/name-rev.c b/builtin/name-rev.c
> index f1cb45c227..0719a9388d 100644
> --- a/builtin/name-rev.c
> +++ b/builtin/name-rev.c
> @@ -77,7 +77,7 @@ static int is_better_name(struct rev_name *name,
>  	return 0;
>  }
>  
> -static void name_rev(struct commit *commit,
> +static int name_rev(struct commit *commit,
>  		const char *tip_name, timestamp_t taggerdate,
>  		int generation, int distance, int from_tag,
>  		int deref)
> @@ -86,11 +86,12 @@ static void name_rev(struct commit *commit,
>  	struct commit_list *parents;
>  	int parent_number = 1;
>  	char *to_free = NULL;
> +	int free_alloc = 1;
>  
>  	parse_commit(commit);
>  
>  	if (commit->date < cutoff)
> -		return;
> +		return 1;
>  
>  	if (deref) {
>  		tip_name = to_free = xstrfmt("%s^0", tip_name);
> @@ -111,9 +112,10 @@ static void name_rev(struct commit *commit,
>  		name->generation = generation;
>  		name->distance = distance;
>  		name->from_tag = from_tag;
> +		free_alloc = 0;
>  	} else {
>  		free(to_free);
> -		return;
> +		return 1;
>  	}
>  
>  	for (parents = commit->parents;
> @@ -131,15 +133,18 @@ static void name_rev(struct commit *commit,
>  				new_name = xstrfmt("%.*s^%d", (int)len, tip_name,
>  						   parent_number);
>  
> -			name_rev(parents->item, new_name, taggerdate, 0,
> -				 distance + MERGE_TRAVERSAL_WEIGHT,
> -				 from_tag, 0);
> +			if (name_rev(parents->item, new_name, taggerdate, 0,
> +				      distance + MERGE_TRAVERSAL_WEIGHT,
> +				      from_tag, 0))
> +				free(new_name);
>  		} else {
> -			name_rev(parents->item, tip_name, taggerdate,
> -				 generation + 1, distance + 1,
> -				 from_tag, 0);
> +			free_alloc &= name_rev(parents->item, tip_name, taggerdate,
> +					       generation + 1, distance + 1,
> +					       from_tag, 0);

This would be easier to read if it avoided the &=, e.g. by turning it
into:

		} else if (!name_rev(parents->item, tip_name, taggerdate,
				     generation + 1, distance + 1,
				     from_tag, 0))
			free_alloc = 0;

Ciao,
Dscho

>  		}
>  	}
> +
> +	return free_alloc;
>  }
>  
>  static int subpath_matches(const char *path, const char *filter)
> -- 
> 2.20.1
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 4/4] name-rev: avoid naming from a ref if it’s not a descendant of any commit
  2019-03-01 17:50 ` [RFC PATCH 4/4] name-rev: avoid naming from a ref if it’s not a descendant of any commit Alban Gruin
  2019-03-01 18:07   ` Eric Sunshine
@ 2019-03-03 19:33   ` Christian Couder
  2019-03-03 19:46     ` Christian Couder
  2019-03-03 20:27     ` Alban Gruin
  1 sibling, 2 replies; 18+ messages in thread
From: Christian Couder @ 2019-03-03 19:33 UTC (permalink / raw)
  To: Alban Gruin; +Cc: Git Mailing List

On Fri, Mar 1, 2019 at 7:11 PM Alban Gruin <alban.gruin@gmail.com> wrote:
>
> A ref may not be the descendant of all of the commits mentionned in
> stdin.  In this case, we want to avoid naming its parents.

Properly speaking a ref usually just points to a commit. It is not a
parent or a descendant of any commit.

Also if the commit, let's call it C, pointed to by a ref isn't a
descendant of a commit, let's call it C', mentioned on stdin, then C'
isn't a parent of C, though I think we want to avoid naming C' from
the ref.

I think it might be clearer with something like:

"The commit, let's call it C, pointed to by a ref may not be a
descendant of all the commits mentioned on stdin. In this case we want
to avoid naming the commits mentioned on stdin that are not parents of
C using the ref."

Or is there something I misunderstood?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 4/4] name-rev: avoid naming from a ref if it’s not a descendant of any commit
  2019-03-03 19:33   ` Christian Couder
@ 2019-03-03 19:46     ` Christian Couder
  2019-03-03 20:27     ` Alban Gruin
  1 sibling, 0 replies; 18+ messages in thread
From: Christian Couder @ 2019-03-03 19:46 UTC (permalink / raw)
  To: Alban Gruin; +Cc: Git Mailing List

On Sun, Mar 3, 2019 at 8:33 PM Christian Couder
<christian.couder@gmail.com> wrote:
>
> On Fri, Mar 1, 2019 at 7:11 PM Alban Gruin <alban.gruin@gmail.com> wrote:
> >
> > A ref may not be the descendant of all of the commits mentionned in
> > stdin.  In this case, we want to avoid naming its parents.
>
> Properly speaking a ref usually just points to a commit. It is not a
> parent or a descendant of any commit.
>
> Also if the commit, let's call it C, pointed to by a ref isn't a
> descendant of a commit, let's call it C', mentioned on stdin, then C'
> isn't a parent of C, though I think we want to avoid naming C' from
> the ref.
>
> I think it might be clearer with something like:
>
> "The commit, let's call it C, pointed to by a ref may not be a
> descendant of all the commits mentioned on stdin. In this case we want
> to avoid naming the commits mentioned on stdin that are not parents of
> C using the ref."

By the way things might even be clearer by replacing "parent" with
"ancestor" in my message.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 4/4] name-rev: avoid naming from a ref if it’s not a descendant of any commit
  2019-03-03 19:33   ` Christian Couder
  2019-03-03 19:46     ` Christian Couder
@ 2019-03-03 20:27     ` Alban Gruin
  1 sibling, 0 replies; 18+ messages in thread
From: Alban Gruin @ 2019-03-03 20:27 UTC (permalink / raw)
  To: Christian Couder; +Cc: Git Mailing List

Hi Christian,

Le 03/03/2019 à 20:33, Christian Couder a écrit :
> On Fri, Mar 1, 2019 at 7:11 PM Alban Gruin <alban.gruin@gmail.com> wrote:
>>
>> A ref may not be the descendant of all of the commits mentionned in
>> stdin.  In this case, we want to avoid naming its parents.
> 
> Properly speaking a ref usually just points to a commit. It is not a
> parent or a descendant of any commit.
> 
> Also if the commit, let's call it C, pointed to by a ref isn't a
> descendant of a commit, let's call it C', mentioned on stdin, then C'
> isn't a parent of C, though I think we want to avoid naming C' from
> the ref.
> 
> I think it might be clearer with something like:
> 
> "The commit, let's call it C, pointed to by a ref may not be a
> descendant of all the commits mentioned on stdin. In this case we want
> to avoid naming the commits mentioned on stdin that are not parents of
> C using the ref."
> 
> Or is there something I misunderstood?
> 

It’s almost this.  We do want to name *all* of the commits mentionned in
stdin, as far as possible.  We also want to lower the amount of memory
allocation, and one of the possibilities is to avoid traversing the tree
from a commit pointed to by a ref if none of its parents are mentionned
in stdin.

Cheers,
Alban


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-03-03 20:28 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-01 17:50 [RFC PATCH 0/4] name-rev: improve memory usage Alban Gruin
2019-03-01 17:50 ` [RFC PATCH 1/4] name-rev: improve name_rev() " Alban Gruin
2019-03-01 18:00   ` Eric Sunshine
2019-03-01 18:44   ` Jeff King
2019-03-02 21:28   ` Johannes Schindelin
2019-03-01 17:50 ` [RFC PATCH 2/4] commit-list: add a function to check if a commit is in a list Alban Gruin
2019-03-01 17:50 ` [RFC PATCH 3/4] name-rev: check if a commit should be named before naming it Alban Gruin
2019-03-01 18:05   ` Eric Sunshine
2019-03-01 18:22     ` Alban Gruin
2019-03-01 18:37       ` Jeff King
2019-03-01 17:50 ` [RFC PATCH 4/4] name-rev: avoid naming from a ref if it’s not a descendant of any commit Alban Gruin
2019-03-01 18:07   ` Eric Sunshine
2019-03-03 19:33   ` Christian Couder
2019-03-03 19:46     ` Christian Couder
2019-03-03 20:27     ` Alban Gruin
2019-03-01 18:42 ` [RFC PATCH 0/4] name-rev: improve memory usage Jeff King
2019-03-01 19:14   ` Alban Gruin
2019-03-01 19:39     ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).