[BUG] add_again() off-by-one error in custom format

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [BUG] add_again() off-by-one error in custom format
@ 2017-06-12  3:13 Michael Giuffrida
  2017-06-12 22:49 ` Junio C Hamano
  0 siblings, 1 reply; 30+ messages in thread
From: Michael Giuffrida @ 2017-06-12  3:13 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, René Scharfe

In a custom pretty format, using the '+' or ' ' combinators to prefix
a non-empty expansion with whitespace will erroneously truncate
subsequent expansions of the same type.

Normally '%+X' inserts a newline before <X>, IFF the expansion of X is
non-empty:

    $ git log -n 1 --pretty='format:newline:%+an'
    newline:
    MyName

For the abbreviated commit hash placeholder ('h'), pretty.c uses
add_again() to cache the result of the expansion, and then uses that
result in future expansions. This causes problems when the expansion
includes whitespace:

    $ git log -n 1 --pretty='format:newline:%+h%nno_newline:%h'
    newline:
    a0b1c2d
    no_newline:
    a0b1c2

The second expansion of the hash added an unwanted newline and removed
the final character. It seemingly used the cached expansion *starting
from the inserted newline*.

The expected output is:

    $ git log -n 1 --pretty='format:newline:%+h%nno_newline:%h'
    newline:
    a0b1c2d
    no_newline:a0b1c2d

/CC René from commit b9c62321 and Junio from 9fa708d

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-12  3:13 [BUG] add_again() off-by-one error in custom format Michael Giuffrida
@ 2017-06-12 22:49 ` Junio C Hamano
  2017-06-13 18:09   ` René Scharfe
  0 siblings, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2017-06-12 22:49 UTC (permalink / raw)
  To: Michael Giuffrida; +Cc: git, René Scharfe

Michael Giuffrida <michaelpg@chromium.org> writes:

> For the abbreviated commit hash placeholder ('h'), pretty.c uses
> add_again() to cache the result of the expansion, and then uses that
> result in future expansions. This causes problems when the expansion
> includes whitespace:
>
>     $ git log -n 1 --pretty='format:newline:%+h%nno_newline:%h'
>     newline:
>     a0b1c2d
>     no_newline:
>     a0b1c2
>
> The second expansion of the hash added an unwanted newline and removed
> the final character. It seemingly used the cached expansion *starting
> from the inserted newline*.
>
> The expected output is:
>
>     $ git log -n 1 --pretty='format:newline:%+h%nno_newline:%h'
>     newline:
>     a0b1c2d
>     no_newline:a0b1c2d

Nicely explained.  The add_again() mechanism caches an earlier
result by remembering the offset and the length in the strbuf the
formatted string is being accumulated to, but because %+x (and
probably %-x) magic placeholders futz with the result of
format_commit_one() in place, the cache goes out of sync, of course.

I think the call to format_commit_one() needs to be taught to pass
'magic' through, and then add_again() mechanism needs to be told not
to cache when magic is in effect, or something like that.

Perhaps something along this line...?

 pretty.c | 64 ++++++++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 38 insertions(+), 26 deletions(-)

diff --git a/pretty.c b/pretty.c
index 09701bd2ff..be22b12986 100644
--- a/pretty.c
+++ b/pretty.c
@@ -789,20 +789,22 @@ struct format_commit_context {
 	size_t wrap_start;
 };
 
-static int add_again(struct strbuf *sb, struct chunk *chunk)
+static int add_again(struct strbuf *sb, struct chunk *chunk, int no_cache)
 {
 	if (chunk->len) {
 		strbuf_adddup(sb, chunk->off, chunk->len);
 		return 1;
 	}
 
-	/*
-	 * We haven't seen this chunk before.  Our caller is surely
-	 * going to add it the hard way now.  Remember the most likely
-	 * start of the to-be-added chunk: the current end of the
-	 * struct strbuf.
-	 */
-	chunk->off = sb->len;
+	if (!no_cache) {
+		/*
+		 * We haven't seen this chunk before.  Our caller is surely
+		 * going to add it the hard way now.  Remember the most likely
+		 * start of the to-be-added chunk: the current end of the
+		 * struct strbuf.
+		 */
+		chunk->off = sb->len;
+	}
 	return 0;
 }
 
@@ -1066,15 +1068,24 @@ static size_t parse_padding_placeholder(struct strbuf *sb,
 	return 0;
 }
 
+enum format_commit_item_magic {
+	NO_MAGIC,
+	ADD_LF_BEFORE_NON_EMPTY,
+	DEL_LF_BEFORE_EMPTY,
+	ADD_SP_BEFORE_NON_EMPTY
+};
+
 static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
 				const char *placeholder,
-				void *context)
+				void *context,
+				enum format_commit_item_magic magic)
 {
 	struct format_commit_context *c = context;
 	const struct commit *commit = c->commit;
 	const char *msg = c->message;
 	struct commit_list *p;
 	int ch;
+	int no_cache;
 
 	/* these are independent of the commit */
 	switch (placeholder[0]) {
@@ -1139,6 +1150,8 @@ static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
 	if (!commit->object.parsed)
 		parse_object(&commit->object.oid);
 
+	no_cache = magic != NO_MAGIC;
+
 	switch (placeholder[0]) {
 	case 'H':		/* commit hash */
 		strbuf_addstr(sb, diff_get_color(c->auto_color, DIFF_COMMIT));
@@ -1147,24 +1160,26 @@ static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
 		return 1;
 	case 'h':		/* abbreviated commit hash */
 		strbuf_addstr(sb, diff_get_color(c->auto_color, DIFF_COMMIT));
-		if (add_again(sb, &c->abbrev_commit_hash)) {
+		if (add_again(sb, &c->abbrev_commit_hash, no_cache)) {
 			strbuf_addstr(sb, diff_get_color(c->auto_color, DIFF_RESET));
 			return 1;
 		}
 		strbuf_add_unique_abbrev(sb, commit->object.oid.hash,
 					 c->pretty_ctx->abbrev);
 		strbuf_addstr(sb, diff_get_color(c->auto_color, DIFF_RESET));
-		c->abbrev_commit_hash.len = sb->len - c->abbrev_commit_hash.off;
+		if (!no_cache)
+			c->abbrev_commit_hash.len = sb->len - c->abbrev_commit_hash.off;
 		return 1;
 	case 'T':		/* tree hash */
 		strbuf_addstr(sb, oid_to_hex(&commit->tree->object.oid));
 		return 1;
 	case 't':		/* abbreviated tree hash */
-		if (add_again(sb, &c->abbrev_tree_hash))
+		if (add_again(sb, &c->abbrev_tree_hash, no_cache))
 			return 1;
 		strbuf_add_unique_abbrev(sb, commit->tree->object.oid.hash,
 					 c->pretty_ctx->abbrev);
-		c->abbrev_tree_hash.len = sb->len - c->abbrev_tree_hash.off;
+		if (!no_cache)
+			c->abbrev_tree_hash.len = sb->len - c->abbrev_tree_hash.off;
 		return 1;
 	case 'P':		/* parent hashes */
 		for (p = commit->parents; p; p = p->next) {
@@ -1174,7 +1189,7 @@ static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
 		}
 		return 1;
 	case 'p':		/* abbreviated parent hashes */
-		if (add_again(sb, &c->abbrev_parent_hashes))
+		if (add_again(sb, &c->abbrev_parent_hashes, no_cache))
 			return 1;
 		for (p = commit->parents; p; p = p->next) {
 			if (p != commit->parents)
@@ -1182,8 +1197,9 @@ static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
 			strbuf_add_unique_abbrev(sb, p->item->object.oid.hash,
 						 c->pretty_ctx->abbrev);
 		}
-		c->abbrev_parent_hashes.len = sb->len -
-		                              c->abbrev_parent_hashes.off;
+		if (!no_cache)
+			c->abbrev_parent_hashes.len = 
+				sb->len - c->abbrev_parent_hashes.off;
 		return 1;
 	case 'm':		/* left/right/bottom */
 		strbuf_addstr(sb, get_revision_mark(NULL, commit));
@@ -1314,7 +1330,8 @@ static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
 
 static size_t format_and_pad_commit(struct strbuf *sb, /* in UTF-8 */
 				    const char *placeholder,
-				    struct format_commit_context *c)
+				    struct format_commit_context *c,
+				    enum format_commit_item_magic magic)
 {
 	struct strbuf local_sb = STRBUF_INIT;
 	int total_consumed = 0, len, padding = c->padding;
@@ -1329,7 +1346,7 @@ static size_t format_and_pad_commit(struct strbuf *sb, /* in UTF-8 */
 	}
 	while (1) {
 		int modifier = *placeholder == 'C';
-		int consumed = format_commit_one(&local_sb, placeholder, c);
+		int consumed = format_commit_one(&local_sb, placeholder, c, magic);
 		total_consumed += consumed;
 
 		if (!modifier)
@@ -1420,12 +1437,7 @@ static size_t format_commit_item(struct strbuf *sb, /* in UTF-8 */
 {
 	int consumed;
 	size_t orig_len;
-	enum {
-		NO_MAGIC,
-		ADD_LF_BEFORE_NON_EMPTY,
-		DEL_LF_BEFORE_EMPTY,
-		ADD_SP_BEFORE_NON_EMPTY
-	} magic = NO_MAGIC;
+	enum format_commit_item_magic magic = NO_MAGIC;
 
 	switch (placeholder[0]) {
 	case '-':
@@ -1445,9 +1457,9 @@ static size_t format_commit_item(struct strbuf *sb, /* in UTF-8 */
 
 	orig_len = sb->len;
 	if (((struct format_commit_context *)context)->flush_type != no_flush)
-		consumed = format_and_pad_commit(sb, placeholder, context);
+		consumed = format_and_pad_commit(sb, placeholder, context, magic);
 	else
-		consumed = format_commit_one(sb, placeholder, context);
+		consumed = format_commit_one(sb, placeholder, context, magic);
 	if (magic == NO_MAGIC)
 		return consumed;
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-12 22:49 ` Junio C Hamano
@ 2017-06-13 18:09   ` René Scharfe
  2017-06-13 18:29     ` Junio C Hamano
  0 siblings, 1 reply; 30+ messages in thread
From: René Scharfe @ 2017-06-13 18:09 UTC (permalink / raw)
  To: Junio C Hamano, Michael Giuffrida; +Cc: git

Am 13.06.2017 um 00:49 schrieb Junio C Hamano:
> Michael Giuffrida <michaelpg@chromium.org> writes:
> 
>> For the abbreviated commit hash placeholder ('h'), pretty.c uses
>> add_again() to cache the result of the expansion, and then uses that
>> result in future expansions. This causes problems when the expansion
>> includes whitespace:
>>
>>      $ git log -n 1 --pretty='format:newline:%+h%nno_newline:%h'
>>      newline:
>>      a0b1c2d
>>      no_newline:
>>      a0b1c2
>>
>> The second expansion of the hash added an unwanted newline and removed
>> the final character. It seemingly used the cached expansion *starting
>> from the inserted newline*.
>>
>> The expected output is:
>>
>>      $ git log -n 1 --pretty='format:newline:%+h%nno_newline:%h'
>>      newline:
>>      a0b1c2d
>>      no_newline:a0b1c2d
> 
> Nicely explained.  The add_again() mechanism caches an earlier
> result by remembering the offset and the length in the strbuf the
> formatted string is being accumulated to, but because %+x (and
> probably %-x) magic placeholders futz with the result of
> format_commit_one() in place, the cache goes out of sync, of course.

Indeed, a very nice bug report!

> I think the call to format_commit_one() needs to be taught to pass
> 'magic' through, and then add_again() mechanism needs to be told not
> to cache when magic is in effect, or something like that.
> 
> Perhaps something along this line...?
> 
>   pretty.c | 64 ++++++++++++++++++++++++++++++++++++++--------------------------
>   1 file changed, 38 insertions(+), 26 deletions(-)

That looks quite big.  You even invent a way to classify magic. Does the
caching feature justify the added complexity?  Alternatives:

- Don't cache anymore, now that we have placeholders that change output
   on top of the original append-only ones.  Yields a net code reduction.
   Duplicated %h, %t and %p will have to be resolved at each occurrence.

- Provide some space for the cache instead of reusing the output, like
   a strbuf for each placeholder.  Adds a memory allocation to each
   first occurrence of %h, %t and %p.  Such a cache could be reused for
   multiple commits by truncating it instead of releasing; not sure how
   to pass it in nicely for it to survive a sequence of calls, though.

René

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-13 18:09   ` René Scharfe
@ 2017-06-13 18:29     ` Junio C Hamano
  2017-06-13 20:29       ` René Scharfe
  0 siblings, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2017-06-13 18:29 UTC (permalink / raw)
  To: René Scharfe; +Cc: Michael Giuffrida, git

René Scharfe <l.s.r@web.de> writes:

> Indeed, a very nice bug report!
>
>> I think the call to format_commit_one() needs to be taught to pass
>> 'magic' through, and then add_again() mechanism needs to be told not
>> to cache when magic is in effect, or something like that.
>>
>> Perhaps something along this line...?
>>
>>   pretty.c | 64 ++++++++++++++++++++++++++++++++++++++--------------------------
>>   1 file changed, 38 insertions(+), 26 deletions(-)
>
> That looks quite big.  You even invent a way to classify magic. 

Hmph, by "a way to classify", do you mean the enum?  That thing was
there from before, just moved.

Also I think we do not have to change add_again() at all, because
chunk->off tolerates being given a garbage value as long as
chunk->len stays 0, and add_again() does not change chunk->len at
all.

Which cuts the diffstat down to

 pretty.c | 40 +++++++++++++++++++++++++---------------
 1 file changed, 25 insertions(+), 15 deletions(-)

> Does the caching feature justify the added complexity?

That's a good question.  I thought about your second alternative
while adding the "don't cache" thing, but if we can measure and find
out that we are not gaining much from caching, certainly that sounds
much simpler.

>
> - Don't cache anymore, now that we have placeholders that change output
>   on top of the original append-only ones.  Yields a net code reduction.
>   Duplicated %h, %t and %p will have to be resolved at each occurrence.
>
> - Provide some space for the cache instead of reusing the output, like
>   a strbuf for each placeholder.  Adds a memory allocation to each
>   first occurrence of %h, %t and %p.  Such a cache could be reused for
>   multiple commits by truncating it instead of releasing; not sure how
>   to pass it in nicely for it to survive a sequence of calls, though.
>
> René

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-13 18:29     ` Junio C Hamano
@ 2017-06-13 20:29       ` René Scharfe
  2017-06-13 21:20         ` Junio C Hamano
  2017-06-13 22:24         ` SZEDER Gábor
  0 siblings, 2 replies; 30+ messages in thread
From: René Scharfe @ 2017-06-13 20:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Michael Giuffrida, git

Am 13.06.2017 um 20:29 schrieb Junio C Hamano:
> René Scharfe <l.s.r@web.de> writes:
> 
>> Indeed, a very nice bug report!
>>
>>> I think the call to format_commit_one() needs to be taught to pass
>>> 'magic' through, and then add_again() mechanism needs to be told not
>>> to cache when magic is in effect, or something like that.
>>>
>>> Perhaps something along this line...?
>>>
>>>    pretty.c | 64 ++++++++++++++++++++++++++++++++++++++--------------------------
>>>    1 file changed, 38 insertions(+), 26 deletions(-)
>>
>> That looks quite big.  You even invent a way to classify magic.
> 
> Hmph, by "a way to classify", do you mean the enum?  That thing was
> there from before, just moved.

Oh, yes, sorry.  I didn't even get that far into the patch.  (I'll
better go to bed after hitting send..)

> Also I think we do not have to change add_again() at all, because
> chunk->off tolerates being given a garbage value as long as
> chunk->len stays 0, and add_again() does not change chunk->len at
> all.
> 
> Which cuts the diffstat down to
> 
>   pretty.c | 40 +++++++++++++++++++++++++---------------
>   1 file changed, 25 insertions(+), 15 deletions(-)
> 
>> Does the caching feature justify the added complexity?
> 
> That's a good question.  I thought about your second alternative
> while adding the "don't cache" thing, but if we can measure and find
> out that we are not gaining much from caching, certainly that sounds
> much simpler.

The difference is about the same as the one between:

	$ time git log --format="" >/dev/null

	real    0m0.463s
	user    0m0.448s
	sys     0m0.012s

and:

	$ time git log --format="%h" >/dev/null

	real    0m1.062s
	user    0m0.636s
	sys     0m0.416s

With caching duplicates are basically free and without it short
hashes have to be looked up again.  Other placeholders may reduce
the relative slowdown, depending on how expensive they are.

Forgot a third option, probably because it's not a particularly good
idea: Replacing the caching in pretty.c with a short static cache in
find_unique_abbrev_r().

René

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-13 20:29       ` René Scharfe
@ 2017-06-13 21:20         ` Junio C Hamano
  2017-06-14 18:24           ` René Scharfe
  2017-06-13 22:24         ` SZEDER Gábor
  1 sibling, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2017-06-13 21:20 UTC (permalink / raw)
  To: René Scharfe; +Cc: Michael Giuffrida, git

René Scharfe <l.s.r@web.de> writes:

> The difference is about the same as the one between:
>
> 	$ time git log --format="" >/dev/null
>
> 	real    0m0.463s
> 	user    0m0.448s
> 	sys     0m0.012s
>
> and:
>
> 	$ time git log --format="%h" >/dev/null
>
> 	real    0m1.062s
> 	user    0m0.636s
> 	sys     0m0.416s
>
> With caching duplicates are basically free and without it short
> hashes have to be looked up again.  Other placeholders may reduce
> the relative slowdown, depending on how expensive they are.

I think the real question is how likely people use more than one
occurrence of the same thing in their custom format, and how deeply
they care that --format='%h %h' costs more than --format='%h'.  The
cost won't of course be double (because the main traversal costs
without any output), but it would be rather unreasonable to expect
that --format='%h %h %h %h %h' to cost the same as --format='%h';
after all, Git is doing more for them ;-)

So in that sense, I am actually OK if we decide to remove the caching.

> Forgot a third option, probably because it's not a particularly good
> idea: Replacing the caching in pretty.c with a short static cache in
> find_unique_abbrev_r().

Indeed.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-13 20:29       ` René Scharfe
  2017-06-13 21:20         ` Junio C Hamano
@ 2017-06-13 22:24         ` SZEDER Gábor
  2017-06-14 17:34           ` René Scharfe
  1 sibling, 1 reply; 30+ messages in thread
From: SZEDER Gábor @ 2017-06-13 22:24 UTC (permalink / raw)
  To: René Scharfe
  Cc: Junio C Hamano, Michael Giuffrida, git, SZEDER Gábor

[sorry for double post, forgot the mailing list...]

To throw in a fourth option, this one adjusts the expansions' cached
offsets when the magic makes it necessary.  It's not necessary for
'%-', because it only makes a difference when the expansion is empty,
and in that case

  - add_again() doesn't consider it cached,
  - and even if it did, the offset of a zero-length string wouldn't
    matter.

 pretty.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/pretty.c b/pretty.c
index d0f86f5d8..69c4f2a21 100644
--- a/pretty.c
+++ b/pretty.c
@@ -1066,9 +1066,17 @@ static size_t parse_padding_placeholder(struct strbuf *sb,
 	return 0;
 }
 
+enum format_commit_item_magic {
+	NO_MAGIC,
+	ADD_LF_BEFORE_NON_EMPTY,
+	DEL_LF_BEFORE_EMPTY,
+	ADD_SP_BEFORE_NON_EMPTY
+};
+
 static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
 				const char *placeholder,
-				void *context)
+				void *context,
+				enum format_commit_item_magic magic)
 {
 	struct format_commit_context *c = context;
 	const struct commit *commit = c->commit;
@@ -1155,6 +1163,8 @@ static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
 					 c->pretty_ctx->abbrev);
 		strbuf_addstr(sb, diff_get_color(c->auto_color, DIFF_RESET));
 		c->abbrev_commit_hash.len = sb->len - c->abbrev_commit_hash.off;
+		if (magic == ADD_LF_BEFORE_NON_EMPTY || magic == ADD_SP_BEFORE_NON_EMPTY)
+			c->abbrev_commit_hash.off++;
 		return 1;
 	case 'T':		/* tree hash */
 		strbuf_addstr(sb, oid_to_hex(&commit->tree->object.oid));
@@ -1165,6 +1175,8 @@ static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
 		strbuf_add_unique_abbrev(sb, commit->tree->object.oid.hash,
 					 c->pretty_ctx->abbrev);
 		c->abbrev_tree_hash.len = sb->len - c->abbrev_tree_hash.off;
+		if (magic == ADD_LF_BEFORE_NON_EMPTY || magic == ADD_SP_BEFORE_NON_EMPTY)
+			c->abbrev_tree_hash.off++;
 		return 1;
 	case 'P':		/* parent hashes */
 		for (p = commit->parents; p; p = p->next) {
@@ -1184,6 +1196,8 @@ static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
 		}
 		c->abbrev_parent_hashes.len = sb->len -
 		                              c->abbrev_parent_hashes.off;
+		if (magic == ADD_LF_BEFORE_NON_EMPTY || magic == ADD_SP_BEFORE_NON_EMPTY)
+			c->abbrev_parent_hashes.off++;
 		return 1;
 	case 'm':		/* left/right/bottom */
 		strbuf_addstr(sb, get_revision_mark(NULL, commit));
@@ -1314,7 +1328,8 @@ static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
 
 static size_t format_and_pad_commit(struct strbuf *sb, /* in UTF-8 */
 				    const char *placeholder,
-				    struct format_commit_context *c)
+				    struct format_commit_context *c,
+				    enum format_commit_item_magic magic)
 {
 	struct strbuf local_sb = STRBUF_INIT;
 	int total_consumed = 0, len, padding = c->padding;
@@ -1329,7 +1344,7 @@ static size_t format_and_pad_commit(struct strbuf *sb, /* in UTF-8 */
 	}
 	while (1) {
 		int modifier = *placeholder == 'C';
-		int consumed = format_commit_one(&local_sb, placeholder, c);
+		int consumed = format_commit_one(&local_sb, placeholder, c, magic);
 		total_consumed += consumed;
 
 		if (!modifier)
@@ -1420,12 +1435,7 @@ static size_t format_commit_item(struct strbuf *sb, /* in UTF-8 */
 {
 	int consumed;
 	size_t orig_len;
-	enum {
-		NO_MAGIC,
-		ADD_LF_BEFORE_NON_EMPTY,
-		DEL_LF_BEFORE_EMPTY,
-		ADD_SP_BEFORE_NON_EMPTY
-	} magic = NO_MAGIC;
+	enum format_commit_item_magic magic = NO_MAGIC;
 
 	switch (placeholder[0]) {
 	case '-':
@@ -1445,9 +1455,9 @@ static size_t format_commit_item(struct strbuf *sb, /* in UTF-8 */
 
 	orig_len = sb->len;
 	if (((struct format_commit_context *)context)->flush_type != no_flush)
-		consumed = format_and_pad_commit(sb, placeholder, context);
+		consumed = format_and_pad_commit(sb, placeholder, context, magic);
 	else
-		consumed = format_commit_one(sb, placeholder, context);
+		consumed = format_commit_one(sb, placeholder, context, magic);
 	if (magic == NO_MAGIC)
 		return consumed;
 
-- 
2.13.1.438.gc638325b1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-13 22:24         ` SZEDER Gábor
@ 2017-06-14 17:34           ` René Scharfe
  0 siblings, 0 replies; 30+ messages in thread
From: René Scharfe @ 2017-06-14 17:34 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Junio C Hamano, Michael Giuffrida, git

Am 14.06.2017 um 00:24 schrieb SZEDER Gábor:
> [sorry for double post, forgot the mailing list...]
> 
> To throw in a fourth option, this one adjusts the expansions' cached
> offsets when the magic makes it necessary.  It's not necessary for
> '%-', because it only makes a difference when the expansion is empty,
> and in that case
> 
>    - add_again() doesn't consider it cached,
>    - and even if it did, the offset of a zero-length string wouldn't
>      matter.
> 
>   pretty.c | 32 +++++++++++++++++++++-----------
>   1 file changed, 21 insertions(+), 11 deletions(-)

There are other modifiers, e.g. try the format '%h%>(20)%h' -- its
output is obviously wrong (contains control characters), even with your
patch.  We'd have to update the offsets for each placeholder that
changes the output of others.  I don't think that approach scales, alas.

René

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-13 21:20         ` Junio C Hamano
@ 2017-06-14 18:24           ` René Scharfe
  2017-06-15  5:56             ` Jeff King
  2017-06-15 18:37             ` [BUG] add_again() off-by-one error in custom format Junio C Hamano
  0 siblings, 2 replies; 30+ messages in thread
From: René Scharfe @ 2017-06-14 18:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Michael Giuffrida, git, Jeff King, SZEDER Gábor

Am 13.06.2017 um 23:20 schrieb Junio C Hamano:
> René Scharfe <l.s.r@web.de> writes:
> 
>> The difference is about the same as the one between:
>>
>> 	$ time git log --format="" >/dev/null
>>
>> 	real    0m0.463s
>> 	user    0m0.448s
>> 	sys     0m0.012s
>>
>> and:
>>
>> 	$ time git log --format="%h" >/dev/null
>>
>> 	real    0m1.062s
>> 	user    0m0.636s
>> 	sys     0m0.416s
>>
>> With caching duplicates are basically free and without it short
>> hashes have to be looked up again.  Other placeholders may reduce
>> the relative slowdown, depending on how expensive they are.
> 
> I think the real question is how likely people use more than one
> occurrence of the same thing in their custom format, and how deeply
> they care that --format='%h %h' costs more than --format='%h'.  The
> cost won't of course be double (because the main traversal costs
> without any output), but it would be rather unreasonable to expect
> that --format='%h %h %h %h %h' to cost the same as --format='%h';
> after all, Git is doing more for them ;-)

The answer to the first half is obviously "very likely" -- otherwise
this bug wouldn't have been found, right? :)

Regarding the question of how bad a 50% slowdown for a second %h
would be: No idea.  If ran interactively it may not even be noticeable
because the user can read the first few lines in less while the rest
is prepared in the background.  We don't have a perf test for formats
with duplicate short hashes, so we don't promise anything, right? :)

René

-- >8 --
Subject: [PATCH] pretty: recalculate duplicate short hashes

b9c6232138 (--format=pretty: avoid calculating expensive expansions
twice) optimized adding short hashes multiple times by using the
fact that the output strbuf was only ever simply appended to and
copying the added string from the previous run.  That prerequisite
is no longer given; we now have modfiers like %< and %+ that can
cause the cache to lose track of the correct offsets.  Remove it.

Reported-by: Michael Giuffrida <michaelpg@chromium.org>
Signed-off-by: Rene Scharfe <l.s.r@web.de>
---
I'm sending this out in the hope that there might be a simple way
to fix it after all, like Gábor's patch does for %+.  %< and %>
seem to be the only other problematic modifiers for now -- I'm
actually surprised that %w seems to be OK.

 pretty.c | 32 --------------------------------
 strbuf.c |  7 -------
 strbuf.h |  6 ------
 3 files changed, 45 deletions(-)

diff --git a/pretty.c b/pretty.c
index 09701bd2ff..cc099dfdd1 100644
--- a/pretty.c
+++ b/pretty.c
@@ -783,29 +783,9 @@ struct format_commit_context {
 	size_t body_off;
 
 	/* The following ones are relative to the result struct strbuf. */
-	struct chunk abbrev_commit_hash;
-	struct chunk abbrev_tree_hash;
-	struct chunk abbrev_parent_hashes;
 	size_t wrap_start;
 };
 
-static int add_again(struct strbuf *sb, struct chunk *chunk)
-{
-	if (chunk->len) {
-		strbuf_adddup(sb, chunk->off, chunk->len);
-		return 1;
-	}
-
-	/*
-	 * We haven't seen this chunk before.  Our caller is surely
-	 * going to add it the hard way now.  Remember the most likely
-	 * start of the to-be-added chunk: the current end of the
-	 * struct strbuf.
-	 */
-	chunk->off = sb->len;
-	return 0;
-}
-
 static void parse_commit_header(struct format_commit_context *context)
 {
 	const char *msg = context->message;
@@ -1147,24 +1127,16 @@ static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
 		return 1;
 	case 'h':		/* abbreviated commit hash */
 		strbuf_addstr(sb, diff_get_color(c->auto_color, DIFF_COMMIT));
-		if (add_again(sb, &c->abbrev_commit_hash)) {
-			strbuf_addstr(sb, diff_get_color(c->auto_color, DIFF_RESET));
-			return 1;
-		}
 		strbuf_add_unique_abbrev(sb, commit->object.oid.hash,
 					 c->pretty_ctx->abbrev);
 		strbuf_addstr(sb, diff_get_color(c->auto_color, DIFF_RESET));
-		c->abbrev_commit_hash.len = sb->len - c->abbrev_commit_hash.off;
 		return 1;
 	case 'T':		/* tree hash */
 		strbuf_addstr(sb, oid_to_hex(&commit->tree->object.oid));
 		return 1;
 	case 't':		/* abbreviated tree hash */
-		if (add_again(sb, &c->abbrev_tree_hash))
-			return 1;
 		strbuf_add_unique_abbrev(sb, commit->tree->object.oid.hash,
 					 c->pretty_ctx->abbrev);
-		c->abbrev_tree_hash.len = sb->len - c->abbrev_tree_hash.off;
 		return 1;
 	case 'P':		/* parent hashes */
 		for (p = commit->parents; p; p = p->next) {
@@ -1174,16 +1146,12 @@ static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
 		}
 		return 1;
 	case 'p':		/* abbreviated parent hashes */
-		if (add_again(sb, &c->abbrev_parent_hashes))
-			return 1;
 		for (p = commit->parents; p; p = p->next) {
 			if (p != commit->parents)
 				strbuf_addch(sb, ' ');
 			strbuf_add_unique_abbrev(sb, p->item->object.oid.hash,
 						 c->pretty_ctx->abbrev);
 		}
-		c->abbrev_parent_hashes.len = sb->len -
-		                              c->abbrev_parent_hashes.off;
 		return 1;
 	case 'm':		/* left/right/bottom */
 		strbuf_addstr(sb, get_revision_mark(NULL, commit));
diff --git a/strbuf.c b/strbuf.c
index 00457940cf..9103bc75e4 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -204,13 +204,6 @@ void strbuf_addbuf(struct strbuf *sb, const struct strbuf *sb2)
 	strbuf_setlen(sb, sb->len + sb2->len);
 }
 
-void strbuf_adddup(struct strbuf *sb, size_t pos, size_t len)
-{
-	strbuf_grow(sb, len);
-	memcpy(sb->buf + sb->len, sb->buf + pos, len);
-	strbuf_setlen(sb, sb->len + len);
-}
-
 void strbuf_addchars(struct strbuf *sb, int c, size_t n)
 {
 	strbuf_grow(sb, n);
diff --git a/strbuf.h b/strbuf.h
index 80047b1bb7..d785258649 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -264,12 +264,6 @@ static inline void strbuf_addstr(struct strbuf *sb, const char *s)
 extern void strbuf_addbuf(struct strbuf *sb, const struct strbuf *sb2);
 
 /**
- * Copy part of the buffer from a given position till a given length to the
- * end of the buffer.
- */
-extern void strbuf_adddup(struct strbuf *sb, size_t pos, size_t len);
-
-/**
  * This function can be used to expand a format string containing
  * placeholders. To that end, it parses the string and calls the specified
  * function for every percent sign found.
-- 
2.13.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-14 18:24           ` René Scharfe
@ 2017-06-15  5:56             ` Jeff King
  2017-06-15 11:33               ` René Scharfe
  2017-06-15 18:37             ` [BUG] add_again() off-by-one error in custom format Junio C Hamano
  1 sibling, 1 reply; 30+ messages in thread
From: Jeff King @ 2017-06-15  5:56 UTC (permalink / raw)
  To: René Scharfe
  Cc: Junio C Hamano, Michael Giuffrida, git, SZEDER Gábor

On Wed, Jun 14, 2017 at 08:24:25PM +0200, René Scharfe wrote:

> > I think the real question is how likely people use more than one
> > occurrence of the same thing in their custom format, and how deeply
> > they care that --format='%h %h' costs more than --format='%h'.  The
> > cost won't of course be double (because the main traversal costs
> > without any output), but it would be rather unreasonable to expect
> > that --format='%h %h %h %h %h' to cost the same as --format='%h';
> > after all, Git is doing more for them ;-)
> 
> The answer to the first half is obviously "very likely" -- otherwise
> this bug wouldn't have been found, right? :)
> 
> Regarding the question of how bad a 50% slowdown for a second %h
> would be: No idea.  If ran interactively it may not even be noticeable
> because the user can read the first few lines in less while the rest
> is prepared in the background.  We don't have a perf test for formats
> with duplicate short hashes, so we don't promise anything, right? :)

One interesting thing is that the cost of finding short hashes very much
depends on your loose object setup. I timed:

  git log --format=%H >/dev/null

versus

  git log --format=%h >/dev/null

on git.git. It went from about 400ms to about 800ms. But then I noticed
I had a lot of loose object directories, and ran "git gc --prune=now".
Afterwards, my timings were more like 380ms and 460ms.

The difference is that in the "before" case, we actually opened each
directory and ran getdents(). But after gc, the directories are gone
totally and open() fails. We also have to do a linear walk through the
objects in each directory, since the contents are sorted.

So I wonder if it is worth trying to optimize the short-sha1 computation
in the first place. Double-%h aside, that would make _everything_
faster, including --oneline.

I'm not really sure how, though, short of caching the directory
contents. That opens up questions of whether and when to invalidate the
cache. If the cache were _just_ about short hashes, it might be OK to
just assume that it remains valid through the length of the program (so
worst case, a simultaneous write might mean that we generate a sha1
which just became ambiguous, but that's generally going to be racy
anyway).

The other downside of course is that we'd spend RAM on it. We could
bound the size of the cache, I suppose.

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-15  5:56             ` Jeff King
@ 2017-06-15 11:33               ` René Scharfe
  2017-06-15 13:25                 ` Jeff King
  0 siblings, 1 reply; 30+ messages in thread
From: René Scharfe @ 2017-06-15 11:33 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Michael Giuffrida, git, SZEDER Gábor

Am 15.06.2017 um 07:56 schrieb Jeff King:
> One interesting thing is that the cost of finding short hashes very much
> depends on your loose object setup. I timed:
> 
>    git log --format=%H >/dev/null
> 
> versus
> 
>    git log --format=%h >/dev/null
> 
> on git.git. It went from about 400ms to about 800ms. But then I noticed
> I had a lot of loose object directories, and ran "git gc --prune=now".
> Afterwards, my timings were more like 380ms and 460ms.
> 
> The difference is that in the "before" case, we actually opened each
> directory and ran getdents(). But after gc, the directories are gone
> totally and open() fails. We also have to do a linear walk through the
> objects in each directory, since the contents are sorted.

Do you mean "unsorted"?

> So I wonder if it is worth trying to optimize the short-sha1 computation
> in the first place. Double-%h aside, that would make _everything_
> faster, including --oneline.

Right.

> I'm not really sure how, though, short of caching the directory
> contents. That opens up questions of whether and when to invalidate the
> cache. If the cache were _just_ about short hashes, it might be OK to
> just assume that it remains valid through the length of the program (so
> worst case, a simultaneous write might mean that we generate a sha1
> which just became ambiguous, but that's generally going to be racy
> anyway).
> 
> The other downside of course is that we'd spend RAM on it. We could
> bound the size of the cache, I suppose.

You mean like an in-memory pack index for loose objects?  In order to
avoid the readdir call in sha1_name.c::find_short_object_filename()?
We'd only need to keep the hashes of found objects.  An oid_array
would be quite compact.

Non-racy writes inside a process should not be ignored (write, read
later) -- e.g. checkout does something like that.

Can we trust object directory time stamps for cache invalidation?

René

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-15 11:33               ` René Scharfe
@ 2017-06-15 13:25                 ` Jeff King
  2017-06-18 10:58                   ` René Scharfe
  2017-06-18 10:58                   ` René Scharfe
  0 siblings, 2 replies; 30+ messages in thread
From: Jeff King @ 2017-06-15 13:25 UTC (permalink / raw)
  To: René Scharfe
  Cc: Junio C Hamano, Michael Giuffrida, git, SZEDER Gábor

On Thu, Jun 15, 2017 at 01:33:34PM +0200, René Scharfe wrote:

> > The difference is that in the "before" case, we actually opened each
> > directory and ran getdents(). But after gc, the directories are gone
> > totally and open() fails. We also have to do a linear walk through the
> > objects in each directory, since the contents are sorted.
> 
> Do you mean "unsorted"?

Er yeah, sorry, I meant to say "aren't".

> > I'm not really sure how, though, short of caching the directory
> > contents. That opens up questions of whether and when to invalidate the
> > cache. If the cache were _just_ about short hashes, it might be OK to
> > just assume that it remains valid through the length of the program (so
> > worst case, a simultaneous write might mean that we generate a sha1
> > which just became ambiguous, but that's generally going to be racy
> > anyway).
> > 
> > The other downside of course is that we'd spend RAM on it. We could
> > bound the size of the cache, I suppose.
> 
> You mean like an in-memory pack index for loose objects?  In order to
> avoid the readdir call in sha1_name.c::find_short_object_filename()?
> We'd only need to keep the hashes of found objects.  An oid_array
> would be quite compact.

Yes, that's what I was thinking of.

> Non-racy writes inside a process should not be ignored (write, read
> later) -- e.g. checkout does something like that.

Ideally, yes. Though thinking on this more, it's racy even today,
because we rely on the in-memory packed_git list. We usually re-check the
directory only when we look for an object and it's missing. So any new
packs which have been added while the process runs will be missed when
doing short-sha1 lookups (and actually, find_short_packed_object does
not even seem to do the usual retry-on-miss).

A normal process like "checkout" isn't writing new packs, but a
simultaneous repack could be migrating its new objects to a pack behind
its back. (It also _can_ write packs, but only for large blobs).

Given that we are talking about finding "unique" abbreviations here, and
that those abbreviations can become invalidated immediately anyway as
more objects are added (even by the same process), I'm not sure we need
to strive for absolute accuracy.

> Can we trust object directory time stamps for cache invalidation?

I think it would work on POSIX-ish systems, since loose object changes
always involve new files, not modifications of existing ones. I don't
know if there are systems that don't update directory mtimes even for
newly added files.

That would still be a stat() per request. I'd be curious how the timing
compares to the opendir/readdir that happens now.

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-14 18:24           ` René Scharfe
  2017-06-15  5:56             ` Jeff King
@ 2017-06-15 18:37             ` Junio C Hamano
  1 sibling, 0 replies; 30+ messages in thread
From: Junio C Hamano @ 2017-06-15 18:37 UTC (permalink / raw)
  To: René Scharfe; +Cc: Michael Giuffrida, git, Jeff King, SZEDER Gábor

René Scharfe <l.s.r@web.de> writes:

> Am 13.06.2017 um 23:20 schrieb Junio C Hamano:
>
>> I think the real question is how likely people use more than one
>> occurrence of the same thing in their custom format, and how deeply
>> they care that --format='%h %h' costs more than --format='%h'.  The
>> cost won't of course be double (because the main traversal costs
>> without any output), but it would be rather unreasonable to expect
>> that --format='%h %h %h %h %h' to cost the same as --format='%h';
>> after all, Git is doing more for them ;-)
>
> The answer to the first half is obviously "very likely" -- otherwise
> this bug wouldn't have been found, right? :)

Not really.  There was only one (this one) after all these years.
The question we are asking is not "very rarely this is used and we
can afford to leave it broken?"  It is "very rarely this is used
and we can afford not to optimize for that rare use case?".

> Regarding the question of how bad a 50% slowdown for a second %h
> would be: No idea.  If ran interactively it may not even be noticeable
> because the user can read the first few lines in less while the rest
> is prepared in the background.  We don't have a perf test for formats
> with duplicate short hashes, so we don't promise anything, right? :)

OK.

> -- >8 --
> Subject: [PATCH] pretty: recalculate duplicate short hashes
>
> b9c6232138 (--format=pretty: avoid calculating expensive expansions
> twice) optimized adding short hashes multiple times by using the
> fact that the output strbuf was only ever simply appended to and
> copying the added string from the previous run.  That prerequisite
> is no longer given; we now have modfiers like %< and %+ that can
> cause the cache to lose track of the correct offsets.  Remove it.
>
> Reported-by: Michael Giuffrida <michaelpg@chromium.org>
> Signed-off-by: Rene Scharfe <l.s.r@web.de>
> ---
> I'm sending this out in the hope that there might be a simple way
> to fix it after all, like Gábor's patch does for %+.  %< and %>
> seem to be the only other problematic modifiers for now -- I'm
> actually surprised that %w seems to be OK.

Thanks, this looks like a sensible first step.  Will queue.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-15 13:25                 ` Jeff King
@ 2017-06-18 10:58                   ` René Scharfe
  2017-06-18 11:49                     ` Jeff King
  2017-06-18 10:58                   ` René Scharfe
  1 sibling, 1 reply; 30+ messages in thread
From: René Scharfe @ 2017-06-18 10:58 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Michael Giuffrida, git, SZEDER Gábor

Am 15.06.2017 um 15:25 schrieb Jeff King:
> On Thu, Jun 15, 2017 at 01:33:34PM +0200, René Scharfe wrote:
>>> I'm not really sure how, though, short of caching the directory
>>> contents. That opens up questions of whether and when to invalidate the
>>> cache. If the cache were _just_ about short hashes, it might be OK to
>>> just assume that it remains valid through the length of the program (so
>>> worst case, a simultaneous write might mean that we generate a sha1
>>> which just became ambiguous, but that's generally going to be racy
>>> anyway).
>>>
>>> The other downside of course is that we'd spend RAM on it. We could
>>> bound the size of the cache, I suppose.
>>
>> You mean like an in-memory pack index for loose objects?  In order to
>> avoid the readdir call in sha1_name.c::find_short_object_filename()?
>> We'd only need to keep the hashes of found objects.  An oid_array
>> would be quite compact.
> 
> Yes, that's what I was thinking of.

An oid_array spends ca. 30 bytes per entry (20 bytes for a hash times
a factor of 1.5 from alloc_nr).  How many loose objects do we have to
expect?  30 MB for a million of them sounds not too bad, but can there
realistically be orders of magnitude more?

>> Non-racy writes inside a process should not be ignored (write, read
>> later) -- e.g. checkout does something like that.
> 
> Ideally, yes. Though thinking on this more, it's racy even today,
> because we rely on the in-memory packed_git list. We usually re-check the
> directory only when we look for an object and it's missing. So any new
> packs which have been added while the process runs will be missed when
> doing short-sha1 lookups (and actually, find_short_packed_object does
> not even seem to do the usual retry-on-miss).
> 
> A normal process like "checkout" isn't writing new packs, but a
> simultaneous repack could be migrating its new objects to a pack behind
> its back. (It also _can_ write packs, but only for large blobs).
> 
> Given that we are talking about finding "unique" abbreviations here, and
> that those abbreviations can become invalidated immediately anyway as
> more objects are added (even by the same process), I'm not sure we need
> to strive for absolute accuracy.

Agreed.  And processes that add objects and use them later probably use
full hashes (at least checkout does).

So here's a patch for caching loose object hashes in an oid_array
without a way to invalidate or release the cache.  It uses a single
oid_array for simplicity, but reads each subdirectory individually and
remembers that fact using a bitmap.
---
 cache.h     |  4 ++++
 sha1_name.c | 69 +++++++++++++++++++++++++++++++++++++++++++------------------
 2 files changed, 53 insertions(+), 20 deletions(-)

diff --git a/cache.h b/cache.h
index 4d92aae0e8..8c6748907b 100644
--- a/cache.h
+++ b/cache.h
@@ -11,6 +11,7 @@
 #include "string-list.h"
 #include "pack-revindex.h"
 #include "hash.h"
+#include "sha1-array.h"
 
 #ifndef platform_SHA_CTX
 /*
@@ -1586,6 +1587,9 @@ extern struct alternate_object_database {
 	struct strbuf scratch;
 	size_t base_len;
 
+	uint32_t loose_objects_subdir_bitmap[8];
+	struct oid_array loose_objects_cache;
+
 	char path[FLEX_ARRAY];
 } *alt_odb_list;
 extern void prepare_alt_odb(void);
diff --git a/sha1_name.c b/sha1_name.c
index 5126853bb5..ae6a5c3abe 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -77,10 +77,46 @@ static void update_candidates(struct disambiguate_state *ds, const struct object
 	/* otherwise, current can be discarded and candidate is still good */
 }
 
+static void read_loose_object_subdir(struct alternate_object_database *alt,
+				     int subdir_nr)
+{
+	struct strbuf *buf;
+	char hex[GIT_MAX_HEXSZ];
+	DIR *dir;
+	struct dirent *de;
+	size_t bitmap_index = subdir_nr / 32;
+	uint32_t bitmap_mask = 1 << (subdir_nr % 32);
+
+	if (alt->loose_objects_subdir_bitmap[bitmap_index] & bitmap_mask)
+		return;
+	alt->loose_objects_subdir_bitmap[bitmap_index] |= bitmap_mask;
+
+	buf = alt_scratch_buf(alt);
+	strbuf_addf(buf, "%02x/", subdir_nr);
+	xsnprintf(hex, sizeof(hex), "%02x", subdir_nr);
+
+	dir = opendir(buf->buf);
+	if (!dir)
+		return;
+
+	while ((de = readdir(dir)) != NULL) {
+		struct object_id oid;
+
+		if (strlen(de->d_name) != GIT_SHA1_HEXSZ - 2)
+			continue;
+		memcpy(hex + 2, de->d_name, GIT_SHA1_HEXSZ - 2);
+		if (!get_oid_hex(hex, &oid))
+			oid_array_append(&alt->loose_objects_cache, &oid);
+	}
+	closedir(dir);
+}
+
+static int match_sha(unsigned, const unsigned char *, const unsigned char *);
+
 static void find_short_object_filename(struct disambiguate_state *ds)
 {
+	int subdir_nr = ds->bin_pfx.hash[0];
 	struct alternate_object_database *alt;
-	char hex[GIT_MAX_HEXSZ];
 	static struct alternate_object_database *fakeent;
 
 	if (!fakeent) {
@@ -95,29 +131,22 @@ static void find_short_object_filename(struct disambiguate_state *ds)
 	}
 	fakeent->next = alt_odb_list;
 
-	xsnprintf(hex, sizeof(hex), "%.2s", ds->hex_pfx);
 	for (alt = fakeent; alt && !ds->ambiguous; alt = alt->next) {
-		struct strbuf *buf = alt_scratch_buf(alt);
-		struct dirent *de;
-		DIR *dir;
-
-		strbuf_addf(buf, "%.2s/", ds->hex_pfx);
-		dir = opendir(buf->buf);
-		if (!dir)
-			continue;
+		int pos;
 
-		while (!ds->ambiguous && (de = readdir(dir)) != NULL) {
-			struct object_id oid;
+		read_loose_object_subdir(alt, subdir_nr);
 
-			if (strlen(de->d_name) != GIT_SHA1_HEXSZ - 2)
-				continue;
-			if (memcmp(de->d_name, ds->hex_pfx + 2, ds->len - 2))
-				continue;
-			memcpy(hex + 2, de->d_name, GIT_SHA1_HEXSZ - 2);
-			if (!get_oid_hex(hex, &oid))
-				update_candidates(ds, &oid);
+		pos = oid_array_lookup(&alt->loose_objects_cache, &ds->bin_pfx);
+		if (pos < 0)
+			pos = -1 - pos;
+		while (!ds->ambiguous && pos < alt->loose_objects_cache.nr) {
+			const struct object_id *oid;
+			oid = alt->loose_objects_cache.oid + pos;
+			if (!match_sha(ds->len, ds->bin_pfx.hash, oid->hash))
+				break;
+			update_candidates(ds, oid);
+			pos++;
 		}
-		closedir(dir);
 	}
 }
 
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-15 13:25                 ` Jeff King
  2017-06-18 10:58                   ` René Scharfe
@ 2017-06-18 10:58                   ` René Scharfe
  2017-06-18 11:50                     ` Jeff King
  1 sibling, 1 reply; 30+ messages in thread
From: René Scharfe @ 2017-06-18 10:58 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Michael Giuffrida, git, SZEDER Gábor

Am 15.06.2017 um 15:25 schrieb Jeff King:
> On Thu, Jun 15, 2017 at 01:33:34PM +0200, René Scharfe wrote:
>> Can we trust object directory time stamps for cache invalidation?
> 
> I think it would work on POSIX-ish systems, since loose object changes
> always involve new files, not modifications of existing ones. I don't
> know if there are systems that don't update directory mtimes even for
> newly added files.
> 
> That would still be a stat() per request. I'd be curious how the timing
> compares to the opendir/readdir that happens now.

Modification times wouldn't be exact, as you mentioned above: An object
could be added just after checking the time stamp.  So perhaps we don't
need that, or a time-based invalidation suffices, e.g. we discard the
cache after five minutes or so?

Anyway, here's a patch for stat-based invalidation, on top of the other
one.  Array removal is really slow (hope I didn't sneak a bug in there,
but my confidence in this code isn't very high).  No locking is done;
parallel threads removing and adding entries could make a mess, but
that's not an issue for log.

Timings for "time git log --pretty=%h >/dev/null" in my git repository
with 5094 loose objects on Debian:

        master       first patch  this patch
real    0m1.065s     0m0.581s     0m0.633s
user    0m0.648s     0m0.564s     0m0.580s
sys     0m0.412s     0m0.016s     0m0.052s


And on mingw with 227 loose objects:

        master       first patch  this patch
real    0m1.756s     0m0.546s     0m1.659s
user    0m0.000s     0m0.000s     0m0.000s
sys     0m0.000s     0m0.000s     0m0.000s

So at least for Windows it would be really nice if we could avoid
calling stat..
---
 cache.h     |  1 +
 sha1_name.c | 32 ++++++++++++++++++++++++++++----
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/cache.h b/cache.h
index 8c6748907b..386b09710d 100644
--- a/cache.h
+++ b/cache.h
@@ -1589,6 +1589,7 @@ extern struct alternate_object_database {
 
 	uint32_t loose_objects_subdir_bitmap[8];
 	struct oid_array loose_objects_cache;
+	time_t loose_objects_subdir_mtime[256];
 
 	char path[FLEX_ARRAY];
 } *alt_odb_list;
diff --git a/sha1_name.c b/sha1_name.c
index ae6a5c3abe..baecb10454 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -77,6 +77,24 @@ static void update_candidates(struct disambiguate_state *ds, const struct object
 	/* otherwise, current can be discarded and candidate is still good */
 }
 
+static void remove_subdir_entries(struct oid_array *array, int subdir_nr)
+{
+	struct object_id oid;
+	int start, end;
+
+	memset(&oid, 0, sizeof(oid));
+	oid.hash[0] = subdir_nr;
+	start = oid_array_lookup(array, &oid);
+	if (start < 0)
+		start = -1 - start;
+	end = start;
+	while (end < array->nr && array->oid[end].hash[0] == subdir_nr)
+		end++;
+	memmove(array->oid + start, array->oid + end,
+		(array->nr - end) * sizeof(array->oid[0]));
+	array->nr -= end - start;
+}
+
 static void read_loose_object_subdir(struct alternate_object_database *alt,
 				     int subdir_nr)
 {
@@ -86,15 +104,21 @@ static void read_loose_object_subdir(struct alternate_object_database *alt,
 	struct dirent *de;
 	size_t bitmap_index = subdir_nr / 32;
 	uint32_t bitmap_mask = 1 << (subdir_nr % 32);
-
-	if (alt->loose_objects_subdir_bitmap[bitmap_index] & bitmap_mask)
-		return;
-	alt->loose_objects_subdir_bitmap[bitmap_index] |= bitmap_mask;
+	struct stat st;
 
 	buf = alt_scratch_buf(alt);
 	strbuf_addf(buf, "%02x/", subdir_nr);
 	xsnprintf(hex, sizeof(hex), "%02x", subdir_nr);
 
+	stat(buf->buf, &st);
+	if (alt->loose_objects_subdir_bitmap[bitmap_index] & bitmap_mask) {
+		if (alt->loose_objects_subdir_mtime[subdir_nr] == st.st_mtime)
+			return;
+		remove_subdir_entries(&alt->loose_objects_cache, subdir_nr);
+	}
+	alt->loose_objects_subdir_mtime[subdir_nr] = st.st_mtime;
+	alt->loose_objects_subdir_bitmap[bitmap_index] |= bitmap_mask;
+
 	dir = opendir(buf->buf);
 	if (!dir)
 		return;
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-18 10:58                   ` René Scharfe
@ 2017-06-18 11:49                     ` Jeff King
  2017-06-18 12:59                       ` René Scharfe
  0 siblings, 1 reply; 30+ messages in thread
From: Jeff King @ 2017-06-18 11:49 UTC (permalink / raw)
  To: René Scharfe
  Cc: Junio C Hamano, Michael Giuffrida, git, SZEDER Gábor

On Sun, Jun 18, 2017 at 12:58:37PM +0200, René Scharfe wrote:

> An oid_array spends ca. 30 bytes per entry (20 bytes for a hash times
> a factor of 1.5 from alloc_nr).  How many loose objects do we have to
> expect?  30 MB for a million of them sounds not too bad, but can there
> realistically be orders of magnitude more?

Good point. We gc by default at 6000 loose objects, and lots of thing
will suffer poor performance by the time you hit a million. So a little
extra space is probably not worth worrying about.

> So here's a patch for caching loose object hashes in an oid_array
> without a way to invalidate or release the cache.  It uses a single
> oid_array for simplicity, but reads each subdirectory individually and
> remembers that fact using a bitmap.

I like the direction of this very much.

> @@ -1586,6 +1587,9 @@ extern struct alternate_object_database {
>  	struct strbuf scratch;
>  	size_t base_len;
>  
> +	uint32_t loose_objects_subdir_bitmap[8];

Is it worth the complexity of having an actual bitmap and not just an
array of char? I guess it's not _that_ complex to access the bits, but
there are a lot of magic numbers involved.

> +	struct oid_array loose_objects_cache;

We should probably insert a comment here warning that the cache may go
out of date during the process run, and should only be used when that's
an acceptable tradeoff.

> +static void read_loose_object_subdir(struct alternate_object_database *alt,
> +				     int subdir_nr)

I think it's nice to pull this out into a helper function. I do wonder
if it should just be reusing for_each_loose_file_in_objdir(). You'd just
need to expose the in_obj_subdir() variant publicly.

It does do slightly more than we need (for the callbacks it actually
builds the filename), but I doubt memcpy()ing a few bytes would be
measurable.

> +		pos = oid_array_lookup(&alt->loose_objects_cache, &ds->bin_pfx);
> +		if (pos < 0)
> +			pos = -1 - pos;
> +		while (!ds->ambiguous && pos < alt->loose_objects_cache.nr) {
> +			const struct object_id *oid;
> +			oid = alt->loose_objects_cache.oid + pos;
> +			if (!match_sha(ds->len, ds->bin_pfx.hash, oid->hash))
> +				break;
> +			update_candidates(ds, oid);
> +			pos++;
>  		}

This part looks quite nice and straightforward.

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-18 10:58                   ` René Scharfe
@ 2017-06-18 11:50                     ` Jeff King
  2017-06-19  4:46                       ` Junio C Hamano
  0 siblings, 1 reply; 30+ messages in thread
From: Jeff King @ 2017-06-18 11:50 UTC (permalink / raw)
  To: René Scharfe
  Cc: Junio C Hamano, Michael Giuffrida, git, SZEDER Gábor

On Sun, Jun 18, 2017 at 12:58:49PM +0200, René Scharfe wrote:

> Anyway, here's a patch for stat-based invalidation, on top of the other
> one.  Array removal is really slow (hope I didn't sneak a bug in there,
> but my confidence in this code isn't very high).  No locking is done;
> parallel threads removing and adding entries could make a mess, but
> that's not an issue for log.
> 
> Timings for "time git log --pretty=%h >/dev/null" in my git repository
> with 5094 loose objects on Debian:
> 
>         master       first patch  this patch
> real    0m1.065s     0m0.581s     0m0.633s
> user    0m0.648s     0m0.564s     0m0.580s
> sys     0m0.412s     0m0.016s     0m0.052s
> 
> 
> And on mingw with 227 loose objects:
> 
>         master       first patch  this patch
> real    0m1.756s     0m0.546s     0m1.659s
> user    0m0.000s     0m0.000s     0m0.000s
> sys     0m0.000s     0m0.000s     0m0.000s
> 
> So at least for Windows it would be really nice if we could avoid
> calling stat..

Thanks for doing the timings. Given those numbers and the earlier
discussion, I'd be inclined to skip the mtime check.

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-18 11:49                     ` Jeff King
@ 2017-06-18 12:59                       ` René Scharfe
  2017-06-18 13:56                         ` Jeff King
  0 siblings, 1 reply; 30+ messages in thread
From: René Scharfe @ 2017-06-18 12:59 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Michael Giuffrida, git, SZEDER Gábor

Am 18.06.2017 um 13:49 schrieb Jeff King:
> On Sun, Jun 18, 2017 at 12:58:37PM +0200, René Scharfe wrote:
> 
>> An oid_array spends ca. 30 bytes per entry (20 bytes for a hash times
>> a factor of 1.5 from alloc_nr).  How many loose objects do we have to
>> expect?  30 MB for a million of them sounds not too bad, but can there
>> realistically be orders of magnitude more?
> 
> Good point. We gc by default at 6000 loose objects, and lots of thing
> will suffer poor performance by the time you hit a million. So a little
> extra space is probably not worth worrying about.
> 
>> So here's a patch for caching loose object hashes in an oid_array
>> without a way to invalidate or release the cache.  It uses a single
>> oid_array for simplicity, but reads each subdirectory individually and
>> remembers that fact using a bitmap.
> 
> I like the direction of this very much.
> 
>> @@ -1586,6 +1587,9 @@ extern struct alternate_object_database {
>>   	struct strbuf scratch;
>>   	size_t base_len;
>>   
>> +	uint32_t loose_objects_subdir_bitmap[8];
> 
> Is it worth the complexity of having an actual bitmap and not just an
> array of char? I guess it's not _that_ complex to access the bits, but
> there are a lot of magic numbers involved.

That would be 224 bytes more per alternate_object_database, and we'd
gain simpler code.  Hmm.  We could add some bitmap helper macros, but
you're probably right that the first version should use the simplest
form for representing small bitmaps that we currently have.

>> +	struct oid_array loose_objects_cache;
> 
> We should probably insert a comment here warning that the cache may go
> out of date during the process run, and should only be used when that's
> an acceptable tradeoff.

Then we need to offer a way for opting out.  Through a global variable?
(I'd rather reduce their number, but don't see how allow programs to
nicely toggle this cache otherwise.)

>> +static void read_loose_object_subdir(struct alternate_object_database *alt,
>> +				     int subdir_nr)
> 
> I think it's nice to pull this out into a helper function. I do wonder
> if it should just be reusing for_each_loose_file_in_objdir(). You'd just
> need to expose the in_obj_subdir() variant publicly.
> 
> It does do slightly more than we need (for the callbacks it actually
> builds the filename), but I doubt memcpy()ing a few bytes would be
> measurable.

Good point.  The function also copies the common first two hex digits
for each entry.  But all that extra work is certainly dwarfed by the
readdir calls.

René

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-18 12:59                       ` René Scharfe
@ 2017-06-18 13:56                         ` Jeff King
  2017-06-22 18:19                           ` René Scharfe
  0 siblings, 1 reply; 30+ messages in thread
From: Jeff King @ 2017-06-18 13:56 UTC (permalink / raw)
  To: René Scharfe
  Cc: Junio C Hamano, Michael Giuffrida, git, SZEDER Gábor

On Sun, Jun 18, 2017 at 02:59:04PM +0200, René Scharfe wrote:

> > > @@ -1586,6 +1587,9 @@ extern struct alternate_object_database {
> > >   	struct strbuf scratch;
> > >   	size_t base_len;
> > > +	uint32_t loose_objects_subdir_bitmap[8];
> > 
> > Is it worth the complexity of having an actual bitmap and not just an
> > array of char? I guess it's not _that_ complex to access the bits, but
> > there are a lot of magic numbers involved.
> 
> That would be 224 bytes more per alternate_object_database, and we'd
> gain simpler code.  Hmm.  We could add some bitmap helper macros, but
> you're probably right that the first version should use the simplest
> form for representing small bitmaps that we currently have.

I'd be OK with keeping it if we could reduce the number of magic
numbers. E.g,. rather than 32 elsewhere use:

  (sizeof(*loose_objects_subdir_bitmap) * CHAR_BIT)

and similarly rather than 8 here use

  256 / sizeof(*loose_objects_subdir_bitmap) / CHAR_BIT

There's also already a bitmap data structure in ewah/bitmap.c. It's a
little bit overkill, though, because it mallocs and will grow the bitmap
as needed.

> > We should probably insert a comment here warning that the cache may go
> > out of date during the process run, and should only be used when that's
> > an acceptable tradeoff.
> 
> Then we need to offer a way for opting out.  Through a global variable?
> (I'd rather reduce their number, but don't see how allow programs to
> nicely toggle this cache otherwise.)

Sorry, I didn't mean disabling it for a particular run of a program. I
just meant that we know this is an OK tradeoff for short-sha1 lookups,
but has_sha1_file(), for example, would never want to use it. So we are
warning programmers from using it in more code, not giving users a way
to turn it off run-to-run.

> > > +static void read_loose_object_subdir(struct alternate_object_database *alt,
> > > +				     int subdir_nr)
> > 
> > I think it's nice to pull this out into a helper function. I do wonder
> > if it should just be reusing for_each_loose_file_in_objdir(). You'd just
> > need to expose the in_obj_subdir() variant publicly.
> > 
> > It does do slightly more than we need (for the callbacks it actually
> > builds the filename), but I doubt memcpy()ing a few bytes would be
> > measurable.
> 
> Good point.  The function also copies the common first two hex digits
> for each entry.  But all that extra work is certainly dwarfed by the
> readdir calls.

Yes. You're welcome to micro-optimize that implementation if you want. ;)

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-18 11:50                     ` Jeff King
@ 2017-06-19  4:46                       ` Junio C Hamano
  2017-06-22 18:19                         ` [PATCH] sha1_name: cache readdir(3) results in find_short_object_filename() René Scharfe
  0 siblings, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2017-06-19  4:46 UTC (permalink / raw)
  To: Jeff King; +Cc: René Scharfe, Michael Giuffrida, git, SZEDER Gábor

Jeff King <peff@peff.net> writes:

> On Sun, Jun 18, 2017 at 12:58:49PM +0200, René Scharfe wrote:
>
>> Anyway, here's a patch for stat-based invalidation, on top of the other
>> one.  Array removal is really slow (hope I didn't sneak a bug in there,
>> but my confidence in this code isn't very high).  No locking is done;
>> parallel threads removing and adding entries could make a mess, but
>> that's not an issue for log.
>> 
>> Timings for "time git log --pretty=%h >/dev/null" in my git repository
>> with 5094 loose objects on Debian:
>> 
>>         master       first patch  this patch
>> real    0m1.065s     0m0.581s     0m0.633s
>> user    0m0.648s     0m0.564s     0m0.580s
>> sys     0m0.412s     0m0.016s     0m0.052s
>> 
>> 
>> And on mingw with 227 loose objects:
>> 
>>         master       first patch  this patch
>> real    0m1.756s     0m0.546s     0m1.659s
>> user    0m0.000s     0m0.000s     0m0.000s
>> sys     0m0.000s     0m0.000s     0m0.000s
>> 
>> So at least for Windows it would be really nice if we could avoid
>> calling stat..
>
> Thanks for doing the timings. Given those numbers and the earlier
> discussion, I'd be inclined to skip the mtime check.

Yeah, thanks for these experiments.  With or without invalidation,
we already accept that racing with other processes will make the
result inaccurate, so I am also inclined to say that it would be
best to take the first one alone.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-18 13:56                         ` Jeff King
@ 2017-06-22 18:19                           ` René Scharfe
  2017-06-22 23:15                             ` Jeff King
  0 siblings, 1 reply; 30+ messages in thread
From: René Scharfe @ 2017-06-22 18:19 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Michael Giuffrida, git, SZEDER Gábor

Am 18.06.2017 um 15:56 schrieb Jeff King:
> On Sun, Jun 18, 2017 at 02:59:04PM +0200, René Scharfe wrote:
> 
>>>> @@ -1586,6 +1587,9 @@ extern struct alternate_object_database {
>>>>    	struct strbuf scratch;
>>>>    	size_t base_len;
>>>> +	uint32_t loose_objects_subdir_bitmap[8];
>>>
>>> Is it worth the complexity of having an actual bitmap and not just an
>>> array of char? I guess it's not _that_ complex to access the bits, but
>>> there are a lot of magic numbers involved.
>>
>> That would be 224 bytes more per alternate_object_database, and we'd
>> gain simpler code.  Hmm.  We could add some bitmap helper macros, but
>> you're probably right that the first version should use the simplest
>> form for representing small bitmaps that we currently have.
> 
> I'd be OK with keeping it if we could reduce the number of magic
> numbers. E.g,. rather than 32 elsewhere use:
> 
>    (sizeof(*loose_objects_subdir_bitmap) * CHAR_BIT)

We have a bitsizeof macro for that.

> and similarly rather than 8 here use
> 
>    256 / sizeof(*loose_objects_subdir_bitmap) / CHAR_BIT

If we're pretending not to know the number of bits in a byte then we
need to round up, and we have DIV_ROUND_UP for that. :)

> There's also already a bitmap data structure in ewah/bitmap.c. It's a
> little bit overkill, though, because it mallocs and will grow the bitmap
> as needed.

Yes, I feel that's too big a hammer for this purpose.

There is another example of a bitmap in shallow_c (just search for
"32").  It would benefit from the macros mentioned above.  That
might make it easier to switch to the native word size (unsigned int)
instead of using uint32_t everywhere.

But perhaps this one is actually a candidate for using EWAH, depending
on the number of refs the code is supposed to handle.

>>>> +static void read_loose_object_subdir(struct alternate_object_database *alt,
>>>> +				     int subdir_nr)
>>>
>>> I think it's nice to pull this out into a helper function. I do wonder
>>> if it should just be reusing for_each_loose_file_in_objdir(). You'd just
>>> need to expose the in_obj_subdir() variant publicly.
>>>
>>> It does do slightly more than we need (for the callbacks it actually
>>> builds the filename), but I doubt memcpy()ing a few bytes would be
>>> measurable.
>>
>> Good point.  The function also copies the common first two hex digits
>> for each entry.  But all that extra work is certainly dwarfed by the
>> readdir calls.
> 
> Yes. You're welcome to micro-optimize that implementation if you want. ;)

My knee-jerk reaction was that this would lead to ugliness, but on
second look it might actually be doable.  Will check, but I don't
expect much of a speedup.

René

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] sha1_name: cache readdir(3) results in find_short_object_filename()
  2017-06-19  4:46                       ` Junio C Hamano
@ 2017-06-22 18:19                         ` René Scharfe
  2017-06-22 23:10                           ` Jeff King
  0 siblings, 1 reply; 30+ messages in thread
From: René Scharfe @ 2017-06-22 18:19 UTC (permalink / raw)
  To: Junio C Hamano, Jeff King; +Cc: Michael Giuffrida, git, SZEDER Gábor

Read each loose object subdirectory at most once when looking for unique
abbreviated hashes.  This speeds up commands like "git log --pretty=%h"
considerably, which previously caused one readdir(3) call for each
candidate, even for subdirectories that were visited before.

The new cache is kept until the program ends and never invalidated.  The
same is already true for pack indexes.  The inherent racy nature of
finding unique short hashes makes it still fit for this purpose -- a
conflicting new object may be added at any time.  Tasks with higher
consistency requirements should not use it, though.

The cached object names are stored in an oid_array, which is quite
compact.  The bitmap for remembering which subdir was already read is
stored as a char array, with one char per directory -- that's not quite
as compact, but really simple and incurs only an overhead equivalent to
11 hashes after all.

Suggested-by: Jeff King <peff@peff.net>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Rene Scharfe <l.s.r@web.de>
---
 cache.h     | 17 +++++++++++++++++
 sha1_file.c | 12 ++++++------
 sha1_name.c | 50 ++++++++++++++++++++++++++++++--------------------
 3 files changed, 53 insertions(+), 26 deletions(-)

diff --git a/cache.h b/cache.h
index d6ba8a2f11..00a017dfcb 100644
--- a/cache.h
+++ b/cache.h
@@ -11,6 +11,7 @@
 #include "string-list.h"
 #include "pack-revindex.h"
 #include "hash.h"
+#include "sha1-array.h"
 
 #ifndef platform_SHA_CTX
 /*
@@ -1593,6 +1594,16 @@ extern struct alternate_object_database {
 	struct strbuf scratch;
 	size_t base_len;
 
+	/*
+	 * Used to store the results of readdir(3) calls when searching
+	 * for unique abbreviated hashes.  This cache is never
+	 * invalidated, thus it's racy and not necessarily accurate.
+	 * That's fine for its purpose; don't use it for tasks requiring
+	 * greater accuracy!
+	 */
+	char loose_objects_subdir_seen[256];
+	struct oid_array loose_objects_cache;
+
 	char path[FLEX_ARRAY];
 } *alt_odb_list;
 extern void prepare_alt_odb(void);
@@ -1811,6 +1822,12 @@ typedef int each_loose_cruft_fn(const char *basename,
 typedef int each_loose_subdir_fn(int nr,
 				 const char *path,
 				 void *data);
+int for_each_file_in_obj_subdir(int subdir_nr,
+				struct strbuf *path,
+				each_loose_object_fn obj_cb,
+				each_loose_cruft_fn cruft_cb,
+				each_loose_subdir_fn subdir_cb,
+				void *data);
 int for_each_loose_file_in_objdir(const char *path,
 				  each_loose_object_fn obj_cb,
 				  each_loose_cruft_fn cruft_cb,
diff --git a/sha1_file.c b/sha1_file.c
index 59a4ed2ed3..5e0ee2b68b 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3735,12 +3735,12 @@ void assert_sha1_type(const unsigned char *sha1, enum object_type expect)
 		    typename(expect));
 }
 
-static int for_each_file_in_obj_subdir(int subdir_nr,
-				       struct strbuf *path,
-				       each_loose_object_fn obj_cb,
-				       each_loose_cruft_fn cruft_cb,
-				       each_loose_subdir_fn subdir_cb,
-				       void *data)
+int for_each_file_in_obj_subdir(int subdir_nr,
+				struct strbuf *path,
+				each_loose_object_fn obj_cb,
+				each_loose_cruft_fn cruft_cb,
+				each_loose_subdir_fn subdir_cb,
+				void *data)
 {
 	size_t baselen = path->len;
 	DIR *dir = opendir(path->buf);
diff --git a/sha1_name.c b/sha1_name.c
index 5126853bb5..ccb5144d0d 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -77,10 +77,19 @@ static void update_candidates(struct disambiguate_state *ds, const struct object
 	/* otherwise, current can be discarded and candidate is still good */
 }
 
+static int append_loose_object(const struct object_id *oid, const char *path,
+			       void *data)
+{
+	oid_array_append(data, oid);
+	return 0;
+}
+
+static int match_sha(unsigned, const unsigned char *, const unsigned char *);
+
 static void find_short_object_filename(struct disambiguate_state *ds)
 {
+	int subdir_nr = ds->bin_pfx.hash[0];
 	struct alternate_object_database *alt;
-	char hex[GIT_MAX_HEXSZ];
 	static struct alternate_object_database *fakeent;
 
 	if (!fakeent) {
@@ -95,29 +104,30 @@ static void find_short_object_filename(struct disambiguate_state *ds)
 	}
 	fakeent->next = alt_odb_list;
 
-	xsnprintf(hex, sizeof(hex), "%.2s", ds->hex_pfx);
 	for (alt = fakeent; alt && !ds->ambiguous; alt = alt->next) {
-		struct strbuf *buf = alt_scratch_buf(alt);
-		struct dirent *de;
-		DIR *dir;
-
-		strbuf_addf(buf, "%.2s/", ds->hex_pfx);
-		dir = opendir(buf->buf);
-		if (!dir)
-			continue;
+		int pos;
 
-		while (!ds->ambiguous && (de = readdir(dir)) != NULL) {
-			struct object_id oid;
+		if (!alt->loose_objects_subdir_seen[subdir_nr]) {
+			struct strbuf *buf = alt_scratch_buf(alt);
+			strbuf_addf(buf, "%02x/", subdir_nr);
+			for_each_file_in_obj_subdir(subdir_nr, buf,
+						    append_loose_object,
+						    NULL, NULL,
+						    &alt->loose_objects_cache);
+			alt->loose_objects_subdir_seen[subdir_nr] = 1;
+		}
 
-			if (strlen(de->d_name) != GIT_SHA1_HEXSZ - 2)
-				continue;
-			if (memcmp(de->d_name, ds->hex_pfx + 2, ds->len - 2))
-				continue;
-			memcpy(hex + 2, de->d_name, GIT_SHA1_HEXSZ - 2);
-			if (!get_oid_hex(hex, &oid))
-				update_candidates(ds, &oid);
+		pos = oid_array_lookup(&alt->loose_objects_cache, &ds->bin_pfx);
+		if (pos < 0)
+			pos = -1 - pos;
+		while (!ds->ambiguous && pos < alt->loose_objects_cache.nr) {
+			const struct object_id *oid;
+			oid = alt->loose_objects_cache.oid + pos;
+			if (!match_sha(ds->len, ds->bin_pfx.hash, oid->hash))
+				break;
+			update_candidates(ds, oid);
+			pos++;
 		}
-		closedir(dir);
 	}
 }
 
-- 
2.13.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] sha1_name: cache readdir(3) results in find_short_object_filename()
  2017-06-22 18:19                         ` [PATCH] sha1_name: cache readdir(3) results in find_short_object_filename() René Scharfe
@ 2017-06-22 23:10                           ` Jeff King
  2017-06-24 12:12                             ` René Scharfe
  2017-06-24 12:12                             ` René Scharfe
  0 siblings, 2 replies; 30+ messages in thread
From: Jeff King @ 2017-06-22 23:10 UTC (permalink / raw)
  To: René Scharfe
  Cc: Junio C Hamano, Michael Giuffrida, git, SZEDER Gábor

On Thu, Jun 22, 2017 at 08:19:48PM +0200, René Scharfe wrote:

> Read each loose object subdirectory at most once when looking for unique
> abbreviated hashes.  This speeds up commands like "git log --pretty=%h"
> considerably, which previously caused one readdir(3) call for each
> candidate, even for subdirectories that were visited before.

Is it worth adding a perf test that's heavy on abbreviations?

> The new cache is kept until the program ends and never invalidated.  The
> same is already true for pack indexes.  The inherent racy nature of
> finding unique short hashes makes it still fit for this purpose -- a
> conflicting new object may be added at any time.  Tasks with higher
> consistency requirements should not use it, though.
> 
> The cached object names are stored in an oid_array, which is quite
> compact.  The bitmap for remembering which subdir was already read is
> stored as a char array, with one char per directory -- that's not quite
> as compact, but really simple and incurs only an overhead equivalent to
> 11 hashes after all.

Sounds reasonable. The code itself looks good, with one minor nit below.

> @@ -1593,6 +1594,16 @@ extern struct alternate_object_database {
>  	struct strbuf scratch;
>  	size_t base_len;
>  
> +	/*
> +	 * Used to store the results of readdir(3) calls when searching
> +	 * for unique abbreviated hashes.  This cache is never
> +	 * invalidated, thus it's racy and not necessarily accurate.
> +	 * That's fine for its purpose; don't use it for tasks requiring
> +	 * greater accuracy!
> +	 */
> +	char loose_objects_subdir_seen[256];
> +	struct oid_array loose_objects_cache;
> +

Thanks for adding this comment.

> @@ -1811,6 +1822,12 @@ typedef int each_loose_cruft_fn(const char *basename,
>  typedef int each_loose_subdir_fn(int nr,
>  				 const char *path,
>  				 void *data);
> +int for_each_file_in_obj_subdir(int subdir_nr,
> +				struct strbuf *path,
> +				each_loose_object_fn obj_cb,
> +				each_loose_cruft_fn cruft_cb,
> +				each_loose_subdir_fn subdir_cb,
> +				void *data);

Now that this is becoming public, I think we need to document what
should be in "path" here. It is the complete path, including the 2-hex
subdir.

That's redundant with "subdir_nr". Should we push that logic down into
the function, and basically do:

  if (subdir_nr < 0 || subdir_nr > 256)
	BUG("object subdir number out of range");
  orig_len = path->len;
  strbuf_addf(path, "/%02x", subdir_nr);
  baselen = path->len;

  ...

  strbuf_setlen(path, orig_len);

That's one less thing for the caller to worry about, and it's easy to
explain the argument as "the path to the object directory root".

> +		if (!alt->loose_objects_subdir_seen[subdir_nr]) {
> +			struct strbuf *buf = alt_scratch_buf(alt);
> +			strbuf_addf(buf, "%02x/", subdir_nr);
> +			for_each_file_in_obj_subdir(subdir_nr, buf,
> +						    append_loose_object,
> +						    NULL, NULL,
> +						    &alt->loose_objects_cache);

I think the trailing slash here is superfluous. If you take my
suggestion above, that line goes away. But then the front slash it adds
becomes superfluous. It should probably just do:

  strbuf_complete(path, '/');
  strbuf_addf(path, "%02x", subdir_nr);

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [BUG] add_again() off-by-one error in custom format
  2017-06-22 18:19                           ` René Scharfe
@ 2017-06-22 23:15                             ` Jeff King
  0 siblings, 0 replies; 30+ messages in thread
From: Jeff King @ 2017-06-22 23:15 UTC (permalink / raw)
  To: René Scharfe
  Cc: Junio C Hamano, Michael Giuffrida, git, SZEDER Gábor

On Thu, Jun 22, 2017 at 08:19:40PM +0200, René Scharfe wrote:

> > I'd be OK with keeping it if we could reduce the number of magic
> > numbers. E.g,. rather than 32 elsewhere use:
> > 
> >    (sizeof(*loose_objects_subdir_bitmap) * CHAR_BIT)
> 
> We have a bitsizeof macro for that.
> 
> > and similarly rather than 8 here use
> > 
> >    256 / sizeof(*loose_objects_subdir_bitmap) / CHAR_BIT
> 
> If we're pretending not to know the number of bits in a byte then we
> need to round up, and we have DIV_ROUND_UP for that. :)

Thanks, I meant to mention the rounding thing but forgot. I didn't know
about either of those macros, though.

> There is another example of a bitmap in shallow_c (just search for
> "32").  It would benefit from the macros mentioned above.  That
> might make it easier to switch to the native word size (unsigned int)
> instead of using uint32_t everywhere.
> 
> But perhaps this one is actually a candidate for using EWAH, depending
> on the number of refs the code is supposed to handle.

I thought at first you meant real EWAH bitmaps. But those aren't random
access (so you have to read them into an uncompressed bitmap in memory,
which is why ewah/bitmap.c exists). But if you just mean reusing those
bitmaps, then yeah, possibly.

It looks like it's using a commit slab, though. If you know the number
of bits you need up front, then the commit slab can do it without any
waste. I didn't dig enough to see if that's what it's doing or not.

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sha1_name: cache readdir(3) results in find_short_object_filename()
  2017-06-22 23:10                           ` Jeff King
@ 2017-06-24 12:12                             ` René Scharfe
  2017-06-24 12:14                               ` Jeff King
  2017-06-24 12:12                             ` René Scharfe
  1 sibling, 1 reply; 30+ messages in thread
From: René Scharfe @ 2017-06-24 12:12 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Michael Giuffrida, git, SZEDER Gábor

Am 23.06.2017 um 01:10 schrieb Jeff King:
> On Thu, Jun 22, 2017 at 08:19:48PM +0200, René Scharfe wrote:
> 
>> Read each loose object subdirectory at most once when looking for unique
>> abbreviated hashes.  This speeds up commands like "git log --pretty=%h"
>> considerably, which previously caused one readdir(3) call for each
>> candidate, even for subdirectories that were visited before.
> 
> Is it worth adding a perf test that's heavy on abbreviations?

Sure.  Here's a simple one.  It currently reports for me:

Test                        origin/master     origin/next              origin/pu
---------------------------------------------------------------------------------------------
4205.1: log with %H         0.44(0.41+0.02)   0.43(0.42+0.01) -2.3%    0.43(0.43+0.00) -2.3%
4205.2: log with %h         1.03(0.60+0.42)   1.04(0.64+0.39) +1.0%    0.57(0.55+0.01) -44.7%
4205.3: log with %T         0.43(0.42+0.00)   0.43(0.42+0.01) +0.0%    0.43(0.40+0.02) +0.0%
4205.4: log with %t         1.05(0.64+0.38)   1.05(0.61+0.42) +0.0%    0.59(0.58+0.00) -43.8%
4205.5: log with %P         0.45(0.42+0.02)   0.43(0.42+0.00) -4.4%    0.43(0.42+0.01) -4.4%
4205.6: log with %p         1.21(0.63+0.57)   1.17(0.68+0.47) -3.3%    0.59(0.58+0.00) -51.2%
4205.7: log with %h-%h-%h   1.05(0.64+0.39)   2.00(0.83+1.15) +90.5%   0.69(0.66+0.02) -34.3%

origin/next has fe9e2aefd4 (pretty: recalculate duplicate short hashes),
while origin/pu has cc817ca3ef (sha1_name: cache readdir(3) results in
find_short_object_filename()).

-- >8 --
Subject: [PATCH] p4205: add perf test script for pretty log formats

Add simple performance tests for expanded log format placeholders.

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Rene Scharfe <l.s.r@web.de>
---
 t/perf/p4205-log-pretty-formats.sh | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)
 create mode 100755 t/perf/p4205-log-pretty-formats.sh

diff --git a/t/perf/p4205-log-pretty-formats.sh b/t/perf/p4205-log-pretty-formats.sh
new file mode 100755
index 0000000000..7c26f4f337
--- /dev/null
+++ b/t/perf/p4205-log-pretty-formats.sh
@@ -0,0 +1,16 @@
+#!/bin/sh
+
+test_description='Tests the performance of various pretty format placeholders'
+
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+for format in %H %h %T %t %P %p %h-%h-%h
+do
+	test_perf "log with $format" "
+		git log --format=\"$format\" >/dev/null
+	"
+done
+
+test_done
-- 
2.13.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] sha1_name: cache readdir(3) results in find_short_object_filename()
  2017-06-22 23:10                           ` Jeff King
  2017-06-24 12:12                             ` René Scharfe
@ 2017-06-24 12:12                             ` René Scharfe
  2017-06-24 12:20                               ` Jeff King
  1 sibling, 1 reply; 30+ messages in thread
From: René Scharfe @ 2017-06-24 12:12 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Michael Giuffrida, git, SZEDER Gábor

Am 23.06.2017 um 01:10 schrieb Jeff King:
> On Thu, Jun 22, 2017 at 08:19:48PM +0200, René Scharfe wrote:
>> @@ -1811,6 +1822,12 @@ typedef int each_loose_cruft_fn(const char *basename,
>>   typedef int each_loose_subdir_fn(int nr,
>>   				 const char *path,
>>   				 void *data);
>> +int for_each_file_in_obj_subdir(int subdir_nr,
>> +				struct strbuf *path,
>> +				each_loose_object_fn obj_cb,
>> +				each_loose_cruft_fn cruft_cb,
>> +				each_loose_subdir_fn subdir_cb,
>> +				void *data);
> 
> Now that this is becoming public, I think we need to document what
> should be in "path" here. It is the complete path, including the 2-hex
> subdir.
> 
> That's redundant with "subdir_nr". Should we push that logic down into
> the function, and basically do:
> 
>    if (subdir_nr < 0 || subdir_nr > 256)
> 	BUG("object subdir number out of range");

Hmm.  I don't expect many more callers, so do we really need this safety
check?  It's cheap compared to the readdir(3) call, of course.

But wouldn't it make more sense to use an unsigned data type to avoid
the first half?  And an unsigned char specifically to only accept valid
values, leaving the overflow concern to the caller?  It warrants a
separate patch in any case IMHO.

>    orig_len = path->len;
>    strbuf_addf(path, "/%02x", subdir_nr);
>    baselen = path->len;
> 
>    ...
> 
>    strbuf_setlen(path, orig_len);
> 
> That's one less thing for the caller to worry about, and it's easy to
> explain the argument as "the path to the object directory root".
> 
>> +		if (!alt->loose_objects_subdir_seen[subdir_nr]) {
>> +			struct strbuf *buf = alt_scratch_buf(alt);
>> +			strbuf_addf(buf, "%02x/", subdir_nr);
>> +			for_each_file_in_obj_subdir(subdir_nr, buf,
>> +						    append_loose_object,
>> +						    NULL, NULL,
>> +						    &alt->loose_objects_cache);
> 
> I think the trailing slash here is superfluous. If you take my
> suggestion above, that line goes away. But then the front slash it adds
> becomes superfluous. It should probably just do:
> 
>    strbuf_complete(path, '/');
>    strbuf_addf(path, "%02x", subdir_nr);

So how about this then as a follow-up patch?

-- >8 --
Subject: [PATCH] sha1_file: let for_each_file_in_obj_subdir() handle subdir names

The function for_each_file_in_obj_subdir() takes a object subdirectory
number and expects the name of the same subdirectory to be included in
the path strbuf.  Avoid this redundancy by letting the function append
the hexadecimal subdirectory name itself.  This makes it a bit easier
and safer to use the function -- it becomes impossible to specify
different subdirectories in subdir_nr and path.

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Rene Scharfe <l.s.r@web.de>
---
 sha1_file.c | 22 ++++++++++++++--------
 sha1_name.c |  1 -
 2 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 5e0ee2b68b..98ce85acf9 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3742,15 +3742,22 @@ int for_each_file_in_obj_subdir(int subdir_nr,
 				each_loose_subdir_fn subdir_cb,
 				void *data)
 {
-	size_t baselen = path->len;
-	DIR *dir = opendir(path->buf);
+	size_t origlen, baselen;
+	DIR *dir;
 	struct dirent *de;
 	int r = 0;
 
+	origlen = path->len;
+	strbuf_complete(path, '/');
+	strbuf_addf(path, "%02x", subdir_nr);
+	baselen = path->len;
+
+	dir = opendir(path->buf);
 	if (!dir) {
-		if (errno == ENOENT)
-			return 0;
-		return error_errno("unable to open %s", path->buf);
+		if (errno != ENOENT)
+			r = error_errno("unable to open %s", path->buf);
+		strbuf_setlen(path, origlen);
+		return r;
 	}
 
 	while ((de = readdir(dir))) {
@@ -3788,6 +3795,8 @@ int for_each_file_in_obj_subdir(int subdir_nr,
 	if (!r && subdir_cb)
 		r = subdir_cb(subdir_nr, path->buf, data);
 
+	strbuf_setlen(path, origlen);
+
 	return r;
 }
 
@@ -3797,15 +3806,12 @@ int for_each_loose_file_in_objdir_buf(struct strbuf *path,
 			    each_loose_subdir_fn subdir_cb,
 			    void *data)
 {
-	size_t baselen = path->len;
 	int r = 0;
 	int i;
 
 	for (i = 0; i < 256; i++) {
-		strbuf_addf(path, "/%02x", i);
 		r = for_each_file_in_obj_subdir(i, path, obj_cb, cruft_cb,
 						subdir_cb, data);
-		strbuf_setlen(path, baselen);
 		if (r)
 			break;
 	}
diff --git a/sha1_name.c b/sha1_name.c
index ccb5144d0d..5670e2540f 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -109,7 +109,6 @@ static void find_short_object_filename(struct disambiguate_state *ds)
 
 		if (!alt->loose_objects_subdir_seen[subdir_nr]) {
 			struct strbuf *buf = alt_scratch_buf(alt);
-			strbuf_addf(buf, "%02x/", subdir_nr);
 			for_each_file_in_obj_subdir(subdir_nr, buf,
 						    append_loose_object,
 						    NULL, NULL,
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] sha1_name: cache readdir(3) results in find_short_object_filename()
  2017-06-24 12:12                             ` René Scharfe
@ 2017-06-24 12:14                               ` Jeff King
  0 siblings, 0 replies; 30+ messages in thread
From: Jeff King @ 2017-06-24 12:14 UTC (permalink / raw)
  To: René Scharfe
  Cc: Junio C Hamano, Michael Giuffrida, git, SZEDER Gábor

On Sat, Jun 24, 2017 at 02:12:07PM +0200, René Scharfe wrote:

> Am 23.06.2017 um 01:10 schrieb Jeff King:
> > On Thu, Jun 22, 2017 at 08:19:48PM +0200, René Scharfe wrote:
> > 
> >> Read each loose object subdirectory at most once when looking for unique
> >> abbreviated hashes.  This speeds up commands like "git log --pretty=%h"
> >> considerably, which previously caused one readdir(3) call for each
> >> candidate, even for subdirectories that were visited before.
> > 
> > Is it worth adding a perf test that's heavy on abbreviations?
> 
> Sure.  Here's a simple one.  It currently reports for me:
> 
> Test                        origin/master     origin/next              origin/pu
> ---------------------------------------------------------------------------------------------
> 4205.1: log with %H         0.44(0.41+0.02)   0.43(0.42+0.01) -2.3%    0.43(0.43+0.00) -2.3%
> 4205.2: log with %h         1.03(0.60+0.42)   1.04(0.64+0.39) +1.0%    0.57(0.55+0.01) -44.7%
> 4205.3: log with %T         0.43(0.42+0.00)   0.43(0.42+0.01) +0.0%    0.43(0.40+0.02) +0.0%
> 4205.4: log with %t         1.05(0.64+0.38)   1.05(0.61+0.42) +0.0%    0.59(0.58+0.00) -43.8%
> 4205.5: log with %P         0.45(0.42+0.02)   0.43(0.42+0.00) -4.4%    0.43(0.42+0.01) -4.4%
> 4205.6: log with %p         1.21(0.63+0.57)   1.17(0.68+0.47) -3.3%    0.59(0.58+0.00) -51.2%
> 4205.7: log with %h-%h-%h   1.05(0.64+0.39)   2.00(0.83+1.15) +90.5%   0.69(0.66+0.02) -34.3%
> 
> origin/next has fe9e2aefd4 (pretty: recalculate duplicate short hashes),
> while origin/pu has cc817ca3ef (sha1_name: cache readdir(3) results in
> find_short_object_filename()).

Perfect. That last one says everything we need to know about which
approach to take. :)

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sha1_name: cache readdir(3) results in find_short_object_filename()
  2017-06-24 12:12                             ` René Scharfe
@ 2017-06-24 12:20                               ` Jeff King
  2017-06-24 14:09                                 ` René Scharfe
  0 siblings, 1 reply; 30+ messages in thread
From: Jeff King @ 2017-06-24 12:20 UTC (permalink / raw)
  To: René Scharfe
  Cc: Junio C Hamano, Michael Giuffrida, git, SZEDER Gábor

On Sat, Jun 24, 2017 at 02:12:30PM +0200, René Scharfe wrote:

> > That's redundant with "subdir_nr". Should we push that logic down into
> > the function, and basically do:
> > 
> >    if (subdir_nr < 0 || subdir_nr > 256)
> > 	BUG("object subdir number out of range");
> 
> Hmm.  I don't expect many more callers, so do we really need this safety
> check?  It's cheap compared to the readdir(3) call, of course.

To me it's as much about documenting the assumptions as it is about
catching buggy callers. I'd be OK with a comment, too.

> But wouldn't it make more sense to use an unsigned data type to avoid
> the first half?  And an unsigned char specifically to only accept valid
> values, leaving the overflow concern to the caller?  It warrants a
> separate patch in any case IMHO.

Using unsigned makes sense. Using "char" because you know it only goes
to 256 is a bit too subtle for my taste. And yes, I'd do it in a
preparatory patch (or follow-on if you prefer).

> -- >8 --
> Subject: [PATCH] sha1_file: let for_each_file_in_obj_subdir() handle subdir names
> 
> The function for_each_file_in_obj_subdir() takes a object subdirectory
> number and expects the name of the same subdirectory to be included in
> the path strbuf.  Avoid this redundancy by letting the function append
> the hexadecimal subdirectory name itself.  This makes it a bit easier
> and safer to use the function -- it becomes impossible to specify
> different subdirectories in subdir_nr and path.

Yeah, this is what I had in mind.

>  	for (i = 0; i < 256; i++) {
> -		strbuf_addf(path, "/%02x", i);
>  		r = for_each_file_in_obj_subdir(i, path, obj_cb, cruft_cb,
>  						subdir_cb, data);
> -		strbuf_setlen(path, baselen);

One side effect of the original code is that this trailing setlen()
would catch any early-exits from for_each_file_in_obj_subdir() which
forgot to reset the strbuf. If any exist, that's a bug that should be in
fixed in that function, though. I double-checked, and there aren't any
(your patch already handles the extra setlen required when opendir
fails).

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] sha1_name: cache readdir(3) results in find_short_object_filename()
  2017-06-24 12:20                               ` Jeff King
@ 2017-06-24 14:09                                 ` René Scharfe
  2017-06-24 14:12                                   ` Jeff King
  0 siblings, 1 reply; 30+ messages in thread
From: René Scharfe @ 2017-06-24 14:09 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Michael Giuffrida, git, SZEDER Gábor

Am 24.06.2017 um 14:20 schrieb Jeff King:
> On Sat, Jun 24, 2017 at 02:12:30PM +0200, René Scharfe wrote:
> 
>>> That's redundant with "subdir_nr". Should we push that logic down into
>>> the function, and basically do:
>>>
>>>     if (subdir_nr < 0 || subdir_nr > 256)
>>> 	BUG("object subdir number out of range");
>>
>> Hmm.  I don't expect many more callers, so do we really need this safety
>> check?  It's cheap compared to the readdir(3) call, of course.
> 
> To me it's as much about documenting the assumptions as it is about
> catching buggy callers. I'd be OK with a comment, too.

I didn't catch the off-by-one error above in the first reading.  Did you
add it intentionally? ;-)  In any case, I'm convinced now that we need
such a check, but with hexadecimal notation to avoid such mistakes.

>> But wouldn't it make more sense to use an unsigned data type to avoid
>> the first half?  And an unsigned char specifically to only accept valid
>> values, leaving the overflow concern to the caller?  It warrants a
>> separate patch in any case IMHO.
> 
> Using unsigned makes sense. Using "char" because you know it only goes
> to 256 is a bit too subtle for my taste. And yes, I'd do it in a
> preparatory patch (or follow-on if you prefer).

It's subtle on the caller's side, as big numbers would just wrap if the
programmer ignored the limits of the type.  On the callee's side it's
fundamental, though.

Anyway, here's a patch on top of the others.

-- >8 --
Subject: [PATCH] sha1_file: guard against invalid loose subdirectory numbers

Loose object subdirectories have hexadecimal names based on the first
byte of the hash of contained objects, thus their numerical
representation can range from 0 (0x00) to 255 (0xff).  Change the type
of the corresponding variable in for_each_file_in_obj_subdir() and
associated callback functions to unsigned int and add a range check.

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Rene Scharfe <l.s.r@web.de>
---
 builtin/fsck.c         | 2 +-
 builtin/prune-packed.c | 2 +-
 builtin/prune.c        | 2 +-
 cache.h                | 4 ++--
 sha1_file.c            | 5 ++++-
 5 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 3a2c27f241..5d113f8bc0 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -536,7 +536,7 @@ static int fsck_cruft(const char *basename, const char *path, void *data)
 	return 0;
 }
 
-static int fsck_subdir(int nr, const char *path, void *progress)
+static int fsck_subdir(unsigned int nr, const char *path, void *progress)
 {
 	display_progress(progress, nr + 1);
 	return 0;
diff --git a/builtin/prune-packed.c b/builtin/prune-packed.c
index c026299e78..ac978ad401 100644
--- a/builtin/prune-packed.c
+++ b/builtin/prune-packed.c
@@ -10,7 +10,7 @@ static const char * const prune_packed_usage[] = {
 
 static struct progress *progress;
 
-static int prune_subdir(int nr, const char *path, void *data)
+static int prune_subdir(unsigned int nr, const char *path, void *data)
 {
 	int *opts = data;
 	display_progress(progress, nr + 1);
diff --git a/builtin/prune.c b/builtin/prune.c
index f0e2bff04c..c378690545 100644
--- a/builtin/prune.c
+++ b/builtin/prune.c
@@ -68,7 +68,7 @@ static int prune_cruft(const char *basename, const char *path, void *data)
 	return 0;
 }
 
-static int prune_subdir(int nr, const char *path, void *data)
+static int prune_subdir(unsigned int nr, const char *path, void *data)
 {
 	if (!show_only)
 		rmdir(path);
diff --git a/cache.h b/cache.h
index 00a017dfcb..04dd2961ad 100644
--- a/cache.h
+++ b/cache.h
@@ -1819,10 +1819,10 @@ typedef int each_loose_object_fn(const struct object_id *oid,
 typedef int each_loose_cruft_fn(const char *basename,
 				const char *path,
 				void *data);
-typedef int each_loose_subdir_fn(int nr,
+typedef int each_loose_subdir_fn(unsigned int nr,
 				 const char *path,
 				 void *data);
-int for_each_file_in_obj_subdir(int subdir_nr,
+int for_each_file_in_obj_subdir(unsigned int subdir_nr,
 				struct strbuf *path,
 				each_loose_object_fn obj_cb,
 				each_loose_cruft_fn cruft_cb,
diff --git a/sha1_file.c b/sha1_file.c
index 98ce85acf9..90b9414170 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3735,7 +3735,7 @@ void assert_sha1_type(const unsigned char *sha1, enum object_type expect)
 		    typename(expect));
 }
 
-int for_each_file_in_obj_subdir(int subdir_nr,
+int for_each_file_in_obj_subdir(unsigned int subdir_nr,
 				struct strbuf *path,
 				each_loose_object_fn obj_cb,
 				each_loose_cruft_fn cruft_cb,
@@ -3747,6 +3747,9 @@ int for_each_file_in_obj_subdir(int subdir_nr,
 	struct dirent *de;
 	int r = 0;
 
+	if (subdir_nr > 0xff)
+		BUG("invalid loose object subdirectory: %x", subdir_nr);
+
 	origlen = path->len;
 	strbuf_complete(path, '/');
 	strbuf_addf(path, "%02x", subdir_nr);
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] sha1_name: cache readdir(3) results in find_short_object_filename()
  2017-06-24 14:09                                 ` René Scharfe
@ 2017-06-24 14:12                                   ` Jeff King
  0 siblings, 0 replies; 30+ messages in thread
From: Jeff King @ 2017-06-24 14:12 UTC (permalink / raw)
  To: René Scharfe
  Cc: Junio C Hamano, Michael Giuffrida, git, SZEDER Gábor

On Sat, Jun 24, 2017 at 04:09:39PM +0200, René Scharfe wrote:

> Am 24.06.2017 um 14:20 schrieb Jeff King:
> > On Sat, Jun 24, 2017 at 02:12:30PM +0200, René Scharfe wrote:
> > 
> >>> That's redundant with "subdir_nr". Should we push that logic down into
> >>> the function, and basically do:
> >>>
> >>>     if (subdir_nr < 0 || subdir_nr > 256)
> >>> 	BUG("object subdir number out of range");
> >>
> >> Hmm.  I don't expect many more callers, so do we really need this safety
> >> check?  It's cheap compared to the readdir(3) call, of course.
> > 
> > To me it's as much about documenting the assumptions as it is about
> > catching buggy callers. I'd be OK with a comment, too.
> 
> I didn't catch the off-by-one error above in the first reading.  Did you
> add it intentionally? ;-)  In any case, I'm convinced now that we need
> such a check, but with hexadecimal notation to avoid such mistakes.

Err...yeah, it was totally intentional. ;)

I agree writing it in hex is better.

> -- >8 --
> Subject: [PATCH] sha1_file: guard against invalid loose subdirectory numbers
> 
> Loose object subdirectories have hexadecimal names based on the first
> byte of the hash of contained objects, thus their numerical
> representation can range from 0 (0x00) to 255 (0xff).  Change the type
> of the corresponding variable in for_each_file_in_obj_subdir() and
> associated callback functions to unsigned int and add a range check.
> 
> Suggested-by: Jeff King <peff@peff.net>
> Signed-off-by: Rene Scharfe <l.s.r@web.de>

This looks great. Thanks.

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2017-06-24 14:12 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-12  3:13 [BUG] add_again() off-by-one error in custom format Michael Giuffrida
2017-06-12 22:49 ` Junio C Hamano
2017-06-13 18:09   ` René Scharfe
2017-06-13 18:29     ` Junio C Hamano
2017-06-13 20:29       ` René Scharfe
2017-06-13 21:20         ` Junio C Hamano
2017-06-14 18:24           ` René Scharfe
2017-06-15  5:56             ` Jeff King
2017-06-15 11:33               ` René Scharfe
2017-06-15 13:25                 ` Jeff King
2017-06-18 10:58                   ` René Scharfe
2017-06-18 11:49                     ` Jeff King
2017-06-18 12:59                       ` René Scharfe
2017-06-18 13:56                         ` Jeff King
2017-06-22 18:19                           ` René Scharfe
2017-06-22 23:15                             ` Jeff King
2017-06-18 10:58                   ` René Scharfe
2017-06-18 11:50                     ` Jeff King
2017-06-19  4:46                       ` Junio C Hamano
2017-06-22 18:19                         ` [PATCH] sha1_name: cache readdir(3) results in find_short_object_filename() René Scharfe
2017-06-22 23:10                           ` Jeff King
2017-06-24 12:12                             ` René Scharfe
2017-06-24 12:14                               ` Jeff King
2017-06-24 12:12                             ` René Scharfe
2017-06-24 12:20                               ` Jeff King
2017-06-24 14:09                                 ` René Scharfe
2017-06-24 14:12                                   ` Jeff King
2017-06-15 18:37             ` [BUG] add_again() off-by-one error in custom format Junio C Hamano
2017-06-13 22:24         ` SZEDER Gábor
2017-06-14 17:34           ` René Scharfe

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).