[PATCH] sha1_file: use strbuf_add() instead of strbuf

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH] sha1_file: use strbuf_add() instead of strbuf_addf()
@ 2017-12-01 17:49 Derrick Stolee
  2017-12-01 18:22 ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Derrick Stolee @ 2017-12-01 17:49 UTC (permalink / raw)
  To: git; +Cc: peff, git, stolee, Derrick Stolee

Replace use of strbuf_addf() with strbuf_add() when enumerating
loose objects in for_each_file_in_obj_subdir(). Since we already
check the length and hex-values of the string before consuming
the path, we can prevent extra computation by using the lower-
level method.

One consumer of for_each_file_in_obj_subdir() is the abbreviation
code. OID abbreviations use a cached list of loose objects (per
object subdirectory) to make repeated queries fast, but there is
significant cache load time when there are many loose objects.

Most repositories do not have many loose objects before repacking,
but in the GVFS case the repos can grow to have millions of loose
objects. Profiling 'git log' performance in GitForWindows on a
GVFS-enabled repo with ~2.5 million loose objects revealed 12% of
the CPU time was spent in strbuf_addf().

Add a new performance test to p4211-line-log.sh that is more
sensitive to this cache-loading. By limiting to 1000 commits, we
more closely resemble user wait time when reading history into a
pager.

For a copy of the Linux repo with two ~512 MB packfiles and ~572K
loose objects, running 'git log --oneline --raw --parents -1000'
had the following performance:

 HEAD~1            HEAD
----------------------------------------
 7.70(7.15+0.54)   7.44(7.09+0.29) -3.4%

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 sha1_file.c              | 7 ++++---
 t/perf/p4211-line-log.sh | 4 ++++
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 8ae6cb6285..2160323c4a 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1914,17 +1914,18 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr,
 	}

 	oid.hash[0] = subdir_nr;
+	strbuf_add(path, "/", 1);
+	baselen = path->len;

 	while ((de = readdir(dir))) {
 		if (is_dot_or_dotdot(de->d_name))
 			continue;

-		strbuf_setlen(path, baselen);
-		strbuf_addf(path, "/%s", de->d_name);
-
 		if (strlen(de->d_name) == GIT_SHA1_HEXSZ - 2 &&
 		    !hex_to_bytes(oid.hash + 1, de->d_name,
 				  GIT_SHA1_RAWSZ - 1)) {
+			strbuf_setlen(path, baselen);
+			strbuf_add(path, de->d_name, GIT_SHA1_HEXSZ - 2);
 			if (obj_cb) {
 				r = obj_cb(&oid, path->buf, data);
 				if (r)
diff --git a/t/perf/p4211-line-log.sh b/t/perf/p4211-line-log.sh
index e0ed05907c..392bcc0e51 100755
--- a/t/perf/p4211-line-log.sh
+++ b/t/perf/p4211-line-log.sh
@@ -35,4 +35,8 @@ test_perf 'git log --oneline --raw --parents' '
 	git log --oneline --raw --parents >/dev/null
 '

+test_perf 'git log --oneline --raw --parents -1000' '
+	git log --oneline --raw --parents -1000 >/dev/null
+'
+
 test_done
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] sha1_file: use strbuf_add() instead of strbuf_addf()
  2017-12-01 17:49 [PATCH] sha1_file: use strbuf_add() instead of strbuf_addf() Derrick Stolee
@ 2017-12-01 18:22 ` Jeff King
  2017-12-01 19:50   ` Derrick Stolee
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2017-12-01 18:22 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, git, stolee

On Fri, Dec 01, 2017 at 12:49:56PM -0500, Derrick Stolee wrote:

> Replace use of strbuf_addf() with strbuf_add() when enumerating
> loose objects in for_each_file_in_obj_subdir(). Since we already
> check the length and hex-values of the string before consuming
> the path, we can prevent extra computation by using the lower-
> level method.

Makes sense. It's a shame that "addf" is turning out to be so slow (I'm
still mildly curious if that differs between Windows and glibc, but
there are so many other variables between the two platforms it's hard to
test).

> One consumer of for_each_file_in_obj_subdir() is the abbreviation
> code. OID abbreviations use a cached list of loose objects (per
> object subdirectory) to make repeated queries fast, but there is
> significant cache load time when there are many loose objects.
> 
> Most repositories do not have many loose objects before repacking,
> but in the GVFS case the repos can grow to have millions of loose
> objects. Profiling 'git log' performance in GitForWindows on a
> GVFS-enabled repo with ~2.5 million loose objects revealed 12% of
> the CPU time was spent in strbuf_addf().

Yeah, we haven't heavily micro-optimized the case for having lots of
loose objects. The general philosophy about having lots of loose objects
is: don't. You're generally going to pay a system call for each access,
which is much heavier-weight than packfiles.

I think abbreviation-finding is the exception there, though, because we
literally just readdir() the entries and never do anything else with
them.

> Add a new performance test to p4211-line-log.sh that is more
> sensitive to this cache-loading. By limiting to 1000 commits, we
> more closely resemble user wait time when reading history into a
> pager.
> 
> For a copy of the Linux repo with two ~512 MB packfiles and ~572K
> loose objects, running 'git log --oneline --raw --parents -1000'
> had the following performance:
> 
>  HEAD~1            HEAD
> ----------------------------------------
>  7.70(7.15+0.54)   7.44(7.09+0.29) -3.4%

Thanks for including numbers. I think the setup here highlights a
weakness of the perf suite that we've discussed: if there's a
performance regression for your case, nobody is likely to notice because
they won't test on a repo with 500k loose objects.

Probably "repo with a bunch of loose objects" ought to be a stock
repository profile for doing regression runs.

> diff --git a/sha1_file.c b/sha1_file.c
> index 8ae6cb6285..2160323c4a 100644

This overall looks good, but I noticed one bug and a few cosmetic
improvements.

> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -1914,17 +1914,18 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr,
>  	}
>  
>  	oid.hash[0] = subdir_nr;
> +	strbuf_add(path, "/", 1);

Maybe:

  strbuf_addch(path, '/');

would be a bit more readable (it's also a tiny bit faster, but this
isn't part of the tight loop).

> +	baselen = path->len;

We set this here so that the '/' is included as part of the base. Makes
sense, but can we now drop the earlier setting of baselen before the
opendir() call?

>  	while ((de = readdir(dir))) {
>  		if (is_dot_or_dotdot(de->d_name))
>  			continue;
>  
> -		strbuf_setlen(path, baselen);
> -		strbuf_addf(path, "/%s", de->d_name);
> -
>  		if (strlen(de->d_name) == GIT_SHA1_HEXSZ - 2 &&
>  		    !hex_to_bytes(oid.hash + 1, de->d_name,
>  				  GIT_SHA1_RAWSZ - 1)) {
> +			strbuf_setlen(path, baselen);
> +			strbuf_add(path, de->d_name, GIT_SHA1_HEXSZ - 2);

Moving this code into the conditional makes sense here, since that's
where we know the actual size. But what about the case when we _don't_
hit this conditional. We still do:

  if (cruft_cb)
        cruft_cb(path->buf);

which is now broken (it will just get the directory name without the
entry appended).

I think the optimized versions probably needs to be something more like:

  while (de = readdir(...)) {
      strbuf_setlen(path, baselen);
      if (size is HEXSZ - 2) {
          strbuf_add(path, de->d_name, HEXSZ - 2);
	  obj_cb(path->buf);
      } else if (cruft_cb) {
          strbuf_addstr(path, de->d_name);
	  cruft_cb(path->buf);
      }
  }

Two other thoughts:

  - from past discussions, I suspect most of your performance
    improvement actually comes from avoiding addf(), and the difference
    between addstr() and add() may be negligible here. It might be worth
    timing strbuf_addstr(). If that's similarly fast, that means we
    could keep the logic out of the conditional entirely.

  - there's an extra micro-optimization there, which is that if there's
    no obj_cb, we have no need to assemble the full path at all. I doubt
    it makes much of a difference, as most callers would pass an object
    callback (I'd be surprised if we even have one that doesn't).

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sha1_file: use strbuf_add() instead of strbuf_addf()
  2017-12-01 18:22 ` Jeff King
@ 2017-12-01 19:50   ` Derrick Stolee
  2017-12-01 22:39     ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Derrick Stolee @ 2017-12-01 19:50 UTC (permalink / raw)
  To: Jeff King, Derrick Stolee; +Cc: git, git

On 12/1/2017 1:22 PM, Jeff King wrote:
> On Fri, Dec 01, 2017 at 12:49:56PM -0500, Derrick Stolee wrote:
[snip]
>> diff --git a/sha1_file.c b/sha1_file.c
>> index 8ae6cb6285..2160323c4a 100644
> This overall looks good, but I noticed one bug and a few cosmetic
> improvements.

Thanks for finding quality improvements to this patch. I'll let it sit 
over the weekend for more feedback before sending a v2.

>
>> --- a/sha1_file.c
>> +++ b/sha1_file.c
>> @@ -1914,17 +1914,18 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr,
>>   	}
>>   
>>   	oid.hash[0] = subdir_nr;
>> +	strbuf_add(path, "/", 1);
> Maybe:
>
>    strbuf_addch(path, '/');
>
> would be a bit more readable (it's also a tiny bit faster, but this
> isn't part of the tight loop).
>
>> +	baselen = path->len;
> We set this here so that the '/' is included as part of the base. Makes
> sense, but can we now drop the earlier setting of baselen before the
> opendir() call?

Yeah, probably. I had briefly considered just adding the '/' before the 
first assignment of "baselen", but didn't want to change the error 
output. I also don't know if there are side effects for other platforms 
by calling opendir() with a '/'-terminated path.

>>   	while ((de = readdir(dir))) {
>>   		if (is_dot_or_dotdot(de->d_name))
>>   			continue;
>>   
>> -		strbuf_setlen(path, baselen);
>> -		strbuf_addf(path, "/%s", de->d_name);
>> -
>>   		if (strlen(de->d_name) == GIT_SHA1_HEXSZ - 2 &&
>>   		    !hex_to_bytes(oid.hash + 1, de->d_name,
>>   				  GIT_SHA1_RAWSZ - 1)) {
>> +			strbuf_setlen(path, baselen);
>> +			strbuf_add(path, de->d_name, GIT_SHA1_HEXSZ - 2);
> Moving this code into the conditional makes sense here, since that's
> where we know the actual size. But what about the case when we _don't_
> hit this conditional. We still do:
>
>    if (cruft_cb)
>          cruft_cb(path->buf);
>
> which is now broken (it will just get the directory name without the
> entry appended).

Good catch! A big reason to pull it inside and use strbuf_add over 
strbuf_addstr is to avoid a duplicate strlen() calculation. However, I 
can store the length before the conditional.

> I think the optimized versions probably needs to be something more like:
>
>    while (de = readdir(...)) {
>        strbuf_setlen(path, baselen);
>        if (size is HEXSZ - 2) {
>            strbuf_add(path, de->d_name, HEXSZ - 2);
> 	  obj_cb(path->buf);
>        } else if (cruft_cb) {
>            strbuf_addstr(path, de->d_name);
> 	  cruft_cb(path->buf);
>        }
>    }

Small change by storing the length in advance of the conditional:

while (de = readdir(...)) {
     int namelen = strlen(de->d_name);
     strbuf_setlen(path, baselen);
     strbuf_add(path, de->d_name, namelen);

     if (namelen == HEXSZ - 2)
         obj_cb(path->buf)
     else
         cruft_cb(path->buf);
}

> Two other thoughts:
>
>    - from past discussions, I suspect most of your performance
>      improvement actually comes from avoiding addf(), and the difference
>      between addstr() and add() may be negligible here. It might be worth
>      timing strbuf_addstr(). If that's similarly fast, that means we
>      could keep the logic out of the conditional entirely.

addstr() duplicates the strlen(), which isn't much but we might as well 
save cycles where we can if it isn't too hard.

>    - there's an extra micro-optimization there, which is that if there's
>      no obj_cb, we have no need to assemble the full path at all. I doubt
>      it makes much of a difference, as most callers would pass an object
>      callback (I'd be surprised if we even have one that doesn't).

After doing a few 'git grep' commands, I found several that include a 
NULL cruft_cb but none that have a NULL obj_cb.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sha1_file: use strbuf_add() instead of strbuf_addf()
  2017-12-01 19:50   ` Derrick Stolee
@ 2017-12-01 22:39     ` Jeff King
  2017-12-04 14:06       ` [PATCH v2] " Derrick Stolee
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2017-12-01 22:39 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Derrick Stolee, git, git

On Fri, Dec 01, 2017 at 02:50:05PM -0500, Derrick Stolee wrote:

> > > +	baselen = path->len;
> > We set this here so that the '/' is included as part of the base. Makes
> > sense, but can we now drop the earlier setting of baselen before the
> > opendir() call?
> 
> Yeah, probably. I had briefly considered just adding the '/' before the
> first assignment of "baselen", but didn't want to change the error output. I
> also don't know if there are side effects for other platforms by calling
> opendir() with a '/'-terminated path.

I noticed that, too. Since it's so easy to keep doing the opendir
without the slash, I'd prefer to avoid finding out if there are such
platforms. :)

> Good catch! A big reason to pull it inside and use strbuf_add over
> strbuf_addstr is to avoid a duplicate strlen() calculation. However, I can
> store the length before the conditional.

I'd give 50/50 odds no whether a compiler could optimize out that
strlen. We inline addstr exactly so that callsites can see that strlen
(it's primarily for string literals, where it can become a compile-time
constant, but I think it could apply here). But sometimes C's pointer
aliasing rules can be surprising in blocking "obviously correct"
optimizations like that.

The generated asm is a little dense, but I _think_ "gcc -O2" does in
fact do this with a single strlen based on the following tweak on top of
your patch:

diff --git a/sha1_file.c b/sha1_file.c
index 2160323c4a..f234519744 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1921,11 +1921,12 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr,
 		if (is_dot_or_dotdot(de->d_name))
 			continue;
 
+		strbuf_setlen(path, baselen);
+		strbuf_addstr(path, de->d_name);
+
 		if (strlen(de->d_name) == GIT_SHA1_HEXSZ - 2 &&
 		    !hex_to_bytes(oid.hash + 1, de->d_name,
 				  GIT_SHA1_RAWSZ - 1)) {
-			strbuf_setlen(path, baselen);
-			strbuf_add(path, de->d_name, GIT_SHA1_HEXSZ - 2);
 			if (obj_cb) {
 				r = obj_cb(&oid, path->buf, data);
 				if (r)

Not that I overly mind the manual assignment of the strlen result in
this particular case. But I'm a curious fellow by nature, and knowing
these kinds of answers helps us build up an accurate gut instinct for
future cases.

> Small change by storing the length in advance of the conditional:
> 
> while (de = readdir(...)) {
>     int namelen = strlen(de->d_name);
>     strbuf_setlen(path, baselen);
>     strbuf_add(path, de->d_name, namelen);
> 
>     if (namelen == HEXSZ - 2)
>         obj_cb(path->buf)
>     else
>         cruft_cb(path->buf);
> }

Yup, I don't mind that approach either, but do please use size_t to
store the result of strlen (I know it's nearly impossible to overflow in
this case, but I've been trying to push the codebase in that direction
slowly over time).

> >    - there's an extra micro-optimization there, which is that if there's
> >      no obj_cb, we have no need to assemble the full path at all. I doubt
> >      it makes much of a difference, as most callers would pass an object
> >      callback (I'd be surprised if we even have one that doesn't).
> 
> After doing a few 'git grep' commands, I found several that include a NULL
> cruft_cb but none that have a NULL obj_cb.

Yeah, that agrees with my cursory look.

Thanks!

-Peff

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2] sha1_file: use strbuf_add() instead of strbuf_addf()
  2017-12-01 22:39     ` Jeff King
@ 2017-12-04 14:06       ` Derrick Stolee
  2017-12-04 16:56         ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Derrick Stolee @ 2017-12-04 14:06 UTC (permalink / raw)
  To: git; +Cc: stolee, peff, git, Derrick Stolee

Thanks for the feedback on v1. This version fixes the cruft_cb
bug and streamlines the strlen(). I would include an inter-diff
but it was the same size as the patch.

Thanks,
-Stolee

---

Replace use of strbuf_addf() with strbuf_add() when enumerating
loose objects in for_each_file_in_obj_subdir(). Since we already
check the length and hex-values of the string before consuming
the path, we can prevent extra computation by using the lower-
level method.

One consumer of for_each_file_in_obj_subdir() is the abbreviation
code. OID abbreviations use a cached list of loose objects (per
object subdirectory) to make repeated queries fast, but there is
significant cache load time when there are many loose objects.

Most repositories do not have many loose objects before repacking,
but in the GVFS case the repos can grow to have millions of loose
objects. Profiling 'git log' performance in GitForWindows on a
GVFS-enabled repo with ~2.5 million loose objects revealed 12% of
the CPU time was spent in strbuf_addf().

Add a new performance test to p4211-line-log.sh that is more
sensitive to this cache-loading. By limiting to 1000 commits, we
more closely resemble user wait time when reading history into a
pager.

For a copy of the Linux repo with two ~512 MB packfiles and ~572K
loose objects, running 'git log --oneline --parents --raw -1000'
had the following performance:

 HEAD~1            HEAD
----------------------------------------
 7.70(7.15+0.54)   7.44(7.09+0.29) -3.4%

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 sha1_file.c              | 12 +++++++-----
 t/perf/p4211-line-log.sh |  4 ++++
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 8ae6cb6285..2fc8fa93b4 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1903,7 +1903,6 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr,
 	origlen = path->len;
 	strbuf_complete(path, '/');
 	strbuf_addf(path, "%02x", subdir_nr);
-	baselen = path->len;

 	dir = opendir(path->buf);
 	if (!dir) {
@@ -1914,15 +1913,18 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr,
 	}

 	oid.hash[0] = subdir_nr;
+	strbuf_addch(path, '/');
+	baselen = path->len;

 	while ((de = readdir(dir))) {
+		size_t namelen;
 		if (is_dot_or_dotdot(de->d_name))
 			continue;

+		namelen = strlen(de->d_name);
 		strbuf_setlen(path, baselen);
-		strbuf_addf(path, "/%s", de->d_name);
-
-		if (strlen(de->d_name) == GIT_SHA1_HEXSZ - 2 &&
+		strbuf_add(path, de->d_name, namelen);
+		if (namelen == GIT_SHA1_HEXSZ - 2 &&
 		    !hex_to_bytes(oid.hash + 1, de->d_name,
 				  GIT_SHA1_RAWSZ - 1)) {
 			if (obj_cb) {
@@ -1941,7 +1943,7 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr,
 	}
 	closedir(dir);

-	strbuf_setlen(path, baselen);
+	strbuf_setlen(path, baselen - 1);
 	if (!r && subdir_cb)
 		r = subdir_cb(subdir_nr, path->buf, data);

diff --git a/t/perf/p4211-line-log.sh b/t/perf/p4211-line-log.sh
index e0ed05907c..392bcc0e51 100755
--- a/t/perf/p4211-line-log.sh
+++ b/t/perf/p4211-line-log.sh
@@ -35,4 +35,8 @@ test_perf 'git log --oneline --raw --parents' '
 	git log --oneline --raw --parents >/dev/null
 '

+test_perf 'git log --oneline --raw --parents -1000' '
+	git log --oneline --raw --parents -1000 >/dev/null
+'
+
 test_done
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] sha1_file: use strbuf_add() instead of strbuf_addf()
  2017-12-04 14:06       ` [PATCH v2] " Derrick Stolee
@ 2017-12-04 16:56         ` Jeff King
  2017-12-04 17:12           ` Derrick Stolee
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2017-12-04 16:56 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, stolee, git

On Mon, Dec 04, 2017 at 09:06:03AM -0500, Derrick Stolee wrote:

> Thanks for the feedback on v1. This version fixes the cruft_cb
> bug and streamlines the strlen(). I would include an inter-diff
> but it was the same size as the patch.

Thanks, this version looks good to me. One procedural nit:

> Thanks,
> -Stolee
> 
> ---

When you put your cover letter first, you need to use "scissors" like:

-- >8 --

to separate it from the commit message. Using three-dashes means "git
am" will include your cover letter as the commit message, and omit your
real commit message entirely.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] sha1_file: use strbuf_add() instead of strbuf_addf()
  2017-12-04 16:56         ` Jeff King
@ 2017-12-04 17:12           ` Derrick Stolee
  0 siblings, 0 replies; 7+ messages in thread
From: Derrick Stolee @ 2017-12-04 17:12 UTC (permalink / raw)
  To: Jeff King, Derrick Stolee; +Cc: git, git

On 12/4/2017 11:56 AM, Jeff King wrote:
> When you put your cover letter first, you need to use "scissors" like:
> -- >8 --
>
> to separate it from the commit message. Using three-dashes means "git
> am" will include your cover letter as the commit message, and omit your
> real commit message entirely.

Thanks. I updated our internal best-practices accordingly.

-Stolee

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-12-04 17:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-01 17:49 [PATCH] sha1_file: use strbuf_add() instead of strbuf_addf() Derrick Stolee
2017-12-01 18:22 ` Jeff King
2017-12-01 19:50   ` Derrick Stolee
2017-12-01 22:39     ` Jeff King
2017-12-04 14:06       ` [PATCH v2] " Derrick Stolee
2017-12-04 16:56         ` Jeff King
2017-12-04 17:12           ` Derrick Stolee

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).