git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/3] Improve abbreviation disambiguation
@ 2017-09-15 16:57 Derrick Stolee
  2017-09-15 16:57 ` [PATCH 1/3] sha1_name: Create perf test for find_unique_abbrev() Derrick Stolee
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Derrick Stolee @ 2017-09-15 16:57 UTC (permalink / raw)
  To: git; +Cc: johannes.schindelin, git, kewillf, Derrick Stolee

Hello,

My name is Derrick Stolee and I just switched teams at Microsoft from
the VSTS Git Server to work on performance improvements in core Git.

This is my first patch submission, and I look forward to your feedback.

Thanks,
 Stolee


When displaying object ids, we frequently want to see an abbreviation
for easier typing. That abbreviation must be unambiguous among all
object ids.

The current implementation of find_unique_abbrev() performs a loop
checking if each abbreviation length is unambiguous until finding one
that works. This causes multiple round-trips to the disk when starting
with the default abbreviation length (usually 7) but needing up to 12
characters for an unambiguous short-sha. For very large repos, this
effect is pronounced and causes issues with several commands, from
obvious consumers `status` and `log` to less obvious commands such as
`fetch` and `push`.

This patch improves performance by iterating over objects matching the
short abbreviation only once, inspecting each object id, and reporting
the minimum length of an unambiguous abbreviation.

A performance helper `test-abbrev` and performance test `p0008-abbrev.sh`
are added to demonstrate this performance improvement. Here are some
performance results for the three included commits, using
GIT_PERF_REPEAT_COUNT=10 since the first test is frequently an outlier
due to the file cache being cold.

Running git on a Linux VM, we see the following gains.

| Repo    | Pack-Files | Loose Objs | Baseline | Patch 2 | Patch 3 |
|---------|------------|------------|----------|---------|---------|
| Git.git | 1          | 0          | 0.46 s   | -87.0%  | -87.0%  |
| Git.git | 5          | 0          | 1.04 s   | -84.6%  | -85.6%  |
| Git.git | 4          | 75852      | 0.88 s   | -86.4%  | -86.4%  |
| Linux   | 1          | 0          | 0.63 s   | -38.1%  | -69.8%  |
| Linux   | 24         | 0          | 5.41 s   | -69.3%  | -71.5%  |
| Linux   | 23         | 323441     | 5.41 s   | -70.6%  | -73.4%  |

Running a similar patch on Git for Windows, we see the following gains.

| Repo          | Pack-Files | Loose | Baseline | Patch 2 | Patch 3 |
|---------------|------------|-------|----------|---------|---------|
| GitForWindows | 6          | 319   | 7.19 s   | -91.1%  | -91.5%  |
| VSTS          | 3          | 38    | 7.83 s   | -88.9%  | -90.9%  |
| Linux         | 3          | 0     | 7.92 s   | -87.9%  | -90.2%  |
| Windows       | 50         | 219   | 17.8 s   | -98.6%  | -98.6%  |

Note that performance improves in all cases, but the performance gain
is larger when there are multiple, large pack-files. This gain comes
from the lack of in-memory caching of index files that have already been
inspected.


Derrick Stolee (3):
  sha1_name: Create perf test for find_unique_abbrev()
  sha1_name: Unroll len loop in find_unique_abbrev_r
  sha1_name: Parse less while finding common prefix

 Makefile               |  1 +
 sha1_name.c            | 66 ++++++++++++++++++++++++++++++++++++++------------
 t/helper/.gitignore    |  1 +
 t/helper/test-abbrev.c | 22 +++++++++++++++++
 t/perf/p0008-abbrev.sh | 12 +++++++++
 5 files changed, 87 insertions(+), 15 deletions(-)
 create mode 100644 t/helper/test-abbrev.c
 create mode 100755 t/perf/p0008-abbrev.sh

-- 
2.14.1.538.g56ec8fc98.dirty


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] sha1_name: Create perf test for find_unique_abbrev()
  2017-09-15 16:57 [PATCH 0/3] Improve abbreviation disambiguation Derrick Stolee
@ 2017-09-15 16:57 ` Derrick Stolee
  2017-09-18  0:51   ` Junio C Hamano
  2017-09-15 16:57 ` [PATCH 2/3] sha1_name: Unroll len loop in find_unique_abbrev_r Derrick Stolee
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Derrick Stolee @ 2017-09-15 16:57 UTC (permalink / raw)
  To: git; +Cc: johannes.schindelin, git, kewillf, Derrick Stolee

Create helper program test-abbrev to compute the minimum length of a
disambiguating short-sha for 100,000 object ids. The ids are created
by iterating an unsigned int hash_base by a constant hash_delta and
copying hash_base five times across the sha1. Iterating by hash_delta
does not create a duplicate value for over 10,000,000 iterations.

test-abberv demonstrates the performance improvements that will be
shown by later improvements to the find_unique_abberv(). The value of
100,000 is large enough to show the significance of the later
improvements while only taking a few seconds on large repos.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile               |  1 +
 t/helper/.gitignore    |  1 +
 t/helper/test-abbrev.c | 23 +++++++++++++++++++++++
 t/perf/p0008-abbrev.sh | 12 ++++++++++++
 4 files changed, 37 insertions(+)
 create mode 100644 t/helper/test-abbrev.c
 create mode 100755 t/perf/p0008-abbrev.sh

diff --git a/Makefile b/Makefile
index f2bb7f2f6..081ca05e8 100644
--- a/Makefile
+++ b/Makefile
@@ -633,6 +633,7 @@ X =
 
 PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
 
+TEST_PROGRAMS_NEED_X += test-abbrev
 TEST_PROGRAMS_NEED_X += test-chmtime
 TEST_PROGRAMS_NEED_X += test-ctype
 TEST_PROGRAMS_NEED_X += test-config
diff --git a/t/helper/.gitignore b/t/helper/.gitignore
index 721650256..80ce7d836 100644
--- a/t/helper/.gitignore
+++ b/t/helper/.gitignore
@@ -1,3 +1,4 @@
+/test-abbrev
 /test-chmtime
 /test-ctype
 /test-config
diff --git a/t/helper/test-abbrev.c b/t/helper/test-abbrev.c
new file mode 100644
index 000000000..cb3551df9
--- /dev/null
+++ b/t/helper/test-abbrev.c
@@ -0,0 +1,23 @@
+#include "cache.h"
+
+int cmd_main(int ac, const char **av)
+{
+	setup_git_directory();
+
+	unsigned int hash_delt = 0x13579BDF;
+	unsigned int hash_base = 0x01020304;
+	struct object_id oid;
+
+	int i, count = 0;
+	int n = sizeof(struct object_id) / sizeof(int);
+	while (count++ < 100000) {
+		for (i = 0; i < n; i++)
+			((unsigned int*)oid.hash)[i] = hash_base;
+
+		find_unique_abbrev(oid.hash, MINIMUM_ABBREV);
+
+		hash_base += hash_delt;
+	}
+
+	exit(0);
+}
diff --git a/t/perf/p0008-abbrev.sh b/t/perf/p0008-abbrev.sh
new file mode 100755
index 000000000..7c3fad807
--- /dev/null
+++ b/t/perf/p0008-abbrev.sh
@@ -0,0 +1,12 @@
+#!/bin/sh
+
+test_description='Test object disambiguation through abbreviations'
+. ./perf-lib.sh
+
+test_perf_large_repo
+
+test_perf 'find_unique_abbrev()' '
+	test-abbrev
+'
+
+test_done
-- 
2.14.1.538.g56ec8fc98.dirty


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/3] sha1_name: Unroll len loop in find_unique_abbrev_r
  2017-09-15 16:57 [PATCH 0/3] Improve abbreviation disambiguation Derrick Stolee
  2017-09-15 16:57 ` [PATCH 1/3] sha1_name: Create perf test for find_unique_abbrev() Derrick Stolee
@ 2017-09-15 16:57 ` Derrick Stolee
  2017-09-15 16:57 ` [PATCH 3/3] sha1_name: Parse less while finding common prefix Derrick Stolee
  2017-09-15 17:08 ` [PATCH 0/3] Improve abbreviation disambiguation Jonathan Nieder
  3 siblings, 0 replies; 8+ messages in thread
From: Derrick Stolee @ 2017-09-15 16:57 UTC (permalink / raw)
  To: git; +Cc: johannes.schindelin, git, kewillf, Derrick Stolee

Unroll the while loop inside find_unique_abbrev_r to avoid iterating
through all loose objects and packfiles multiple times when the short
name is longer than the predicted length.

Instead, inspect each object that collides with the estimated
abbreviation to find the longest common prefix.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 sha1_name.c | 57 ++++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 42 insertions(+), 15 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 134ac9742..f2a1ebe49 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -474,10 +474,32 @@ static unsigned msb(unsigned long val)
 	return r;
 }
 
-int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+struct min_abbrev_data {
+	unsigned int init_len;
+	unsigned int cur_len;
+	char *hex;
+};
+
+static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
-	int status, exists;
+	struct min_abbrev_data *mad = cb_data;
+
+	char *hex = oid_to_hex(oid);
+	unsigned int i = mad->init_len;
+	while (mad->hex[i] && mad->hex[i] == hex[i])
+		i++;
+
+	if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
+		mad->cur_len = i + 1;
+
+	return 0;
+}
 
+int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+{
+	struct disambiguate_state ds;
+	struct min_abbrev_data mad;
+	struct object_id oid_ret;
 	if (len < 0) {
 		unsigned long count = approximate_object_count();
 		/*
@@ -503,19 +525,24 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
 	sha1_to_hex_r(hex, sha1);
 	if (len == GIT_SHA1_HEXSZ || !len)
 		return GIT_SHA1_HEXSZ;
-	exists = has_sha1_file(sha1);
-	while (len < GIT_SHA1_HEXSZ) {
-		struct object_id oid_ret;
-		status = get_short_oid(hex, len, &oid_ret, GET_OID_QUIETLY);
-		if (exists
-		    ? !status
-		    : status == SHORT_NAME_NOT_FOUND) {
-			hex[len] = 0;
-			return len;
-		}
-		len++;
-	}
-	return len;
+
+	if (init_object_disambiguation(hex, len, &ds) < 0)
+		return -1;
+
+	mad.init_len = len;
+	mad.cur_len = len;
+	mad.hex = hex;
+
+	ds.fn = extend_abbrev_len;
+	ds.always_call_fn = 1;
+	ds.cb_data = (void*)&mad;
+
+	find_short_object_filename(&ds);
+	find_short_packed_object(&ds);
+	(void)finish_object_disambiguation(&ds, &oid_ret);
+
+	hex[mad.cur_len] = 0;
+	return mad.cur_len;
 }
 
 const char *find_unique_abbrev(const unsigned char *sha1, int len)
-- 
2.14.1.538.g56ec8fc98.dirty


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 3/3] sha1_name: Parse less while finding common prefix
  2017-09-15 16:57 [PATCH 0/3] Improve abbreviation disambiguation Derrick Stolee
  2017-09-15 16:57 ` [PATCH 1/3] sha1_name: Create perf test for find_unique_abbrev() Derrick Stolee
  2017-09-15 16:57 ` [PATCH 2/3] sha1_name: Unroll len loop in find_unique_abbrev_r Derrick Stolee
@ 2017-09-15 16:57 ` Derrick Stolee
  2017-09-15 17:08 ` [PATCH 0/3] Improve abbreviation disambiguation Jonathan Nieder
  3 siblings, 0 replies; 8+ messages in thread
From: Derrick Stolee @ 2017-09-15 16:57 UTC (permalink / raw)
  To: git; +Cc: johannes.schindelin, git, kewillf, Derrick Stolee

Create get_hex_char_from_oid() to parse oids one hex character at a
time. This prevents unnecessary copying of hex characters in
extend_abbrev_len() when finding the length of a common prefix.

This change decreases the time to run test-abbrev by up to 40% on
large repos.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 sha1_name.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index f2a1ebe49..bb47b6702 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -480,13 +480,22 @@ struct min_abbrev_data {
 	char *hex;
 };
 
+static inline char get_hex_char_from_oid(const struct object_id *oid, int i)
+{
+	static const char hex[] = "0123456789abcdef";
+
+	if ((i & 1) == 0)
+		return hex[oid->hash[i >> 1] >> 4];
+	else
+		return hex[oid->hash[i >> 1] & 0xf];
+}
+
 static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
 	struct min_abbrev_data *mad = cb_data;
 
-	char *hex = oid_to_hex(oid);
 	unsigned int i = mad->init_len;
-	while (mad->hex[i] && mad->hex[i] == hex[i])
+	while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
 		i++;
 
 	if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
-- 
2.14.1.538.g56ec8fc98.dirty


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/3] Improve abbreviation disambiguation
  2017-09-15 16:57 [PATCH 0/3] Improve abbreviation disambiguation Derrick Stolee
                   ` (2 preceding siblings ...)
  2017-09-15 16:57 ` [PATCH 3/3] sha1_name: Parse less while finding common prefix Derrick Stolee
@ 2017-09-15 17:08 ` Jonathan Nieder
  3 siblings, 0 replies; 8+ messages in thread
From: Jonathan Nieder @ 2017-09-15 17:08 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, johannes.schindelin, git, kewillf

Hi,

Derrick Stolee wrote:

> This is my first patch submission, and I look forward to your feedback.

Thanks for writing this.  Looks exciting.

[...]
> When displaying object ids, we frequently want to see an abbreviation
[etc]
> Note that performance improves in all cases, but the performance gain
> is larger when there are multiple, large pack-files. This gain comes
> from the lack of in-memory caching of index files that have already been
> inspected.

Can this (especially the performance information) go in the commit
message for one of the patches?

Otherwise it is harder for people to track down when looking at these
changes later with "git log".  In the worst case, if the mailing list
archive is down, then people would not have access to this information
at all.  For that reason, I try (though I often fail!) to restrict
cover letters to giving hints at e.g. what changed since the patch
series last visited the list, and to make the commit messages in the
patches themselves as self-contained as possible.

That aside, looking forward to reading the patches themselves in more
detail.  Thanks for working on this.

Sincerely,
Jonathan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3] sha1_name: Create perf test for find_unique_abbrev()
  2017-09-15 16:57 ` [PATCH 1/3] sha1_name: Create perf test for find_unique_abbrev() Derrick Stolee
@ 2017-09-18  0:51   ` Junio C Hamano
  2017-09-18 11:36     ` Derrick Stolee
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2017-09-18  0:51 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, johannes.schindelin, git, kewillf

Derrick Stolee <dstolee@microsoft.com> writes:

> +int cmd_main(int ac, const char **av)
> +{
> +	setup_git_directory();

As far as I recall, we do not (yet) allow declaration after
statement in our codebase.  Move this down to make it after all
decls.

> +
> +	unsigned int hash_delt = 0x13579BDF;
> +	unsigned int hash_base = 0x01020304;
> +	struct object_id oid;
> +
> +	int i, count = 0;
> +	int n = sizeof(struct object_id) / sizeof(int);

It probably is technically OK to assume sizeof(int) always equals to
sizeof(unsigned), but because you use 'n' _only_ to work with uint
and never with int, it would make more sense to match.  

But I do not think we want this "clever" optimization that involves
'n' in the first place.

> +	while (count++ < 100000) {
> +		for (i = 0; i < n; i++)
> +			((unsigned int*)oid.hash)[i] = hash_base;

Does it make sense to assume that uint is always 4-byte (so this
code won't work if it is 8-byte on your platform) and doing this is
faster than using platform-optimized memcpy()?

> +		find_unique_abbrev(oid.hash, MINIMUM_ABBREV);
> +
> +		hash_base += hash_delt;
> +	}

I also wonder if this is measuring the right thing.  I am guessing
that by making many queries for a unique abbreviation of "random"
(well, it is deterministic, but my point is these queries are not
based on the names of objects that exist in the repository) hashes,
this test measures how much time it takes for us to decide that such
a random hash does not exist.  In the real life use, we make the
opposite query far more frequently: we have an object that we _know_
exists in the repository and we try to find a sufficient length to
disambiguate it from others, and I suspect that these two use
different logic.  Don't you need to be measuring the time it takes
to compute the shortest abbreviation of an object that exists
instead?

> +	exit(0);
> +}
> diff --git a/t/perf/p0008-abbrev.sh b/t/perf/p0008-abbrev.sh
> new file mode 100755
> index 000000000..7c3fad807
> --- /dev/null
> +++ b/t/perf/p0008-abbrev.sh
> @@ -0,0 +1,12 @@
> +#!/bin/sh
> +
> +test_description='Test object disambiguation through abbreviations'
> +. ./perf-lib.sh
> +
> +test_perf_large_repo
> +
> +test_perf 'find_unique_abbrev()' '
> +	test-abbrev
> +'
> +
> +test_done

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3] sha1_name: Create perf test for find_unique_abbrev()
  2017-09-18  0:51   ` Junio C Hamano
@ 2017-09-18 11:36     ` Derrick Stolee
  2017-09-19  0:51       ` Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: Derrick Stolee @ 2017-09-18 11:36 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee; +Cc: git, johannes.schindelin, git, kewillf

On 9/17/2017 5:51 PM, Junio C Hamano wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
> 
>> +int cmd_main(int ac, const char **av)
>> +{
>> +	setup_git_directory();
> 
> As far as I recall, we do not (yet) allow declaration after
> statement in our codebase.  Move this down to make it after all
> decls.

Will fix.

>> +
>> +	unsigned int hash_delt = 0x13579BDF;
>> +	unsigned int hash_base = 0x01020304;
>> +	struct object_id oid;
>> +
>> +	int i, count = 0;
>> +	int n = sizeof(struct object_id) / sizeof(int);
> 
> It probably is technically OK to assume sizeof(int) always equals to
> sizeof(unsigned), but because you use 'n' _only_ to work with uint
> and never with int, it would make more sense to match.

Will fix. I also notice that "n" should be const.

> But I do not think we want this "clever" optimization that involves
> 'n' in the first place.
 >>> +	while (count++ < 100000) {
>> +		for (i = 0; i < n; i++)
>> +			((unsigned int*)oid.hash)[i] = hash_base;
> 
> Does it make sense to assume that uint is always 4-byte (so this
> code won't work if it is 8-byte on your platform) and doing this is
> faster than using platform-optimized memcpy()?

I'm not sure what you mean by using memcpy to improve this, because
it would require calling memcpy in the inner loop, such as

	for (i = 0; i < n; i++)
		memcpy(oid.hash + i * sizeof(unsigned), &hash_base,
		       sizeof(unsigned));

I'm probably misunderstanding your intended use of memcpy().

>> +		find_unique_abbrev(oid.hash, MINIMUM_ABBREV);
>> +
>> +		hash_base += hash_delt;
>> +	}
> 
> I also wonder if this is measuring the right thing.  I am guessing
> that by making many queries for a unique abbreviation of "random"
> (well, it is deterministic, but my point is these queries are not
> based on the names of objects that exist in the repository) hashes,
> this test measures how much time it takes for us to decide that such
> a random hash does not exist.  In the real life use, we make the
> opposite query far more frequently: we have an object that we _know_
> exists in the repository and we try to find a sufficient length to
> disambiguate it from others, and I suspect that these two use
> different logic.  Don't you need to be measuring the time it takes
> to compute the shortest abbreviation of an object that exists
> instead?

First, this doesn't just measure the time it takes to determine non-
existence, because it finds the len required to disambiguate an
"incoming" hash from all known hashes. When testing, I put in a
simple printf to report the result abbreviation so I could see how
often it needed to be extended. In this sense, the test exposes the
while loop that is removed by PATCH 2/3.

Second, your criticism about extant hashes is valid, and one I
struggled to reconcile. I see two issues with testing known hashes:

1. By determining the hash exists, we have inspected the file that
    contains it (either a .idx or the loose-object directory). This
    has side-effects that warm up the file cache so the looped method
    is artificially faster to find matching objects. The effect is
    particularly significant on a repo with exactly one packfile.

2. If we iterate over the entire set of objects, this test takes
    O(N*t(N)) time, where t(N) is the average time to compute a
    minimum abbreviation. For large repos, this can take several
    minutes. Instead, even with the constant 100,000 test hashes, we
    have an O(t(N)) test. We could avoid the asymptotic growth by
    limiting the number of existing hashes we use, but how do we
    find a sufficiently uniform sample from them?

By looking at some other perf tests, I see that we can add a pre-
requisite action. I will investigate making another helper that
uniformly selects a set of hashes from the repo and writes them
to stdout in a random order. p0008-abbrev.sh will run the helper as
a prerequisite, saving the data to a file. test-abbrev will read
the hashes from stdin to test find_unique_abbrev. This should avoid
the side-effects I mentioned (test-abbrev will not warm up indexes)
while also testing abbreviation lengths for existing objects.

I'll get started on these changes and send a new patch with new perf
data in a couple days. Please let me know if there is a better way
to measure performance for this change.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3] sha1_name: Create perf test for find_unique_abbrev()
  2017-09-18 11:36     ` Derrick Stolee
@ 2017-09-19  0:51       ` Junio C Hamano
  0 siblings, 0 replies; 8+ messages in thread
From: Junio C Hamano @ 2017-09-19  0:51 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Derrick Stolee, git, johannes.schindelin, git, kewillf

Derrick Stolee <stolee@gmail.com> writes:

>> But I do not think we want this "clever" optimization that involves
>> 'n' in the first place.
>>>> +	while (count++ < 100000) {
>>> +		for (i = 0; i < n; i++)
>>> +			((unsigned int*)oid.hash)[i] = hash_base;
>>
>> Does it make sense to assume that uint is always 4-byte (so this
>> code won't work if it is 8-byte on your platform) and doing this is
>> faster than using platform-optimized memcpy()?
>
> I'm not sure what you mean by using memcpy to improve this, because
> it would require calling memcpy in the inner loop, such as
>
> 	for (i = 0; i < n; i++)
> 		memcpy(oid.hash + i * sizeof(unsigned), &hash_base,
> 		       sizeof(unsigned));

Sorry, I left it without saying as I thought it was obvious, but
what I meant was to use a whole "struct oid", not just a single
unsigned (repeated), as the hash that is tested.  If you have an
array of object names you use in the test, then

	for (count = 0; count < limit; count++) {
		hashcpy(&oid.hash, &samples[count]);

		... do the probing ...
	}

> First, this doesn't just measure the time it takes to determine non-
> existence,

Sorry, my phrasing was indeed misleading.  I know the time we spend
to see if we have or do not have the object is the largest cycle
spender in these codepaths (and even if it were, I do not think that
is what you are trying to optimize in these patches anyway).  

But if I recall correctly, the way we come up with the unique
abbreviation for an object that exists and an object that does not
exist are different?  And because most of the time (think: "git log
-p" output) we would be finding abbreviation for objects that we do
have, benchmarking and tweaking the code that comes up with an
object that does not exist is not optimizing for the right case.

Back when I wrote that initial response, I didn't check how
different the code was between the two cases, but now I did.  It
seems that in both cases we start from the shortest-allowed length
and then extend the same way, and the only difference between these
two cases is that we return immediately when our candidate name is
long enough not to match any existing object when the full name
refers to an object we do not have, while we return only when
disambiguity is resolved.  I _think_ these amount to the same
computation (i.e. when an object with the full name we have exists,
the amount of computation we need to come up with its unique
abbreviation is the same as the computation we need to find the
unique abbreviation for the same name in another repository that has
identical set of objects, minus that single object), so from that
point of view, throwing random hashes, most of which would not name
any existing object, and measuring how much time it takes to run
get_short_oid() to compute the minimum length of the unique prefix
should be sufficient.

Thanks.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-09-19  0:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-15 16:57 [PATCH 0/3] Improve abbreviation disambiguation Derrick Stolee
2017-09-15 16:57 ` [PATCH 1/3] sha1_name: Create perf test for find_unique_abbrev() Derrick Stolee
2017-09-18  0:51   ` Junio C Hamano
2017-09-18 11:36     ` Derrick Stolee
2017-09-19  0:51       ` Junio C Hamano
2017-09-15 16:57 ` [PATCH 2/3] sha1_name: Unroll len loop in find_unique_abbrev_r Derrick Stolee
2017-09-15 16:57 ` [PATCH 3/3] sha1_name: Parse less while finding common prefix Derrick Stolee
2017-09-15 17:08 ` [PATCH 0/3] Improve abbreviation disambiguation Jonathan Nieder

git@vger.kernel.org list mirror (unofficial, one of many)

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 git git/ https://public-inbox.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://7fh6tueqddpjyxjmgtdiueylzoqt6pt7hec3pukyptlmohoowvhde4yd.onion/inbox.comp.version-control.git
	nntp://ie5yzdi7fg72h7s4sdcztq5evakq23rdt33mfyfcddc5u3ndnw24ogqd.onion/inbox.comp.version-control.git
	nntp://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git