git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: git@vger.kernel.org
Cc: "Junio C Hamano" <gitster@pobox.com>,
	dstolee@microsoft.com, git@jeffhostetler.com, peff@peff.net,
	johannes.schindelin@gmx.de, jrnieder@gmail.com,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: [PATCH 20/20] abbrev: add a core.validateAbbrev setting
Date: Fri,  8 Jun 2018 22:41:36 +0000	[thread overview]
Message-ID: <20180608224136.20220-21-avarab@gmail.com> (raw)
In-Reply-To: <20180608224136.20220-1-avarab@gmail.com>

Operations that need to abbreviate a lot of SHA-1s such as 'git log
--oneline' experience degraded performance when traversing a lot of
packs. See [1] and [2] for some relevant performance numbers.

One way to address this is something like the MIDX to make looking up
the SHA-1s cheaper.

This change adds an alternate method of achieving some of the same
ends (but possibly not all, see [3] and replies to the original thread
at [1]).

Instead of trying really hard to find an unambiguous SHA-1 we can with
core.validateAbbrev=false, and preferably combined with the new
support for relative core.abbrev values (such as +4) unconditionally
print a short SHA-1 without doing any disambiguation check. I.e. it
allows for picking a trade-off between performance, and the odds that
future or remote (or current and local) short SHA-1 will be ambiguous.

With the included performance test my git.git repacked with with `git
repack -A -d --max-pack-size=X` gives the following results against
git.git itself with X=64M:

    Test                                        HEAD~             HEAD
    ------------------------------------------------------------------------------------
    0014.1: git log --oneline --raw --parents   2.53(2.48+0.05)   2.20(2.14+0.05) -13.0%

With one big pack -7.6%, and with 16M packs -23.8%.

1. https://public-inbox.org/git/20180107181459.222909-1-dstolee@microsoft.com/
2. https://public-inbox.org/git/20180607140338.32440-1-dstolee@microsoft.com/
3. https://public-inbox.org/git/87lgbsz61p.fsf@evledraar.gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/config.txt            | 43 ++++++++++++++++++++++++++
 cache.h                             |  1 +
 config.c                            |  7 +++++
 environment.c                       |  1 +
 sha1-name.c                         |  4 +++
 t/perf/p0014-abbrev.sh              | 13 ++++++++
 t/t1512-rev-parse-disambiguation.sh | 47 +++++++++++++++++++++++++++++
 7 files changed, 116 insertions(+)
 create mode 100755 t/perf/p0014-abbrev.sh

diff --git a/Documentation/config.txt b/Documentation/config.txt
index abf07be7b6..df31d1351f 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -925,6 +925,49 @@ means to add or subtract N characters from the SHA-1 that Git would
 otherwise print, this allows for producing more future-proof SHA-1s
 for use within a given project, while adjusting the value for the
 current approximate number of objects.
++
+This is especially useful in combination with the
+`core.validateAbbrev` setting, or to get more future-proof hashes to
+reference in the future in a repository whose number of objects is
+expected to grow.
+
+core.validateAbbrev::
+	If set to false (true by default) don't do any validation when
+	printing abbreviated object names to see if they're really
+	unique. This makes printing objects more performant at the
+	cost of potentially printing object names that aren't unique
+	within the current repository.
++
+When printing abbreviated object names Git needs to look through the
+local object store. This is an `O(log N)` operation assuming all the
+objects are in a single pack file, but `X * O(log N)` given `X` pack
+files, which can get expensive on some larger repositories.
++
+This setting changes that to `O(1)`, but with the trade-off that
+depending on the value of `core.abbrev` we may be printing abbreviated
+hashes that collide. Too see how likely this is, try running:
++
+-----------------------------------------------------------------------------------------------------------
+git log --all --pretty=format:%h --abbrev=4 | perl -nE 'chomp; say length' | sort | uniq -c | sort -nr
+-----------------------------------------------------------------------------------------------------------
++
+This shows how many commits were found at each abbreviation length. On
+linux.git in June 2018 this shows a bit more than 750,000 commits,
+with just 4 needing 11 characters to be fully abbreviated, and the
+default heuristic picks a length of 12.
++
+Even without `core.validateAbbrev=false` the results abbreviation
+already a bit of a probability game. They're guaranteed at the moment
+of generation, but as more objects are added, ambiguities may be
+introduced. Likewise, what's unambiguous for you may not be for
+somebody else you're communicating with, if they have their own clone.
++
+Therefore the default of `core.validateAbbrev=true` may not save you
+in practice if you're sharing the SHA-1 or noting it now to use after
+a `git fetch`. You may be better off setting `core.abbrev` to
+e.g. `+2` to add 2 extra characters to the SHA-1, and possibly combine
+that with `core.validateAbbrev=false` to get a reasonable trade-off
+between safety and performance.
 
 add.ignoreErrors::
 add.ignore-errors (deprecated)::
diff --git a/cache.h b/cache.h
index 0fb4211804..6dc5af9482 100644
--- a/cache.h
+++ b/cache.h
@@ -773,6 +773,7 @@ extern int quote_path_fully;
 extern int has_symlinks;
 extern int minimum_abbrev, default_abbrev;
 extern int default_abbrev_relative;
+extern int validate_abbrev;
 extern int ignore_case;
 extern int assume_unchanged;
 extern int prefer_symlink_refs;
diff --git a/config.c b/config.c
index cd95c6bdfb..fd11e1047d 100644
--- a/config.c
+++ b/config.c
@@ -1146,6 +1146,13 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.validateabbrev")) {
+		if (!value)
+			return config_error_nonbool(var);
+		validate_abbrev =  git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.abbrev")) {
 		if (!value)
 			return config_error_nonbool(var);
diff --git a/environment.c b/environment.c
index 0d48d52fba..4a24d8126b 100644
--- a/environment.c
+++ b/environment.c
@@ -23,6 +23,7 @@ int check_stat = 1;
 int has_symlinks = 1;
 int minimum_abbrev = 4, default_abbrev = -1;
 int default_abbrev_relative = 0;
+int validate_abbrev = 1;
 int ignore_case;
 int assume_unchanged;
 int prefer_symlink_refs;
diff --git a/sha1-name.c b/sha1-name.c
index 75f1bef7d1..57dc782782 100644
--- a/sha1-name.c
+++ b/sha1-name.c
@@ -612,6 +612,10 @@ int find_unique_abbrev_r(char *hex, const struct object_id *oid, int len)
 		}
 		len += dar;
 	}
+	if (!validate_abbrev) {
+		hex[len] = 0;
+		return len;
+	}
 
 	mad.init_len = len;
 	mad.cur_len = len;
diff --git a/t/perf/p0014-abbrev.sh b/t/perf/p0014-abbrev.sh
new file mode 100755
index 0000000000..15c3764265
--- /dev/null
+++ b/t/perf/p0014-abbrev.sh
@@ -0,0 +1,13 @@
+#!/bin/sh
+
+test_description="Tests core.validateAbbrev performance"
+
+. ./perf-lib.sh
+
+test_perf_large_repo
+
+test_perf 'git log --oneline --raw --parents' '
+	git -c core.abbrev=15 -c core.validateAbbrev=false log --oneline --raw --parents >/dev/null
+'
+
+test_done
diff --git a/t/t1512-rev-parse-disambiguation.sh b/t/t1512-rev-parse-disambiguation.sh
index 96fe3754c8..89565af508 100755
--- a/t/t1512-rev-parse-disambiguation.sh
+++ b/t/t1512-rev-parse-disambiguation.sh
@@ -232,6 +232,53 @@ test_expect_success 'more history' '
 
 '
 
+test_expect_success 'ambiguous objects and core.abbrev' '
+	cat >expected <<-\EOF &&
+	00000000006
+	0000000005
+	00000000008
+	000000000004
+	0000000000e
+	EOF
+	git -c core.abbrev=4 log --pretty=tformat:%h >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success 'ambiguous objects and core.validateAbbrev' '
+	git -c core.abbrev=4 -c core.validateabbrev=true log --pretty=tformat:%h >actual &&
+	test_cmp expected actual &&
+
+	cat >expected <<-\EOF &&
+	0000
+	0000
+	0000
+	0000
+	0000
+	EOF
+	git -c core.abbrev=4 -c core.validateabbrev=false log --pretty=tformat:%h >actual &&
+	test_cmp expected actual &&
+
+	cat >expected <<-\EOF &&
+	00000000006
+	0000000005b
+	00000000008
+	000000000004
+	0000000000e
+	EOF
+	git -c core.abbrev=+4 log --pretty=tformat:%h >actual &&
+	test_cmp expected actual &&
+
+	cat >expected <<-\EOF &&
+	00000000006
+	0000000005b
+	00000000008
+	00000000000
+	0000000000e
+	EOF
+	git -c core.abbrev=+4 -c core.validateabbrev=false log --pretty=tformat:%h >actual &&
+	test_cmp expected actual
+'
+
 test_expect_failure 'parse describe name taking advantage of generation' '
 	# ambiguous at the object name level, but there is only one
 	# such commit at generation 0
-- 
2.17.0.290.gded63e768a


  parent reply	other threads:[~2018-06-08 22:42 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-08 22:41 [PATCH 00/20] unconditional O(1) SHA-1 abbreviation Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 01/20] t/README: clarify the description of test_line_count Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 02/20] test library: add a test_byte_count Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 03/20] blame doc: explicitly note how --abbrev=40 gives 39 chars Ævar Arnfjörð Bjarmason
2018-06-12 18:10   ` Junio C Hamano
2018-06-08 22:41 ` [PATCH 04/20] abbrev tests: add tests for core.abbrev and --abbrev Ævar Arnfjörð Bjarmason
2018-06-12 18:31   ` Junio C Hamano
2018-06-08 22:41 ` [PATCH 05/20] abbrev tests: test "git-blame" behavior Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 06/20] blame: fix a bug, core.abbrev should work like --abbrev Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 07/20] abbrev tests: test "git branch" behavior Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 08/20] abbrev tests: test for "git-describe" behavior Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 09/20] abbrev tests: test for "git-log" behavior Ævar Arnfjörð Bjarmason
2018-06-09  8:43   ` Martin Ågren
2018-06-09  9:56     ` Ævar Arnfjörð Bjarmason
2018-06-09 13:56       ` Martin Ågren
2018-06-08 22:41 ` [PATCH 10/20] abbrev tests: test for "git-diff" behavior Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 11/20] abbrev tests: test for plumbing behavior Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 12/20] abbrev tests: test for --abbrev and core.abbrev=[+-]N Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 13/20] parse-options-cb.c: convert uses of 40 to GIT_SHA1_HEXSZ Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 14/20] config.c: use braces on multiple conditional arms Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 15/20] parse-options-cb.c: " Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 16/20] abbrev: unify the handling of non-numeric values Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 17/20] abbrev: unify the handling of empty values Ævar Arnfjörð Bjarmason
2018-06-09 14:24   ` Martin Ågren
2018-06-09 14:31     ` Martin Ågren
2018-06-08 22:41 ` [PATCH 18/20] abbrev parsing: use braces on multiple conditional arms Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 19/20] abbrev: support relative abbrev values Ævar Arnfjörð Bjarmason
2018-06-09 15:38   ` Martin Ågren
2018-06-12 19:16   ` Junio C Hamano
2018-06-13 22:22     ` Ævar Arnfjörð Bjarmason
2018-06-13 22:34       ` Junio C Hamano
2018-06-14  7:36         ` Ævar Arnfjörð Bjarmason
2018-06-14 15:50           ` Junio C Hamano
2018-06-14 19:07             ` Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` Ævar Arnfjörð Bjarmason [this message]
2018-06-09 15:47   ` [PATCH 20/20] abbrev: add a core.validateAbbrev setting Martin Ågren
2018-06-12 18:58     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180608224136.20220-21-avarab@gmail.com \
    --to=avarab@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=johannes.schindelin@gmx.de \
    --cc=jrnieder@gmail.com \
    --cc=peff@peff.net \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).