[PATCH v2 0/4] Hash Abstraction

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH v2 0/4] Hash Abstraction
@ 2017-10-28 18:12 brian m. carlson
  2017-10-28 18:12 ` [PATCH v2 1/4] setup: expose enumerated repo info brian m. carlson
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: brian m. carlson @ 2017-10-28 18:12 UTC (permalink / raw)
  To: git; +Cc: Jonathan Nieder, Stefan Beller, Brandon Williams

This is a series proposing a basic abstraction for hash functions.

As we get closer to converting the remainder of the codebase to use
struct object_id, we should think about the design we want our hash
function abstraction to take.  This series is a proposal for one idea.
Input on any aspect of this proposal is welcome.

This series exposes a struct git_hash_algo that contains basic
information about a given hash algorithm that distinguishes it from
other algorithms: name, identifiers, lengths, implementing functions,
and empty tree and blob constants.  It also exposes an array of hash
algorithms, and a constant for indexing them.

The series also demonstrates a simple conversion using the abstraction
over empty blob and tree values.

In order to avoid conflicting with the struct repository work and with
the goal of avoiding global variables as much as possible, I've pushed
the hash algorithm into struct repository and exposed it via a #define.

I propose this series now as it will inform the way we go about
converting other parts of the codebase, especially some of the pack
algorithms.  Because we share some hash computation code between pack
checksums and object hashing, we need to decide whether to expose pack
checksums as struct object_id, even though they are technically not
object IDs.  Furthermore, if we end up needing to stuff an algorithm
value into struct object_id, we'll no longer be able to directly
reference object IDs in a pack without a copy.

I've updated this series in some significant ways to reflect and better
implement the transition plan as it's developed.  If there are ways
in which this series (or future series) can converge better on the
transition plan, that input would be valuable.

This series is available from the usual places as branch hash-struct,
based against master as of 2.15-rc2.

Changes from v1:
* Rebase onto 2.15-rc2.
* Fix the uninitialized value that Peff pointed out.  This fixes the
  error, but leaves the code in the same place, since I think it's where
  it should be.
* Improve commit message to explain the meaning of current_hash WRT the
  transition plan.
* Added an unknown hash algorithm constant and value to better implement
  the transition plan.
* Explain in the commit message why hex size and binary size are both
  provided.
* Add a format_id field to the struct, in coordination with the
  transition plan.
* Improve comments for struct fields and constants.

brian m. carlson (4):
  setup: expose enumerated repo info
  Add structure representing hash algorithm
  Integrate hash algorithm support with repo setup
  Switch empty tree and blob lookups to use hash abstraction

 builtin/am.c       |  2 +-
 builtin/checkout.c |  2 +-
 builtin/diff.c     |  2 +-
 builtin/pull.c     |  2 +-
 cache.h            | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 diff-lib.c         |  2 +-
 merge-recursive.c  |  2 +-
 notes-merge.c      |  2 +-
 repository.c       |  7 ++++++
 repository.h       |  5 ++++
 sequencer.c        |  6 ++---
 setup.c            | 49 ++++++++++++++++++++++-----------------
 sha1_file.c        | 43 +++++++++++++++++++++++++++++++++++
 submodule.c        |  2 +-
 14 files changed, 157 insertions(+), 36 deletions(-)

-- 
2.15.0.rc2

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/4] setup: expose enumerated repo info
  2017-10-28 18:12 [PATCH v2 0/4] Hash Abstraction brian m. carlson
@ 2017-10-28 18:12 ` brian m. carlson
  2017-10-30 16:08   ` Stefan Beller
  2017-10-28 18:12 ` [PATCH v2 2/4] Add structure representing hash algorithm brian m. carlson
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: brian m. carlson @ 2017-10-28 18:12 UTC (permalink / raw)
  To: git; +Cc: Jonathan Nieder, Stefan Beller, Brandon Williams

We enumerate several different items as part of struct
repository_format, but then actually set up those values using the
global variables we've initialized from them.  Instead, let's pass a
pointer to the structure down to the code where we enumerate these
values, so we can later on use those values directly to perform setup.

This technique makes it easier for us to determine additional items
about the repository format (such as the hash algorithm) and then use
them for setup later on, without needing to add additional global
variables.  We can't avoid using the existing global variables since
they're intricately intertwined with how things work at the moment, but
this improves things for the future.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 setup.c | 46 +++++++++++++++++++++++++---------------------
 1 file changed, 25 insertions(+), 21 deletions(-)

diff --git a/setup.c b/setup.c
index 03f51e056c..f0f509fe85 100644
--- a/setup.c
+++ b/setup.c
@@ -432,16 +432,15 @@ static int check_repo_format(const char *var, const char *value, void *vdata)
 	return 0;
 }
 
-static int check_repository_format_gently(const char *gitdir, int *nongit_ok)
+static int check_repository_format_gently(const char *gitdir, struct repository_format *candidate, int *nongit_ok)
 {
 	struct strbuf sb = STRBUF_INIT;
 	struct strbuf err = STRBUF_INIT;
-	struct repository_format candidate;
 	int has_common;
 
 	has_common = get_common_dir(&sb, gitdir);
 	strbuf_addstr(&sb, "/config");
-	read_repository_format(&candidate, sb.buf);
+	read_repository_format(candidate, sb.buf);
 	strbuf_release(&sb);
 
 	/*
@@ -449,10 +448,10 @@ static int check_repository_format_gently(const char *gitdir, int *nongit_ok)
 	 * we treat a missing config as a silent "ok", even when nongit_ok
 	 * is unset.
 	 */
-	if (candidate.version < 0)
+	if (candidate->version < 0)
 		return 0;
 
-	if (verify_repository_format(&candidate, &err) < 0) {
+	if (verify_repository_format(candidate, &err) < 0) {
 		if (nongit_ok) {
 			warning("%s", err.buf);
 			strbuf_release(&err);
@@ -462,21 +461,21 @@ static int check_repository_format_gently(const char *gitdir, int *nongit_ok)
 		die("%s", err.buf);
 	}
 
-	repository_format_precious_objects = candidate.precious_objects;
-	string_list_clear(&candidate.unknown_extensions, 0);
+	repository_format_precious_objects = candidate->precious_objects;
+	string_list_clear(&candidate->unknown_extensions, 0);
 	if (!has_common) {
-		if (candidate.is_bare != -1) {
-			is_bare_repository_cfg = candidate.is_bare;
+		if (candidate->is_bare != -1) {
+			is_bare_repository_cfg = candidate->is_bare;
 			if (is_bare_repository_cfg == 1)
 				inside_work_tree = -1;
 		}
-		if (candidate.work_tree) {
+		if (candidate->work_tree) {
 			free(git_work_tree_cfg);
-			git_work_tree_cfg = candidate.work_tree;
+			git_work_tree_cfg = candidate->work_tree;
 			inside_work_tree = -1;
 		}
 	} else {
-		free(candidate.work_tree);
+		free(candidate->work_tree);
 	}
 
 	return 0;
@@ -623,6 +622,7 @@ const char *read_gitfile_gently(const char *path, int *return_error_code)
 
 static const char *setup_explicit_git_dir(const char *gitdirenv,
 					  struct strbuf *cwd,
+					  struct repository_format *repo_fmt,
 					  int *nongit_ok)
 {
 	const char *work_tree_env = getenv(GIT_WORK_TREE_ENVIRONMENT);
@@ -648,7 +648,7 @@ static const char *setup_explicit_git_dir(const char *gitdirenv,
 		die("Not a git repository: '%s'", gitdirenv);
 	}
 
-	if (check_repository_format_gently(gitdirenv, nongit_ok)) {
+	if (check_repository_format_gently(gitdirenv, repo_fmt, nongit_ok)) {
 		free(gitfile);
 		return NULL;
 	}
@@ -721,9 +721,10 @@ static const char *setup_explicit_git_dir(const char *gitdirenv,
 
 static const char *setup_discovered_git_dir(const char *gitdir,
 					    struct strbuf *cwd, int offset,
+					    struct repository_format *repo_fmt,
 					    int *nongit_ok)
 {
-	if (check_repository_format_gently(gitdir, nongit_ok))
+	if (check_repository_format_gently(gitdir, repo_fmt, nongit_ok))
 		return NULL;
 
 	/* --work-tree is set without --git-dir; use discovered one */
@@ -735,7 +736,7 @@ static const char *setup_discovered_git_dir(const char *gitdir,
 			gitdir = to_free = real_pathdup(gitdir, 1);
 		if (chdir(cwd->buf))
 			die_errno("Could not come back to cwd");
-		ret = setup_explicit_git_dir(gitdir, cwd, nongit_ok);
+		ret = setup_explicit_git_dir(gitdir, cwd, repo_fmt, nongit_ok);
 		free(to_free);
 		return ret;
 	}
@@ -767,11 +768,12 @@ static const char *setup_discovered_git_dir(const char *gitdir,
 
 /* #16.1, #17.1, #20.1, #21.1, #22.1 (see t1510) */
 static const char *setup_bare_git_dir(struct strbuf *cwd, int offset,
+				      struct repository_format *repo_fmt,
 				      int *nongit_ok)
 {
 	int root_len;
 
-	if (check_repository_format_gently(".", nongit_ok))
+	if (check_repository_format_gently(".", repo_fmt, nongit_ok))
 		return NULL;
 
 	setenv(GIT_IMPLICIT_WORK_TREE_ENVIRONMENT, "0", 1);
@@ -783,7 +785,7 @@ static const char *setup_bare_git_dir(struct strbuf *cwd, int offset,
 		gitdir = offset == cwd->len ? "." : xmemdupz(cwd->buf, offset);
 		if (chdir(cwd->buf))
 			die_errno("Could not come back to cwd");
-		return setup_explicit_git_dir(gitdir, cwd, nongit_ok);
+		return setup_explicit_git_dir(gitdir, cwd, repo_fmt, nongit_ok);
 	}
 
 	inside_git_dir = 1;
@@ -1024,6 +1026,7 @@ const char *setup_git_directory_gently(int *nongit_ok)
 	static struct strbuf cwd = STRBUF_INIT;
 	struct strbuf dir = STRBUF_INIT, gitdir = STRBUF_INIT;
 	const char *prefix;
+	struct repository_format repo_fmt;
 
 	/*
 	 * We may have read an incomplete configuration before
@@ -1051,18 +1054,18 @@ const char *setup_git_directory_gently(int *nongit_ok)
 		prefix = NULL;
 		break;
 	case GIT_DIR_EXPLICIT:
-		prefix = setup_explicit_git_dir(gitdir.buf, &cwd, nongit_ok);
+		prefix = setup_explicit_git_dir(gitdir.buf, &cwd, &repo_fmt, nongit_ok);
 		break;
 	case GIT_DIR_DISCOVERED:
 		if (dir.len < cwd.len && chdir(dir.buf))
 			die(_("Cannot change to '%s'"), dir.buf);
 		prefix = setup_discovered_git_dir(gitdir.buf, &cwd, dir.len,
-						  nongit_ok);
+						  &repo_fmt, nongit_ok);
 		break;
 	case GIT_DIR_BARE:
 		if (dir.len < cwd.len && chdir(dir.buf))
 			die(_("Cannot change to '%s'"), dir.buf);
-		prefix = setup_bare_git_dir(&cwd, dir.len, nongit_ok);
+		prefix = setup_bare_git_dir(&cwd, dir.len, &repo_fmt, nongit_ok);
 		break;
 	case GIT_DIR_HIT_CEILING:
 		prefix = setup_nongit(cwd.buf, nongit_ok);
@@ -1169,7 +1172,8 @@ int git_config_perm(const char *var, const char *value)
 
 void check_repository_format(void)
 {
-	check_repository_format_gently(get_git_dir(), NULL);
+	struct repository_format repo_fmt;
+	check_repository_format_gently(get_git_dir(), &repo_fmt, NULL);
 	startup_info->have_repository = 1;
 }
 
-- 
2.15.0.rc2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 2/4] Add structure representing hash algorithm
  2017-10-28 18:12 [PATCH v2 0/4] Hash Abstraction brian m. carlson
  2017-10-28 18:12 ` [PATCH v2 1/4] setup: expose enumerated repo info brian m. carlson
@ 2017-10-28 18:12 ` brian m. carlson
  2017-10-29  1:36   ` Eric Sunshine
                     ` (2 more replies)
  2017-10-28 18:12 ` [PATCH v2 3/4] Integrate hash algorithm support with repo setup brian m. carlson
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 19+ messages in thread
From: brian m. carlson @ 2017-10-28 18:12 UTC (permalink / raw)
  To: git; +Cc: Jonathan Nieder, Stefan Beller, Brandon Williams

Since in the future we want to support an additional hash algorithm, add
a structure that represents a hash algorithm and all the data that must
go along with it.  Add a constant to allow easy enumeration of hash
algorithms.  Implement function typedefs to create an abstract API that
can be used by any hash algorithm, and wrappers for the existing SHA1
functions that conform to this API.

Expose a value for hex size as well as binary size.  While one will
always be twice the other, the two values are both used extremely
commonly throughout the codebase and providing both leads to improved
readability.

Don't include an entry in the hash algorithm structure for the null
object ID.  As this value is all zeros, any suitably sized all-zero
object ID can be used, and there's no need to store a given one on a
per-hash basis.

The current hash function transition plan envisions a time when we will
accept input from the user that might be in SHA-1 or in the NewHash
format.  Since we cannot know which the user has provided, add a
constant representing the unknown algorithm to allow us to indicate that
we must look the correct value up.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
I believe I did get the format_id constant for SHA-1 right, but
sanity-checking would be appreciated.  We don't want another Referer
mess.

 cache.h     | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 sha1_file.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 98 insertions(+)

diff --git a/cache.h b/cache.h
index 6440e2bf21..9e9eb08f05 100644
--- a/cache.h
+++ b/cache.h
@@ -77,6 +77,61 @@ struct object_id {
 	unsigned char hash[GIT_MAX_RAWSZ];
 };
 
+/*
+ * Note that these constants are suitable for indexing the hash_algos array and
+ * comparing against each other, but are otherwise arbitrary, so they should not
+ * be exposed to the user or serialized to disk.  To know whether a
+ * git_hash_algo struct points to some usable hash function, test the format_id
+ * field for being non-zero.  Use the name field for user-visible situations and
+ * the format_id field for fixed-length fields on disk.
+ */
+/* An unknown hash function. */
+#define GIT_HASH_UNKNOWN 0
+/* SHA-1 */
+#define GIT_HASH_SHA1 1
+/* Number of algorithms supported (including unknown). */
+#define GIT_HASH_NALGOS (GIT_HASH_SHA1 + 1)
+
+typedef void (*git_hash_init_fn)(void *ctx);
+typedef void (*git_hash_update_fn)(void *ctx, const void *in, size_t len);
+typedef void (*git_hash_final_fn)(unsigned char *hash, void *ctx);
+
+struct git_hash_algo {
+	/*
+	 * The name of the algorithm, as appears in the config file and in
+	 * messages.
+	 */
+	const char *name;
+
+	/* A four-byte version identifier, used in pack indices. */
+	uint32_t format_id;
+
+	/* The size of a hash context (e.g. git_SHA_CTX). */
+	size_t ctxsz;
+
+	/* The length of the hash in binary. */
+	size_t rawsz;
+
+	/* The length of the hash in hex characters. */
+	size_t hexsz;
+
+	/* The hash initialization function. */
+	git_hash_init_fn init_fn;
+
+	/* The hash update function. */
+	git_hash_update_fn update_fn;
+
+	/* The hash finalization function. */
+	git_hash_final_fn final_fn;
+
+	/* The OID of the empty tree. */
+	const struct object_id *empty_tree;
+
+	/* The OID of the empty blob. */
+	const struct object_id *empty_blob;
+};
+extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS];
+
 #if defined(DT_UNKNOWN) && !defined(NO_D_TYPE_IN_DIRENT)
 #define DTYPE(de)	((de)->d_type)
 #else
diff --git a/sha1_file.c b/sha1_file.c
index 10c3a0083d..77b320126a 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -39,6 +39,49 @@ const struct object_id empty_blob_oid = {
 	EMPTY_BLOB_SHA1_BIN_LITERAL
 };
 
+static inline void git_hash_sha1_init(void *ctx)
+{
+	git_SHA1_Init((git_SHA_CTX *)ctx);
+}
+
+static inline void git_hash_sha1_update(void *ctx, const void *data, size_t len)
+{
+	git_SHA1_Update((git_SHA_CTX *)ctx, data, len);
+}
+
+static inline void git_hash_sha1_final(unsigned char *hash, void *ctx)
+{
+	git_SHA1_Final(hash, (git_SHA_CTX *)ctx);
+}
+
+const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
+	{
+		NULL,
+		0x00000000,
+		0,
+		0,
+		0,
+		NULL,
+		NULL,
+		NULL,
+		NULL,
+		NULL,
+	},
+	{
+		"sha-1",
+		/* "sha1", big-endian */
+		0x73686131,
+		sizeof(git_SHA_CTX),
+		GIT_SHA1_RAWSZ,
+		GIT_SHA1_HEXSZ,
+		git_hash_sha1_init,
+		git_hash_sha1_update,
+		git_hash_sha1_final,
+		&empty_tree_oid,
+		&empty_blob_oid,
+	},
+};
+
 /*
  * This is meant to hold a *small* number of objects that you would
  * want read_sha1_file() to be able to return, but yet you do not want
-- 
2.15.0.rc2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 3/4] Integrate hash algorithm support with repo setup
  2017-10-28 18:12 [PATCH v2 0/4] Hash Abstraction brian m. carlson
  2017-10-28 18:12 ` [PATCH v2 1/4] setup: expose enumerated repo info brian m. carlson
  2017-10-28 18:12 ` [PATCH v2 2/4] Add structure representing hash algorithm brian m. carlson
@ 2017-10-28 18:12 ` brian m. carlson
  2017-10-29  1:44   ` Eric Sunshine
  2017-10-30 16:27   ` Stefan Beller
  2017-10-28 18:12 ` [PATCH v2 4/4] Switch empty tree and blob lookups to use hash abstraction brian m. carlson
  2017-10-30 16:45 ` [PATCH v2 0/4] Hash Abstraction Stefan Beller
  4 siblings, 2 replies; 19+ messages in thread
From: brian m. carlson @ 2017-10-28 18:12 UTC (permalink / raw)
  To: git; +Cc: Jonathan Nieder, Stefan Beller, Brandon Williams

In future versions of Git, we plan to support an additional hash
algorithm.  Integrate the enumeration of hash algorithms with repository
setup, and store a pointer to the enumerated data in struct repository.
Of course, we currently only support SHA-1, so hard-code this value in
read_repository_format.  In the future, we'll enumerate this value from
the configuration.

Add a constant, current_hash, which points to the hash_algo structure
pointer in the repository global.  Note that this is the hash which is
used to serialize data to disk, not the hash which is used to display
items to the user.  The transition plan anticipates that these may be
different.  We can add an additional element in the future (say,
ui_hash_algo) to provide for this case.

Include repository.h in cache.h since we now need to have access to
these struct and variable definitions.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 cache.h      | 4 ++++
 repository.c | 7 +++++++
 repository.h | 5 +++++
 setup.c      | 3 +++
 4 files changed, 19 insertions(+)

diff --git a/cache.h b/cache.h
index 9e9eb08f05..bce57c74c4 100644
--- a/cache.h
+++ b/cache.h
@@ -14,6 +14,7 @@
 #include "hash.h"
 #include "path.h"
 #include "sha1-array.h"
+#include "repository.h"
 
 #ifndef platform_SHA_CTX
 /*
@@ -132,6 +133,8 @@ struct git_hash_algo {
 };
 extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS];
 
+#define current_hash the_repository->hash_algo
+
 #if defined(DT_UNKNOWN) && !defined(NO_D_TYPE_IN_DIRENT)
 #define DTYPE(de)	((de)->d_type)
 #else
@@ -920,6 +923,7 @@ struct repository_format {
 	int version;
 	int precious_objects;
 	int is_bare;
+	int hash_algo;
 	char *work_tree;
 	struct string_list unknown_extensions;
 };
diff --git a/repository.c b/repository.c
index bb2fae5446..c63bff05a5 100644
--- a/repository.c
+++ b/repository.c
@@ -64,6 +64,11 @@ void repo_set_gitdir(struct repository *repo, const char *path)
 	free(old_gitdir);
 }
 
+void repo_set_hash_algo(struct repository *repo, int hash_algo)
+{
+	repo->hash_algo = &hash_algos[hash_algo];
+}
+
 /*
  * Attempt to resolve and set the provided 'gitdir' for repository 'repo'.
  * Return 0 upon success and a non-zero value upon failure.
@@ -136,6 +141,8 @@ int repo_init(struct repository *repo, const char *gitdir, const char *worktree)
 	if (read_and_verify_repository_format(&format, repo->commondir))
 		goto error;
 
+	repo->hash_algo = &hash_algos[format.hash_algo];
+
 	if (worktree)
 		repo_set_worktree(repo, worktree);
 
diff --git a/repository.h b/repository.h
index 7f5e24a0a2..0329e40c7f 100644
--- a/repository.h
+++ b/repository.h
@@ -4,6 +4,7 @@
 struct config_set;
 struct index_state;
 struct submodule_cache;
+struct git_hash_algo;
 
 struct repository {
 	/* Environment */
@@ -67,6 +68,9 @@ struct repository {
 	 */
 	struct index_state *index;
 
+	/* Repository's current hash algorithm, as serialized on disk. */
+	const struct git_hash_algo *hash_algo;
+
 	/* Configurations */
 	/*
 	 * Bit used during initialization to indicate if repository state (like
@@ -86,6 +90,7 @@ extern struct repository *the_repository;
 
 extern void repo_set_gitdir(struct repository *repo, const char *path);
 extern void repo_set_worktree(struct repository *repo, const char *path);
+extern void repo_set_hash_algo(struct repository *repo, int algo);
 extern int repo_init(struct repository *repo, const char *gitdir, const char *worktree);
 extern int repo_submodule_init(struct repository *submodule,
 			       struct repository *superproject,
diff --git a/setup.c b/setup.c
index f0f509fe85..d99f8a82ff 100644
--- a/setup.c
+++ b/setup.c
@@ -486,6 +486,7 @@ int read_repository_format(struct repository_format *format, const char *path)
 	memset(format, 0, sizeof(*format));
 	format->version = -1;
 	format->is_bare = -1;
+	format->hash_algo = GIT_HASH_SHA1;
 	string_list_init(&format->unknown_extensions, 1);
 	git_config_from_file(check_repo_format, path, format);
 	return format->version;
@@ -1111,6 +1112,8 @@ const char *setup_git_directory_gently(int *nongit_ok)
 			repo_set_gitdir(the_repository, gitdir);
 			setup_git_env();
 		}
+		if (startup_info->have_repository)
+			repo_set_hash_algo(the_repository, repo_fmt.hash_algo);
 	}
 
 	strbuf_release(&dir);
-- 
2.15.0.rc2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 4/4] Switch empty tree and blob lookups to use hash abstraction
  2017-10-28 18:12 [PATCH v2 0/4] Hash Abstraction brian m. carlson
                   ` (2 preceding siblings ...)
  2017-10-28 18:12 ` [PATCH v2 3/4] Integrate hash algorithm support with repo setup brian m. carlson
@ 2017-10-28 18:12 ` brian m. carlson
  2017-10-30 16:45 ` [PATCH v2 0/4] Hash Abstraction Stefan Beller
  4 siblings, 0 replies; 19+ messages in thread
From: brian m. carlson @ 2017-10-28 18:12 UTC (permalink / raw)
  To: git; +Cc: Jonathan Nieder, Stefan Beller, Brandon Williams

Switch the uses of empty_tree_oid and empty_blob_oid to use the
current_hash abstraction that represents the current hash algorithm in
use.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 builtin/am.c       | 2 +-
 builtin/checkout.c | 2 +-
 builtin/diff.c     | 2 +-
 builtin/pull.c     | 2 +-
 cache.h            | 8 ++++----
 diff-lib.c         | 2 +-
 merge-recursive.c  | 2 +-
 notes-merge.c      | 2 +-
 sequencer.c        | 6 +++---
 submodule.c        | 2 +-
 10 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/builtin/am.c b/builtin/am.c
index d7513f5375..7f71974b79 100644
--- a/builtin/am.c
+++ b/builtin/am.c
@@ -1433,7 +1433,7 @@ static void write_index_patch(const struct am_state *state)
 	if (!get_oid_tree("HEAD", &head))
 		tree = lookup_tree(&head);
 	else
-		tree = lookup_tree(&empty_tree_oid);
+		tree = lookup_tree(current_hash->empty_tree);
 
 	fp = xfopen(am_path(state, "patch"), "w");
 	init_revisions(&rev_info, NULL);
diff --git a/builtin/checkout.c b/builtin/checkout.c
index fc4f8fd2ea..e7878fa62e 100644
--- a/builtin/checkout.c
+++ b/builtin/checkout.c
@@ -516,7 +516,7 @@ static int merge_working_tree(const struct checkout_opts *opts,
 		}
 		tree = parse_tree_indirect(old->commit ?
 					   &old->commit->object.oid :
-					   &empty_tree_oid);
+					   current_hash->empty_tree);
 		init_tree_desc(&trees[0], tree->buffer, tree->size);
 		tree = parse_tree_indirect(&new->commit->object.oid);
 		init_tree_desc(&trees[1], tree->buffer, tree->size);
diff --git a/builtin/diff.c b/builtin/diff.c
index f5bbd4d757..2419de1770 100644
--- a/builtin/diff.c
+++ b/builtin/diff.c
@@ -380,7 +380,7 @@ int cmd_diff(int argc, const char **argv, const char *prefix)
 				add_head_to_pending(&rev);
 				if (!rev.pending.nr) {
 					struct tree *tree;
-					tree = lookup_tree(&empty_tree_oid);
+					tree = lookup_tree(current_hash->empty_tree);
 					add_pending_object(&rev, &tree->object, "HEAD");
 				}
 				break;
diff --git a/builtin/pull.c b/builtin/pull.c
index 6f772e8a22..7bb96cb5a0 100644
--- a/builtin/pull.c
+++ b/builtin/pull.c
@@ -541,7 +541,7 @@ static int pull_into_void(const struct object_id *merge_head,
 	 * index/worktree changes that the user already made on the unborn
 	 * branch.
 	 */
-	if (checkout_fast_forward(&empty_tree_oid, merge_head, 0))
+	if (checkout_fast_forward(current_hash->empty_tree, merge_head, 0))
 		return 1;
 
 	if (update_ref("initial pull", "HEAD", merge_head->hash, curr_head->hash, 0, UPDATE_REFS_DIE_ON_ERR))
diff --git a/cache.h b/cache.h
index bce57c74c4..28906468fa 100644
--- a/cache.h
+++ b/cache.h
@@ -1056,22 +1056,22 @@ extern const struct object_id empty_blob_oid;
 
 static inline int is_empty_blob_sha1(const unsigned char *sha1)
 {
-	return !hashcmp(sha1, EMPTY_BLOB_SHA1_BIN);
+	return !hashcmp(sha1, current_hash->empty_blob->hash);
 }
 
 static inline int is_empty_blob_oid(const struct object_id *oid)
 {
-	return !hashcmp(oid->hash, EMPTY_BLOB_SHA1_BIN);
+	return !oidcmp(oid, current_hash->empty_blob);
 }
 
 static inline int is_empty_tree_sha1(const unsigned char *sha1)
 {
-	return !hashcmp(sha1, EMPTY_TREE_SHA1_BIN);
+	return !hashcmp(sha1, current_hash->empty_tree->hash);
 }
 
 static inline int is_empty_tree_oid(const struct object_id *oid)
 {
-	return !hashcmp(oid->hash, EMPTY_TREE_SHA1_BIN);
+	return !oidcmp(oid, current_hash->empty_tree);
 }
 
 /* set default permissions by passing mode arguments to open(2) */
diff --git a/diff-lib.c b/diff-lib.c
index 4e0980caa8..b5d95ea5c3 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -216,7 +216,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			} else if (revs->diffopt.ita_invisible_in_index &&
 				   ce_intent_to_add(ce)) {
 				diff_addremove(&revs->diffopt, '+', ce->ce_mode,
-					       &empty_tree_oid, 0,
+					       current_hash->empty_tree, 0,
 					       ce->name, 0);
 				continue;
 			}
diff --git a/merge-recursive.c b/merge-recursive.c
index 1d3f8f0d22..4547e15b8c 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -2081,7 +2081,7 @@ int merge_recursive(struct merge_options *o,
 		/* if there is no common ancestor, use an empty tree */
 		struct tree *tree;
 
-		tree = lookup_tree(&empty_tree_oid);
+		tree = lookup_tree(current_hash->empty_tree);
 		merged_common_ancestors = make_virtual_commit(tree, "ancestor");
 	}
 
diff --git a/notes-merge.c b/notes-merge.c
index 4352c34a6e..aab27121cb 100644
--- a/notes-merge.c
+++ b/notes-merge.c
@@ -595,7 +595,7 @@ int notes_merge(struct notes_merge_options *o,
 	bases = get_merge_bases(local, remote);
 	if (!bases) {
 		base_oid = &null_oid;
-		base_tree_oid = &empty_tree_oid;
+		base_tree_oid = current_hash->empty_tree;
 		if (o->verbosity >= 4)
 			printf("No merge base found; doing history-less merge\n");
 	} else if (!bases->next) {
diff --git a/sequencer.c b/sequencer.c
index f2a10cc4f2..b910165a82 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -347,7 +347,7 @@ static int read_oneliner(struct strbuf *buf,
 
 static struct tree *empty_tree(void)
 {
-	return lookup_tree(&empty_tree_oid);
+	return lookup_tree(current_hash->empty_tree);
 }
 
 static int error_dirty_index(struct replay_opts *opts)
@@ -705,7 +705,7 @@ static int is_original_commit_empty(struct commit *commit)
 				oid_to_hex(&parent->object.oid));
 		ptree_oid = &parent->tree->object.oid;
 	} else {
-		ptree_oid = &empty_tree_oid; /* commit is root */
+		ptree_oid = current_hash->empty_tree; /* commit is root */
 	}
 
 	return !oidcmp(ptree_oid, &commit->tree->object.oid);
@@ -958,7 +958,7 @@ static int do_pick_commit(enum todo_command command, struct commit *commit,
 	} else {
 		unborn = get_oid("HEAD", &head);
 		if (unborn)
-			oidcpy(&head, &empty_tree_oid);
+			oidcpy(&head, current_hash->empty_tree);
 		if (index_differs_from(unborn ? EMPTY_TREE_SHA1_HEX : "HEAD", 0, 0))
 			return error_dirty_index(opts);
 	}
diff --git a/submodule.c b/submodule.c
index 63e7094e16..218cbf6227 100644
--- a/submodule.c
+++ b/submodule.c
@@ -587,7 +587,7 @@ void show_submodule_inline_diff(struct diff_options *o, const char *path,
 		struct object_id *one, struct object_id *two,
 		unsigned dirty_submodule)
 {
-	const struct object_id *old = &empty_tree_oid, *new = &empty_tree_oid;
+	const struct object_id *old = current_hash->empty_tree, *new = current_hash->empty_tree;
 	struct commit *left = NULL, *right = NULL;
 	struct commit_list *merge_bases = NULL;
 	struct child_process cp = CHILD_PROCESS_INIT;
-- 
2.15.0.rc2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/4] Add structure representing hash algorithm
  2017-10-28 18:12 ` [PATCH v2 2/4] Add structure representing hash algorithm brian m. carlson
@ 2017-10-29  1:36   ` Eric Sunshine
  2017-10-29 17:00     ` brian m. carlson
  2017-10-30 16:14   ` Stefan Beller
  2017-10-30 23:36   ` Brandon Williams
  2 siblings, 1 reply; 19+ messages in thread
From: Eric Sunshine @ 2017-10-29  1:36 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Git List, Jonathan Nieder, Stefan Beller, Brandon Williams

On Sat, Oct 28, 2017 at 2:12 PM, brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> Since in the future we want to support an additional hash algorithm, add
> a structure that represents a hash algorithm and all the data that must
> go along with it.  Add a constant to allow easy enumeration of hash
> algorithms.  Implement function typedefs to create an abstract API that
> can be used by any hash algorithm, and wrappers for the existing SHA1
> functions that conform to this API.
> [...]
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
> diff --git a/cache.h b/cache.h
> @@ -77,6 +77,61 @@ struct object_id {
> +typedef void (*git_hash_init_fn)(void *ctx);
> +typedef void (*git_hash_update_fn)(void *ctx, const void *in, size_t len);
> +typedef void (*git_hash_final_fn)(unsigned char *hash, void *ctx);
> +
> +struct git_hash_algo {
> +       [...]
> +       /* The hash initialization function. */
> +       git_hash_init_fn init_fn;
> +
> +       /* The hash update function. */
> +       git_hash_update_fn update_fn;
> +
> +       /* The hash finalization function. */
> +       git_hash_final_fn final_fn;
> +       [...]
> +};
> diff --git a/sha1_file.c b/sha1_file.c
> @@ -39,6 +39,49 @@ const struct object_id empty_blob_oid = {
> +static inline void git_hash_sha1_init(void *ctx)
> +{
> +       git_SHA1_Init((git_SHA_CTX *)ctx);
> +}
> +
> +static inline void git_hash_sha1_update(void *ctx, const void *data, size_t len)
> +{
> +       git_SHA1_Update((git_SHA_CTX *)ctx, data, len);
> +}
> +
> +static inline void git_hash_sha1_final(unsigned char *hash, void *ctx)
> +{
> +       git_SHA1_Final(hash, (git_SHA_CTX *)ctx);
> +}

Why 'inline' if you only ever take the addresses of these functions?

> +const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
> +       [...]
> +       {
> +               "sha-1",
> +               /* "sha1", big-endian */
> +               0x73686131,
> +               sizeof(git_SHA_CTX),
> +               GIT_SHA1_RAWSZ,
> +               GIT_SHA1_HEXSZ,
> +               git_hash_sha1_init,
> +               git_hash_sha1_update,
> +               git_hash_sha1_final,
> +               &empty_tree_oid,
> +               &empty_blob_oid,
> +       },
> +};

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/4] Integrate hash algorithm support with repo setup
  2017-10-28 18:12 ` [PATCH v2 3/4] Integrate hash algorithm support with repo setup brian m. carlson
@ 2017-10-29  1:44   ` Eric Sunshine
  2017-10-29 17:57     ` brian m. carlson
  2017-10-30 16:27   ` Stefan Beller
  1 sibling, 1 reply; 19+ messages in thread
From: Eric Sunshine @ 2017-10-29  1:44 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Git List, Jonathan Nieder, Stefan Beller, Brandon Williams

On Sat, Oct 28, 2017 at 2:12 PM, brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> In future versions of Git, we plan to support an additional hash
> algorithm.  Integrate the enumeration of hash algorithms with repository
> setup, and store a pointer to the enumerated data in struct repository.
> Of course, we currently only support SHA-1, so hard-code this value in
> read_repository_format.  In the future, we'll enumerate this value from
> the configuration.
> [...]
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
> diff --git a/cache.h b/cache.h
> @@ -132,6 +133,8 @@ struct git_hash_algo {
>  extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS];
>
> +#define current_hash the_repository->hash_algo

The all-lowercase name "current_hash" seems likely to conflict with a
variable name some day; the fact that it is also a #define makes such
a collision even more worrisome. Although it is retrieving the "hash
algorithm", when reading the terse name "current_hash", one may
instead intuitively think it is referring to a hash _value_ (not an
algorithm).

> diff --git a/repository.c b/repository.c
> @@ -64,6 +64,11 @@ void repo_set_gitdir(struct repository *repo, const char *path)
> +void repo_set_hash_algo(struct repository *repo, int hash_algo)
> +{
> +       repo->hash_algo = &hash_algos[hash_algo];
> +}
> +
>  /*
>   * Attempt to resolve and set the provided 'gitdir' for repository 'repo'.
>   * Return 0 upon success and a non-zero value upon failure.
> @@ -136,6 +141,8 @@ int repo_init(struct repository *repo, const char *gitdir, const char *worktree)
>         if (read_and_verify_repository_format(&format, repo->commondir))
>                 goto error;
>
> +       repo->hash_algo = &hash_algos[format.hash_algo];

Should this be instead invoking repo_set_hash_algo()?

>         if (worktree)
>                 repo_set_worktree(repo, worktree);

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/4] Add structure representing hash algorithm
  2017-10-29  1:36   ` Eric Sunshine
@ 2017-10-29 17:00     ` brian m. carlson
  0 siblings, 0 replies; 19+ messages in thread
From: brian m. carlson @ 2017-10-29 17:00 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Git List, Jonathan Nieder, Stefan Beller, Brandon Williams

[-- Attachment #1: Type: text/plain, Size: 1223 bytes --]

On Sat, Oct 28, 2017 at 09:36:05PM -0400, Eric Sunshine wrote:
> On Sat, Oct 28, 2017 at 2:12 PM, brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
> > diff --git a/sha1_file.c b/sha1_file.c
> > @@ -39,6 +39,49 @@ const struct object_id empty_blob_oid = {
> > +static inline void git_hash_sha1_init(void *ctx)
> > +{
> > +       git_SHA1_Init((git_SHA_CTX *)ctx);
> > +}
> > +
> > +static inline void git_hash_sha1_update(void *ctx, const void *data, size_t len)
> > +{
> > +       git_SHA1_Update((git_SHA_CTX *)ctx, data, len);
> > +}
> > +
> > +static inline void git_hash_sha1_final(unsigned char *hash, void *ctx)
> > +{
> > +       git_SHA1_Final(hash, (git_SHA_CTX *)ctx);
> > +}
> 
> Why 'inline' if you only ever take the addresses of these functions?

Good point.  I wanted to avoid the overhead of a needless intermediate
function call when using this code path and have the compiler
essentially emit a call to the underlying functions directly, but inline
probably won't affect whether that happens.

I'll remove them.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/4] Integrate hash algorithm support with repo setup
  2017-10-29  1:44   ` Eric Sunshine
@ 2017-10-29 17:57     ` brian m. carlson
  2017-10-29 19:02       ` Eric Sunshine
  0 siblings, 1 reply; 19+ messages in thread
From: brian m. carlson @ 2017-10-29 17:57 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Git List, Jonathan Nieder, Stefan Beller, Brandon Williams

[-- Attachment #1: Type: text/plain, Size: 1601 bytes --]

On Sat, Oct 28, 2017 at 09:44:07PM -0400, Eric Sunshine wrote:
> > diff --git a/cache.h b/cache.h
> > @@ -132,6 +133,8 @@ struct git_hash_algo {
> >  extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS];
> >
> > +#define current_hash the_repository->hash_algo
> 
> The all-lowercase name "current_hash" seems likely to conflict with a
> variable name some day; the fact that it is also a #define makes such
> a collision even more worrisome. Although it is retrieving the "hash
> algorithm", when reading the terse name "current_hash", one may
> instead intuitively think it is referring to a hash _value_ (not an
> algorithm).

I can do CURRENT_HASH_ALGO or CURRENT_HASH instead if you think that's
an improvement.  I originally omitted the "algo" portion to keep it
short.

Alternatively, we could have a current_hash() (or current_hash_algo())
inline function if people like that better.

> >   * Attempt to resolve and set the provided 'gitdir' for repository 'repo'.
> >   * Return 0 upon success and a non-zero value upon failure.
> > @@ -136,6 +141,8 @@ int repo_init(struct repository *repo, const char *gitdir, const char *worktree)
> >         if (read_and_verify_repository_format(&format, repo->commondir))
> >                 goto error;
> >
> > +       repo->hash_algo = &hash_algos[format.hash_algo];
> 
> Should this be instead invoking repo_set_hash_algo()?

Yes, it should.  Will fix.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/4] Integrate hash algorithm support with repo setup
  2017-10-29 17:57     ` brian m. carlson
@ 2017-10-29 19:02       ` Eric Sunshine
  2017-10-29 19:33         ` brian m. carlson
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Sunshine @ 2017-10-29 19:02 UTC (permalink / raw)
  To: brian m. carlson, Eric Sunshine, Git List, Jonathan Nieder,
	Stefan Beller, Brandon Williams

On Sun, Oct 29, 2017 at 1:57 PM, brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> On Sat, Oct 28, 2017 at 09:44:07PM -0400, Eric Sunshine wrote:
>> > +#define current_hash the_repository->hash_algo
>>
>> The all-lowercase name "current_hash" seems likely to conflict with a
>> variable name some day; the fact that it is also a #define makes such
>> a collision even more worrisome. Although it is retrieving the "hash
>> algorithm", when reading the terse name "current_hash", one may
>> instead intuitively think it is referring to a hash _value_ (not an
>> algorithm).
>
> I can do CURRENT_HASH_ALGO or CURRENT_HASH instead if you think that's
> an improvement.  I originally omitted the "algo" portion to keep it
> short.

I don't have strong feelings about it aside from worrying about a
"current_hash" name clash or a reader misunderstanding what it
represents.

Does "current" need to be in the name? What about HASH_ALGO or REPO_HASH_ALGO?

> Alternatively, we could have a current_hash() (or current_hash_algo())
> inline function if people like that better.

hash_algo() or repo_hash_algo()?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/4] Integrate hash algorithm support with repo setup
  2017-10-29 19:02       ` Eric Sunshine
@ 2017-10-29 19:33         ` brian m. carlson
  2017-10-30  2:13           ` Junio C Hamano
  0 siblings, 1 reply; 19+ messages in thread
From: brian m. carlson @ 2017-10-29 19:33 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Git List, Jonathan Nieder, Stefan Beller, Brandon Williams

[-- Attachment #1: Type: text/plain, Size: 1034 bytes --]

On Sun, Oct 29, 2017 at 03:02:20PM -0400, Eric Sunshine wrote:
> On Sun, Oct 29, 2017 at 1:57 PM, brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
> > I can do CURRENT_HASH_ALGO or CURRENT_HASH instead if you think that's
> > an improvement.  I originally omitted the "algo" portion to keep it
> > short.
> 
> I don't have strong feelings about it aside from worrying about a
> "current_hash" name clash or a reader misunderstanding what it
> represents.
> 
> Does "current" need to be in the name? What about HASH_ALGO or REPO_HASH_ALGO?
> 
> > Alternatively, we could have a current_hash() (or current_hash_algo())
> > inline function if people like that better.
> 
> hash_algo() or repo_hash_algo()?

Those are also fine, and shorter to boot.  I'll wait to see if anyone
has strong opinions on the direction we should go before making a
change.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/4] Integrate hash algorithm support with repo setup
  2017-10-29 19:33         ` brian m. carlson
@ 2017-10-30  2:13           ` Junio C Hamano
  2017-10-30  2:54             ` brian m. carlson
  0 siblings, 1 reply; 19+ messages in thread
From: Junio C Hamano @ 2017-10-30  2:13 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Eric Sunshine, Git List, Jonathan Nieder, Stefan Beller,
	Brandon Williams

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> On Sun, Oct 29, 2017 at 03:02:20PM -0400, Eric Sunshine wrote:
>> On Sun, Oct 29, 2017 at 1:57 PM, brian m. carlson
>> <sandals@crustytoothpaste.net> wrote:
>> > I can do CURRENT_HASH_ALGO or CURRENT_HASH instead if you think that's
>> > an improvement.  I originally omitted the "algo" portion to keep it
>> > short.
>> 
>> I don't have strong feelings about it aside from worrying about a
>> "current_hash" name clash or a reader misunderstanding what it
>> represents.
>> 
>> Does "current" need to be in the name? What about HASH_ALGO or REPO_HASH_ALGO?
>> 
>> > Alternatively, we could have a current_hash() (or current_hash_algo())
>> > inline function if people like that better.
>> 
>> hash_algo() or repo_hash_algo()?
>
> Those are also fine, and shorter to boot.  I'll wait to see if anyone
> has strong opinions on the direction we should go before making a
> change.

Is the plan to allow running with multiple hash algorithms in
parallel?  I thought what we want to see in the future codebase was
to have the default hash algorithm used for everything except for a
select few codepaths, and assumed that the way we achieve it is to

 - allow very low level helper functions (e.g. read_sha1_file(),
   write_sha1_file_prepare(), etc.) to take a pointer to the hash
   algorithm structure;

 - have higher level helper functions to call these low level
   helpers with a fixed singleton instance of hash algorithm
   structure that represents that default one (SHA2-something?); and

 - a few selected codepaths (e.g. index-pack that reads SHA-1 stream
   and converts it into NewHash pack/index while building the object
   name mapping) use an additional singleton instance of hash
   algorithm structure that represents the SHA-1 hash, and the way
   they use it is *NOT* by replacing "the current" one with SHA-1,
   but by explicitly passing the instance to the low level helpers
   as parameter.

So, "current" does indeed sound quite wrong, as it makes it sound as
if you can swap it anytime and ask "which one is in effect now?".
If we do not want to call the default instance "SHA-2" because we
want to prepare for migrating again by not having the name of a
concrete hash algorithm sprinkled in the codebase all over like we
currently do, why not call the instance "the_hash_algo"?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/4] Integrate hash algorithm support with repo setup
  2017-10-30  2:13           ` Junio C Hamano
@ 2017-10-30  2:54             ` brian m. carlson
  0 siblings, 0 replies; 19+ messages in thread
From: brian m. carlson @ 2017-10-30  2:54 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Eric Sunshine, Git List, Jonathan Nieder, Stefan Beller,
	Brandon Williams

[-- Attachment #1: Type: text/plain, Size: 2132 bytes --]

On Mon, Oct 30, 2017 at 11:13:41AM +0900, Junio C Hamano wrote:
> Is the plan to allow running with multiple hash algorithms in
> parallel?  I thought what we want to see in the future codebase was
> to have the default hash algorithm used for everything except for a
> select few codepaths, and assumed that the way we achieve it is to
> 
>  - allow very low level helper functions (e.g. read_sha1_file(),
>    write_sha1_file_prepare(), etc.) to take a pointer to the hash
>    algorithm structure;
> 
>  - have higher level helper functions to call these low level
>    helpers with a fixed singleton instance of hash algorithm
>    structure that represents that default one (SHA2-something?); and
> 
>  - a few selected codepaths (e.g. index-pack that reads SHA-1 stream
>    and converts it into NewHash pack/index while building the object
>    name mapping) use an additional singleton instance of hash
>    algorithm structure that represents the SHA-1 hash, and the way
>    they use it is *NOT* by replacing "the current" one with SHA-1,
>    but by explicitly passing the instance to the low level helpers
>    as parameter.

This is consistent with my understanding as well.

> So, "current" does indeed sound quite wrong, as it makes it sound as
> if you can swap it anytime and ask "which one is in effect now?".
> If we do not want to call the default instance "SHA-2" because we
> want to prepare for migrating again by not having the name of a
> concrete hash algorithm sprinkled in the codebase all over like we
> currently do, why not call the instance "the_hash_algo"?

My original plan was to handle different algorithms in different
submodules, but I think with the current transition plan, that becomes
unnecessary, since we're going to paper over any differences that might
show up.  the_hash_algo should be fine, and it's consistent with
the_repository and the_index as well.

Thanks for a sensible suggestion.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/4] setup: expose enumerated repo info
  2017-10-28 18:12 ` [PATCH v2 1/4] setup: expose enumerated repo info brian m. carlson
@ 2017-10-30 16:08   ` Stefan Beller
  0 siblings, 0 replies; 19+ messages in thread
From: Stefan Beller @ 2017-10-30 16:08 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Jonathan Nieder, Brandon Williams

On Sat, Oct 28, 2017 at 11:12 AM, brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> We enumerate several different items as part of struct
> repository_format, but then actually set up those values using the
> global variables we've initialized from them.  Instead, let's pass a
> pointer to the structure down to the code where we enumerate these
> values, so we can later on use those values directly to perform setup.
>
> This technique makes it easier for us to determine additional items
> about the repository format (such as the hash algorithm) and then use
> them for setup later on, without needing to add additional global
> variables.  We can't avoid using the existing global variables since
> they're intricately intertwined with how things work at the moment, but
> this improves things for the future.

yup. looks good to me.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/4] Add structure representing hash algorithm
  2017-10-28 18:12 ` [PATCH v2 2/4] Add structure representing hash algorithm brian m. carlson
  2017-10-29  1:36   ` Eric Sunshine
@ 2017-10-30 16:14   ` Stefan Beller
  2017-10-30 23:36   ` Brandon Williams
  2 siblings, 0 replies; 19+ messages in thread
From: Stefan Beller @ 2017-10-30 16:14 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Jonathan Nieder, Brandon Williams

On Sat, Oct 28, 2017 at 11:12 AM, brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> Since in the future we want to support an additional hash algorithm, add
> a structure that represents a hash algorithm and all the data that must
> go along with it.  Add a constant to allow easy enumeration of hash
> algorithms.  Implement function typedefs to create an abstract API that
> can be used by any hash algorithm, and wrappers for the existing SHA1
> functions that conform to this API.
>
> Expose a value for hex size as well as binary size.  While one will
> always be twice the other, the two values are both used extremely
> commonly throughout the codebase and providing both leads to improved
> readability.
>
> Don't include an entry in the hash algorithm structure for the null
> object ID.  As this value is all zeros, any suitably sized all-zero
> object ID can be used, and there's no need to store a given one on a
> per-hash basis.
>
> The current hash function transition plan envisions a time when we will
> accept input from the user that might be in SHA-1 or in the NewHash
> format.  Since we cannot know which the user has provided, add a
> constant representing the unknown algorithm to allow us to indicate that
> we must look the correct value up.

Cool.


> +
> +const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
> +       {
> +               NULL,
> +               0x00000000,
> +               0,
> +               0,
> +               0,
> +               NULL,
> +               NULL,
> +               NULL,
> +               NULL,
> +               NULL,

If we are fancy we could provide an appropriate die() call
as the function pointers. That way if you call these functions
by accident, you get a well worded warning instead of a segfault.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/4] Integrate hash algorithm support with repo setup
  2017-10-28 18:12 ` [PATCH v2 3/4] Integrate hash algorithm support with repo setup brian m. carlson
  2017-10-29  1:44   ` Eric Sunshine
@ 2017-10-30 16:27   ` Stefan Beller
  1 sibling, 0 replies; 19+ messages in thread
From: Stefan Beller @ 2017-10-30 16:27 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Jonathan Nieder, Brandon Williams

On Sat, Oct 28, 2017 at 11:12 AM, brian m. carlson
<sandals@crustytoothpaste.net> wrote:
>
> Include repository.h in cache.h since we now need to have access to
> these struct and variable definitions.

Let's see how that works out. I remember having include issues
in the repository struct series.

>  };
>  extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS];
>
> +#define current_hash the_repository->hash_algo
> +
>  #if defined(DT_UNKNOWN) && !defined(NO_D_TYPE_IN_DIRENT)
>  #define DTYPE(de)      ((de)->d_type)
>  #else
> @@ -920,6 +923,7 @@ struct repository_format {
>         int version;
>         int precious_objects;
>         int is_bare;
> +       int hash_algo;

I wonder if this (as well as the #defines in patch 2),
ought to be an enum { unknown-hash=0, sha1=1}.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/4] Hash Abstraction
  2017-10-28 18:12 [PATCH v2 0/4] Hash Abstraction brian m. carlson
                   ` (3 preceding siblings ...)
  2017-10-28 18:12 ` [PATCH v2 4/4] Switch empty tree and blob lookups to use hash abstraction brian m. carlson
@ 2017-10-30 16:45 ` Stefan Beller
  4 siblings, 0 replies; 19+ messages in thread
From: Stefan Beller @ 2017-10-30 16:45 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Jonathan Nieder, Brandon Williams

On Sat, Oct 28, 2017 at 11:12 AM, brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> This is a series proposing a basic abstraction for hash functions.
>
> As we get closer to converting the remainder of the codebase to use
> struct object_id, we should think about the design we want our hash
> function abstraction to take.  This series is a proposal for one idea.
> Input on any aspect of this proposal is welcome.

I like the series/idea.

> This series exposes a struct git_hash_algo that contains basic
> information about a given hash algorithm that distinguishes it from
> other algorithms: name, identifiers, lengths, implementing functions,
> and empty tree and blob constants.  It also exposes an array of hash
> algorithms, and a constant for indexing them.
>
> The series also demonstrates a simple conversion using the abstraction
> over empty blob and tree values.

That looks good, though I wonder if we'll have to give a way a little
performance during the transition period (or even indefinitely, as the
hash abstraction won't go away; we'd rather want to keep it in place
long term). I guess we can measure after all is said and done,
no big deal.

>
> In order to avoid conflicting with the struct repository work and with
> the goal of avoiding global variables as much as possible, I've pushed
> the hash algorithm into struct repository and exposed it via a #define.

If you are referring to that long series[1] that Jonathan and I were working
on, then no worries.  Unfortunately that series got some delays, but I rebased
its prepared successors (of over 100 patches) on Friday and it was surprisingly
easy to do so; just tedious.

[1] https://public-inbox.org/git/20170830064634.GA153983@aiede.mtv.corp.google.com/

>
> I propose this series now as it will inform the way we go about
> converting other parts of the codebase, especially some of the pack
> algorithms.

Thanks for doing this now.

> Because we share some hash computation code between pack
> checksums and object hashing, we need to decide whether to expose pack
> checksums as struct object_id, even though they are technically not
> object IDs.

I think we used sha1 as checksums, because it was a hash function easily
accessible, not because its other properties were the best to do its job.
So I am undecided if we want to just "keep the same hash function for
checksum as well as object hashing" or pickup another checksum, that
actually *only* does checksumming (we don't need the cryptographic
properties, such that an easy hash [such as crc, djb, fnv or murmur]
would do; raw speed, barely protecting the pack file against bit flips).

For the sake of symmetry I tend to go with the former, i.e use the object
hash function for pack files, too.

>  Furthermore, if we end up needing to stuff an algorithm
> value into struct object_id, we'll no longer be able to directly
> reference object IDs in a pack without a copy.
>
> I've updated this series in some significant ways to reflect and better
> implement the transition plan as it's developed.  If there are ways
> in which this series (or future series) can converge better on the
> transition plan, that input would be valuable.
>
> This series is available from the usual places as branch hash-struct,
> based against master as of 2.15-rc2.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/4] Add structure representing hash algorithm
  2017-10-28 18:12 ` [PATCH v2 2/4] Add structure representing hash algorithm brian m. carlson
  2017-10-29  1:36   ` Eric Sunshine
  2017-10-30 16:14   ` Stefan Beller
@ 2017-10-30 23:36   ` Brandon Williams
  2017-11-01  1:35     ` brian m. carlson
  2 siblings, 1 reply; 19+ messages in thread
From: Brandon Williams @ 2017-10-30 23:36 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Jonathan Nieder, Stefan Beller

On 10/28, brian m. carlson wrote:
> Since in the future we want to support an additional hash algorithm, add
> a structure that represents a hash algorithm and all the data that must
> go along with it.  Add a constant to allow easy enumeration of hash
> algorithms.  Implement function typedefs to create an abstract API that
> can be used by any hash algorithm, and wrappers for the existing SHA1
> functions that conform to this API.
> 
> Expose a value for hex size as well as binary size.  While one will
> always be twice the other, the two values are both used extremely
> commonly throughout the codebase and providing both leads to improved
> readability.
> 
> Don't include an entry in the hash algorithm structure for the null
> object ID.  As this value is all zeros, any suitably sized all-zero
> object ID can be used, and there's no need to store a given one on a
> per-hash basis.
> 
> The current hash function transition plan envisions a time when we will
> accept input from the user that might be in SHA-1 or in the NewHash
> format.  Since we cannot know which the user has provided, add a
> constant representing the unknown algorithm to allow us to indicate that
> we must look the correct value up.
> 
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
> I believe I did get the format_id constant for SHA-1 right, but
> sanity-checking would be appreciated.  We don't want another Referer
> mess.
> 
>  cache.h     | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  sha1_file.c | 43 +++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 98 insertions(+)
> 
> diff --git a/cache.h b/cache.h
> index 6440e2bf21..9e9eb08f05 100644
> --- a/cache.h
> +++ b/cache.h

Maybe it would be good to place this interface in its own header file
that way we can avoid cluttering cache.h with more stuff?

> @@ -77,6 +77,61 @@ struct object_id {
>  	unsigned char hash[GIT_MAX_RAWSZ];
>  };
>  
> +/*
> + * Note that these constants are suitable for indexing the hash_algos array and
> + * comparing against each other, but are otherwise arbitrary, so they should not
> + * be exposed to the user or serialized to disk.  To know whether a
> + * git_hash_algo struct points to some usable hash function, test the format_id
> + * field for being non-zero.  Use the name field for user-visible situations and
> + * the format_id field for fixed-length fields on disk.
> + */
> +/* An unknown hash function. */
> +#define GIT_HASH_UNKNOWN 0
> +/* SHA-1 */
> +#define GIT_HASH_SHA1 1
> +/* Number of algorithms supported (including unknown). */
> +#define GIT_HASH_NALGOS (GIT_HASH_SHA1 + 1)
> +
> +typedef void (*git_hash_init_fn)(void *ctx);
> +typedef void (*git_hash_update_fn)(void *ctx, const void *in, size_t len);
> +typedef void (*git_hash_final_fn)(unsigned char *hash, void *ctx);
> +
> +struct git_hash_algo {
> +	/*
> +	 * The name of the algorithm, as appears in the config file and in
> +	 * messages.
> +	 */
> +	const char *name;
> +
> +	/* A four-byte version identifier, used in pack indices. */
> +	uint32_t format_id;
> +
> +	/* The size of a hash context (e.g. git_SHA_CTX). */
> +	size_t ctxsz;
> +
> +	/* The length of the hash in binary. */
> +	size_t rawsz;
> +
> +	/* The length of the hash in hex characters. */
> +	size_t hexsz;
> +
> +	/* The hash initialization function. */
> +	git_hash_init_fn init_fn;
> +
> +	/* The hash update function. */
> +	git_hash_update_fn update_fn;
> +
> +	/* The hash finalization function. */
> +	git_hash_final_fn final_fn;
> +
> +	/* The OID of the empty tree. */
> +	const struct object_id *empty_tree;
> +
> +	/* The OID of the empty blob. */
> +	const struct object_id *empty_blob;
> +};
> +extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS];
> +
>  #if defined(DT_UNKNOWN) && !defined(NO_D_TYPE_IN_DIRENT)
>  #define DTYPE(de)	((de)->d_type)
>  #else
> diff --git a/sha1_file.c b/sha1_file.c
> index 10c3a0083d..77b320126a 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -39,6 +39,49 @@ const struct object_id empty_blob_oid = {
>  	EMPTY_BLOB_SHA1_BIN_LITERAL
>  };
>  
> +static inline void git_hash_sha1_init(void *ctx)
> +{
> +	git_SHA1_Init((git_SHA_CTX *)ctx);
> +}
> +
> +static inline void git_hash_sha1_update(void *ctx, const void *data, size_t len)
> +{
> +	git_SHA1_Update((git_SHA_CTX *)ctx, data, len);
> +}
> +
> +static inline void git_hash_sha1_final(unsigned char *hash, void *ctx)
> +{
> +	git_SHA1_Final(hash, (git_SHA_CTX *)ctx);
> +}
> +
> +const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
> +	{
> +		NULL,
> +		0x00000000,
> +		0,
> +		0,
> +		0,
> +		NULL,
> +		NULL,
> +		NULL,
> +		NULL,
> +		NULL,
> +	},
> +	{
> +		"sha-1",
> +		/* "sha1", big-endian */
> +		0x73686131,
> +		sizeof(git_SHA_CTX),
> +		GIT_SHA1_RAWSZ,
> +		GIT_SHA1_HEXSZ,
> +		git_hash_sha1_init,
> +		git_hash_sha1_update,
> +		git_hash_sha1_final,
> +		&empty_tree_oid,
> +		&empty_blob_oid,
> +	},
> +};
> +
>  /*
>   * This is meant to hold a *small* number of objects that you would
>   * want read_sha1_file() to be able to return, but yet you do not want
> -- 
> 2.15.0.rc2
> 

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/4] Add structure representing hash algorithm
  2017-10-30 23:36   ` Brandon Williams
@ 2017-11-01  1:35     ` brian m. carlson
  0 siblings, 0 replies; 19+ messages in thread
From: brian m. carlson @ 2017-11-01  1:35 UTC (permalink / raw)
  To: Brandon Williams; +Cc: git, Jonathan Nieder, Stefan Beller

[-- Attachment #1: Type: text/plain, Size: 2343 bytes --]

On Mon, Oct 30, 2017 at 04:36:15PM -0700, Brandon Williams wrote:
> On 10/28, brian m. carlson wrote:
> > Since in the future we want to support an additional hash algorithm, add
> > a structure that represents a hash algorithm and all the data that must
> > go along with it.  Add a constant to allow easy enumeration of hash
> > algorithms.  Implement function typedefs to create an abstract API that
> > can be used by any hash algorithm, and wrappers for the existing SHA1
> > functions that conform to this API.
> > 
> > Expose a value for hex size as well as binary size.  While one will
> > always be twice the other, the two values are both used extremely
> > commonly throughout the codebase and providing both leads to improved
> > readability.
> > 
> > Don't include an entry in the hash algorithm structure for the null
> > object ID.  As this value is all zeros, any suitably sized all-zero
> > object ID can be used, and there's no need to store a given one on a
> > per-hash basis.
> > 
> > The current hash function transition plan envisions a time when we will
> > accept input from the user that might be in SHA-1 or in the NewHash
> > format.  Since we cannot know which the user has provided, add a
> > constant representing the unknown algorithm to allow us to indicate that
> > we must look the correct value up.
> > 
> > Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> > ---
> > I believe I did get the format_id constant for SHA-1 right, but
> > sanity-checking would be appreciated.  We don't want another Referer
> > mess.
> > 
> >  cache.h     | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  sha1_file.c | 43 +++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 98 insertions(+)
> > 
> > diff --git a/cache.h b/cache.h
> > index 6440e2bf21..9e9eb08f05 100644
> > --- a/cache.h
> > +++ b/cache.h
> 
> Maybe it would be good to place this interface in its own header file
> that way we can avoid cluttering cache.h with more stuff?

Sure, if you like.  It will end up needing to be included in cache.h
because of the blob and tree code, but I'm happy to split it out.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-11-01  1:36 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-28 18:12 [PATCH v2 0/4] Hash Abstraction brian m. carlson
2017-10-28 18:12 ` [PATCH v2 1/4] setup: expose enumerated repo info brian m. carlson
2017-10-30 16:08   ` Stefan Beller
2017-10-28 18:12 ` [PATCH v2 2/4] Add structure representing hash algorithm brian m. carlson
2017-10-29  1:36   ` Eric Sunshine
2017-10-29 17:00     ` brian m. carlson
2017-10-30 16:14   ` Stefan Beller
2017-10-30 23:36   ` Brandon Williams
2017-11-01  1:35     ` brian m. carlson
2017-10-28 18:12 ` [PATCH v2 3/4] Integrate hash algorithm support with repo setup brian m. carlson
2017-10-29  1:44   ` Eric Sunshine
2017-10-29 17:57     ` brian m. carlson
2017-10-29 19:02       ` Eric Sunshine
2017-10-29 19:33         ` brian m. carlson
2017-10-30  2:13           ` Junio C Hamano
2017-10-30  2:54             ` brian m. carlson
2017-10-30 16:27   ` Stefan Beller
2017-10-28 18:12 ` [PATCH v2 4/4] Switch empty tree and blob lookups to use hash abstraction brian m. carlson
2017-10-30 16:45 ` [PATCH v2 0/4] Hash Abstraction Stefan Beller

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).