git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] init-db: init the_repository->hash_algo early from GIT_DEFAULT_HASH
@ 2020-11-26 19:39 Han-Wen Nienhuys via GitGitGadget
  2020-12-01  4:59 ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Han-Wen Nienhuys via GitGitGadget @ 2020-11-26 19:39 UTC (permalink / raw)
  To: git; +Cc: Han-Wen Nienhuys, Han-Wen Nienhuys

From: Han-Wen Nienhuys <hanwen@google.com>

The reftable backend needs to know the hash algorithm for writing the
initialization hash table.

The initial reftable contains a symref HEAD => "main" (or "master"), which is
agnostic to the size of hash value, but this is an exceptional circumstance, and
the reftable library does not cater for this exception. It insists that all
tables in the stack have a consistent format ID for the hash algorithm.

Call set_repo_hash_algo directly after reading out GIT_DEFAULT_HASH.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
---
    init-db: init the_repository->hash_algo early from GIT_DEFAULT_HASH
    
    The reftable backend needs to know the hash algorithm for writing the
    initialization hash table.
    
    The initial reftable contains a symref HEAD => "main" (or "master"),
    which is agnostic to the size of hash value, but this is an exceptional
    circumstance, and the reftable library does not cater for this
    exception. It insists that all tables in the stack have a consistent
    format ID for the hash algorithm.
    
    Call set_repo_hash_algo directly after reading out GIT_DEFAULT_HASH.
    
    Signed-off-by: Han-Wen Nienhuys hanwen@google.com [hanwen@google.com]

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-924%2Fhanwen%2Fset-hash-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-924/hanwen/set-hash-v1
Pull-Request: https://github.com/git/git/pull/924

 builtin/init-db.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/builtin/init-db.c b/builtin/init-db.c
index 01bc648d41..5c8c67fec6 100644
--- a/builtin/init-db.c
+++ b/builtin/init-db.c
@@ -391,6 +391,7 @@ static void validate_hash_algorithm(struct repository_format *repo_fmt, int hash
 			die(_("unknown hash algorithm '%s'"), env);
 		repo_fmt->hash_algo = env_algo;
 	}
+	repo_set_hash_algo(the_repository, repo_fmt->hash_algo);
 }
 
 int init_db(const char *git_dir, const char *real_git_dir,

base-commit: faefdd61ec7c7f6f3c8c9907891465ac9a2a1475
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] init-db: init the_repository->hash_algo early from GIT_DEFAULT_HASH
  2020-11-26 19:39 [PATCH] init-db: init the_repository->hash_algo early from GIT_DEFAULT_HASH Han-Wen Nienhuys via GitGitGadget
@ 2020-12-01  4:59 ` Junio C Hamano
  2020-12-03 20:23   ` Han-Wen Nienhuys
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2020-12-01  4:59 UTC (permalink / raw)
  To: Han-Wen Nienhuys via GitGitGadget; +Cc: git, Han-Wen Nienhuys, Han-Wen Nienhuys

"Han-Wen Nienhuys via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Han-Wen Nienhuys <hanwen@google.com>
>
> The reftable backend needs to know the hash algorithm for writing the
> initialization hash table.
>
> The initial reftable contains a symref HEAD => "main" (or "master"), which is
> agnostic to the size of hash value, but this is an exceptional circumstance, and
> the reftable library does not cater for this exception. It insists that all
> tables in the stack have a consistent format ID for the hash algorithm.
>
> Call set_repo_hash_algo directly after reading out GIT_DEFAULT_HASH.

Seeing that there is no mention of the_repository in the entire
init-db.c file, that the same information is in repo_fmt which is
passed to the init-db.c::create_default_files() function, and that
create_default_files() is where the initial set of refs is prepared
in the current system, it is not clear why this patch is needed
(i.e. why we consider the current code that has no mention of
the_repository is wrong).  Isn't it the matter of passing the hash
taken from repo_fmt to the refs API to initialize that part of the
repository, instead of relying on half-initialized state in
the_repository?  It's not like only the hash_algo member is yet to
be prepared in the_repository instance at that point in the code.
Most of the members are, except for a very few fields initialized by
initialize_the_repository(), not filled in the codepath, no?

So, this might have been the most convenient way to pass hash_algo
down but the patch does not convince me that it is the best way.  It
may be, but it does not answer questions like "what makes us sure
that hash-algo will stay to be the only thing we need to fill early
in this codepath?"

If the patch _were_ to fill many other members of the_repository to
make us pretend as if we did the setup_git_dir() in the repository
we just created, and the hash_algo is filled merely as part of it,
then it would have been much more convincing that we are moving in
the right direction.  It would for example mean gitdir/commondir
would be filled among other members, and the existing "mkdir()" to
manually prepare "refs/" for files backend in init-db.c can probably
be removed and instead would be done by calling a new entry-point
that initializes the initial set of refs in the refs API.  And that
entry-point would either rely on the_repository or take a repository
instance, and use hash_algo member of that structure to do its thing.

So, I dunno.

> diff --git a/builtin/init-db.c b/builtin/init-db.c
> index 01bc648d41..5c8c67fec6 100644
> --- a/builtin/init-db.c
> +++ b/builtin/init-db.c
> @@ -391,6 +391,7 @@ static void validate_hash_algorithm(struct repository_format *repo_fmt, int hash
>  			die(_("unknown hash algorithm '%s'"), env);
>  		repo_fmt->hash_algo = env_algo;
>  	}
> +	repo_set_hash_algo(the_repository, repo_fmt->hash_algo);
>  }
>  
>  int init_db(const char *git_dir, const char *real_git_dir,
>
> base-commit: faefdd61ec7c7f6f3c8c9907891465ac9a2a1475

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] init-db: init the_repository->hash_algo early from GIT_DEFAULT_HASH
  2020-12-01  4:59 ` Junio C Hamano
@ 2020-12-03 20:23   ` Han-Wen Nienhuys
  2020-12-03 23:25     ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Han-Wen Nienhuys @ 2020-12-03 20:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Han-Wen Nienhuys via GitGitGadget, git, Han-Wen Nienhuys

On Tue, Dec 1, 2020 at 6:00 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Han-Wen Nienhuys via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Han-Wen Nienhuys <hanwen@google.com>
> >
> > The reftable backend needs to know the hash algorithm for writing the
> > initialization hash table.
> >
> > The initial reftable contains a symref HEAD => "main" (or "master"), which is
> > agnostic to the size of hash value, but this is an exceptional circumstance, and
> > the reftable library does not cater for this exception. It insists that all
> > tables in the stack have a consistent format ID for the hash algorithm.
> >
> > Call set_repo_hash_algo directly after reading out GIT_DEFAULT_HASH.
>
> Seeing that there is no mention of the_repository in the entire
> init-db.c file,

it's not because it's hidden in the call to refs_init_db().
refs_init_db accesses the_repository global variable, because
get_main_ref_store() takes a repository (and not repository_format)
argument.

> Isn't it the matter of passing the hash
> taken from repo_fmt to the refs API to initialize that part of the
> repository, instead of relying on half-initialized state in
> the_repository?

The refstore is created in get_main_ref_store(). I can't add a
repository_format argument there, because it has a lot of call sites
where the repository_format isn't available.

If the_repository is only half-initialized at this point in init_db(),
then why are we passing it in refs_init_db() just a couple of lines
further? At what point the_repository considered initialized?

> It's not like only the hash_algo member is yet to
> be prepared in the_repository instance at that point in the code.
> Most of the members are, except for a very few fields initialized by
> initialize_the_repository(), not filled in the codepath, no?
>
> So, this might have been the most convenient way to pass hash_algo
> down but the patch does not convince me that it is the best way.

I'm a bit at a loss here; I never learned how to cleanly work with so
many global variables, so I'm happy to take your suggestion.

-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--

Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] init-db: init the_repository->hash_algo early from GIT_DEFAULT_HASH
  2020-12-03 20:23   ` Han-Wen Nienhuys
@ 2020-12-03 23:25     ` Junio C Hamano
  2020-12-07 11:52       ` Han-Wen Nienhuys
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2020-12-03 23:25 UTC (permalink / raw)
  To: Han-Wen Nienhuys; +Cc: Han-Wen Nienhuys via GitGitGadget, git, Han-Wen Nienhuys

Han-Wen Nienhuys <hanwen@google.com> writes:

> If the_repository is only half-initialized at this point in init_db(),
> then why are we passing it in refs_init_db() just a couple of lines
> further? At what point the_repository considered initialized?

I would have to say it probably depends on what callees expect.  The
current implementation of refs_init_db() for files backend may not
need anything other than the hash algorithm enum, but many other
fields are missing, and they should ideally be populated, no?  

For example, I see files_ref_store_create() cheats by calling
get_common_dir_noenv() to find out where the commondir is, instead
of ever asking the repository the ref store belongs to. At least,
get_main_ref_store() is told to get the ref store that belongs to
the_repository, and it would be the right place to learn relevant
pieces of information (for that matter, I am not sure why struct
ref_store does not have a pointer to a repository structure; perhaps
we are seeing the result of piecemeal evolution, not a designed
structure?).

> I'm a bit at a loss here; I never learned how to cleanly work with so
> many global variables, so I'm happy to take your suggestion.

I am only interested in giving a clear direction to future
developers where to populate the_repository's members (and nothing
else) if their enhancement needs members other than the hash
algorithm to be populated, as if it were the_repository initialized
in an already working repository (I am not talking about many global
variables, whichever you are referring to).

One way to do so would probably be to do something like the
attached.

The patch that started this thread (or the equivalent one in the
updated reftable series) may want to initialize a bit more members
while at it (looking at how commondir from the_repository is not
used by the refs/files-backend.c::files_init_db() to decide where
the function creates the refs/heads and refs/tags directories, we
probably would need to populate the_repository->commondir before
that codepath can be fixed to look at the member, for example).


 builtin/init-db.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git c/builtin/init-db.c w/builtin/init-db.c
index dcc45bef51..2b5c94596f 100644
--- c/builtin/init-db.c
+++ w/builtin/init-db.c
@@ -438,6 +438,27 @@ int init_db(const char *git_dir, const char *real_git_dir,
 
 	validate_hash_algorithm(&repo_fmt, hash);
 
+	/*
+	 * At this point, the_repository we have in-core does not look
+	 * anything like one that we would see initialized in an already
+	 * working repository after calling setup_git_directory().
+	 *
+	 * Calling repository.c::initialize_the_repository() may have
+	 * prepared the .index .objects and .parsed_objects members, but
+	 * other members like .gitdir, .commondir, etc. have not been
+	 * initialized.
+	 *
+	 * Many API functions assume they are working with the_repository
+	 * that has sensibly been initialized, but because we haven't
+	 * really read from an existing repository, we need to hand-craft
+	 * the necessary members of the structure to get out of this
+	 * chicken-and-egg situation.
+	 *
+	 * For now, we update the hash algorithm member to what the
+	 * validate_hash_algorithm() call decided for us.
+	 */
+	repo_set_hash_algo(the_repository, repo_fmt->hash_algo);
+
 	reinit = create_default_files(template_dir, original_git_dir,
 				      initial_branch, &repo_fmt,
 				      flags & INIT_DB_QUIET);



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] init-db: init the_repository->hash_algo early from GIT_DEFAULT_HASH
  2020-12-03 23:25     ` Junio C Hamano
@ 2020-12-07 11:52       ` Han-Wen Nienhuys
  2020-12-07 19:31         ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Han-Wen Nienhuys @ 2020-12-07 11:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Han-Wen Nienhuys via GitGitGadget, git, Han-Wen Nienhuys

On Fri, Dec 4, 2020 at 12:25 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Han-Wen Nienhuys <hanwen@google.com> writes:
>
> > If the_repository is only half-initialized at this point in init_db(),
> > then why are we passing it in refs_init_db() just a couple of lines
> > further? At what point the_repository considered initialized?
>
> I would have to say it probably depends on what callees expect.  The
> current implementation of refs_init_db() for files backend may not
> need anything other than the hash algorithm enum, but many other
> fields are missing, and they should ideally be populated, no?

I think so too, but it looks like a major refactoring by itself, which
is only very tangentially related to the reftable effort, so I'd like
to punt on it.

> For example, I see files_ref_store_create() cheats by calling
> get_common_dir_noenv() to find out where the commondir is, instead
> of ever asking the repository the ref store belongs to. At least,
> get_main_ref_store() is told to get the ref store that belongs to
> the_repository, and it would be the right place to learn relevant
> pieces of information (for that matter, I am not sure why struct
> ref_store does not have a pointer to a repository structure; perhaps
> we are seeing the result of piecemeal evolution, not a designed
> structure?).
>
> > I'm a bit at a loss here; I never learned how to cleanly work with so
> > many global variables, so I'm happy to take your suggestion.
>
> I am only interested in giving a clear direction to future
> developers where to populate the_repository's members (and nothing
> else) if their enhancement needs members other than the hash
> algorithm to be populated, as if it were the_repository initialized
> in an already working repository (I am not talking about many global
> variables, whichever you are referring to).

You said earlier that maybe ref_store should hold a link to 'struct
repository'. However, "struct repository" doesn't have a documented
purpose (there is literally no comment documenting its purpose), so
it's hard to tell upfront how to correctly configure the relation
between "struct ref_store" and "struct repository". Currently, "struct
repository" has a pointer to 'struct ref_store", which makes it
suspect for there to be a pointer in the other direction as well.

The reason I bring up global variables is that without them, the code
would enforce a much more clear relation between different entities.

Looking at the definition of repository, there seems to be some
overlap in functionality between struct repository, struct
repository_format, struct repo_settings, and config_set, and none of
these structs have a clearly documented role.   Before we change code,
it's probably better to spell out what these structs are thought to
be.

> One way to do so would probably be to do something like the
> attached.

Thanks, I've amended my patch.

-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--

Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] init-db: init the_repository->hash_algo early from GIT_DEFAULT_HASH
  2020-12-07 11:52       ` Han-Wen Nienhuys
@ 2020-12-07 19:31         ` Junio C Hamano
  0 siblings, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2020-12-07 19:31 UTC (permalink / raw)
  To: Han-Wen Nienhuys; +Cc: Han-Wen Nienhuys via GitGitGadget, git, Han-Wen Nienhuys

Han-Wen Nienhuys <hanwen@google.com> writes:

> You said earlier that maybe ref_store should hold a link to 'struct
> repository'. However, "struct repository" doesn't have a documented
> purpose (there is literally no comment documenting its purpose), so
> it's hard to tell upfront how to correctly configure the relation
> between "struct ref_store" and "struct repository".

Folks, who have been pushing to recurse into submodules in-process,
does any of you care to help out?

In order to deal with separate repositories inside a process,
instead of allowing many API functions to assume that they only have
to work in the repository "git" started in, you need to be able to
make them to work on a different repository, which would have
different configuration files, with working tree locations that are
different from that of the initial repository, and certainly would
have different set of refs, etc., and the way to do so is by passing
an instance of the repository structure.  To the API functions that
have been converted to take such an instance, working in the
repository "git" started in is done by working with the_repository
instance.



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-12-07 19:37 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-26 19:39 [PATCH] init-db: init the_repository->hash_algo early from GIT_DEFAULT_HASH Han-Wen Nienhuys via GitGitGadget
2020-12-01  4:59 ` Junio C Hamano
2020-12-03 20:23   ` Han-Wen Nienhuys
2020-12-03 23:25     ` Junio C Hamano
2020-12-07 11:52       ` Han-Wen Nienhuys
2020-12-07 19:31         ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).