* [PATCH] init-db: init the_repository->hash_algo early from GIT_DEFAULT_HASH @ 2020-11-26 19:39 Han-Wen Nienhuys via GitGitGadget 2020-12-01 4:59 ` Junio C Hamano 0 siblings, 1 reply; 6+ messages in thread From: Han-Wen Nienhuys via GitGitGadget @ 2020-11-26 19:39 UTC (permalink / raw) To: git; +Cc: Han-Wen Nienhuys, Han-Wen Nienhuys From: Han-Wen Nienhuys <hanwen@google.com> The reftable backend needs to know the hash algorithm for writing the initialization hash table. The initial reftable contains a symref HEAD => "main" (or "master"), which is agnostic to the size of hash value, but this is an exceptional circumstance, and the reftable library does not cater for this exception. It insists that all tables in the stack have a consistent format ID for the hash algorithm. Call set_repo_hash_algo directly after reading out GIT_DEFAULT_HASH. Signed-off-by: Han-Wen Nienhuys <hanwen@google.com> --- init-db: init the_repository->hash_algo early from GIT_DEFAULT_HASH The reftable backend needs to know the hash algorithm for writing the initialization hash table. The initial reftable contains a symref HEAD => "main" (or "master"), which is agnostic to the size of hash value, but this is an exceptional circumstance, and the reftable library does not cater for this exception. It insists that all tables in the stack have a consistent format ID for the hash algorithm. Call set_repo_hash_algo directly after reading out GIT_DEFAULT_HASH. Signed-off-by: Han-Wen Nienhuys hanwen@google.com [hanwen@google.com] Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-924%2Fhanwen%2Fset-hash-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-924/hanwen/set-hash-v1 Pull-Request: https://github.com/git/git/pull/924 builtin/init-db.c | 1 + 1 file changed, 1 insertion(+) diff --git a/builtin/init-db.c b/builtin/init-db.c index 01bc648d41..5c8c67fec6 100644 --- a/builtin/init-db.c +++ b/builtin/init-db.c @@ -391,6 +391,7 @@ static void validate_hash_algorithm(struct repository_format *repo_fmt, int hash die(_("unknown hash algorithm '%s'"), env); repo_fmt->hash_algo = env_algo; } + repo_set_hash_algo(the_repository, repo_fmt->hash_algo); } int init_db(const char *git_dir, const char *real_git_dir, base-commit: faefdd61ec7c7f6f3c8c9907891465ac9a2a1475 -- gitgitgadget ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] init-db: init the_repository->hash_algo early from GIT_DEFAULT_HASH 2020-11-26 19:39 [PATCH] init-db: init the_repository->hash_algo early from GIT_DEFAULT_HASH Han-Wen Nienhuys via GitGitGadget @ 2020-12-01 4:59 ` Junio C Hamano 2020-12-03 20:23 ` Han-Wen Nienhuys 0 siblings, 1 reply; 6+ messages in thread From: Junio C Hamano @ 2020-12-01 4:59 UTC (permalink / raw) To: Han-Wen Nienhuys via GitGitGadget; +Cc: git, Han-Wen Nienhuys, Han-Wen Nienhuys "Han-Wen Nienhuys via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Han-Wen Nienhuys <hanwen@google.com> > > The reftable backend needs to know the hash algorithm for writing the > initialization hash table. > > The initial reftable contains a symref HEAD => "main" (or "master"), which is > agnostic to the size of hash value, but this is an exceptional circumstance, and > the reftable library does not cater for this exception. It insists that all > tables in the stack have a consistent format ID for the hash algorithm. > > Call set_repo_hash_algo directly after reading out GIT_DEFAULT_HASH. Seeing that there is no mention of the_repository in the entire init-db.c file, that the same information is in repo_fmt which is passed to the init-db.c::create_default_files() function, and that create_default_files() is where the initial set of refs is prepared in the current system, it is not clear why this patch is needed (i.e. why we consider the current code that has no mention of the_repository is wrong). Isn't it the matter of passing the hash taken from repo_fmt to the refs API to initialize that part of the repository, instead of relying on half-initialized state in the_repository? It's not like only the hash_algo member is yet to be prepared in the_repository instance at that point in the code. Most of the members are, except for a very few fields initialized by initialize_the_repository(), not filled in the codepath, no? So, this might have been the most convenient way to pass hash_algo down but the patch does not convince me that it is the best way. It may be, but it does not answer questions like "what makes us sure that hash-algo will stay to be the only thing we need to fill early in this codepath?" If the patch _were_ to fill many other members of the_repository to make us pretend as if we did the setup_git_dir() in the repository we just created, and the hash_algo is filled merely as part of it, then it would have been much more convincing that we are moving in the right direction. It would for example mean gitdir/commondir would be filled among other members, and the existing "mkdir()" to manually prepare "refs/" for files backend in init-db.c can probably be removed and instead would be done by calling a new entry-point that initializes the initial set of refs in the refs API. And that entry-point would either rely on the_repository or take a repository instance, and use hash_algo member of that structure to do its thing. So, I dunno. > diff --git a/builtin/init-db.c b/builtin/init-db.c > index 01bc648d41..5c8c67fec6 100644 > --- a/builtin/init-db.c > +++ b/builtin/init-db.c > @@ -391,6 +391,7 @@ static void validate_hash_algorithm(struct repository_format *repo_fmt, int hash > die(_("unknown hash algorithm '%s'"), env); > repo_fmt->hash_algo = env_algo; > } > + repo_set_hash_algo(the_repository, repo_fmt->hash_algo); > } > > int init_db(const char *git_dir, const char *real_git_dir, > > base-commit: faefdd61ec7c7f6f3c8c9907891465ac9a2a1475 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] init-db: init the_repository->hash_algo early from GIT_DEFAULT_HASH 2020-12-01 4:59 ` Junio C Hamano @ 2020-12-03 20:23 ` Han-Wen Nienhuys 2020-12-03 23:25 ` Junio C Hamano 0 siblings, 1 reply; 6+ messages in thread From: Han-Wen Nienhuys @ 2020-12-03 20:23 UTC (permalink / raw) To: Junio C Hamano; +Cc: Han-Wen Nienhuys via GitGitGadget, git, Han-Wen Nienhuys On Tue, Dec 1, 2020 at 6:00 AM Junio C Hamano <gitster@pobox.com> wrote: > > "Han-Wen Nienhuys via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > From: Han-Wen Nienhuys <hanwen@google.com> > > > > The reftable backend needs to know the hash algorithm for writing the > > initialization hash table. > > > > The initial reftable contains a symref HEAD => "main" (or "master"), which is > > agnostic to the size of hash value, but this is an exceptional circumstance, and > > the reftable library does not cater for this exception. It insists that all > > tables in the stack have a consistent format ID for the hash algorithm. > > > > Call set_repo_hash_algo directly after reading out GIT_DEFAULT_HASH. > > Seeing that there is no mention of the_repository in the entire > init-db.c file, it's not because it's hidden in the call to refs_init_db(). refs_init_db accesses the_repository global variable, because get_main_ref_store() takes a repository (and not repository_format) argument. > Isn't it the matter of passing the hash > taken from repo_fmt to the refs API to initialize that part of the > repository, instead of relying on half-initialized state in > the_repository? The refstore is created in get_main_ref_store(). I can't add a repository_format argument there, because it has a lot of call sites where the repository_format isn't available. If the_repository is only half-initialized at this point in init_db(), then why are we passing it in refs_init_db() just a couple of lines further? At what point the_repository considered initialized? > It's not like only the hash_algo member is yet to > be prepared in the_repository instance at that point in the code. > Most of the members are, except for a very few fields initialized by > initialize_the_repository(), not filled in the codepath, no? > > So, this might have been the most convenient way to pass hash_algo > down but the patch does not convince me that it is the best way. I'm a bit at a loss here; I never learned how to cleanly work with so many global variables, so I'm happy to take your suggestion. -- Han-Wen Nienhuys - Google Munich I work 80%. Don't expect answers from me on Fridays. -- Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg Geschäftsführer: Paul Manicle, Halimah DeLaine Prado ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] init-db: init the_repository->hash_algo early from GIT_DEFAULT_HASH 2020-12-03 20:23 ` Han-Wen Nienhuys @ 2020-12-03 23:25 ` Junio C Hamano 2020-12-07 11:52 ` Han-Wen Nienhuys 0 siblings, 1 reply; 6+ messages in thread From: Junio C Hamano @ 2020-12-03 23:25 UTC (permalink / raw) To: Han-Wen Nienhuys; +Cc: Han-Wen Nienhuys via GitGitGadget, git, Han-Wen Nienhuys Han-Wen Nienhuys <hanwen@google.com> writes: > If the_repository is only half-initialized at this point in init_db(), > then why are we passing it in refs_init_db() just a couple of lines > further? At what point the_repository considered initialized? I would have to say it probably depends on what callees expect. The current implementation of refs_init_db() for files backend may not need anything other than the hash algorithm enum, but many other fields are missing, and they should ideally be populated, no? For example, I see files_ref_store_create() cheats by calling get_common_dir_noenv() to find out where the commondir is, instead of ever asking the repository the ref store belongs to. At least, get_main_ref_store() is told to get the ref store that belongs to the_repository, and it would be the right place to learn relevant pieces of information (for that matter, I am not sure why struct ref_store does not have a pointer to a repository structure; perhaps we are seeing the result of piecemeal evolution, not a designed structure?). > I'm a bit at a loss here; I never learned how to cleanly work with so > many global variables, so I'm happy to take your suggestion. I am only interested in giving a clear direction to future developers where to populate the_repository's members (and nothing else) if their enhancement needs members other than the hash algorithm to be populated, as if it were the_repository initialized in an already working repository (I am not talking about many global variables, whichever you are referring to). One way to do so would probably be to do something like the attached. The patch that started this thread (or the equivalent one in the updated reftable series) may want to initialize a bit more members while at it (looking at how commondir from the_repository is not used by the refs/files-backend.c::files_init_db() to decide where the function creates the refs/heads and refs/tags directories, we probably would need to populate the_repository->commondir before that codepath can be fixed to look at the member, for example). builtin/init-db.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git c/builtin/init-db.c w/builtin/init-db.c index dcc45bef51..2b5c94596f 100644 --- c/builtin/init-db.c +++ w/builtin/init-db.c @@ -438,6 +438,27 @@ int init_db(const char *git_dir, const char *real_git_dir, validate_hash_algorithm(&repo_fmt, hash); + /* + * At this point, the_repository we have in-core does not look + * anything like one that we would see initialized in an already + * working repository after calling setup_git_directory(). + * + * Calling repository.c::initialize_the_repository() may have + * prepared the .index .objects and .parsed_objects members, but + * other members like .gitdir, .commondir, etc. have not been + * initialized. + * + * Many API functions assume they are working with the_repository + * that has sensibly been initialized, but because we haven't + * really read from an existing repository, we need to hand-craft + * the necessary members of the structure to get out of this + * chicken-and-egg situation. + * + * For now, we update the hash algorithm member to what the + * validate_hash_algorithm() call decided for us. + */ + repo_set_hash_algo(the_repository, repo_fmt->hash_algo); + reinit = create_default_files(template_dir, original_git_dir, initial_branch, &repo_fmt, flags & INIT_DB_QUIET); ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] init-db: init the_repository->hash_algo early from GIT_DEFAULT_HASH 2020-12-03 23:25 ` Junio C Hamano @ 2020-12-07 11:52 ` Han-Wen Nienhuys 2020-12-07 19:31 ` Junio C Hamano 0 siblings, 1 reply; 6+ messages in thread From: Han-Wen Nienhuys @ 2020-12-07 11:52 UTC (permalink / raw) To: Junio C Hamano; +Cc: Han-Wen Nienhuys via GitGitGadget, git, Han-Wen Nienhuys On Fri, Dec 4, 2020 at 12:25 AM Junio C Hamano <gitster@pobox.com> wrote: > > Han-Wen Nienhuys <hanwen@google.com> writes: > > > If the_repository is only half-initialized at this point in init_db(), > > then why are we passing it in refs_init_db() just a couple of lines > > further? At what point the_repository considered initialized? > > I would have to say it probably depends on what callees expect. The > current implementation of refs_init_db() for files backend may not > need anything other than the hash algorithm enum, but many other > fields are missing, and they should ideally be populated, no? I think so too, but it looks like a major refactoring by itself, which is only very tangentially related to the reftable effort, so I'd like to punt on it. > For example, I see files_ref_store_create() cheats by calling > get_common_dir_noenv() to find out where the commondir is, instead > of ever asking the repository the ref store belongs to. At least, > get_main_ref_store() is told to get the ref store that belongs to > the_repository, and it would be the right place to learn relevant > pieces of information (for that matter, I am not sure why struct > ref_store does not have a pointer to a repository structure; perhaps > we are seeing the result of piecemeal evolution, not a designed > structure?). > > > I'm a bit at a loss here; I never learned how to cleanly work with so > > many global variables, so I'm happy to take your suggestion. > > I am only interested in giving a clear direction to future > developers where to populate the_repository's members (and nothing > else) if their enhancement needs members other than the hash > algorithm to be populated, as if it were the_repository initialized > in an already working repository (I am not talking about many global > variables, whichever you are referring to). You said earlier that maybe ref_store should hold a link to 'struct repository'. However, "struct repository" doesn't have a documented purpose (there is literally no comment documenting its purpose), so it's hard to tell upfront how to correctly configure the relation between "struct ref_store" and "struct repository". Currently, "struct repository" has a pointer to 'struct ref_store", which makes it suspect for there to be a pointer in the other direction as well. The reason I bring up global variables is that without them, the code would enforce a much more clear relation between different entities. Looking at the definition of repository, there seems to be some overlap in functionality between struct repository, struct repository_format, struct repo_settings, and config_set, and none of these structs have a clearly documented role. Before we change code, it's probably better to spell out what these structs are thought to be. > One way to do so would probably be to do something like the > attached. Thanks, I've amended my patch. -- Han-Wen Nienhuys - Google Munich I work 80%. Don't expect answers from me on Fridays. -- Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg Geschäftsführer: Paul Manicle, Halimah DeLaine Prado ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] init-db: init the_repository->hash_algo early from GIT_DEFAULT_HASH 2020-12-07 11:52 ` Han-Wen Nienhuys @ 2020-12-07 19:31 ` Junio C Hamano 0 siblings, 0 replies; 6+ messages in thread From: Junio C Hamano @ 2020-12-07 19:31 UTC (permalink / raw) To: Han-Wen Nienhuys; +Cc: Han-Wen Nienhuys via GitGitGadget, git, Han-Wen Nienhuys Han-Wen Nienhuys <hanwen@google.com> writes: > You said earlier that maybe ref_store should hold a link to 'struct > repository'. However, "struct repository" doesn't have a documented > purpose (there is literally no comment documenting its purpose), so > it's hard to tell upfront how to correctly configure the relation > between "struct ref_store" and "struct repository". Folks, who have been pushing to recurse into submodules in-process, does any of you care to help out? In order to deal with separate repositories inside a process, instead of allowing many API functions to assume that they only have to work in the repository "git" started in, you need to be able to make them to work on a different repository, which would have different configuration files, with working tree locations that are different from that of the initial repository, and certainly would have different set of refs, etc., and the way to do so is by passing an instance of the repository structure. To the API functions that have been converted to take such an instance, working in the repository "git" started in is done by working with the_repository instance. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2020-12-07 19:37 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-11-26 19:39 [PATCH] init-db: init the_repository->hash_algo early from GIT_DEFAULT_HASH Han-Wen Nienhuys via GitGitGadget 2020-12-01 4:59 ` Junio C Hamano 2020-12-03 20:23 ` Han-Wen Nienhuys 2020-12-03 23:25 ` Junio C Hamano 2020-12-07 11:52 ` Han-Wen Nienhuys 2020-12-07 19:31 ` Junio C Hamano
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).