[PATCH] implement submodule config cache for lookup of submodule names

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH] implement submodule config cache for lookup of submodule names
@ 2014-03-10 21:24 Heiko Voigt
  2014-03-11 21:58 ` Jeff King
  2014-03-12  2:28 ` Jonathan Nieder
  0 siblings, 2 replies; 6+ messages in thread
From: Heiko Voigt @ 2014-03-10 21:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jens Lehmann, Jeff King, Jonathan Nieder

This submodule configuration cache allows us to lazily read .gitmodules
configurations by commit into a runtime cache which can then be used to
easily lookup values from it. Currently only the values for path or name
are stored but it can be extended for any value needed.

It is expected that .gitmodules files do not change often between
commits. Thats why we lookup the .gitmodules sha1 from a commit and then
either lookup an already parsed configuration or parse and cache an
unknown one for each sha1. The cache is lazily build on demand for each
requested commit.

This cache can be used for all purposes which need knowledge about
submodule configurations. Example use cases are:

 * Recursive submodule checkout needs lookup a submodule name from its
   path when a submodule first appears. This needs be done before this
   configuration exists in the worktree.

 * The implementation of submodule support for 'git archive' needs to
   lookup the submodule name to generate the archive when given a
   revision that is not checked out.

 * 'git fetch' when given the --recurse-submodules=on-demand option (or
   configuration) needs to lookup submodule names by path from the
   database rather than reading from the worktree. For new submodule it
   needs to lookup the name from its path to allow cloning new
   submodules into the .git folder so they can be checked out without
   any network interaction when the user does a checkout of that
   revision.

Signed-off-by: Heiko Voigt <hvoigt@hvoigt.net>
---

This was originally part of my submodule fetch series[1] of moved or new
submodules. There has been an RFC patch separating this[2] to use it
independently for git-archive. This updated patch now uses the new
hashmap implementation from Karsten Blees.

I have also moved all functions into the new submodule-config-cache
module. I am not completely satisfied with the naming since it is quite
long. If someone comes up with some different naming I am open for it.
Maybe simply submodule-config (submodule_config prefix for functions)?

Next I am planning to extend this so configuration cache so it will
return the local configuration (with the usual .gitmodules,
/etc/gitconfig, .git/config, ... overlay of variables) when you pass it
null_sha1 for gitmodule or commit sha1. That way we can give this
configuration cache some use and implement its usage for the current
configuration variables in submodule.c. There we could get rid of the
global string_lists and only have one global submodule_config_cache
structure from which values can be read.

Let me know what you think.

Cheers Heiko

[1] http://article.gmane.org/gmane.comp.version-control.git/217018
[2] http://article.gmane.org/gmane.comp.version-control.git/239094

 .gitignore                                         |   1 +
 .../technical/api-submodule-config-cache.txt       |  39 ++++
 Makefile                                           |   2 +
 submodule-config-cache.c                           | 256 +++++++++++++++++++++
 submodule-config-cache.h                           |  26 +++
 t/t7410-submodule-config-cache.sh                  |  74 ++++++
 test-submodule-config-cache.c                      |  50 ++++
 7 files changed, 448 insertions(+)
 create mode 100644 Documentation/technical/api-submodule-config-cache.txt
 create mode 100644 submodule-config-cache.c
 create mode 100644 submodule-config-cache.h
 create mode 100755 t/t7410-submodule-config-cache.sh
 create mode 100644 test-submodule-config-cache.c

diff --git a/.gitignore b/.gitignore
index dc600f9..ed286e4 100644
--- a/.gitignore
+++ b/.gitignore
@@ -198,6 +198,7 @@
 /test-sha1
 /test-sigchain
 /test-string-list
+/test-submodule-config-cache
 /test-subprocess
 /test-svn-fe
 /test-urlmatch-normalization
diff --git a/Documentation/technical/api-submodule-config-cache.txt b/Documentation/technical/api-submodule-config-cache.txt
new file mode 100644
index 0000000..e3c80e8
--- /dev/null
+++ b/Documentation/technical/api-submodule-config-cache.txt
@@ -0,0 +1,39 @@
+submodule config cache API
+==========================
+
+The submodule config cache API allows to lazily read submodule
+configurations from multiple revisions into a cache that can then be
+used for subsequent lookups of submodule configurations either by its
+path or name.
+
+Data Structures
+---------------
+
+`struct submodule_config_cache`::
+
+	The configuration cache which stores everything needed for
+	lookup. All functions for lookup or insertion operate
+	on this structure.
+
+`struct submodule_config`::
+
+	This structure stores the configuration for one submodule for a
+	certain revision. It is returned by the lookup functions.
+
+Functions
+---------
+
+`void submodule_config_cache_init(struct submodule_config_cache *cache)`::
+`void submodule_config_cache_free(struct submodule_config_cache *cache)`::
+
+	Use these to initialize before and free a cache after your
+	lookups.
+
+`struct submodule_config *get_submodule_config_from_path(
+			struct submodule_config_cache *cache,
+			const unsigned char *commit_sha1,
+			const char *path)::
+
+	Lookup a submodules configuration by its commit_sha1 and path.
+
+For an example usage see test-submodule-config-cache.c.
diff --git a/Makefile b/Makefile
index d4ce53a..d2459bd 100644
--- a/Makefile
+++ b/Makefile
@@ -572,6 +572,7 @@ TEST_PROGRAMS_NEED_X += test-scrap-cache-tree
 TEST_PROGRAMS_NEED_X += test-sha1
 TEST_PROGRAMS_NEED_X += test-sigchain
 TEST_PROGRAMS_NEED_X += test-string-list
+TEST_PROGRAMS_NEED_X += test-submodule-config-cache
 TEST_PROGRAMS_NEED_X += test-subprocess
 TEST_PROGRAMS_NEED_X += test-svn-fe
 TEST_PROGRAMS_NEED_X += test-urlmatch-normalization
@@ -880,6 +881,7 @@ LIB_OBJS += strbuf.o
 LIB_OBJS += streaming.o
 LIB_OBJS += string-list.o
 LIB_OBJS += submodule.o
+LIB_OBJS += submodule-config-cache.o
 LIB_OBJS += symlinks.o
 LIB_OBJS += tag.o
 LIB_OBJS += trace.o
diff --git a/submodule-config-cache.c b/submodule-config-cache.c
new file mode 100644
index 0000000..5a7ffea
--- /dev/null
+++ b/submodule-config-cache.c
@@ -0,0 +1,256 @@
+#include "cache.h"
+#include "submodule-config-cache.h"
+#include "strbuf.h"
+
+struct submodule_config_entry {
+	struct hashmap_entry ent;
+	struct submodule_config *config;
+};
+
+static int submodule_config_path_cmp(const struct submodule_config_entry *a,
+				     const struct submodule_config_entry *b,
+				     const void *unused)
+{
+	int ret;
+	if ((ret = strcmp(a->config->path.buf, b->config->path.buf)))
+		return ret;
+	if ((ret = hashcmp(a->config->gitmodule_sha1, b->config->gitmodule_sha1)))
+		return ret;
+
+	return 0;
+}
+
+static int submodule_config_name_cmp(const struct submodule_config_entry *a,
+				     const struct submodule_config_entry *b,
+				     const void *unused)
+{
+	int ret;
+	if ((ret = strcmp(a->config->name.buf, b->config->name.buf)))
+		return ret;
+	if ((ret = hashcmp(a->config->gitmodule_sha1, b->config->gitmodule_sha1)))
+		return ret;
+
+	return 0;
+}
+
+void submodule_config_cache_init(struct submodule_config_cache *cache)
+{
+	hashmap_init(&cache->for_path, (hashmap_cmp_fn) submodule_config_path_cmp, 0);
+	hashmap_init(&cache->for_name, (hashmap_cmp_fn) submodule_config_name_cmp, 0);
+}
+
+static int free_one_submodule_config(struct submodule_config_entry *entry)
+{
+	strbuf_release(&entry->config->path);
+	strbuf_release(&entry->config->name);
+	free(entry->config);
+
+	return 0;
+}
+
+void submodule_config_cache_free(struct submodule_config_cache *cache)
+{
+	struct hashmap_iter iter;
+	struct submodule_config_entry *entry;
+
+	/*
+	 * NOTE: we iterate over the name hash here since
+	 * submodule_config entries are allocated by their gitmodule
+	 * sha1 and submodule name.
+	 */
+	hashmap_iter_init(&cache->for_name, &iter);
+	while ((entry = hashmap_iter_next(&iter)))
+		free_one_submodule_config(entry);
+
+	hashmap_free(&cache->for_path, 1);
+	hashmap_free(&cache->for_name, 1);
+}
+
+static unsigned int hash_sha1_string(const unsigned char *sha1, const char *string)
+{
+	return memhash(sha1, 20) + strhash(string);
+}
+
+static void submodule_config_cache_update_path(struct submodule_config_cache *cache,
+		struct submodule_config *config)
+{
+	unsigned int hash = hash_sha1_string(config->gitmodule_sha1,
+					     config->path.buf);
+	struct submodule_config_entry *e = xmalloc(sizeof(*e));
+	hashmap_entry_init(e, hash);
+	e->config = config;
+	hashmap_put(&cache->for_path, e);
+}
+
+static void submodule_config_cache_insert(struct submodule_config_cache *cache, struct submodule_config *config)
+{
+	unsigned int hash = hash_sha1_string(config->gitmodule_sha1,
+					     config->name.buf);
+	struct submodule_config_entry *e = xmalloc(sizeof(*e));
+	hashmap_entry_init(e, hash);
+	e->config = config;
+	hashmap_add(&cache->for_name, e);
+}
+
+static struct submodule_config *submodule_config_cache_lookup_path(struct submodule_config_cache *cache,
+	const unsigned char *gitmodule_sha1, const char *path)
+{
+	struct submodule_config_entry *entry;
+	unsigned int hash = hash_sha1_string(gitmodule_sha1, path);
+	struct submodule_config_entry key;
+	struct submodule_config key_config;
+
+	hashcpy(key_config.gitmodule_sha1, gitmodule_sha1);
+	key_config.path.buf = (char *) path;
+
+	hashmap_entry_init(&key, hash);
+	key.config = &key_config;
+
+	entry = hashmap_get(&cache->for_path, &key, NULL);
+	if (entry)
+		return entry->config;
+	return NULL;
+}
+
+static struct submodule_config *submodule_config_cache_lookup_name(struct submodule_config_cache *cache,
+	const unsigned char *gitmodule_sha1, const char *name)
+{
+	struct submodule_config_entry *entry;
+	unsigned int hash = hash_sha1_string(gitmodule_sha1, name);
+	struct submodule_config_entry key;
+	struct submodule_config key_config;
+
+	hashcpy(key_config.gitmodule_sha1, gitmodule_sha1);
+	key_config.name.buf = (char *) name;
+
+	hashmap_entry_init(&key, hash);
+	key.config = &key_config;
+
+	entry = hashmap_get(&cache->for_name, &key, NULL);
+	if (entry)
+		return entry->config;
+	return NULL;
+}
+
+struct parse_submodule_config_parameter {
+	unsigned char *gitmodule_sha1;
+	struct submodule_config_cache *cache;
+};
+
+static int name_and_item_from_var(const char *var, struct strbuf *name, struct strbuf *item)
+{
+	/* find the name and add it */
+	strbuf_addstr(name, var + strlen("submodule."));
+	char *end = strrchr(name->buf, '.');
+	if (!end) {
+		strbuf_release(name);
+		return 0;
+	}
+	*end = '\0';
+	if (((end + 1) - name->buf) < name->len)
+		strbuf_addstr(item, end + 1);
+
+	return 1;
+}
+
+static struct submodule_config *lookup_or_create_by_name(struct submodule_config_cache *cache,
+		unsigned char *gitmodule_sha1, const char *name)
+{
+	struct submodule_config *config;
+	config = submodule_config_cache_lookup_name(cache, gitmodule_sha1, name);
+	if (config)
+		return config;
+
+	config = xmalloc(sizeof(*config));
+
+	strbuf_init(&config->name, 0);
+	strbuf_addstr(&config->name, name);
+
+	strbuf_init(&config->path, 0);
+
+	hashcpy(config->gitmodule_sha1, gitmodule_sha1);
+
+	submodule_config_cache_insert(cache, config);
+
+	return config;
+}
+
+static void warn_multiple_config(struct submodule_config *config, const char *option)
+{
+	warning("%s:.gitmodules, multiple configurations found for submodule.%s.%s. "
+			"Skipping second one!", sha1_to_hex(config->gitmodule_sha1),
+			option, config->name.buf);
+}
+
+static int parse_submodule_config_into_cache(const char *var, const char *value, void *data)
+{
+	struct parse_submodule_config_parameter *me = data;
+	struct submodule_config *submodule_config;
+	struct strbuf name = STRBUF_INIT, item = STRBUF_INIT;
+
+	/* We only read submodule.<name> entries */
+	if (prefixcmp(var, "submodule."))
+		return 0;
+
+	if (!name_and_item_from_var(var, &name, &item))
+		return 0;
+
+	submodule_config = lookup_or_create_by_name(me->cache, me->gitmodule_sha1, name.buf);
+
+	if (!suffixcmp(var, ".path")) {
+		if (*submodule_config->path.buf != '\0') {
+			warn_multiple_config(submodule_config, "path");
+			return 0;
+		}
+		strbuf_addstr(&submodule_config->path, value);
+		submodule_config_cache_update_path(me->cache, submodule_config);
+	}
+
+	strbuf_release(&name);
+	strbuf_release(&item);
+
+	return 0;
+}
+
+struct submodule_config *submodule_config_from_path(struct submodule_config_cache *cache,
+		const unsigned char *commit_sha1, const char *path)
+{
+	struct strbuf rev = STRBUF_INIT;
+	unsigned long config_size;
+	char *config;
+	unsigned char sha1[20];
+	enum object_type type;
+	struct submodule_config *submodule_config = NULL;
+	struct parse_submodule_config_parameter parameter;
+
+
+	strbuf_addf(&rev, "%s:.gitmodules", sha1_to_hex(commit_sha1));
+	if (get_sha1(rev.buf, sha1) < 0)
+		goto free_rev;
+
+	submodule_config = submodule_config_cache_lookup_path(cache, sha1, path);
+	if (submodule_config)
+		goto free_rev;
+
+	config = read_sha1_file(sha1, &type, &config_size);
+	if (!config)
+		goto free_rev;
+
+	if (type != OBJ_BLOB) {
+		free(config);
+		goto free_rev;
+	}
+
+	/* fill the submodule config into the cache */
+	parameter.cache = cache;
+	parameter.gitmodule_sha1 = sha1;
+	git_config_from_buf(parse_submodule_config_into_cache, rev.buf,
+			config, config_size, &parameter);
+	free(config);
+
+	submodule_config = submodule_config_cache_lookup_path(cache, sha1, path);
+
+free_rev:
+	strbuf_release(&rev);
+	return submodule_config;
+}
diff --git a/submodule-config-cache.h b/submodule-config-cache.h
new file mode 100644
index 0000000..18524b0
--- /dev/null
+++ b/submodule-config-cache.h
@@ -0,0 +1,26 @@
+#ifndef SUBMODULE_CONFIG_CACHE_H
+#define SUBMODULE_CONFIG_CACHE_H
+
+#include "hashmap.h"
+#include "strbuf.h"
+
+struct submodule_config_cache {
+	struct hashmap for_path;
+	struct hashmap for_name;
+};
+
+/* one submodule_config_cache entry */
+struct submodule_config {
+	struct strbuf path;
+	struct strbuf name;
+	unsigned char gitmodule_sha1[20];
+};
+
+void submodule_config_cache_init(struct submodule_config_cache *cache);
+void submodule_config_cache_free(struct submodule_config_cache *cache);
+
+struct submodule_config *submodule_config_from_path(
+		struct submodule_config_cache *cache,
+		const unsigned char *commit_sha1, const char *path);
+
+#endif /* SUBMODULE_CONFIG_CACHE_H */
diff --git a/t/t7410-submodule-config-cache.sh b/t/t7410-submodule-config-cache.sh
new file mode 100755
index 0000000..9193e9d
--- /dev/null
+++ b/t/t7410-submodule-config-cache.sh
@@ -0,0 +1,74 @@
+#!/bin/sh
+#
+# Copyright (c) 2014 Heiko Voigt
+#
+
+test_description='Test submodules config cache infrastructure
+
+This test verifies that parsing .gitmodules configuration directly
+from the database works.
+'
+
+TEST_NO_CREATE_REPO=1
+. ./test-lib.sh
+
+test_expect_success 'submodule config cache setup' '
+	mkdir submodule &&
+	(cd submodule &&
+		git init
+		echo a >a &&
+		git add . &&
+		git commit -ma
+	) &&
+	mkdir super &&
+	(cd super &&
+		git init &&
+		git submodule add ../submodule &&
+		git submodule add ../submodule a &&
+		git commit -m "add as submodule and as a" &&
+		git mv a b &&
+		git commit -m "move a to b"
+	)
+'
+
+cat >super/expect <<EOF
+Submodule name: 'a' for path 'a'
+Submodule name: 'a' for path 'b'
+Submodule name: 'submodule' for path 'submodule'
+Submodule name: 'submodule' for path 'submodule'
+EOF
+
+test_expect_success 'test parsing of submodule config' '
+	(cd super &&
+		test-submodule-config-cache \
+			HEAD^ a \
+			HEAD b \
+			HEAD^ submodule \
+			HEAD submodule \
+				>actual &&
+		test_cmp expect actual
+	)
+'
+
+cat >super/expect_error <<EOF
+Submodule name: 'a' for path 'b'
+Submodule name: 'submodule' for path 'submodule'
+EOF
+
+test_expect_success 'error in one submodule config lets continue' '
+	(cd super &&
+		cp .gitmodules .gitmodules.bak &&
+		echo "	value = \"" >>.gitmodules &&
+		git add .gitmodules &&
+		mv .gitmodules.bak .gitmodules &&
+		git commit -m "add error" &&
+		test-submodule-config-cache \
+			HEAD b \
+			HEAD submodule \
+				>actual &&
+		test_cmp expect_error actual &&
+		git checkout .gitmodules
+	)
+'
+
+test_done
diff --git a/test-submodule-config-cache.c b/test-submodule-config-cache.c
new file mode 100644
index 0000000..b893d94
--- /dev/null
+++ b/test-submodule-config-cache.c
@@ -0,0 +1,50 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "cache.h"
+#include "submodule-config-cache.h"
+
+static void die_usage(int argc, char **argv, const char *msg)
+{
+	fprintf(stderr, "%s\n", msg);
+	fprintf(stderr, "Usage: %s [<commit> <submodulepath>] ...\n", argv[0]);
+	exit(1);
+}
+
+int main(int argc, char **argv)
+{
+	char **arg = argv;
+	struct submodule_config_cache submodule_config_cache;
+
+	if ((argc - 1) % 2 != 0)
+		die_usage(argc, argv, "Wrong number of arguments.");
+
+	submodule_config_cache_init(&submodule_config_cache);
+
+	arg++;
+	while (*arg) {
+		unsigned char commit_sha1[20];
+		struct submodule_config *submodule_config;
+		const char *commit;
+		const char *path;
+
+		commit = arg[0];
+		path = arg[1];
+
+		if (get_sha1(commit, commit_sha1) < 0)
+			die_usage(argc, argv, "Commit not found.");
+
+		submodule_config = submodule_config_from_path(&submodule_config_cache,
+				commit_sha1, path);
+		if (!submodule_config)
+			die_usage(argc, argv, "Submodule config not found.");
+
+		printf("Submodule name: '%s' for path '%s'\n", submodule_config->name.buf, path);
+		arg += 2;
+	}
+
+	submodule_config_cache_free(&submodule_config_cache);
+
+	return 0;
+}
-- 
1.9.0.158.gd3fc797

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] implement submodule config cache for lookup of submodule names
  2014-03-10 21:24 [PATCH] implement submodule config cache for lookup of submodule names Heiko Voigt
@ 2014-03-11 21:58 ` Jeff King
  2014-03-12 17:07   ` Heiko Voigt
  2014-03-12  2:28 ` Jonathan Nieder
  1 sibling, 1 reply; 6+ messages in thread
From: Jeff King @ 2014-03-11 21:58 UTC (permalink / raw)
  To: Heiko Voigt; +Cc: Junio C Hamano, git, Jens Lehmann, Jonathan Nieder

On Mon, Mar 10, 2014 at 10:24:12PM +0100, Heiko Voigt wrote:

> I have also moved all functions into the new submodule-config-cache
> module. I am not completely satisfied with the naming since it is quite
> long. If someone comes up with some different naming I am open for it.
> Maybe simply submodule-config (submodule_config prefix for functions)?

Since the cache is totally internal to the submodule-config code, I do
not know that you even need to have a nice submodule-config-cache API.
Can it just be a singleton?

That is bad design in a sense (it becomes harder later if you ever want
to pull submodule config from two sources simultaneously), but it
matches many other subsystems in git which cache behind the scenes.

I also suspect you could call submodule_config simply "submodule", and
let it be a stand-in for the submodule (for now, only data from the
config, but potentially you could keep other data on it, too).

So with all that, the entry point into your code is just:

  const struct submodule *submodule_from_path(const char *path);

and the caching magically happens behind-the-scenes.

But take all of this with a giant grain of salt, as I am not too
familiar with the needs of the callers.

> +/* one submodule_config_cache entry */
> +struct submodule_config {
> +	struct strbuf path;
> +	struct strbuf name;
> +	unsigned char gitmodule_sha1[20];
> +};

Do these strings need changed after they are written once? If not, you
may want to just make these bare pointers (you can still use strbufs to
create them, and then strbuf_detach at the end). That may just be a matter of
taste, though.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] implement submodule config cache for lookup of submodule names
  2014-03-10 21:24 [PATCH] implement submodule config cache for lookup of submodule names Heiko Voigt
  2014-03-11 21:58 ` Jeff King
@ 2014-03-12  2:28 ` Jonathan Nieder
  2014-03-12 23:58   ` Heiko Voigt
  1 sibling, 1 reply; 6+ messages in thread
From: Jonathan Nieder @ 2014-03-12  2:28 UTC (permalink / raw)
  To: Heiko Voigt; +Cc: Junio C Hamano, git, Jens Lehmann, Jeff King

Hi,

Some quick thoughts.

Heiko Voigt wrote:

> This submodule configuration cache allows us to lazily read .gitmodules
> configurations by commit into a runtime cache which can then be used to
> easily lookup values from it. Currently only the values for path or name
> are stored but it can be extended for any value needed.

Makes sense.

[...]
> Next I am planning to extend this so configuration cache so it will
> return the local configuration (with the usual .gitmodules,
> /etc/gitconfig, .git/config, ... overlay of variables) when you pass it
> null_sha1 for gitmodule or commit sha1. That way we can give this
> configuration cache some use and implement its usage for the current
> configuration variables in submodule.c.

Yay!

I wonder whether local configuration should also override information
from commits at times.

[...]
> --- /dev/null
> +++ b/Documentation/technical/api-submodule-config-cache.txt
> @@ -0,0 +1,39 @@
> +submodule config cache API
> +==========================

Most API documents in Documentation/technical have a section explaining
general usage --- e.g. (from api-revision-walking),

	Calling sequence
	----------------

	The walking API has a given calling sequence: first you need to
	initialize a rev_info structure, then add revisions to control what kind
	of revision list do you want to get, finally you can iterate over the
	revision list.

Or (from api-string-list):

	The caller:

	. Allocates and clears a `struct string_list` variable.

	. Initializes the members. You might want to set the flag `strdup_strings`
	  if the strings should be strdup()ed. For example, this is necessary
[...]
	. Can remove items not matching a criterion from a sorted or unsorted
	  list using `filter_string_list`, or remove empty strings using
	  `string_list_remove_empty_items`.

	. Finally it should free the list using `string_list_clear`.

This makes it easier to get started by looking at the documentation alone
without having to mimic an example.

Another thought: it's tempting to call this the 'submodule config API' and
treat the (optional?) memoization as an implementation detail instead of
reminding the caller of it too much.  But I haven't thought that through
completely.

[...]
> +`struct submodule_config *get_submodule_config_from_path(
> +			struct submodule_config_cache *cache,
> +			const unsigned char *commit_sha1,
> +			const char *path)::
> +
> +	Lookup a submodules configuration by its commit_sha1 and path.

The cache just always grows until it's freed, right?  Would it make
sense to allow cached configurations to be explicitly evicted when the
caller is done with them some day?  (Just curious, not a complaint.
Might be worth mentioning in the overview how the cache is expected to
be used, though.)

[...]
> --- /dev/null
> +++ b/submodule-config-cache.h
> @@ -0,0 +1,26 @@
> +#ifndef SUBMODULE_CONFIG_CACHE_H
> +#define SUBMODULE_CONFIG_CACHE_H
> +
> +#include "hashmap.h"
> +#include "strbuf.h"
> +
> +struct submodule_config_cache {
> +	struct hashmap for_path;
> +	struct hashmap for_name;
> +};

Could use comments (e.g., saying what keys each maps to what values).
Would callers ever look at the hashmaps directly or are they only
supposed to be accessed using helper functions that take a whole
struct?

[...]
> +
> +/* one submodule_config_cache entry */
> +struct submodule_config {
> +	struct strbuf path;
> +	struct strbuf name;
> +	unsigned char gitmodule_sha1[20];

Could also use comments.  What does the gitmodule_sha1 field represent?

[...]
> --- /dev/null
> +++ b/submodule-config-cache.c
> @@ -0,0 +1,256 @@
> +#include "cache.h"
> +#include "submodule-config-cache.h"
> +#include "strbuf.h"
> +
> +struct submodule_config_entry {
> +	struct hashmap_entry ent;
> +	struct submodule_config *config;
> +};
> +
> +static int submodule_config_path_cmp(const struct submodule_config_entry *a,
> +				     const struct submodule_config_entry *b,
> +				     const void *unused)
> +{
> +	int ret;
> +	if ((ret = strcmp(a->config->path.buf, b->config->path.buf)))
> +		return ret;

Style: avoid assignments in conditionals:

	ret = strcmp(...);
	if (ret)
		return ret;

	return hashcmp(...);

The actual return value from strcmp isn't important since
hashmap_cmp_fn is only used to check for equality.  Perhaps as a
separate change it would make sense to make it a hashmap_equality_fn
some day:

	return !strcmp(...) && !hashcmp(...);

This checks both the path and gitmodule_sha1, so I guess it's for
looking up submodule names.

[...]
> +static int submodule_config_name_cmp(const struct submodule_config_entry *a,
> +				     const struct submodule_config_entry *b,
> +				     const void *unused)

Likewise.

[...]
> +void submodule_config_cache_init(struct submodule_config_cache *cache)
> +{
> +	hashmap_init(&cache->for_path, (hashmap_cmp_fn) submodule_config_path_cmp, 0);
> +	hashmap_init(&cache->for_name, (hashmap_cmp_fn) submodule_config_name_cmp, 0);

Long lines.

I don't like the casts (they make it hard to catch if e.g. the number
of arguments expected for a hashmap_cmp_fn changes) but it's the style
used elsewhere.   Ok.

[...]
> +void submodule_config_cache_free(struct submodule_config_cache *cache)
> +{
> +	struct hashmap_iter iter;
> +	struct submodule_config_entry *entry;
> +
> +	/*
> +	 * NOTE: we iterate over the name hash here since
> +	 * submodule_config entries are allocated by their gitmodule
> +	 * sha1 and submodule name.
> +	 */

No need to say 'NOTE' --- it's just as clear without that word that
this is a comment.

Not sure I understand the comment.  Is the point that you only have
to iterate over one of the hashmaps, or is there something more
subtle going on?

[...]
> +	hashmap_free(&cache->for_path, 1);
> +	hashmap_free(&cache->for_name, 1);

Not your fault: this contextless '1' is unhelpful.  hashmap_free could
take a flag HASHMAP_FREE_ENTRIES.

[...]
> +static unsigned int hash_sha1_string(const unsigned char *sha1, const char *string)
> +{
> +	return memhash(sha1, 20) + strhash(string);
> +}

Feels a bit unconventional.  I can't find a strong reason to mind.

When Boost's

	size_t seed = 0;
	hash_combine(seed, x);
	hash_combine(seed, y);

does

	hash = hash_value(x) + 0x9e3779b9
	hash ^= hash_value(y) + 0x9e3779b9 + (seed<<6) + (seed>>2);

I suspect some of the complication is because they expect to be mixing
more values in a loop.

> +
> +static void submodule_config_cache_update_path(struct submodule_config_cache *cache,
> +		struct submodule_config *config)
> +{
> +	unsigned int hash = hash_sha1_string(config->gitmodule_sha1,
> +					     config->path.buf);
> +	struct submodule_config_entry *e = xmalloc(sizeof(*e));
> +	hashmap_entry_init(e, hash);
> +	e->config = config;
> +	hashmap_put(&cache->for_path, e);
> +}

Naming nit: maybe ..._put for consistency with the hashmap API?

[...]
> +static void submodule_config_cache_insert(struct submodule_config_cache *cache, struct submodule_config *config)

Likewise, maybe _add?

[...]
> +static struct submodule_config *submodule_config_cache_lookup_path(struct submodule_config_cache *cache,
> +	const unsigned char *gitmodule_sha1, const char *path)
> +{
> +	struct submodule_config_entry *entry;
> +	unsigned int hash = hash_sha1_string(gitmodule_sha1, path);
> +	struct submodule_config_entry key;
> +	struct submodule_config key_config;
> +
> +	hashcpy(key_config.gitmodule_sha1, gitmodule_sha1);
> +	key_config.path.buf = (char *) path;

An odd use of strbuf.

My first reaction was "Shouldn't this use strbuf_addstr?".  That way
it would be clear who has responsibility to free it and the alloc and
len fields would be right.

But looking more closely, it seems that key_config.path.buf is just
being used to transmit a \0-terminated string.  Would making struct
submodule_config::path a plain "const char *" work?  Or perhaps struct
submodule_config_entry should be a tagged union or similar to make
both the variant that owns its name and path and the variant that
borrows buffers possible and comparable.

[...]
> +struct parse_submodule_config_parameter {
> +	unsigned char *gitmodule_sha1;
> +	struct submodule_config_cache *cache;
> +};

What is this struct used for?  A comment or moving it closer to where
it is used could help.

[...]
> +static void warn_multiple_config(struct submodule_config *config, const char *option)
> +{
> +	warning("%s:.gitmodules, multiple configurations found for submodule.%s.%s. "
> +			"Skipping second one!", sha1_to_hex(config->gitmodule_sha1),
> +			option, config->name.buf);

Ah, so gitmodule_sha1 is a commit id?

I think a name like "commit" for the field would have left me less
confused (and a comment could explain that what the hash values store
is information gleaned from .gitmodules in that commit).

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Re: [PATCH] implement submodule config cache for lookup of submodule names
  2014-03-11 21:58 ` Jeff King
@ 2014-03-12 17:07   ` Heiko Voigt
  0 siblings, 0 replies; 6+ messages in thread
From: Heiko Voigt @ 2014-03-12 17:07 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git, Jens Lehmann, Jonathan Nieder

On Tue, Mar 11, 2014 at 05:58:08PM -0400, Jeff King wrote:
> On Mon, Mar 10, 2014 at 10:24:12PM +0100, Heiko Voigt wrote:
> 
> > I have also moved all functions into the new submodule-config-cache
> > module. I am not completely satisfied with the naming since it is quite
> > long. If someone comes up with some different naming I am open for it.
> > Maybe simply submodule-config (submodule_config prefix for functions)?
> 
> Since the cache is totally internal to the submodule-config code, I do
> not know that you even need to have a nice submodule-config-cache API.
> Can it just be a singleton?
> 
> That is bad design in a sense (it becomes harder later if you ever want
> to pull submodule config from two sources simultaneously), but it
> matches many other subsystems in git which cache behind the scenes.
> 
> I also suspect you could call submodule_config simply "submodule", and
> let it be a stand-in for the submodule (for now, only data from the
> config, but potentially you could keep other data on it, too).
> 
> So with all that, the entry point into your code is just:
> 
>   const struct submodule *submodule_from_path(const char *path);
> 
> and the caching magically happens behind-the-scenes.

Actually since we also need to define the revision from which we request
these submodule values that would become:

   const struct submodule *submodule_from_path(const unsigned char *commit_sha1,
   					const char *path);

Since the configuration for submodules for a submodule are identified by
a unique commit sha1 and its unique path (or unique name) I think it is
feasible to make it a singleton. That would also simplify using it from
the existing config parsing code.

To be future proof I can internally keep the object oriented approach
always passing on the submodule_config_cache pointer. That way it would
be easy to expose to the outside in case we later need multiple caches.

So I currently I do not see any downside of making it a singleton to the
outside and would go with that.

> > +/* one submodule_config_cache entry */
> > +struct submodule_config {
> > +	struct strbuf path;
> > +	struct strbuf name;
> > +	unsigned char gitmodule_sha1[20];
> > +};
> 
> Do these strings need changed after they are written once? If not, you
> may want to just make these bare pointers (you can still use strbufs to
> create them, and then strbuf_detach at the end). That may just be a matter of
> taste, though.

No they do not need to be changed after parsing, since every path,
name mapping should be unique in one .gitmodule file. And I think it
actually would make the code more clear in one instance where I directly
set the buf pointer which Jonathan mentioned. There it is needed only
for the hashmap lookup.

Cheers Heiko

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Re: [PATCH] implement submodule config cache for lookup of submodule names
  2014-03-12  2:28 ` Jonathan Nieder
@ 2014-03-12 23:58   ` Heiko Voigt
  2014-03-13  0:46     ` Jonathan Nieder
  0 siblings, 1 reply; 6+ messages in thread
From: Heiko Voigt @ 2014-03-12 23:58 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Junio C Hamano, git, Jens Lehmann, Jeff King

Hi,

On Tue, Mar 11, 2014 at 07:28:52PM -0700, Jonathan Nieder wrote:
> Heiko Voigt wrote:
> 
> > This submodule configuration cache allows us to lazily read .gitmodules
> > configurations by commit into a runtime cache which can then be used to
> > easily lookup values from it. Currently only the values for path or name
> > are stored but it can be extended for any value needed.
> 
> Makes sense.
> 
> [...]
> > Next I am planning to extend this so configuration cache so it will
> > return the local configuration (with the usual .gitmodules,
> > /etc/gitconfig, .git/config, ... overlay of variables) when you pass it
> > null_sha1 for gitmodule or commit sha1. That way we can give this
> > configuration cache some use and implement its usage for the current
> > configuration variables in submodule.c.
> 
> Yay!
> 
> I wonder whether local configuration should also override information
> from commits at times.

Yeah but for that I plan a different code path that solves conflicts
and/or extend missing values. I think its best to keep the submodule
configurations by commit as simple as possible. But we will see once I
get to the point where we need this.

> [...]
> > --- /dev/null
> > +++ b/Documentation/technical/api-submodule-config-cache.txt
> > @@ -0,0 +1,39 @@
> > +submodule config cache API
> > +==========================
> 
> Most API documents in Documentation/technical have a section explaining
> general usage --- e.g. (from api-revision-walking),
> 
> 	Calling sequence
> 	----------------
> 
> 	The walking API has a given calling sequence: first you need to
> 	initialize a rev_info structure, then add revisions to control what kind
> 	of revision list do you want to get, finally you can iterate over the
> 	revision list.
> 
> Or (from api-string-list):
> 
> 	The caller:
> 
> 	. Allocates and clears a `struct string_list` variable.
> 
> 	. Initializes the members. You might want to set the flag `strdup_strings`
> 	  if the strings should be strdup()ed. For example, this is necessary
> [...]
> 	. Can remove items not matching a criterion from a sorted or unsorted
> 	  list using `filter_string_list`, or remove empty strings using
> 	  `string_list_remove_empty_items`.
> 
> 	. Finally it should free the list using `string_list_clear`.
> 
> This makes it easier to get started by looking at the documentation alone
> without having to mimic an example.

True, will extend that in the next iteration.

> Another thought: it's tempting to call this the 'submodule config API' and
> treat the (optional?) memoization as an implementation detail instead of
> reminding the caller of it too much.  But I haven't thought that through
> completely.

Thats the same point Jeff mentioned and I think that will simplify many
things. As I mentioned in the answer to Jeff I will keep passing around
the cache pointer internally but expose only functions without it to the
outside to be future proof but also easy to use for the current use
case.

> [...]
> > +`struct submodule_config *get_submodule_config_from_path(
> > +			struct submodule_config_cache *cache,
> > +			const unsigned char *commit_sha1,
> > +			const char *path)::
> > +
> > +	Lookup a submodules configuration by its commit_sha1 and path.
> 
> The cache just always grows until it's freed, right?  Would it make
> sense to allow cached configurations to be explicitly evicted when the
> caller is done with them some day?  (Just curious, not a complaint.
> Might be worth mentioning in the overview how the cache is expected to
> be used, though.)

Yes it only grows at the moment. Not sure whether we need to remove
configurations. Currently the only use case that comes to my mind would
be if it grows to big to be kept in memory, but at the moment that seems
far away for me, so I would leave that for a future extension. It will
be hard to know whether someone is done with a certain .gitmodule file
since we cache it by its sha1 expecting it to be the same over many
revisions.

> [...]
> > --- /dev/null
> > +++ b/submodule-config-cache.h
> > @@ -0,0 +1,26 @@
> > +#ifndef SUBMODULE_CONFIG_CACHE_H
> > +#define SUBMODULE_CONFIG_CACHE_H
> > +
> > +#include "hashmap.h"
> > +#include "strbuf.h"
> > +
> > +struct submodule_config_cache {
> > +	struct hashmap for_path;
> > +	struct hashmap for_name;
> > +};
> 
> Could use comments (e.g., saying what keys each maps to what values).
> Would callers ever look at the hashmaps directly or are they only
> supposed to be accessed using helper functions that take a whole
> struct?

Users should only use the helper functions, but I will hide this in the
submodule-config module. Will add some comment as well.

> [...]
> > +
> > +/* one submodule_config_cache entry */
> > +struct submodule_config {
> > +	struct strbuf path;
> > +	struct strbuf name;
> > +	unsigned char gitmodule_sha1[20];
> 
> Could also use comments.  What does the gitmodule_sha1 field represent?

Thats the sha1 of the parsed .gitmodule file blob.

> [...]
> > --- /dev/null
> > +++ b/submodule-config-cache.c
> > @@ -0,0 +1,256 @@
> > +#include "cache.h"
> > +#include "submodule-config-cache.h"
> > +#include "strbuf.h"
> > +
> > +struct submodule_config_entry {
> > +	struct hashmap_entry ent;
> > +	struct submodule_config *config;
> > +};
> > +
> > +static int submodule_config_path_cmp(const struct submodule_config_entry *a,
> > +				     const struct submodule_config_entry *b,
> > +				     const void *unused)
> > +{
> > +	int ret;
> > +	if ((ret = strcmp(a->config->path.buf, b->config->path.buf)))
> > +		return ret;
> 
> Style: avoid assignments in conditionals:
> 
> 	ret = strcmp(...);
> 	if (ret)
> 		return ret;
> 
> 	return hashcmp(...);
> 
> The actual return value from strcmp isn't important since
> hashmap_cmp_fn is only used to check for equality.

I know but since its so similar I used the return value anyway.

> Perhaps as a
> separate change it would make sense to make it a hashmap_equality_fn
> some day:
> 
> 	return !strcmp(...) && !hashcmp(...);
> 
> This checks both the path and gitmodule_sha1, so I guess it's for
> looking up submodule names.

Yes, will change it that way.

> [...]
> > +static int submodule_config_name_cmp(const struct submodule_config_entry *a,
> > +				     const struct submodule_config_entry *b,
> > +				     const void *unused)
> 
> Likewise.

Ok

> [...]
> > +void submodule_config_cache_init(struct submodule_config_cache *cache)
> > +{
> > +	hashmap_init(&cache->for_path, (hashmap_cmp_fn) submodule_config_path_cmp, 0);
> > +	hashmap_init(&cache->for_name, (hashmap_cmp_fn) submodule_config_name_cmp, 0);
> 
> Long lines.
> 
> I don't like the casts (they make it hard to catch if e.g. the number
> of arguments expected for a hashmap_cmp_fn changes) but it's the style
> used elsewhere.   Ok.

Me neither but we got to cast somewhere, either in the cmp functions or
here. I don't know what better, so I used the style suggested in the
documentation.

> [...]
> > +void submodule_config_cache_free(struct submodule_config_cache *cache)
> > +{
> > +	struct hashmap_iter iter;
> > +	struct submodule_config_entry *entry;
> > +
> > +	/*
> > +	 * NOTE: we iterate over the name hash here since
> > +	 * submodule_config entries are allocated by their gitmodule
> > +	 * sha1 and submodule name.
> > +	 */
> 
> No need to say 'NOTE' --- it's just as clear without that word that
> this is a comment.

Now looking at it from that perspective, you are right: Yes its a note
why does it also have to us that :-)

> Not sure I understand the comment.  Is the point that you only have
> to iterate over one of the hashmaps, or is there something more
> subtle going on?

No its only purpose is to explain why we only need to iterate over one
hashmap. In case thats clear to everyone I can also delete it.

> [...]
> > +	hashmap_free(&cache->for_path, 1);
> > +	hashmap_free(&cache->for_name, 1);
> 
> Not your fault: this contextless '1' is unhelpful.  hashmap_free could
> take a flag HASHMAP_FREE_ENTRIES.

We have a similar int in other places as well. See for example
string_list_clear().

> [...]
> > +static unsigned int hash_sha1_string(const unsigned char *sha1, const char *string)
> > +{
> > +	return memhash(sha1, 20) + strhash(string);
> > +}
> 
> Feels a bit unconventional.  I can't find a strong reason to mind.

Well I did not think much about this. I simply thought: hash -> kind of
random value. Adding the two together is as good as anything (even
overflow does not matter).

> When Boost's
> 
> 	size_t seed = 0;
> 	hash_combine(seed, x);
> 	hash_combine(seed, y);
> 
> does
> 
> 	hash = hash_value(x) + 0x9e3779b9
> 	hash ^= hash_value(y) + 0x9e3779b9 + (seed<<6) + (seed>>2);
> 
> I suspect some of the complication is because they expect to be mixing
> more values in a loop.

I am fine with a switch to something different. We could use classic XOR
in case that feels better.

> > +
> > +static void submodule_config_cache_update_path(struct submodule_config_cache *cache,
> > +		struct submodule_config *config)
> > +{
> > +	unsigned int hash = hash_sha1_string(config->gitmodule_sha1,
> > +					     config->path.buf);
> > +	struct submodule_config_entry *e = xmalloc(sizeof(*e));
> > +	hashmap_entry_init(e, hash);
> > +	e->config = config;
> > +	hashmap_put(&cache->for_path, e);
> > +}
> 
> Naming nit: maybe ..._put for consistency with the hashmap API?

Fine with me, will change.

> [...]
> > +static void submodule_config_cache_insert(struct submodule_config_cache *cache, struct submodule_config *config)
> 
> Likewise, maybe _add?

Thats fine by me as well. The _insert I got from string_list I think.

> [...]
> > +static struct submodule_config *submodule_config_cache_lookup_path(struct submodule_config_cache *cache,
> > +	const unsigned char *gitmodule_sha1, const char *path)
> > +{
> > +	struct submodule_config_entry *entry;
> > +	unsigned int hash = hash_sha1_string(gitmodule_sha1, path);
> > +	struct submodule_config_entry key;
> > +	struct submodule_config key_config;
> > +
> > +	hashcpy(key_config.gitmodule_sha1, gitmodule_sha1);
> > +	key_config.path.buf = (char *) path;
> 
> An odd use of strbuf.

Yeah, I first designed the struct submodule_config when the old hashmap
implementation was around. In the implementation I found that I have to
pass in the complete key somehow. When I ported to hashmap this was the
"hack" so we do not do an unnecessary copying of strings. But as Jeff
also pointed out that the entries in submodule_config can be simple
const char *'s I will change to that and here we can simply set the
pointer.

> My first reaction was "Shouldn't this use strbuf_addstr?".  That way
> it would be clear who has responsibility to free it and the alloc and
> len fields would be right.
> 
> But looking more closely, it seems that key_config.path.buf is just
> being used to transmit a \0-terminated string.  Would making struct
> submodule_config::path a plain "const char *" work?  Or perhaps struct
> submodule_config_entry should be a tagged union or similar to make
> both the variant that owns its name and path and the variant that
> borrows buffers possible and comparable.

Yep "const char *" it will be.

> [...]
> > +struct parse_submodule_config_parameter {
> > +	unsigned char *gitmodule_sha1;
> > +	struct submodule_config_cache *cache;
> > +};
> 
> What is this struct used for?  A comment or moving it closer to where
> it is used could help.

This is used to pass these parameters into the git_config_from_buf
infrastructure. Will move it directly above the callback where it is
used.

> [...]
> > +static void warn_multiple_config(struct submodule_config *config, const char *option)
> > +{
> > +	warning("%s:.gitmodules, multiple configurations found for submodule.%s.%s. "
> > +			"Skipping second one!", sha1_to_hex(config->gitmodule_sha1),
> > +			option, config->name.buf);
> 
> Ah, so gitmodule_sha1 is a commit id?

No, this output is a bug. gitmodule_sha1 is actually the sha1 of the
.gitmodule blob we read. Thanks for noticing will fix. Should I also add
a comment to the gitmodule_sha1 field to explain what it is?

> I think a name like "commit" for the field would have left me less
> confused (and a comment could explain that what the hash values store
> is information gleaned from .gitmodules in that commit).

I understand your confusion because of the misleading output. But with
the clarification does the name make sense now?

Thanks and Cheers Heiko

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Re: [PATCH] implement submodule config cache for lookup of submodule names
  2014-03-12 23:58   ` Heiko Voigt
@ 2014-03-13  0:46     ` Jonathan Nieder
  0 siblings, 0 replies; 6+ messages in thread
From: Jonathan Nieder @ 2014-03-13  0:46 UTC (permalink / raw)
  To: Heiko Voigt; +Cc: Junio C Hamano, git, Jens Lehmann, Jeff King

Heiko Voigt wrote:
> On Tue, Mar 11, 2014 at 07:28:52PM -0700, Jonathan Nieder wrote:
>> Heiko Voigt wrote:

>>> +static unsigned int hash_sha1_string(const unsigned char *sha1, const char *string)
>>> +{
>>> +	return memhash(sha1, 20) + strhash(string);
>>> +}
>>
>> Feels a bit unconventional.  I can't find a strong reason to mind.
>
> Well I did not think much about this. I simply thought: hash -> kind of
> random value. Adding the two together is as good as anything (even
> overflow does not matter).
[...]
> I am fine with a switch to something different. We could use classic XOR
> in case that feels better.

Either + or ^ is fine with me (yeah, '^' is what I expected so '+'
forced me to think for a few seconds).  I don't think we have to worry
much about hostile people making repos that force git to spend a long
time dealing with hash collisions, so anything more complicated is
probably overkill. :)

[...]
>> [...]
>>> +static void warn_multiple_config(struct submodule_config *config, const char *option)
>>> +{
>>> +	warning("%s:.gitmodules, multiple configurations found for submodule.%s.%s. "
>>> +			"Skipping second one!", sha1_to_hex(config->gitmodule_sha1),
>>> +			option, config->name.buf);
>>
>> Ah, so gitmodule_sha1 is a commit id?
>
> No, this output is a bug. gitmodule_sha1 is actually the sha1 of the
> .gitmodule blob we read. Thanks for noticing will fix. Should I also add
> a comment to the gitmodule_sha1 field to explain what it is?
[...]
>                                                                   with
> the clarification does the name make sense now?

Yep.  Suggested fixes:

 - call it gitmodules_sha1 instead of gitmodule_sha1 (it's the blob
   name for .gitmodules, not the name of a module)

 - add a comment where the field is declared (this would make it clear
   that it's a blob name instead of e.g. just the SHA-1 of the text)

Thanks for your thoughtfulness.
Jonathan

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-03-13  0:46 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-10 21:24 [PATCH] implement submodule config cache for lookup of submodule names Heiko Voigt
2014-03-11 21:58 ` Jeff King
2014-03-12 17:07   ` Heiko Voigt
2014-03-12  2:28 ` Jonathan Nieder
2014-03-12 23:58   ` Heiko Voigt
2014-03-13  0:46     ` Jonathan Nieder

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).