git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH v3 0/9] Rewrite packfile reuse code
@ 2019-11-15 14:15 Christian Couder
  2019-11-15 14:15 ` [PATCH v3 1/9] builtin/pack-objects: report reused packfile objects Christian Couder
                   ` (9 more replies)
  0 siblings, 10 replies; 42+ messages in thread
From: Christian Couder @ 2019-11-15 14:15 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

This patch series is rewriting the code that tries to reuse existing
packfiles.

The code in this patch series was written by GitHub, and Peff nicely
provided it in the following discussion:

https://public-inbox.org/git/3E56B0FD-EBE8-4057-A93A-16EBB09FBCE0@jramsay.com.au/

The first versions of this patch series were also discussed:

V2: https://public-inbox.org/git/20191019103531.23274-1-chriscool@tuxfamily.org/
V1: https://public-inbox.org/git/20190913130226.7449-1-chriscool@tuxfamily.org/

Thanks to the reviewers!

According to Peff this new code is a lot smarter than what it
replaces. It allows "holes" in the chunks of packfile to be reused,
and skips over them. It rewrites OFS_DELTA offsets as it goes to
account for the holes. So it's basically a linear walk over the
packfile, but with the important distinction that we don't add those
objects to the object_entry array, which makes them very lightweight
(especially in memory use, but they also aren't considered bases for
finding new deltas, etc). It seems like a good compromise between the
cost to serve a clone and the quality of the resulting packfile.

Changes since the previous patch series are the following:

  - Rebased onto current master (d9f6f3b619, The first batch post 2.24
    cycle, 2019-11-10)

  - Remove a paragraph in the commit message of patch 3/9 as suggested
    by Jonathan Tan.

  - Improve commit message of patch 9/9 as suggested by Jonathan Tan.

  - Renamed fields of struct reused_chunk in patch 9/9 as suggested by
    Jonathan Tan.

  - Added a few comments in patch 9/9 as suggested by Jonathan Tan.

It could be a good idea if Peff could answer some of the comments made
by Jonathan Tan about patch 9/9.

I have put Peff as the author of all the commits.

Jeff King (9):
  builtin/pack-objects: report reused packfile objects
  packfile: expose get_delta_base()
  ewah/bitmap: introduce bitmap_word_alloc()
  pack-bitmap: don't rely on bitmap_git->reuse_objects
  pack-bitmap: introduce bitmap_walk_contains()
  csum-file: introduce hashfile_total()
  pack-objects: introduce pack.allowPackReuse
  builtin/pack-objects: introduce obj_is_packed()
  pack-objects: improve partial packfile reuse

 Documentation/config/pack.txt |   4 +
 builtin/pack-objects.c        | 243 +++++++++++++++++++++++++++-------
 csum-file.h                   |   9 ++
 ewah/bitmap.c                 |  13 +-
 ewah/ewok.h                   |   1 +
 pack-bitmap.c                 | 178 ++++++++++++++++++-------
 pack-bitmap.h                 |   6 +-
 packfile.c                    |  10 +-
 packfile.h                    |   3 +
 9 files changed, 357 insertions(+), 110 deletions(-)

-- 
2.24.0-rc1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v3 1/9] builtin/pack-objects: report reused packfile objects
  2019-11-15 14:15 [PATCH v3 0/9] Rewrite packfile reuse code Christian Couder
@ 2019-11-15 14:15 ` Christian Couder
  2019-12-09  6:24   ` Jeff King
  2019-11-15 14:15 ` [PATCH v3 2/9] packfile: expose get_delta_base() Christian Couder
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Christian Couder @ 2019-11-15 14:15 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan, James Ramsay

From: Jeff King <peff@peff.net>

To see when packfile reuse kicks in or not, it is useful to
show reused packfile objects statistics in the output of
upload-pack.

Helped-by: James Ramsay <james@jramsay.com.au>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 builtin/pack-objects.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5876583220..f2c2703090 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3509,7 +3509,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (progress)
 		fprintf_ln(stderr,
 			   _("Total %"PRIu32" (delta %"PRIu32"),"
-			     " reused %"PRIu32" (delta %"PRIu32")"),
-			   written, written_delta, reused, reused_delta);
+			     " reused %"PRIu32" (delta %"PRIu32"),"
+			     " pack-reused %"PRIu32),
+			   written, written_delta, reused, reused_delta,
+			   reuse_packfile_objects);
 	return 0;
 }
-- 
2.24.0-rc1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v3 2/9] packfile: expose get_delta_base()
  2019-11-15 14:15 [PATCH v3 0/9] Rewrite packfile reuse code Christian Couder
  2019-11-15 14:15 ` [PATCH v3 1/9] builtin/pack-objects: report reused packfile objects Christian Couder
@ 2019-11-15 14:15 ` Christian Couder
  2019-12-09  6:26   ` Jeff King
  2019-11-15 14:15 ` [PATCH v3 3/9] ewah/bitmap: introduce bitmap_word_alloc() Christian Couder
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Christian Couder @ 2019-11-15 14:15 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

From: Jeff King <peff@peff.net>

In a following commit get_delta_base() will be used outside
packfile.c, so let's make it non static and declare it in
packfile.h.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 packfile.c | 10 +++++-----
 packfile.h |  3 +++
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/packfile.c b/packfile.c
index 355066de17..81e66847bf 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1173,11 +1173,11 @@ const struct packed_git *has_packed_and_bad(struct repository *r,
 	return NULL;
 }
 
-static off_t get_delta_base(struct packed_git *p,
-				    struct pack_window **w_curs,
-				    off_t *curpos,
-				    enum object_type type,
-				    off_t delta_obj_offset)
+off_t get_delta_base(struct packed_git *p,
+		     struct pack_window **w_curs,
+		     off_t *curpos,
+		     enum object_type type,
+		     off_t delta_obj_offset)
 {
 	unsigned char *base_info = use_pack(p, w_curs, *curpos, NULL);
 	off_t base_offset;
diff --git a/packfile.h b/packfile.h
index fc7904ec81..ec536a4ae5 100644
--- a/packfile.h
+++ b/packfile.h
@@ -151,6 +151,9 @@ void *unpack_entry(struct repository *r, struct packed_git *, off_t, enum object
 unsigned long unpack_object_header_buffer(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
 unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
 int unpack_object_header(struct packed_git *, struct pack_window **, off_t *, unsigned long *);
+off_t get_delta_base(struct packed_git *p, struct pack_window **w_curs,
+		     off_t *curpos, enum object_type type,
+		     off_t delta_obj_offset);
 
 void release_pack_memory(size_t);
 
-- 
2.24.0-rc1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v3 3/9] ewah/bitmap: introduce bitmap_word_alloc()
  2019-11-15 14:15 [PATCH v3 0/9] Rewrite packfile reuse code Christian Couder
  2019-11-15 14:15 ` [PATCH v3 1/9] builtin/pack-objects: report reused packfile objects Christian Couder
  2019-11-15 14:15 ` [PATCH v3 2/9] packfile: expose get_delta_base() Christian Couder
@ 2019-11-15 14:15 ` Christian Couder
  2019-12-09  6:28   ` Jeff King
  2019-11-15 14:15 ` [PATCH v3 4/9] pack-bitmap: don't rely on bitmap_git->reuse_objects Christian Couder
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Christian Couder @ 2019-11-15 14:15 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

From: Jeff King <peff@peff.net>

In a following commit we will need to allocate a variable
number of bitmap words, instead of always 32, so let's add
bitmap_word_alloc() for this purpose.

Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 ewah/bitmap.c | 13 +++++++++----
 ewah/ewok.h   |  1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/ewah/bitmap.c b/ewah/bitmap.c
index 52f1178db4..b5fed9621f 100644
--- a/ewah/bitmap.c
+++ b/ewah/bitmap.c
@@ -22,21 +22,26 @@
 #define EWAH_MASK(x) ((eword_t)1 << (x % BITS_IN_EWORD))
 #define EWAH_BLOCK(x) (x / BITS_IN_EWORD)
 
-struct bitmap *bitmap_new(void)
+struct bitmap *bitmap_word_alloc(size_t word_alloc)
 {
 	struct bitmap *bitmap = xmalloc(sizeof(struct bitmap));
-	bitmap->words = xcalloc(32, sizeof(eword_t));
-	bitmap->word_alloc = 32;
+	bitmap->words = xcalloc(word_alloc, sizeof(eword_t));
+	bitmap->word_alloc = word_alloc;
 	return bitmap;
 }
 
+struct bitmap *bitmap_new(void)
+{
+	return bitmap_word_alloc(32);
+}
+
 void bitmap_set(struct bitmap *self, size_t pos)
 {
 	size_t block = EWAH_BLOCK(pos);
 
 	if (block >= self->word_alloc) {
 		size_t old_size = self->word_alloc;
-		self->word_alloc = block * 2;
+		self->word_alloc = block ? block * 2 : 1;
 		REALLOC_ARRAY(self->words, self->word_alloc);
 		memset(self->words + old_size, 0x0,
 			(self->word_alloc - old_size) * sizeof(eword_t));
diff --git a/ewah/ewok.h b/ewah/ewok.h
index 84b2a29faa..1b98b57c8b 100644
--- a/ewah/ewok.h
+++ b/ewah/ewok.h
@@ -172,6 +172,7 @@ struct bitmap {
 };
 
 struct bitmap *bitmap_new(void);
+struct bitmap *bitmap_word_alloc(size_t word_alloc);
 void bitmap_set(struct bitmap *self, size_t pos);
 int bitmap_get(struct bitmap *self, size_t pos);
 void bitmap_reset(struct bitmap *self);
-- 
2.24.0-rc1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v3 4/9] pack-bitmap: don't rely on bitmap_git->reuse_objects
  2019-11-15 14:15 [PATCH v3 0/9] Rewrite packfile reuse code Christian Couder
                   ` (2 preceding siblings ...)
  2019-11-15 14:15 ` [PATCH v3 3/9] ewah/bitmap: introduce bitmap_word_alloc() Christian Couder
@ 2019-11-15 14:15 ` Christian Couder
  2019-12-09  6:47   ` Jeff King
  2019-11-15 14:15 ` [PATCH v3 5/9] pack-bitmap: introduce bitmap_walk_contains() Christian Couder
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Christian Couder @ 2019-11-15 14:15 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

From: Jeff King <peff@peff.net>

We will no longer compute bitmap_git->reuse_objects in a
following commit, so we cannot rely on it anymore to
terminate the loop early; we have to iterate to the end.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 pack-bitmap.c | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index e07c798879..016d0319fc 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -622,7 +622,7 @@ static void show_objects_for_type(
 	enum object_type object_type,
 	show_reachable_fn show_reach)
 {
-	size_t pos = 0, i = 0;
+	size_t i = 0;
 	uint32_t offset;
 
 	struct ewah_iterator it;
@@ -630,13 +630,15 @@ static void show_objects_for_type(
 
 	struct bitmap *objects = bitmap_git->result;
 
-	if (bitmap_git->reuse_objects == bitmap_git->pack->num_objects)
-		return;
-
 	ewah_iterator_init(&it, type_filter);
 
-	while (i < objects->word_alloc && ewah_iterator_next(&filter, &it)) {
+	for (i = 0; i < objects->word_alloc &&
+			ewah_iterator_next(&filter, &it); i++) {
 		eword_t word = objects->words[i] & filter;
+		size_t pos = (i * BITS_IN_EWORD);
+
+		if (!word)
+			continue;
 
 		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
 			struct object_id oid;
@@ -648,9 +650,6 @@ static void show_objects_for_type(
 
 			offset += ewah_bit_ctz64(word >> offset);
 
-			if (pos + offset < bitmap_git->reuse_objects)
-				continue;
-
 			entry = &bitmap_git->pack->revindex[pos + offset];
 			nth_packed_object_oid(&oid, bitmap_git->pack, entry->nr);
 
@@ -659,9 +658,6 @@ static void show_objects_for_type(
 
 			show_reach(&oid, object_type, 0, hash, bitmap_git->pack, entry->offset);
 		}
-
-		pos += BITS_IN_EWORD;
-		i++;
 	}
 }
 
-- 
2.24.0-rc1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v3 5/9] pack-bitmap: introduce bitmap_walk_contains()
  2019-11-15 14:15 [PATCH v3 0/9] Rewrite packfile reuse code Christian Couder
                   ` (3 preceding siblings ...)
  2019-11-15 14:15 ` [PATCH v3 4/9] pack-bitmap: don't rely on bitmap_git->reuse_objects Christian Couder
@ 2019-11-15 14:15 ` Christian Couder
  2019-12-09  7:06   ` Jeff King
  2019-11-15 14:15 ` [PATCH v3 6/9] csum-file: introduce hashfile_total() Christian Couder
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Christian Couder @ 2019-11-15 14:15 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

From: Jeff King <peff@peff.net>

We will use this helper function in a following commit to
tell us if an object is packed.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 pack-bitmap.c | 12 ++++++++++++
 pack-bitmap.h |  3 +++
 2 files changed, 15 insertions(+)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 016d0319fc..8a51302a1a 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -826,6 +826,18 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	return 0;
 }
 
+int bitmap_walk_contains(struct bitmap_index *bitmap_git,
+			 struct bitmap *bitmap, const struct object_id *oid)
+{
+	int idx;
+
+	if (!bitmap)
+		return 0;
+
+	idx = bitmap_position(bitmap_git, oid);
+	return idx >= 0 && bitmap_get(bitmap, idx);
+}
+
 void traverse_bitmap_commit_list(struct bitmap_index *bitmap_git,
 				 show_reachable_fn show_reachable)
 {
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 466c5afa09..6ab6033dbe 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -3,6 +3,7 @@
 
 #include "ewah/ewok.h"
 #include "khash.h"
+#include "pack.h"
 #include "pack-objects.h"
 
 struct commit;
@@ -53,6 +54,8 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *,
 int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
 			     kh_oid_map_t *reused_bitmaps, int show_progress);
 void free_bitmap_index(struct bitmap_index *);
+int bitmap_walk_contains(struct bitmap_index *,
+			 struct bitmap *bitmap, const struct object_id *oid);
 
 /*
  * After a traversal has been performed by prepare_bitmap_walk(), this can be
-- 
2.24.0-rc1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v3 6/9] csum-file: introduce hashfile_total()
  2019-11-15 14:15 [PATCH v3 0/9] Rewrite packfile reuse code Christian Couder
                   ` (4 preceding siblings ...)
  2019-11-15 14:15 ` [PATCH v3 5/9] pack-bitmap: introduce bitmap_walk_contains() Christian Couder
@ 2019-11-15 14:15 ` Christian Couder
  2019-12-09  7:07   ` Jeff King
  2019-11-15 14:15 ` [PATCH v3 7/9] pack-objects: introduce pack.allowPackReuse Christian Couder
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Christian Couder @ 2019-11-15 14:15 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

From: Jeff King <peff@peff.net>

We will need this helper function in a following commit
to give us total number of bytes fed to the hashfile so far.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 csum-file.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/csum-file.h b/csum-file.h
index a98b1eee53..f9cbd317fb 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -42,6 +42,15 @@ void hashflush(struct hashfile *f);
 void crc32_begin(struct hashfile *);
 uint32_t crc32_end(struct hashfile *);
 
+/*
+ * Returns the total number of bytes fed to the hashfile so far (including ones
+ * that have not been written out to the descriptor yet).
+ */
+static inline off_t hashfile_total(struct hashfile *f)
+{
+	return f->total + f->offset;
+}
+
 static inline void hashwrite_u8(struct hashfile *f, uint8_t data)
 {
 	hashwrite(f, &data, sizeof(data));
-- 
2.24.0-rc1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v3 7/9] pack-objects: introduce pack.allowPackReuse
  2019-11-15 14:15 [PATCH v3 0/9] Rewrite packfile reuse code Christian Couder
                   ` (5 preceding siblings ...)
  2019-11-15 14:15 ` [PATCH v3 6/9] csum-file: introduce hashfile_total() Christian Couder
@ 2019-11-15 14:15 ` Christian Couder
  2019-12-09  7:14   ` Jeff King
  2019-11-15 14:15 ` [PATCH v3 8/9] builtin/pack-objects: introduce obj_is_packed() Christian Couder
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Christian Couder @ 2019-11-15 14:15 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

From: Jeff King <peff@peff.net>

Let's make it possible to configure if we want pack reuse or not.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/pack.txt | 4 ++++
 builtin/pack-objects.c        | 8 +++++++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index 1d66f0c992..58323a351f 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -27,6 +27,10 @@ Note that changing the compression level will not automatically recompress
 all existing objects. You can force recompression by passing the -F option
 to linkgit:git-repack[1].
 
+pack.allowPackReuse::
+	When true, which is the default, Git will try to reuse parts
+	of existing packfiles when preparing new packfiles.
+
 pack.island::
 	An extended regular expression configuring a set of delta
 	islands. See "DELTA ISLANDS" in linkgit:git-pack-objects[1]
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index f2c2703090..4fcfcf6097 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -96,6 +96,7 @@ static off_t reuse_packfile_offset;
 
 static int use_bitmap_index_default = 1;
 static int use_bitmap_index = -1;
+static int allow_pack_reuse = 1;
 static enum {
 	WRITE_BITMAP_FALSE = 0,
 	WRITE_BITMAP_QUIET,
@@ -2699,6 +2700,10 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 		use_bitmap_index_default = git_config_bool(k, v);
 		return 0;
 	}
+	if (!strcmp(k, "pack.allowpackreuse")) {
+		allow_pack_reuse = git_config_bool(k, v);
+		return 0;
+	}
 	if (!strcmp(k, "pack.threads")) {
 		delta_search_threads = git_config_int(k, v);
 		if (delta_search_threads < 0)
@@ -3030,7 +3035,8 @@ static void loosen_unused_packed_objects(void)
  */
 static int pack_options_allow_reuse(void)
 {
-	return pack_to_stdout &&
+	return allow_pack_reuse &&
+	       pack_to_stdout &&
 	       allow_ofs_delta &&
 	       !ignore_packed_keep_on_disk &&
 	       !ignore_packed_keep_in_core &&
-- 
2.24.0-rc1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v3 8/9] builtin/pack-objects: introduce obj_is_packed()
  2019-11-15 14:15 [PATCH v3 0/9] Rewrite packfile reuse code Christian Couder
                   ` (6 preceding siblings ...)
  2019-11-15 14:15 ` [PATCH v3 7/9] pack-objects: introduce pack.allowPackReuse Christian Couder
@ 2019-11-15 14:15 ` Christian Couder
  2019-12-09  7:14   ` Jeff King
  2019-11-15 14:15 ` [PATCH v3 9/9] pack-objects: improve partial packfile reuse Christian Couder
  2019-11-15 18:03 ` [PATCH v3 0/9] Rewrite packfile reuse code Jonathan Tan
  9 siblings, 1 reply; 42+ messages in thread
From: Christian Couder @ 2019-11-15 14:15 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

From: Jeff King <peff@peff.net>

Let's refactor the way we check if an object is packed by
introducing obj_is_packed(). This function is now a simple
wrapper around packlist_find(), but it will evolve in a
following commit.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 builtin/pack-objects.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 4fcfcf6097..08898331ef 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2553,6 +2553,11 @@ static void ll_find_deltas(struct object_entry **list, unsigned list_size,
 	free(p);
 }
 
+static int obj_is_packed(const struct object_id *oid)
+{
+	return !!packlist_find(&to_pack, oid);
+}
+
 static void add_tag_chain(const struct object_id *oid)
 {
 	struct tag *tag;
@@ -2564,7 +2569,7 @@ static void add_tag_chain(const struct object_id *oid)
 	 * it was included via bitmaps, we would not have parsed it
 	 * previously).
 	 */
-	if (packlist_find(&to_pack, oid))
+	if (obj_is_packed(oid))
 		return;
 
 	tag = lookup_tag(the_repository, oid);
@@ -2588,7 +2593,7 @@ static int add_ref_tag(const char *path, const struct object_id *oid, int flag,
 
 	if (starts_with(path, "refs/tags/") && /* is a tag? */
 	    !peel_ref(path, &peeled)    && /* peelable? */
-	    packlist_find(&to_pack, &peeled))      /* object packed? */
+	    obj_is_packed(&peeled)) /* object packed? */
 		add_tag_chain(oid);
 	return 0;
 }
-- 
2.24.0-rc1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v3 9/9] pack-objects: improve partial packfile reuse
  2019-11-15 14:15 [PATCH v3 0/9] Rewrite packfile reuse code Christian Couder
                   ` (7 preceding siblings ...)
  2019-11-15 14:15 ` [PATCH v3 8/9] builtin/pack-objects: introduce obj_is_packed() Christian Couder
@ 2019-11-15 14:15 ` Christian Couder
  2019-12-09  8:11   ` Jeff King
  2019-11-15 18:03 ` [PATCH v3 0/9] Rewrite packfile reuse code Jonathan Tan
  9 siblings, 1 reply; 42+ messages in thread
From: Christian Couder @ 2019-11-15 14:15 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

From: Jeff King <peff@peff.net>

The old code to reuse deltas from an existing packfile
just tried to dump a whole segment of the pack verbatim.
That's faster than the traditional way of actually adding
objects to the packing list, but it didn't kick in very
often. This new code is really going for a middle ground:
do _some_ per-object work, but way less than we'd
traditionally do.

For instance, packing torvalds/linux on GitHub servers
just now reused 6.5M objects, but only needed ~50k
chunks.

To implement this, we store the chunks of the packfile
that we reuse in a dynamic array of `struct reused_chunk`,
and we use a reuse_packfile_bitmap to speed up reusing
parts of packfiles.

The dynamic array of `struct reused_chunk` is useful
because we need to know the accumulated offset due to
missing objects. So without the array we'd end up having
to walk over the revindex for that set of objects. The
array is basically caching those accumulated offsets
(for the parts we _do_ include), so we don't have to
compute them repeatedly.

Additional checks are added in have_duplicate_entry()
and obj_is_packed() to avoid duplicate objects in the
reuse bitmap. It was probably buggy to not have such a
check before.

If a client both asks for a tag by sha1 and specifies
"include-tag", we may end up including the tag in the reuse
bitmap (due to the first thing), and then later adding it to
the packlist (due to the second). This results in duplicate
objects in the pack, which git chokes on. We should notice
that we are already including it when doing the include-tag
portion, and avoid adding it to the packlist.

The simplest place to fix this is right in add_ref_tag,
where we could avoid peeling the tag at all if we know that
we are already including it. However, I've pushed the check
instead into have_duplicate_entry(). This fixes not only
this case, but also means that we cannot have any similar
problems lurking in other code.

No tests, because git does not actually exhibit this "ask
for it and also include-tag" behavior. We do one or the
other on clone, depending on whether --single-branch is set.
However, libgit2 does both.

Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 builtin/pack-objects.c | 222 ++++++++++++++++++++++++++++++++---------
 pack-bitmap.c          | 150 ++++++++++++++++++++--------
 pack-bitmap.h          |   3 +-
 3 files changed, 288 insertions(+), 87 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 08898331ef..64ab033923 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -92,7 +92,7 @@ static struct progress *progress_state;
 
 static struct packed_git *reuse_packfile;
 static uint32_t reuse_packfile_objects;
-static off_t reuse_packfile_offset;
+static struct bitmap *reuse_packfile_bitmap;
 
 static int use_bitmap_index_default = 1;
 static int use_bitmap_index = -1;
@@ -785,57 +785,185 @@ static struct object_entry **compute_write_order(void)
 	return wo;
 }
 
-static off_t write_reused_pack(struct hashfile *f)
+
+/*
+ * A reused set of objects. All objects in a chunk have the same
+ * relative position in the original packfile and the generated
+ * packfile.
+ */
+
+static struct reused_chunk {
+	/* The offset of the first object of this chunk in the original
+	 * packfile. */
+	off_t original;
+	/* The offset of the first object of this chunk in the generated
+	 * packfile minus "original". */
+	off_t difference;
+} *reused_chunks;
+static int reused_chunks_nr;
+static int reused_chunks_alloc;
+
+static void record_reused_object(off_t where, off_t offset)
 {
-	unsigned char buffer[8192];
-	off_t to_write, total;
-	int fd;
+	if (reused_chunks_nr && reused_chunks[reused_chunks_nr-1].difference == offset)
+		return;
 
-	if (!is_pack_valid(reuse_packfile))
-		die(_("packfile is invalid: %s"), reuse_packfile->pack_name);
+	ALLOC_GROW(reused_chunks, reused_chunks_nr + 1,
+		   reused_chunks_alloc);
+	reused_chunks[reused_chunks_nr].original = where;
+	reused_chunks[reused_chunks_nr].difference = offset;
+	reused_chunks_nr++;
+}
 
-	fd = git_open(reuse_packfile->pack_name);
-	if (fd < 0)
-		die_errno(_("unable to open packfile for reuse: %s"),
-			  reuse_packfile->pack_name);
+/*
+ * Binary search to find the chunk that "where" is in. Note
+ * that we're not looking for an exact match, just the first
+ * chunk that contains it (which implicitly ends at the start
+ * of the next chunk.
+ */
+static off_t find_reused_offset(off_t where)
+{
+	int lo = 0, hi = reused_chunks_nr;
+	while (lo < hi) {
+		int mi = lo + ((hi - lo) / 2);
+		if (where == reused_chunks[mi].original)
+			return reused_chunks[mi].difference;
+		if (where < reused_chunks[mi].original)
+			hi = mi;
+		else
+			lo = mi + 1;
+	}
 
-	if (lseek(fd, sizeof(struct pack_header), SEEK_SET) == -1)
-		die_errno(_("unable to seek in reused packfile"));
+	/*
+	 * The first chunk starts at zero, so we can't have gone below
+	 * there.
+	 */
+	assert(lo);
+	return reused_chunks[lo-1].difference;
+}
+
+static void write_reused_pack_one(size_t pos, struct hashfile *out,
+				  struct pack_window **w_curs)
+{
+	off_t offset, next, cur;
+	enum object_type type;
+	unsigned long size;
 
-	if (reuse_packfile_offset < 0)
-		reuse_packfile_offset = reuse_packfile->pack_size - the_hash_algo->rawsz;
+	offset = reuse_packfile->revindex[pos].offset;
+	next = reuse_packfile->revindex[pos + 1].offset;
 
-	total = to_write = reuse_packfile_offset - sizeof(struct pack_header);
+	record_reused_object(offset, offset - hashfile_total(out));
 
-	while (to_write) {
-		int read_pack = xread(fd, buffer, sizeof(buffer));
+	cur = offset;
+	type = unpack_object_header(reuse_packfile, w_curs, &cur, &size);
+	assert(type >= 0);
 
-		if (read_pack <= 0)
-			die_errno(_("unable to read from reused packfile"));
+	if (type == OBJ_OFS_DELTA) {
+		off_t base_offset;
+		off_t fixup;
+
+		unsigned char header[MAX_PACK_OBJECT_HEADER];
+		unsigned len;
+
+		base_offset = get_delta_base(reuse_packfile, w_curs, &cur, type, offset);
+		assert(base_offset != 0);
+
+		/* Convert to REF_DELTA if we must... */
+		if (!allow_ofs_delta) {
+			int base_pos = find_revindex_position(reuse_packfile, base_offset);
+			const unsigned char *base_sha1 =
+				nth_packed_object_sha1(reuse_packfile,
+						       reuse_packfile->revindex[base_pos].nr);
+
+			len = encode_in_pack_object_header(header, sizeof(header),
+							   OBJ_REF_DELTA, size);
+			hashwrite(out, header, len);
+			hashwrite(out, base_sha1, 20);
+			copy_pack_data(out, reuse_packfile, w_curs, cur, next - cur);
+			return;
+		}
 
-		if (read_pack > to_write)
-			read_pack = to_write;
+		/* Otherwise see if we need to rewrite the offset... */
+		fixup = find_reused_offset(offset) -
+			find_reused_offset(base_offset);
+		if (fixup) {
+			unsigned char ofs_header[10];
+			unsigned i, ofs_len;
+			off_t ofs = offset - base_offset - fixup;
 
-		hashwrite(f, buffer, read_pack);
-		to_write -= read_pack;
+			len = encode_in_pack_object_header(header, sizeof(header),
+							   OBJ_OFS_DELTA, size);
+
+			i = sizeof(ofs_header) - 1;
+			ofs_header[i] = ofs & 127;
+			while (ofs >>= 7)
+				ofs_header[--i] = 128 | (--ofs & 127);
+
+			ofs_len = sizeof(ofs_header) - i;
+
+			hashwrite(out, header, len);
+			hashwrite(out, ofs_header + sizeof(ofs_header) - ofs_len, ofs_len);
+			copy_pack_data(out, reuse_packfile, w_curs, cur, next - cur);
+			return;
+		}
+
+		/* ...otherwise we have no fixup, and can write it verbatim */
+	}
+
+	copy_pack_data(out, reuse_packfile, w_curs, offset, next - offset);
+}
+
+static size_t write_reused_pack_verbatim(struct hashfile *out,
+					 struct pack_window **w_curs)
+{
+	size_t pos = 0;
+
+	while (pos < reuse_packfile_bitmap->word_alloc &&
+			reuse_packfile_bitmap->words[pos] == (eword_t)~0)
+		pos++;
+
+	if (pos) {
+		off_t to_write;
+
+		written = (pos * BITS_IN_EWORD);
+		to_write = reuse_packfile->revindex[written].offset
+			- sizeof(struct pack_header);
+
+		/* We're recording one chunk, not one object. */
+		record_reused_object(sizeof(struct pack_header), 0);
+		hashflush(out);
+		copy_pack_data(out, reuse_packfile, w_curs,
+			sizeof(struct pack_header), to_write);
 
-		/*
-		 * We don't know the actual number of objects written,
-		 * only how many bytes written, how many bytes total, and
-		 * how many objects total. So we can fake it by pretending all
-		 * objects we are writing are the same size. This gives us a
-		 * smooth progress meter, and at the end it matches the true
-		 * answer.
-		 */
-		written = reuse_packfile_objects *
-				(((double)(total - to_write)) / total);
 		display_progress(progress_state, written);
 	}
+	return pos;
+}
 
-	close(fd);
-	written = reuse_packfile_objects;
-	display_progress(progress_state, written);
-	return reuse_packfile_offset - sizeof(struct pack_header);
+static void write_reused_pack(struct hashfile *f)
+{
+	size_t i = 0;
+	uint32_t offset;
+	struct pack_window *w_curs = NULL;
+
+	if (allow_ofs_delta)
+		i = write_reused_pack_verbatim(f, &w_curs);
+
+	for (; i < reuse_packfile_bitmap->word_alloc; ++i) {
+		eword_t word = reuse_packfile_bitmap->words[i];
+		size_t pos = (i * BITS_IN_EWORD);
+
+		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
+			if ((word >> offset) == 0)
+				break;
+
+			offset += ewah_bit_ctz64(word >> offset);
+			write_reused_pack_one(pos + offset, f, &w_curs);
+			display_progress(progress_state, ++written);
+		}
+	}
+
+	unuse_pack(&w_curs);
 }
 
 static const char no_split_warning[] = N_(
@@ -868,11 +996,9 @@ static void write_pack_file(void)
 		offset = write_pack_header(f, nr_remaining);
 
 		if (reuse_packfile) {
-			off_t packfile_size;
 			assert(pack_to_stdout);
-
-			packfile_size = write_reused_pack(f);
-			offset += packfile_size;
+			write_reused_pack(f);
+			offset = hashfile_total(f);
 		}
 
 		nr_written = 0;
@@ -1001,6 +1127,10 @@ static int have_duplicate_entry(const struct object_id *oid,
 {
 	struct object_entry *entry;
 
+	if (reuse_packfile_bitmap &&
+	    bitmap_walk_contains(bitmap_git, reuse_packfile_bitmap, oid))
+		return 1;
+
 	entry = packlist_find(&to_pack, oid);
 	if (!entry)
 		return 0;
@@ -2555,7 +2685,9 @@ static void ll_find_deltas(struct object_entry **list, unsigned list_size,
 
 static int obj_is_packed(const struct object_id *oid)
 {
-	return !!packlist_find(&to_pack, oid);
+	return packlist_find(&to_pack, oid) ||
+		(reuse_packfile_bitmap &&
+		 bitmap_walk_contains(bitmap_git, reuse_packfile_bitmap, oid));
 }
 
 static void add_tag_chain(const struct object_id *oid)
@@ -2661,6 +2793,7 @@ static void prepare_pack(int window, int depth)
 
 	if (nr_deltas && n > 1) {
 		unsigned nr_done = 0;
+
 		if (progress)
 			progress_state = start_progress(_("Compressing objects"),
 							nr_deltas);
@@ -3042,7 +3175,6 @@ static int pack_options_allow_reuse(void)
 {
 	return allow_pack_reuse &&
 	       pack_to_stdout &&
-	       allow_ofs_delta &&
 	       !ignore_packed_keep_on_disk &&
 	       !ignore_packed_keep_in_core &&
 	       (!local || !have_non_local_packs) &&
@@ -3059,7 +3191,7 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
 			bitmap_git,
 			&reuse_packfile,
 			&reuse_packfile_objects,
-			&reuse_packfile_offset)) {
+			&reuse_packfile_bitmap)) {
 		assert(reuse_packfile_objects);
 		nr_result += reuse_packfile_objects;
 		display_progress(progress_state, nr_result);
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 8a51302a1a..cbfc544411 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -326,6 +326,13 @@ static int load_pack_bitmap(struct bitmap_index *bitmap_git)
 	munmap(bitmap_git->map, bitmap_git->map_size);
 	bitmap_git->map = NULL;
 	bitmap_git->map_size = 0;
+
+	kh_destroy_oid_map(bitmap_git->bitmaps);
+	bitmap_git->bitmaps = NULL;
+
+	kh_destroy_oid_pos(bitmap_git->ext_index.positions);
+	bitmap_git->ext_index.positions = NULL;
+
 	return -1;
 }
 
@@ -764,65 +771,126 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
 	return NULL;
 }
 
-int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
-				       struct packed_git **packfile,
-				       uint32_t *entries,
-				       off_t *up_to)
+static void try_partial_reuse(struct bitmap_index *bitmap_git,
+			      size_t pos,
+			      struct bitmap *reuse,
+			      struct pack_window **w_curs)
 {
+	struct revindex_entry *revidx;
+	off_t offset;
+	enum object_type type;
+	unsigned long size;
+
+	if (pos >= bitmap_git->pack->num_objects)
+		return; /* not actually in the pack */
+
+	revidx = &bitmap_git->pack->revindex[pos];
+	offset = revidx->offset;
+	type = unpack_object_header(bitmap_git->pack, w_curs, &offset, &size);
+	if (type < 0)
+		return; /* broken packfile, punt */
+
+	if (type == OBJ_REF_DELTA || type == OBJ_OFS_DELTA) {
+		off_t base_offset;
+		int base_pos;
+
+		/*
+		 * Find the position of the base object so we can look it up
+		 * in our bitmaps. If we can't come up with an offset, or if
+		 * that offset is not in the revidx, the pack is corrupt.
+		 * There's nothing we can do, so just punt on this object,
+		 * and the normal slow path will complain about it in
+		 * more detail.
+		 */
+		base_offset = get_delta_base(bitmap_git->pack, w_curs,
+					     &offset, type, revidx->offset);
+		if (!base_offset)
+			return;
+		base_pos = find_revindex_position(bitmap_git->pack, base_offset);
+		if (base_pos < 0)
+			return;
+
+		/*
+		 * We assume delta dependencies always point backwards. This
+		 * lets us do a single pass, and is basically always true
+		 * due to the way OFS_DELTAs work. You would not typically
+		 * find REF_DELTA in a bitmapped pack, since we only bitmap
+		 * packs we write fresh, and OFS_DELTA is the default). But
+		 * let's double check to make sure the pack wasn't written with
+		 * odd parameters.
+		 */
+		if (base_pos >= pos)
+			return;
+
+		/*
+		 * And finally, if we're not sending the base as part of our
+		 * reuse chunk, then don't send this object either. The base
+		 * would come after us, along with other objects not
+		 * necessarily in the pack, which means we'd need to convert
+		 * to REF_DELTA on the fly. Better to just let the normal
+		 * object_entry code path handle it.
+		 */
+		if (!bitmap_get(reuse, base_pos))
+			return;
+	}
+
 	/*
-	 * Reuse the packfile content if we need more than
-	 * 90% of its objects
+	 * If we got here, then the object is OK to reuse. Mark it.
 	 */
-	static const double REUSE_PERCENT = 0.9;
+	bitmap_set(reuse, pos);
+}
 
+int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
+				       struct packed_git **packfile_out,
+				       uint32_t *entries,
+				       struct bitmap **reuse_out)
+{
 	struct bitmap *result = bitmap_git->result;
-	uint32_t reuse_threshold;
-	uint32_t i, reuse_objects = 0;
+	struct bitmap *reuse;
+	struct pack_window *w_curs = NULL;
+	size_t i = 0;
+	uint32_t offset;
 
 	assert(result);
 
-	for (i = 0; i < result->word_alloc; ++i) {
-		if (result->words[i] != (eword_t)~0) {
-			reuse_objects += ewah_bit_ctz64(~result->words[i]);
-			break;
-		}
-
-		reuse_objects += BITS_IN_EWORD;
-	}
+	while (i < result->word_alloc && result->words[i] == (eword_t)~0)
+		i++;
 
-#ifdef GIT_BITMAP_DEBUG
-	{
-		const unsigned char *sha1;
-		struct revindex_entry *entry;
+	/* Don't mark objects not in the packfile */
+	if (i > bitmap_git->pack->num_objects / BITS_IN_EWORD)
+		i = bitmap_git->pack->num_objects / BITS_IN_EWORD;
 
-		entry = &bitmap_git->reverse_index->revindex[reuse_objects];
-		sha1 = nth_packed_object_sha1(bitmap_git->pack, entry->nr);
+	reuse = bitmap_word_alloc(i);
+	memset(reuse->words, 0xFF, i * sizeof(eword_t));
 
-		fprintf(stderr, "Failed to reuse at %d (%016llx)\n",
-			reuse_objects, result->words[i]);
-		fprintf(stderr, " %s\n", hash_to_hex(sha1));
-	}
-#endif
+	for (; i < result->word_alloc; ++i) {
+		eword_t word = result->words[i];
+		size_t pos = (i * BITS_IN_EWORD);
 
-	if (!reuse_objects)
-		return -1;
+		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
+			if ((word >> offset) == 0)
+				break;
 
-	if (reuse_objects >= bitmap_git->pack->num_objects) {
-		bitmap_git->reuse_objects = *entries = bitmap_git->pack->num_objects;
-		*up_to = -1; /* reuse the full pack */
-		*packfile = bitmap_git->pack;
-		return 0;
+			offset += ewah_bit_ctz64(word >> offset);
+			try_partial_reuse(bitmap_git, pos + offset, reuse, &w_curs);
+		}
 	}
 
-	reuse_threshold = bitmap_popcount(bitmap_git->result) * REUSE_PERCENT;
+	unuse_pack(&w_curs);
 
-	if (reuse_objects < reuse_threshold)
+	*entries = bitmap_popcount(reuse);
+	if (!*entries) {
+		bitmap_free(reuse);
 		return -1;
+	}
 
-	bitmap_git->reuse_objects = *entries = reuse_objects;
-	*up_to = bitmap_git->pack->revindex[reuse_objects].offset;
-	*packfile = bitmap_git->pack;
-
+	/*
+	 * Drop any reused objects from the result, since they will not
+	 * need to be handled separately.
+	 */
+	bitmap_and_not(result, reuse);
+	*packfile_out = bitmap_git->pack;
+	*reuse_out = reuse;
 	return 0;
 }
 
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 6ab6033dbe..bcd03b8993 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -50,7 +50,8 @@ void test_bitmap_walk(struct rev_info *revs);
 struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs);
 int reuse_partial_packfile_from_bitmap(struct bitmap_index *,
 				       struct packed_git **packfile,
-				       uint32_t *entries, off_t *up_to);
+				       uint32_t *entries,
+				       struct bitmap **reuse_out);
 int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
 			     kh_oid_map_t *reused_bitmaps, int show_progress);
 void free_bitmap_index(struct bitmap_index *);
-- 
2.24.0-rc1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 0/9] Rewrite packfile reuse code
  2019-11-15 14:15 [PATCH v3 0/9] Rewrite packfile reuse code Christian Couder
                   ` (8 preceding siblings ...)
  2019-11-15 14:15 ` [PATCH v3 9/9] pack-objects: improve partial packfile reuse Christian Couder
@ 2019-11-15 18:03 ` Jonathan Tan
  2019-11-25  6:30   ` Junio C Hamano
  9 siblings, 1 reply; 42+ messages in thread
From: Jonathan Tan @ 2019-11-15 18:03 UTC (permalink / raw)
  To: christian.couder; +Cc: git, gitster, peff, chriscool, ramsay, jonathantanmy

> It could be a good idea if Peff could answer some of the comments made
> by Jonathan Tan about patch 9/9.
> 
> I have put Peff as the author of all the commits.

Thanks. I think the series looks mostly good except for the questions I
raised in patch 9/9, so I'll wait for Peff to respond too.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 0/9] Rewrite packfile reuse code
  2019-11-15 18:03 ` [PATCH v3 0/9] Rewrite packfile reuse code Jonathan Tan
@ 2019-11-25  6:30   ` Junio C Hamano
  2019-11-25  6:36     ` Junio C Hamano
  0 siblings, 1 reply; 42+ messages in thread
From: Junio C Hamano @ 2019-11-25  6:30 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: christian.couder, git, peff, chriscool, ramsay

Jonathan Tan <jonathantanmy@google.com> writes:

>> It could be a good idea if Peff could answer some of the comments made
>> by Jonathan Tan about patch 9/9.
>> 
>> I have put Peff as the author of all the commits.
>
> Thanks. I think the series looks mostly good except for the questions I
> raised in patch 9/9, so I'll wait for Peff to respond too.

Hmph, the round before this one has been in 'next' for quite a
while, so should I eject it before waiting for Peff to respond
before queuing this one?

Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 0/9] Rewrite packfile reuse code
  2019-11-25  6:30   ` Junio C Hamano
@ 2019-11-25  6:36     ` Junio C Hamano
  2019-12-06 21:42       ` Junio C Hamano
  0 siblings, 1 reply; 42+ messages in thread
From: Junio C Hamano @ 2019-11-25  6:36 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: christian.couder, git, peff, chriscool, ramsay

Junio C Hamano <gitster@pobox.com> writes:

> Jonathan Tan <jonathantanmy@google.com> writes:
>
>>> It could be a good idea if Peff could answer some of the comments made
>>> by Jonathan Tan about patch 9/9.
>>> 
>>> I have put Peff as the author of all the commits.
>>
>> Thanks. I think the series looks mostly good except for the questions I
>> raised in patch 9/9, so I'll wait for Peff to respond too.
>
> Hmph, the round before this one has been in 'next' for quite a
> while, so should I eject it before waiting for Peff to respond
> before queuing this one?

After rebasing these v3 patches on top of the base of the one in
'next', the only difference seems to be the log message of 3/9 and
the contents of 9/9.  I guess I'll mark the topic as "on hold" for
now before doing anything, as I am officially taking a time-off most
of this week ;-)

Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 0/9] Rewrite packfile reuse code
  2019-11-25  6:36     ` Junio C Hamano
@ 2019-12-06 21:42       ` Junio C Hamano
  2019-12-07 10:12         ` Christian Couder
  0 siblings, 1 reply; 42+ messages in thread
From: Junio C Hamano @ 2019-12-06 21:42 UTC (permalink / raw)
  To: peff; +Cc: Jonathan Tan, christian.couder, git, chriscool, ramsay

Junio C Hamano <gitster@pobox.com> writes:

> Junio C Hamano <gitster@pobox.com> writes:
>
>> Jonathan Tan <jonathantanmy@google.com> writes:
>>
>>>> It could be a good idea if Peff could answer some of the comments made
>>>> by Jonathan Tan about patch 9/9.
>>>> 
>>>> I have put Peff as the author of all the commits.
>>>
>>> Thanks. I think the series looks mostly good except for the questions I
>>> raised in patch 9/9, so I'll wait for Peff to respond too.
>>
>> Hmph, the round before this one has been in 'next' for quite a
>> while, so should I eject it before waiting for Peff to respond
>> before queuing this one?
>
> After rebasing these v3 patches on top of the base of the one in
> 'next', the only difference seems to be the log message of 3/9 and
> the contents of 9/9.  I guess I'll mark the topic as "on hold" for
> now before doing anything, as I am officially taking a time-off most
> of this week ;-)

So..., that week has passed---anything new?

Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 0/9] Rewrite packfile reuse code
  2019-12-06 21:42       ` Junio C Hamano
@ 2019-12-07 10:12         ` Christian Couder
  2019-12-07 20:47           ` Johannes Schindelin
  0 siblings, 1 reply; 42+ messages in thread
From: Christian Couder @ 2019-12-07 10:12 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Jonathan Tan, git, Christian Couder, Ramsay Jones

On Fri, Dec 6, 2019 at 10:42 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Junio C Hamano <gitster@pobox.com> writes:
>
> > Junio C Hamano <gitster@pobox.com> writes:
> >
> >> Jonathan Tan <jonathantanmy@google.com> writes:
> >>
> >>>> It could be a good idea if Peff could answer some of the comments made
> >>>> by Jonathan Tan about patch 9/9.
> >>>>
> >>>> I have put Peff as the author of all the commits.
> >>>
> >>> Thanks. I think the series looks mostly good except for the questions I
> >>> raised in patch 9/9, so I'll wait for Peff to respond too.
> >>
> >> Hmph, the round before this one has been in 'next' for quite a
> >> while, so should I eject it before waiting for Peff to respond
> >> before queuing this one?
> >
> > After rebasing these v3 patches on top of the base of the one in
> > 'next', the only difference seems to be the log message of 3/9 and
> > the contents of 9/9.  I guess I'll mark the topic as "on hold" for
> > now before doing anything, as I am officially taking a time-off most
> > of this week ;-)
>
> So..., that week has passed---anything new?

Unfortunately, no.

If you want I can send an incremental change on the content of 9/9 on
top of what's in next. Otherwise I can't see what I could do on this.

Peff, could you tell us if you might have time to take a look at this soon?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 0/9] Rewrite packfile reuse code
  2019-12-07 10:12         ` Christian Couder
@ 2019-12-07 20:47           ` Johannes Schindelin
  2019-12-08  7:53             ` Christian Couder
  0 siblings, 1 reply; 42+ messages in thread
From: Johannes Schindelin @ 2019-12-07 20:47 UTC (permalink / raw)
  To: Christian Couder
  Cc: Junio C Hamano, Jeff King, Jonathan Tan, git, Christian Couder,
	Ramsay Jones

Hi Chris,

On Sat, 7 Dec 2019, Christian Couder wrote:

> On Fri, Dec 6, 2019 at 10:42 PM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Junio C Hamano <gitster@pobox.com> writes:
> >
> > > Junio C Hamano <gitster@pobox.com> writes:
> > >
> > >> Jonathan Tan <jonathantanmy@google.com> writes:
> > >>
> > >>>> It could be a good idea if Peff could answer some of the comments made
> > >>>> by Jonathan Tan about patch 9/9.
> > >>>>
> > >>>> I have put Peff as the author of all the commits.
> > >>>
> > >>> Thanks. I think the series looks mostly good except for the questions I
> > >>> raised in patch 9/9, so I'll wait for Peff to respond too.
> > >>
> > >> Hmph, the round before this one has been in 'next' for quite a
> > >> while, so should I eject it before waiting for Peff to respond
> > >> before queuing this one?
> > >
> > > After rebasing these v3 patches on top of the base of the one in
> > > 'next', the only difference seems to be the log message of 3/9 and
> > > the contents of 9/9.  I guess I'll mark the topic as "on hold" for
> > > now before doing anything, as I am officially taking a time-off most
> > > of this week ;-)
> >
> > So..., that week has passed---anything new?
>
> Unfortunately, no.
>
> If you want I can send an incremental change on the content of 9/9 on
> top of what's in next. Otherwise I can't see what I could do on this.
>
> Peff, could you tell us if you might have time to take a look at this soon?

Chris, correct me if I am wrong, but was it not your decision to
contribute these patches? Are you saying that you do not understand them
well enough to drive this patch series forward (e.g. address all reviews
and questions) and are basically trying to force Peff to contribute them
instead?

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 0/9] Rewrite packfile reuse code
  2019-12-07 20:47           ` Johannes Schindelin
@ 2019-12-08  7:53             ` Christian Couder
  2019-12-08  8:54               ` Johannes Schindelin
  0 siblings, 1 reply; 42+ messages in thread
From: Christian Couder @ 2019-12-08  7:53 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Jeff King, Jonathan Tan, git, Christian Couder,
	Ramsay Jones

Hi Johannes,

On Sat, Dec 7, 2019 at 9:47 PM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> On Sat, 7 Dec 2019, Christian Couder wrote:
>
> > On Fri, Dec 6, 2019 at 10:42 PM Junio C Hamano <gitster@pobox.com> wrote:
> > >
> > > Junio C Hamano <gitster@pobox.com> writes:
> > >
> > > > Junio C Hamano <gitster@pobox.com> writes:
> > > >
> > > >> Jonathan Tan <jonathantanmy@google.com> writes:
> > > >>
> > > >>>> It could be a good idea if Peff could answer some of the comments made
> > > >>>> by Jonathan Tan about patch 9/9.
> > > >>>>
> > > >>>> I have put Peff as the author of all the commits.
> > > >>>
> > > >>> Thanks. I think the series looks mostly good except for the questions I
> > > >>> raised in patch 9/9, so I'll wait for Peff to respond too.
> > > >>
> > > >> Hmph, the round before this one has been in 'next' for quite a
> > > >> while, so should I eject it before waiting for Peff to respond
> > > >> before queuing this one?
> > > >
> > > > After rebasing these v3 patches on top of the base of the one in
> > > > 'next', the only difference seems to be the log message of 3/9 and
> > > > the contents of 9/9.  I guess I'll mark the topic as "on hold" for
> > > > now before doing anything, as I am officially taking a time-off most
> > > > of this week ;-)
> > >
> > > So..., that week has passed---anything new?
> >
> > Unfortunately, no.
> >
> > If you want I can send an incremental change on the content of 9/9 on
> > top of what's in next. Otherwise I can't see what I could do on this.
> >
> > Peff, could you tell us if you might have time to take a look at this soon?
>
> Chris, correct me if I am wrong, but was it not your decision to
> contribute these patches?

Please take a look at:

https://public-inbox.org/git/3E56B0FD-EBE8-4057-A93A-16EBB09FBCE0@jramsay.com.au/

and Peff's response to James Ramsay's email.

Peff wrote:

> It's been on my todo list to upstream for a while, but I've dragged my
> feet on it because there's a lot of cleanup/polishing from the original
> patches (they were never very clean in the first place, and we've merged
> a dozen or more times with upstream since then, so the updates are
> spread across a bunch of merge commits).

and then:

> Yeah, I think we should work on getting our changes (including those
> stats) into upstream.

So actually I thought that I was helping Peff on this, though I know
of course that it's also helping GitLab and everyone else. That's why
I put Peff as the author of the patches.

> Are you saying that you do not understand them
> well enough to drive this patch series forward (e.g. address all reviews
> and questions) and are basically trying to force Peff to contribute them
> instead?

Yeah, I don't understand them well enough to answer Jonathan Tan's questions.

But no I am not trying to force Peff. I am trying to work with him.
When he said he thought we should work on getting the change into
upstream, I just thought he meant it and would be willing to help.





> Ciao,
> Johannes

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 0/9] Rewrite packfile reuse code
  2019-12-08  7:53             ` Christian Couder
@ 2019-12-08  8:54               ` Johannes Schindelin
  2019-12-08 10:26                 ` Christian Couder
  2019-12-09  6:18                 ` Jeff King
  0 siblings, 2 replies; 42+ messages in thread
From: Johannes Schindelin @ 2019-12-08  8:54 UTC (permalink / raw)
  To: Christian Couder
  Cc: Junio C Hamano, Jeff King, Jonathan Tan, git, Christian Couder,
	Ramsay Jones

Hi Chris,

On Sun, 8 Dec 2019, Christian Couder wrote:

> Peff wrote:
>
> > It's been on my todo list to upstream for a while, but I've dragged my
> > feet on it because there's a lot of cleanup/polishing from the original
> > patches (they were never very clean in the first place, and we've merged
> > a dozen or more times with upstream since then, so the updates are
> > spread across a bunch of merge commits).
>
> and then:
>
> > Yeah, I think we should work on getting our changes (including those
> > stats) into upstream.
>
> So actually I thought that I was helping Peff on this, though I know
> of course that it's also helping GitLab and everyone else.

In my experience, sending the initial set of patches is the easiest part
of contributing patches, by far. The most involved part of the process is
to react to reviewer comments and to prepare new iterations.

You can see this challenge in action in all the Git for Windows
patches/patch series I am "upstreaming".

So actually I think that you are doing a disservice to Peff: if he had
time for that tedious part of the patch contribution, I am sure it would
have been no burden at all to send the initial set of patches.

> That's why I put Peff as the author of the patches.

No, that is not the reason. You might think that that is the reason, but
the real reason why Peff is marked as the author of those patches is that
he really authored those patches.

In light of what you said, I don't think that it is a good idea to go
forward by leaning even further on Peff. From his activity on the Git
mailing list, I deduce that he is not exactly in need of even more work.

Instead, I think that if you truly want to push these patches forward, you
will have to dig deeper yourself, and answer Jonathan Tan's questions, and
possibly adjust the patches accordingly and send a new iteration.

I perceive it as very unfair toward Peff that this has not yet happened.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 0/9] Rewrite packfile reuse code
  2019-12-08  8:54               ` Johannes Schindelin
@ 2019-12-08 10:26                 ` Christian Couder
  2019-12-08 10:45                   ` Johannes Schindelin
  2019-12-09  6:18                 ` Jeff King
  1 sibling, 1 reply; 42+ messages in thread
From: Christian Couder @ 2019-12-08 10:26 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Jeff King, Jonathan Tan, git, Christian Couder,
	Ramsay Jones

Hi Dscho,

On Sun, Dec 8, 2019 at 9:54 AM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> On Sun, 8 Dec 2019, Christian Couder wrote:
>
> > Peff wrote:
> >
> > > It's been on my todo list to upstream for a while, but I've dragged my
> > > feet on it because there's a lot of cleanup/polishing from the original
> > > patches (they were never very clean in the first place, and we've merged
> > > a dozen or more times with upstream since then, so the updates are
> > > spread across a bunch of merge commits).
> >
> > and then:
> >
> > > Yeah, I think we should work on getting our changes (including those
> > > stats) into upstream.
> >
> > So actually I thought that I was helping Peff on this, though I know
> > of course that it's also helping GitLab and everyone else.
>
> In my experience, sending the initial set of patches is the easiest part
> of contributing patches, by far. The most involved part of the process is
> to react to reviewer comments and to prepare new iterations.
>
> You can see this challenge in action in all the Git for Windows
> patches/patch series I am "upstreaming".
>
> So actually I think that you are doing a disservice to Peff: if he had
> time for that tedious part of the patch contribution, I am sure it would
> have been no burden at all to send the initial set of patches.

I think Peff can say by himself what he thinks about my work when I
rework the raw patches he sends to help get them upstreamed.

It's not the first time that I have done that and every time I have
done it, I think he has found it useful. Even this time he also wrote
that my work has been useful.

> > That's why I put Peff as the author of the patches.
>
> No, that is not the reason. You might think that that is the reason, but
> the real reason why Peff is marked as the author of those patches is that
> he really authored those patches.

That doesn't contradict at all what I am saying. I am saying that I
kept Peff as the author because I am just helping him, which means
that I actually acknowledge that he really authored those patches, no?

> In light of what you said, I don't think that it is a good idea to go
> forward by leaning even further on Peff. From his activity on the Git
> mailing list, I deduce that he is not exactly in need of even more work.

I think it's ok to ping people, even many times, when they have said
that they want to work on something but for some reason don't do it.
That's what Junio did by the way too. Junio just pinged everyone
involved, and then I pinged Peff specifically as I think the part of
the work left is more his than mine.

If Peff had said that he doesn't want to work on this any more then I
wouldn't ping him, and I would perhaps try something else, like just
ask for only the first patch (1/9) to be merged.

> Instead, I think that if you truly want to push these patches forward, you
> will have to dig deeper yourself, and answer Jonathan Tan's questions, and
> possibly adjust the patches accordingly and send a new iteration.

I think that it's ok to ping Peff until he says that he doesn't or
cannot for some reason work on this anymore. This shouldn't be a big
burden for him to say that, no?

> I perceive it as very unfair toward Peff that this has not yet happened.

I perceive it as unfair to me that you think that I have to do a lot
of work on this when Peff hasn't even said that he doesn't want to, or
cannot, answer Jonathan's question.

Best,
Christian.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 0/9] Rewrite packfile reuse code
  2019-12-08 10:26                 ` Christian Couder
@ 2019-12-08 10:45                   ` Johannes Schindelin
  0 siblings, 0 replies; 42+ messages in thread
From: Johannes Schindelin @ 2019-12-08 10:45 UTC (permalink / raw)
  To: Christian Couder
  Cc: Junio C Hamano, Jeff King, Jonathan Tan, git, Christian Couder,
	Ramsay Jones

Hi Chris,

On Sun, 8 Dec 2019, Christian Couder wrote:

> I perceive it as unfair to me that you think that I have to do a lot
> of work on this when Peff hasn't even said that he doesn't want to, or
> cannot, answer Jonathan's question.

Well, you have time enough to send lengthy replies on a Sunday morning
(while Peff apparently did not even have time to say that he lacks the
time to work on this).

So there,
Johannes

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 0/9] Rewrite packfile reuse code
  2019-12-08  8:54               ` Johannes Schindelin
  2019-12-08 10:26                 ` Christian Couder
@ 2019-12-09  6:18                 ` Jeff King
  2019-12-09  9:28                   ` Johannes Schindelin
                                     ` (2 more replies)
  1 sibling, 3 replies; 42+ messages in thread
From: Jeff King @ 2019-12-09  6:18 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Christian Couder, Junio C Hamano, Jonathan Tan, git,
	Christian Couder, Ramsay Jones

On Sun, Dec 08, 2019 at 09:54:01AM +0100, Johannes Schindelin wrote:

> > That's why I put Peff as the author of the patches.
> 
> No, that is not the reason. You might think that that is the reason, but
> the real reason why Peff is marked as the author of those patches is that
> he really authored those patches.
> 
> In light of what you said, I don't think that it is a good idea to go
> forward by leaning even further on Peff. From his activity on the Git
> mailing list, I deduce that he is not exactly in need of even more work.
> 
> Instead, I think that if you truly want to push these patches forward, you
> will have to dig deeper yourself, and answer Jonathan Tan's questions, and
> possibly adjust the patches accordingly and send a new iteration.
> 
> I perceive it as very unfair toward Peff that this has not yet happened.

To be clear, I am not bothered by this. And in fact I feel bad that I
promised Christian that I take a careful look at the patches again, but
haven't gotten around to it (for an embarrassingly long time now).

Now I would _love_ if somebody else dug into the topic enough to
understand all of the ins and outs, and whether what they're doing is
sane (or could be done better). But barring that, these patches have
been battle-tested for many years on GitHub's servers, so even if we
just take them as-is I hope it would be an improvement.

Fortunately I have some other work to do that I would like very much to
procrastinate on, so let me see if that can summon the willpower for me
to review these.

> Well, you have time enough to send lengthy replies on a Sunday morning
> (while Peff apparently did not even have time to say that he lacks the
> time to work on this).

One tricky thing here is that I leave messages or subthreads that I
intend to act on in my incoming Git mbox. And of course as time goes on,
those get pushed further back in the pile. But when new messages arrive,
mutt attaches them to the old threads, and I sometimes don't see them
(until I go back and sift through the pile).

I wish there was a good way to have mutt remain in threaded mode, but
sort the threads by recent activity. Setting sort_aux=last-date kind of
works, but last time I tried it, I got annoyed that it did funny things
with the order of patches within a thread (if somebody replies to patch
3/5, and then 2/5, it will pull 3/5 down as "more recent").

Dscho, you may feel free to roll your eyes and mutter under your breath
about email if you wish. ;)

-Peff

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/9] builtin/pack-objects: report reused packfile objects
  2019-11-15 14:15 ` [PATCH v3 1/9] builtin/pack-objects: report reused packfile objects Christian Couder
@ 2019-12-09  6:24   ` Jeff King
  2019-12-11 13:48     ` Christian Couder
  0 siblings, 1 reply; 42+ messages in thread
From: Jeff King @ 2019-12-09  6:24 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, Christian Couder, Ramsay Jones, Jonathan Tan,
	James Ramsay

On Fri, Nov 15, 2019 at 03:15:33PM +0100, Christian Couder wrote:

> From: Jeff King <peff@peff.net>
> 
> To see when packfile reuse kicks in or not, it is useful to
> show reused packfile objects statistics in the output of
> upload-pack.

Yep, I think this one makes sense. I would be a bit curious to see how
often reuse actually kicks in for you with the current code. Back when I
worked on this topic in 2015, it was kicking in for us almost-never (at
least for big repos like torvalds/linux that we were interested in
measuring), but that's because we were packing all of the forks together
(and still do). The earlier heuristics made no sense for that case.

I don't know if it's really worth spending a lot of time trying to
collect data, though.

-Peff

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 2/9] packfile: expose get_delta_base()
  2019-11-15 14:15 ` [PATCH v3 2/9] packfile: expose get_delta_base() Christian Couder
@ 2019-12-09  6:26   ` Jeff King
  0 siblings, 0 replies; 42+ messages in thread
From: Jeff King @ 2019-12-09  6:26 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, Christian Couder, Ramsay Jones, Jonathan Tan

On Fri, Nov 15, 2019 at 03:15:34PM +0100, Christian Couder wrote:

> In a following commit get_delta_base() will be used outside
> packfile.c, so let's make it non static and declare it in
> packfile.h.

Yep, obvious and hopefully uncontroversial.

-Peff

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 3/9] ewah/bitmap: introduce bitmap_word_alloc()
  2019-11-15 14:15 ` [PATCH v3 3/9] ewah/bitmap: introduce bitmap_word_alloc() Christian Couder
@ 2019-12-09  6:28   ` Jeff King
  2019-12-11 13:50     ` Christian Couder
  0 siblings, 1 reply; 42+ messages in thread
From: Jeff King @ 2019-12-09  6:28 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, Christian Couder, Ramsay Jones, Jonathan Tan

On Fri, Nov 15, 2019 at 03:15:35PM +0100, Christian Couder wrote:

> In a following commit we will need to allocate a variable
> number of bitmap words, instead of always 32, so let's add
> bitmap_word_alloc() for this purpose.

This name is much better than the horrible "bitmap_new2()" it was called
in the original. ;)

I wonder if "new_with_world_alloc" or "new_with_size" would make it more
obvious that this is also a constructor, though.

>  void bitmap_set(struct bitmap *self, size_t pos)
>  {
>  	size_t block = EWAH_BLOCK(pos);
>  
>  	if (block >= self->word_alloc) {
>  		size_t old_size = self->word_alloc;
> -		self->word_alloc = block * 2;
> +		self->word_alloc = block ? block * 2 : 1;

Since this hunk caused so much confusion, maybe worth calling it out in
the commit message. Something like:

  Note that we have to adjust the block growth in bitmap_set(), since
  a caller could now use an initial size of "0" (we don't plan to do
  that, but it doesn't hurt to be defensive).

-Peff

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/9] pack-bitmap: don't rely on bitmap_git->reuse_objects
  2019-11-15 14:15 ` [PATCH v3 4/9] pack-bitmap: don't rely on bitmap_git->reuse_objects Christian Couder
@ 2019-12-09  6:47   ` Jeff King
  2019-12-13 13:26     ` Christian Couder
  0 siblings, 1 reply; 42+ messages in thread
From: Jeff King @ 2019-12-09  6:47 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, Christian Couder, Ramsay Jones, Jonathan Tan

On Fri, Nov 15, 2019 at 03:15:36PM +0100, Christian Couder wrote:

> From: Jeff King <peff@peff.net>
> 
> We will no longer compute bitmap_git->reuse_objects in a
> following commit, so we cannot rely on it anymore to
> terminate the loop early; we have to iterate to the end.

Hmm. In my original work from 2015 (which you never saw as individual
commits), this came somewhere in the middle, after moving to per-object
reuse.

I think by dropping these hunks now, the state of the code at this point
would mean that we might write the objects twice. We'd mark them as
"reused" and send them as part of write_reused_pack(). But we'd also
send them to pack-objects via the show_reachable_fn callback, and it
would add them to the usual packing list.

And indeed, t5310.10 fails at this point in the series with:

  Cloning into bare repository 'clone.git'...
  remote: Enumerating objects: 331, done.        
  remote: Counting objects: 100% (331/331), done.        
  remote: Compressing objects: 100% (111/111), done.        
  remote: Total 662 (delta 108), reused 331 (delta 108), pack-reused 331        
  Receiving objects: 100% (662/662), 53.14 KiB | 17.71 MiB/s, done.
  Resolving deltas: 100% (216/216), done.
  fatal: The same object 00c1d3730931e66eb08dabe3a3c9fa16621d728a appears twice in the pack
  fatal: index-pack failed

and then starts working again after the final patch.

-Peff

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 5/9] pack-bitmap: introduce bitmap_walk_contains()
  2019-11-15 14:15 ` [PATCH v3 5/9] pack-bitmap: introduce bitmap_walk_contains() Christian Couder
@ 2019-12-09  7:06   ` Jeff King
  2019-12-13 13:27     ` Christian Couder
  0 siblings, 1 reply; 42+ messages in thread
From: Jeff King @ 2019-12-09  7:06 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, Christian Couder, Ramsay Jones, Jonathan Tan

On Fri, Nov 15, 2019 at 03:15:37PM +0100, Christian Couder wrote:

> We will use this helper function in a following commit to
> tell us if an object is packed.

Yeah, makes sense. This is eventually used in have_duplicate_entry() in
pack-objects, to check whether an object is already mentioned in
reuse_packfile_bitmap. And that's the part that would fix the test
failures from the previous commit.

But of course we don't yet have reuse_packfile_bitmap; that comes later.

> +int bitmap_walk_contains(struct bitmap_index *bitmap_git,
> +			 struct bitmap *bitmap, const struct object_id *oid)
> +{
> +	int idx;
> +
> +	if (!bitmap)
> +		return 0;
> +
> +	idx = bitmap_position(bitmap_git, oid);
> +	return idx >= 0 && bitmap_get(bitmap, idx);
> +}

This is really a factoring out of code in
bitmap_has_oid_in_uninteresting(). So I think you could simplify that
like:

diff --git a/pack-bitmap.c b/pack-bitmap.c
index cbfc544411..f5749d0ab3 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1194,16 +1194,6 @@ void free_bitmap_index(struct bitmap_index *b)
 int bitmap_has_oid_in_uninteresting(struct bitmap_index *bitmap_git,
 				    const struct object_id *oid)
 {
-	int pos;
-
-	if (!bitmap_git)
-		return 0; /* no bitmap loaded */
-	if (!bitmap_git->haves)
-		return 0; /* walk had no "haves" */
-
-	pos = bitmap_position_packfile(bitmap_git, oid);
-	if (pos < 0)
-		return 0;
-
-	return bitmap_get(bitmap_git->haves, pos);
+	return bitmap_git &&
+	       bitmap_walk_contains(bitmap_git, bitmap_git->haves, oid);
 }

One curiosity is that bitmap_has_oid_in_uninteresting() only uses
bitmap_position_packfile(), not bitmap_position(). So it wouldn't find
objects which weren't in the bitmapped packfile (i.e., ones where we
extended the bitmap to handle loose objects, or objects in other packs).

That seems like a bug in the current code to me. I suspect nobody
noticed because the only effect would be that sometimes we fail to
notice that we could reuse a delta against such an object (which isn't
incorrect, just suboptimal). I don't think p5311 would show this,
though, because it simulates a server that is fully packed.

I think it's probably still worth doing this as a preparatory patch,
though:

diff --git a/pack-bitmap.c b/pack-bitmap.c
index e07c798879..6df22e7291 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1125,7 +1125,7 @@ int bitmap_has_oid_in_uninteresting(struct bitmap_index *bitmap_git,
 	if (!bitmap_git->haves)
 		return 0; /* walk had no "haves" */
 
-	pos = bitmap_position_packfile(bitmap_git, oid);
+	pos = bitmap_position(bitmap_git, oid);
 	if (pos < 0)
 		return 0;
 

-Peff

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 6/9] csum-file: introduce hashfile_total()
  2019-11-15 14:15 ` [PATCH v3 6/9] csum-file: introduce hashfile_total() Christian Couder
@ 2019-12-09  7:07   ` Jeff King
  0 siblings, 0 replies; 42+ messages in thread
From: Jeff King @ 2019-12-09  7:07 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, Christian Couder, Ramsay Jones, Jonathan Tan

On Fri, Nov 15, 2019 at 03:15:38PM +0100, Christian Couder wrote:

> From: Jeff King <peff@peff.net>
> 
> We will need this helper function in a following commit
> to give us total number of bytes fed to the hashfile so far.

Yep, makes sense.

-Peff

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 7/9] pack-objects: introduce pack.allowPackReuse
  2019-11-15 14:15 ` [PATCH v3 7/9] pack-objects: introduce pack.allowPackReuse Christian Couder
@ 2019-12-09  7:14   ` Jeff King
  2019-12-13 13:27     ` Christian Couder
  0 siblings, 1 reply; 42+ messages in thread
From: Jeff King @ 2019-12-09  7:14 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, Christian Couder, Ramsay Jones, Jonathan Tan

On Fri, Nov 15, 2019 at 03:15:39PM +0100, Christian Couder wrote:

> From: Jeff King <peff@peff.net>
> 
> Let's make it possible to configure if we want pack reuse or not.

...because? :)

I mostly used it for debugging and performance testing. I don't think
there should be any big downside to using it that would cause people to
want to turn it off. But it _might_ cause larger packs, because we
wouldn't consider the reused objects as bases for finding new deltas.

I think the documentation could mention this. And maybe make it a bit
more clear what "reuse" means.

So maybe:

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index 58323a351f..0dac580581 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -28,8 +28,11 @@ all existing objects. You can force recompression by passing the -F option
 to linkgit:git-repack[1].
 
 pack.allowPackReuse::
-	When true, which is the default, Git will try to reuse parts
-	of existing packfiles when preparing new packfiles.
+	When true, and when reachability bitmaps are enabled,
+	pack-objects will try to send parts of the bitmapped packfile
+	verbatim. This can reduce memory and CPU usage to serve fetches,
+	but might result in sending a slightly larger pack. Defaults to
+	true.
 
 pack.island::
 	An extended regular expression configuring a set of delta

-Peff

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 8/9] builtin/pack-objects: introduce obj_is_packed()
  2019-11-15 14:15 ` [PATCH v3 8/9] builtin/pack-objects: introduce obj_is_packed() Christian Couder
@ 2019-12-09  7:14   ` Jeff King
  0 siblings, 0 replies; 42+ messages in thread
From: Jeff King @ 2019-12-09  7:14 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, Christian Couder, Ramsay Jones, Jonathan Tan

On Fri, Nov 15, 2019 at 03:15:40PM +0100, Christian Couder wrote:

> From: Jeff King <peff@peff.net>
> 
> Let's refactor the way we check if an object is packed by
> introducing obj_is_packed(). This function is now a simple
> wrapper around packlist_find(), but it will evolve in a
> following commit.

Yep, this is a good refactor.

-Peff

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 9/9] pack-objects: improve partial packfile reuse
  2019-11-15 14:15 ` [PATCH v3 9/9] pack-objects: improve partial packfile reuse Christian Couder
@ 2019-12-09  8:11   ` Jeff King
  2019-12-18 11:26     ` Christian Couder
  0 siblings, 1 reply; 42+ messages in thread
From: Jeff King @ 2019-12-09  8:11 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, Christian Couder, Ramsay Jones, Jonathan Tan

On Fri, Nov 15, 2019 at 03:15:41PM +0100, Christian Couder wrote:

> The old code to reuse deltas from an existing packfile
> just tried to dump a whole segment of the pack verbatim.
> That's faster than the traditional way of actually adding
> objects to the packing list, but it didn't kick in very
> often. This new code is really going for a middle ground:
> do _some_ per-object work, but way less than we'd
> traditionally do.

This is a good intro to the general problem, but...

> For instance, packing torvalds/linux on GitHub servers
> just now reused 6.5M objects, but only needed ~50k
> chunks.

What the heck is a chunk? :)

I know, because I wrote the patches, but I think we need to sketch out
the solution a bit for the reader.

I think the flow here is:

  - we want to do some per-object work (i.e., your intro above)

  - the general strategy is to make a bitmap of objects from the
    packfile we'll include, and then iterate over it, writing out each
    object exactly as it is in our on-disk pack, but _not_ adding it to
    our packlist (which costs memory, and increases the search space for
    deltas)

  - one complication is that if we're omitting some objects, we can't
    set a delta against a base that we're not sending. So we have to
    check each object in try_partial_reuse() to make sure we have its
    delta (actually, that third big comment in that function is
    incomplete, I think; it talks about sending the object later, not as
    part of the reused pack, but we might not send it at all!).

  - another complication is that when we omit parts of the packfile,
    that screws up delta base offsets. So to handle that, we have to
    keep track of which chunks we send (which is the bits you included
    below about chunks)

  - and then we can talk about performance; worst case we might have
    interleaved objects that we are sending or not sending, and we'd
    have as many chunks as objects. But in practice we send big chunks,
    so that's where the 6.5M objects vs 50k chunks comes in.

> Additional checks are added in have_duplicate_entry()
> and obj_is_packed() to avoid duplicate objects in the
> reuse bitmap. It was probably buggy to not have such a
> check before.
> 
> If a client both asks for a tag by sha1 and specifies
> "include-tag", we may end up including the tag in the reuse
> bitmap (due to the first thing), and then later adding it to
> the packlist (due to the second). This results in duplicate
> objects in the pack, which git chokes on. We should notice
> that we are already including it when doing the include-tag
> portion, and avoid adding it to the packlist.
> 
> The simplest place to fix this is right in add_ref_tag,
> where we could avoid peeling the tag at all if we know that
> we are already including it. However, I've pushed the check
> instead into have_duplicate_entry(). This fixes not only
> this case, but also means that we cannot have any similar
> problems lurking in other code.

I think this part could go in its own commit. If we introduce
reuse_packfile_bitmap early, even if it's always an all-or-nothing chunk
at the beginning of the file with the existing code, then we can
introduce the extra checks in have_duplicate_entry() on top of that.

And then it would be safe to do the cleanup in show_objects_from_type()
that triggers the test failure in patch 4.

>  builtin/pack-objects.c | 222 ++++++++++++++++++++++++++++++++---------
>  pack-bitmap.c          | 150 ++++++++++++++++++++--------
>  pack-bitmap.h          |   3 +-
>  3 files changed, 288 insertions(+), 87 deletions(-)

It might be worth having a perf test here. The case this is helping the
most, I think, is when somebody cloning wants all of the objects from
the beginning of the pack, but then there's a bunch of _other_ stuff.

That could be unreachable objects kept by "repack -k", or maybe objects
reachable outside of heads and tags. Or objects that are part of other
delta islands. This last is the most plausible, really, because we'll
put all of the root-island objects at the beginning of the pack. So
maybe a good perf test would be to take an existing repository add a
bunch of new commits in refs/foo/, and then repack with delta islands
such that refs/heads and refs/tags are in one (root) island, but
refs/foo is in another.

The old code should fail to do the pack-reuse thing, but we should get
pretty good reuse with the new code (and less CPU and peak RAM,
hopefully, though the perf suite doesn't measure RAM directly).


Answering some questions from Jonathan's response in the last round
(some of the quotes are his, some are yours):

> I still don't understand this part. Going off what's written in the
> commit message here, it seems to me that the issue is that (1) an object
> can be added to the reuse bitmap without going through to_pack, and (2)
> this is done when the client asks for a tag by SHA-1. But all objects
> when specified by SHA-1 go through to_pack, don't they?

No, if it's part of the reused chunk, we'd avoid adding it to to_pack at
all (and I think the commit message should make that more clear, as I
described above).

> >> No tests, because git does not actually exhibit this "ask
> >> for it and also include-tag" behavior. We do one or the
> >> other on clone, depending on whether --single-branch is set.
> >> However, libgit2 does both.
>
> So my wild guess sould be that maybe the reason is that some of this
> code was shared (or intended to be shared) with libgit2?

No, the question is how the client behaves, and how we on the server
react to it. Git as a client would never ask for both include-tag and
for the tag itself by sha1. But libgit2 will, so a libgit2 client
cloning from a Git server would trigger the bug.

> > +     if (pos >= bitmap_git->pack->num_objects)
> > +             return; /* not actually in the pack */
>
> Is this case possible? try_partial_reuse() is only called when there is
> a 1 at the bit location.

Yes, it's possible. The result of a bitmap walk is larger than a given
pack, because we have to extend it to match whatever objects the caller
asked for. E.g., imagine we have commit A, repack into into a bitmapped
pack, and then somebody pushes up commit B. Now I want to clone, getting
both A and B.

Our bitmap result represents the whole answer, and so must include both
objects. But since "B" isn't in the pack, it doesn't have an assigned
bit. We add it to the in-memory bitmap_git->ext_index, which gives it a
bit (valid only for that walk result!). But of course for pack reuse, we
only care about the objects that were actually in the bitmapped pack.

> > +     /* Don't mark objects not in the packfile */
> > +     if (i > bitmap_git->pack->num_objects / BITS_IN_EWORD)
> > +             i = bitmap_git->pack->num_objects / BITS_IN_EWORD;
>
> Why is this necessary? Let's say we have 42 fully-1 words, and therefore
> i is 42. Shouldn't we allocate 42 words in our reuse bitmap?

This is the same issue as above. Imagine instead of two objects, imagine
we have 42 words worth. But if only 20 words worth are in the packfile,
then we have to stop there.

Now we _could_ figure this out for each individual object through the
similar logic in try_partial_reuse(). But we try to take an even faster
path for the initial chunk of objects. We don't even walk over them
looking to see if they're deltas or not. We just send that chunk of the
pack verbatim.

I don't have numbers for how often we hit this path versus the
individual try_partial_reuse() path. Possibly the stats could
differentiate that.

Thinking on it more, though, I wonder if there's a bug hiding here. We
know that we can send the whole initial chunk verbatim for OFS_DELTA
objects, because the relative offsets will remain unchanged (i.e., there
are no "holes" that trigger our chunk code). But if there were a
REF_DELTA in that initial chunk, we have no certainty that the base is
being sent.

In practice, this doesn't happen because the objects in question have to
be in a pack with bitmaps, which means it was written ourselves by
git-repack. And we'd never write REF_DELTA objects there.

But I wonder if it's worth being a bit more careful (and what the
performance impact is; it would mean checking the type of every object
in that initial chunk).

-Peff

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 0/9] Rewrite packfile reuse code
  2019-12-09  6:18                 ` Jeff King
@ 2019-12-09  9:28                   ` Johannes Schindelin
  2019-12-09 19:00                   ` Junio C Hamano
  2019-12-09 19:05                   ` Junio C Hamano
  2 siblings, 0 replies; 42+ messages in thread
From: Johannes Schindelin @ 2019-12-09  9:28 UTC (permalink / raw)
  To: Jeff King
  Cc: Christian Couder, Junio C Hamano, Jonathan Tan, git,
	Christian Couder, Ramsay Jones

Hi Peff,

On Mon, 9 Dec 2019, Jeff King wrote:

> On Sun, Dec 08, 2019 at 09:54:01AM +0100, Johannes Schindelin wrote:
>
> > > That's why I put Peff as the author of the patches.
> >
> > No, that is not the reason. You might think that that is the reason, but
> > the real reason why Peff is marked as the author of those patches is that
> > he really authored those patches.
> >
> > In light of what you said, I don't think that it is a good idea to go
> > forward by leaning even further on Peff. From his activity on the Git
> > mailing list, I deduce that he is not exactly in need of even more work.
> >
> > Instead, I think that if you truly want to push these patches forward, you
> > will have to dig deeper yourself, and answer Jonathan Tan's questions, and
> > possibly adjust the patches accordingly and send a new iteration.
> >
> > I perceive it as very unfair toward Peff that this has not yet happened.
>
> To be clear, I am not bothered by this. And in fact I feel bad that I
> promised Christian that I take a careful look at the patches again, but
> haven't gotten around to it (for an embarrassingly long time now).
>
> Now I would _love_ if somebody else dug into the topic enough to
> understand all of the ins and outs, and whether what they're doing is
> sane (or could be done better).

That's what I thought.

When I bring patches to the Git mailing list, it means implicitly not only
that I understand the ins and outs of them, but also that I am fully
prepared to address reviewer comments and send out new, enhanced
iterations.

That holds when I send patch series that include patches authored by
someone else than me. I thought that that is kind of expected, otherwise
there would be no good reason for _me_ to send those patches, right?

> But barring that, these patches have been battle-tested for many years
> on GitHub's servers, so even if we just take them as-is I hope it would
> be an improvement.
>
> Fortunately I have some other work to do that I would like very much to
> procrastinate on, so let me see if that can summon the willpower for me
> to review these.

Heh, I know that feeling.

> > Well, you have time enough to send lengthy replies on a Sunday morning
> > (while Peff apparently did not even have time to say that he lacks the
> > time to work on this).
>
> One tricky thing here is that I leave messages or subthreads that I
> intend to act on in my incoming Git mbox. And of course as time goes on,
> those get pushed further back in the pile. But when new messages arrive,
> mutt attaches them to the old threads, and I sometimes don't see them
> (until I go back and sift through the pile).
>
> I wish there was a good way to have mutt remain in threaded mode, but
> sort the threads by recent activity. Setting sort_aux=last-date kind of
> works, but last time I tried it, I got annoyed that it did funny things
> with the order of patches within a thread (if somebody replies to patch
> 3/5, and then 2/5, it will pull 3/5 down as "more recent").

When I hit such a situation, I usually go on this kind of insane
side-track to figure out whether it would be easy to fix this (it's open
source, after all). Last time I tried such a thing, though, I had to admit
that it was not easy (but I use Alpine, not mutt, out of sheer inability
to adjust my muscle memory).

> Dscho, you may feel free to roll your eyes and mutter under your breath
> about email if you wish. ;)

Done.

;-)

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 0/9] Rewrite packfile reuse code
  2019-12-09  6:18                 ` Jeff King
  2019-12-09  9:28                   ` Johannes Schindelin
@ 2019-12-09 19:00                   ` Junio C Hamano
  2019-12-09 19:05                   ` Junio C Hamano
  2 siblings, 0 replies; 42+ messages in thread
From: Junio C Hamano @ 2019-12-09 19:00 UTC (permalink / raw)
  To: Jeff King
  Cc: Johannes Schindelin, Christian Couder, Jonathan Tan, git,
	Christian Couder, Ramsay Jones

Jeff King <peff@peff.net> writes:

> Now I would _love_ if somebody else dug into the topic enough to
> understand all of the ins and outs, and whether what they're doing is
> sane (or could be done better). But barring that, these patches have
> been battle-tested for many years on GitHub's servers, so even if we
> just take them as-is I hope it would be an improvement.

Sorry; it wasn't my intention to ask accelerating the topic
forward---it was just to see the current status to adjust my
expectations.  And you could have stopped at the end of this
paragraph but ...

> Fortunately I have some other work to do that I would like very much to
> procrastinate on, so let me see if that can summon the willpower for me
> to review these.

... I'd love to see us benefit from this ;-)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 0/9] Rewrite packfile reuse code
  2019-12-09  6:18                 ` Jeff King
  2019-12-09  9:28                   ` Johannes Schindelin
  2019-12-09 19:00                   ` Junio C Hamano
@ 2019-12-09 19:05                   ` Junio C Hamano
  2 siblings, 0 replies; 42+ messages in thread
From: Junio C Hamano @ 2019-12-09 19:05 UTC (permalink / raw)
  To: Jeff King
  Cc: Johannes Schindelin, Christian Couder, Jonathan Tan, git,
	Christian Couder, Ramsay Jones

Jeff King <peff@peff.net> writes:

> One tricky thing here is that I leave messages or subthreads that I
> intend to act on in my incoming Git mbox. And of course as time goes on,
> those get pushed further back in the pile. But when new messages arrive,
> mutt attaches them to the old threads, and I sometimes don't see them
> (until I go back and sift through the pile).

This is why I still let a tab in my browser to be squat by GMail
that shows only the traffic sent to this mailing list, as it is very
good at surfacing a thread with new activity even though it is bad
at everything else, including threading.

For real work, picking messages up and responding to them, I read
and handle the list traffic via NNTP interface to public-inbox (and
lore.k.o these days), but keeping GMail purely as a notification
channel has its uses ;-).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/9] builtin/pack-objects: report reused packfile objects
  2019-12-09  6:24   ` Jeff King
@ 2019-12-11 13:48     ` Christian Couder
  0 siblings, 0 replies; 42+ messages in thread
From: Christian Couder @ 2019-12-11 13:48 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Junio C Hamano, Christian Couder, Ramsay Jones, Jonathan Tan,
	James Ramsay

On Mon, Dec 9, 2019 at 7:24 AM Jeff King <peff@peff.net> wrote:
>
> On Fri, Nov 15, 2019 at 03:15:33PM +0100, Christian Couder wrote:
>
> > From: Jeff King <peff@peff.net>
> >
> > To see when packfile reuse kicks in or not, it is useful to
> > show reused packfile objects statistics in the output of
> > upload-pack.
>
> Yep, I think this one makes sense. I would be a bit curious to see how
> often reuse actually kicks in for you with the current code. Back when I
> worked on this topic in 2015, it was kicking in for us almost-never (at
> least for big repos like torvalds/linux that we were interested in
> measuring), but that's because we were packing all of the forks together
> (and still do). The earlier heuristics made no sense for that case.
>
> I don't know if it's really worth spending a lot of time trying to
> collect data, though.

Unfortunately it's not yet easy for me, or people in my team, to
collect data from production servers. I will try to ping people about
that though.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 3/9] ewah/bitmap: introduce bitmap_word_alloc()
  2019-12-09  6:28   ` Jeff King
@ 2019-12-11 13:50     ` Christian Couder
  2019-12-12  5:45       ` Jeff King
  0 siblings, 1 reply; 42+ messages in thread
From: Christian Couder @ 2019-12-11 13:50 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Junio C Hamano, Christian Couder, Ramsay Jones, Jonathan Tan

On Mon, Dec 9, 2019 at 7:28 AM Jeff King <peff@peff.net> wrote:
>
> On Fri, Nov 15, 2019 at 03:15:35PM +0100, Christian Couder wrote:
>
> > In a following commit we will need to allocate a variable
> > number of bitmap words, instead of always 32, so let's add
> > bitmap_word_alloc() for this purpose.
>
> This name is much better than the horrible "bitmap_new2()" it was called
> in the original. ;)
>
> I wonder if "new_with_world_alloc" or "new_with_size" would make it more
> obvious that this is also a constructor, though.

bitmap_new_with_word_alloc() feels a bit verbose to me for such a
short function, so for now I kept bitmap_word_alloc().

I think that if we really want to use "new" for constructors then a
better solution would be something like renaming bitmap_new(void) to
bitmap_new_default(void) and bitmap_word_alloc(size_t word_alloc) to
bitmap_new(size_t word_alloc).

> >  void bitmap_set(struct bitmap *self, size_t pos)
> >  {
> >       size_t block = EWAH_BLOCK(pos);
> >
> >       if (block >= self->word_alloc) {
> >               size_t old_size = self->word_alloc;
> > -             self->word_alloc = block * 2;
> > +             self->word_alloc = block ? block * 2 : 1;
>
> Since this hunk caused so much confusion, maybe worth calling it out in
> the commit message. Something like:
>
>   Note that we have to adjust the block growth in bitmap_set(), since
>   a caller could now use an initial size of "0" (we don't plan to do
>   that, but it doesn't hurt to be defensive).

Ok, I added the above as a commit message note in my current version.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 3/9] ewah/bitmap: introduce bitmap_word_alloc()
  2019-12-11 13:50     ` Christian Couder
@ 2019-12-12  5:45       ` Jeff King
  0 siblings, 0 replies; 42+ messages in thread
From: Jeff King @ 2019-12-12  5:45 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, Christian Couder, Ramsay Jones, Jonathan Tan

On Wed, Dec 11, 2019 at 02:50:40PM +0100, Christian Couder wrote:

> > I wonder if "new_with_world_alloc" or "new_with_size" would make it more
> > obvious that this is also a constructor, though.
> 
> bitmap_new_with_word_alloc() feels a bit verbose to me for such a
> short function, so for now I kept bitmap_word_alloc().

OK, that's fine with me...

> I think that if we really want to use "new" for constructors then a
> better solution would be something like renaming bitmap_new(void) to
> bitmap_new_default(void) and bitmap_word_alloc(size_t word_alloc) to
> bitmap_new(size_t word_alloc).

...but that also would be fine with me.

-Peff

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/9] pack-bitmap: don't rely on bitmap_git->reuse_objects
  2019-12-09  6:47   ` Jeff King
@ 2019-12-13 13:26     ` Christian Couder
  0 siblings, 0 replies; 42+ messages in thread
From: Christian Couder @ 2019-12-13 13:26 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Junio C Hamano, Christian Couder, Ramsay Jones, Jonathan Tan

On Mon, Dec 9, 2019 at 7:47 AM Jeff King <peff@peff.net> wrote:


> I think by dropping these hunks now, the state of the code at this point
> would mean that we might write the objects twice. We'd mark them as
> "reused" and send them as part of write_reused_pack(). But we'd also
> send them to pack-objects via the show_reachable_fn callback, and it
> would add them to the usual packing list.
>
> And indeed, t5310.10 fails at this point in the series with:
>
>   Cloning into bare repository 'clone.git'...
>   remote: Enumerating objects: 331, done.
>   remote: Counting objects: 100% (331/331), done.
>   remote: Compressing objects: 100% (111/111), done.
>   remote: Total 662 (delta 108), reused 331 (delta 108), pack-reused 331
>   Receiving objects: 100% (662/662), 53.14 KiB | 17.71 MiB/s, done.
>   Resolving deltas: 100% (216/216), done.
>   fatal: The same object 00c1d3730931e66eb08dabe3a3c9fa16621d728a appears twice in the pack
>   fatal: index-pack failed
>
> and then starts working again after the final patch.

Yeah, I thought I had tested this, but anyway I squashed this patch
into the final patch to avoid failures.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 5/9] pack-bitmap: introduce bitmap_walk_contains()
  2019-12-09  7:06   ` Jeff King
@ 2019-12-13 13:27     ` Christian Couder
  0 siblings, 0 replies; 42+ messages in thread
From: Christian Couder @ 2019-12-13 13:27 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Junio C Hamano, Christian Couder, Ramsay Jones, Jonathan Tan

On Mon, Dec 9, 2019 at 8:06 AM Jeff King <peff@peff.net> wrote:
>
> On Fri, Nov 15, 2019 at 03:15:37PM +0100, Christian Couder wrote:

> > +int bitmap_walk_contains(struct bitmap_index *bitmap_git,
> > +                      struct bitmap *bitmap, const struct object_id *oid)
> > +{
> > +     int idx;
> > +
> > +     if (!bitmap)
> > +             return 0;
> > +
> > +     idx = bitmap_position(bitmap_git, oid);
> > +     return idx >= 0 && bitmap_get(bitmap, idx);
> > +}
>
> This is really a factoring out of code in
> bitmap_has_oid_in_uninteresting(). So I think you could simplify that
> like:
>
> diff --git a/pack-bitmap.c b/pack-bitmap.c
> index cbfc544411..f5749d0ab3 100644
> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -1194,16 +1194,6 @@ void free_bitmap_index(struct bitmap_index *b)
>  int bitmap_has_oid_in_uninteresting(struct bitmap_index *bitmap_git,
>                                     const struct object_id *oid)
>  {
> -       int pos;
> -
> -       if (!bitmap_git)
> -               return 0; /* no bitmap loaded */
> -       if (!bitmap_git->haves)
> -               return 0; /* walk had no "haves" */
> -
> -       pos = bitmap_position_packfile(bitmap_git, oid);
> -       if (pos < 0)
> -               return 0;
> -
> -       return bitmap_get(bitmap_git->haves, pos);
> +       return bitmap_git &&
> +              bitmap_walk_contains(bitmap_git, bitmap_git->haves, oid);
>  }

Yeah, nice simplification. I added a patch doing that.

> One curiosity is that bitmap_has_oid_in_uninteresting() only uses
> bitmap_position_packfile(), not bitmap_position(). So it wouldn't find
> objects which weren't in the bitmapped packfile (i.e., ones where we
> extended the bitmap to handle loose objects, or objects in other packs).
>
> That seems like a bug in the current code to me. I suspect nobody
> noticed because the only effect would be that sometimes we fail to
> notice that we could reuse a delta against such an object (which isn't
> incorrect, just suboptimal). I don't think p5311 would show this,
> though, because it simulates a server that is fully packed.
>
> I think it's probably still worth doing this as a preparatory patch,
> though:
>
> diff --git a/pack-bitmap.c b/pack-bitmap.c
> index e07c798879..6df22e7291 100644
> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -1125,7 +1125,7 @@ int bitmap_has_oid_in_uninteresting(struct bitmap_index *bitmap_git,
>         if (!bitmap_git->haves)
>                 return 0; /* walk had no "haves" */
>
> -       pos = bitmap_position_packfile(bitmap_git, oid);
> +       pos = bitmap_position(bitmap_git, oid);
>         if (pos < 0)
>                 return 0;

Yeah, I agree that it's a good idea to do it in a preparatory patch,
so I added a patch doing that before the one doing the simplification
you suggest above.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 7/9] pack-objects: introduce pack.allowPackReuse
  2019-12-09  7:14   ` Jeff King
@ 2019-12-13 13:27     ` Christian Couder
  0 siblings, 0 replies; 42+ messages in thread
From: Christian Couder @ 2019-12-13 13:27 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Junio C Hamano, Christian Couder, Ramsay Jones, Jonathan Tan

On Mon, Dec 9, 2019 at 8:14 AM Jeff King <peff@peff.net> wrote:
>
> On Fri, Nov 15, 2019 at 03:15:39PM +0100, Christian Couder wrote:
>
> > From: Jeff King <peff@peff.net>
> >
> > Let's make it possible to configure if we want pack reuse or not.
>
> ...because? :)
>
> I mostly used it for debugging and performance testing. I don't think
> there should be any big downside to using it that would cause people to
> want to turn it off. But it _might_ cause larger packs, because we
> wouldn't consider the reused objects as bases for finding new deltas.

Ok, I changed the commit message to:

    Let's make it possible to configure if we want pack reuse or not.

    The main reason it might not be wanted is probably debugging and
    performance testing, though pack reuse _might_ cause larger packs,
    because we wouldn't consider the reused objects as bases for
    finding new deltas.

> I think the documentation could mention this. And maybe make it a bit
> more clear what "reuse" means.
>
> So maybe:
>
> diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
> index 58323a351f..0dac580581 100644
> --- a/Documentation/config/pack.txt
> +++ b/Documentation/config/pack.txt
> @@ -28,8 +28,11 @@ all existing objects. You can force recompression by passing the -F option
>  to linkgit:git-repack[1].
>
>  pack.allowPackReuse::
> -       When true, which is the default, Git will try to reuse parts
> -       of existing packfiles when preparing new packfiles.
> +       When true, and when reachability bitmaps are enabled,
> +       pack-objects will try to send parts of the bitmapped packfile
> +       verbatim. This can reduce memory and CPU usage to serve fetches,
> +       but might result in sending a slightly larger pack. Defaults to
> +       true.

Yeah, great! I use the above in my current version. Thanks!

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 9/9] pack-objects: improve partial packfile reuse
  2019-12-09  8:11   ` Jeff King
@ 2019-12-18 11:26     ` Christian Couder
  2019-12-19  0:42       ` Jeff King
  0 siblings, 1 reply; 42+ messages in thread
From: Christian Couder @ 2019-12-18 11:26 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Junio C Hamano, Christian Couder, Ramsay Jones, Jonathan Tan

On Mon, Dec 9, 2019 at 9:11 AM Jeff King <peff@peff.net> wrote:
>
> On Fri, Nov 15, 2019 at 03:15:41PM +0100, Christian Couder wrote:
>
> > The old code to reuse deltas from an existing packfile
> > just tried to dump a whole segment of the pack verbatim.
> > That's faster than the traditional way of actually adding
> > objects to the packing list, but it didn't kick in very
> > often. This new code is really going for a middle ground:
> > do _some_ per-object work, but way less than we'd
> > traditionally do.
>
> This is a good intro to the general problem, but...
>
> > For instance, packing torvalds/linux on GitHub servers
> > just now reused 6.5M objects, but only needed ~50k
> > chunks.
>
> What the heck is a chunk? :)
>
> I know, because I wrote the patches, but I think we need to sketch out
> the solution a bit for the reader.

Ok with sketching it out.

> I think the flow here is:
>
>   - we want to do some per-object work (i.e., your intro above)
>
>   - the general strategy is to make a bitmap of objects from the
>     packfile we'll include, and then iterate over it, writing out each
>     object exactly as it is in our on-disk pack, but _not_ adding it to
>     our packlist (which costs memory, and increases the search space for
>     deltas)
>
>   - one complication is that if we're omitting some objects, we can't
>     set a delta against a base that we're not sending. So we have to
>     check each object in try_partial_reuse() to make sure we have its
>     delta (actually, that third big comment in that function is
>     incomplete, I think; it talks about sending the object later, not as
>     part of the reused pack, but we might not send it at all!).

Are you talking about this comment:

        /*
         * And finally, if we're not sending the base as part of our
         * reuse chunk, then don't send this object either. The base
         * would come after us, along with other objects not
         * necessarily in the pack, which means we'd need to convert
         * to REF_DELTA on the fly. Better to just let the normal
         * object_entry code path handle it.
         */

?

I don't see how it's talking about sending the object later, so I have
left it as is. Maybe you can check it again in the v4 series I am
going to send.

>   - another complication is that when we omit parts of the packfile,
>     that screws up delta base offsets. So to handle that, we have to
>     keep track of which chunks we send (which is the bits you included
>     below about chunks)
>
>   - and then we can talk about performance; worst case we might have
>     interleaved objects that we are sending or not sending, and we'd
>     have as many chunks as objects. But in practice we send big chunks,
>     so that's where the 6.5M objects vs 50k chunks comes in.

Ok, I have added your points above to the commit message.

> > Additional checks are added in have_duplicate_entry()
> > and obj_is_packed() to avoid duplicate objects in the
> > reuse bitmap. It was probably buggy to not have such a
> > check before.
> >
> > If a client both asks for a tag by sha1 and specifies
> > "include-tag", we may end up including the tag in the reuse
> > bitmap (due to the first thing), and then later adding it to
> > the packlist (due to the second). This results in duplicate
> > objects in the pack, which git chokes on. We should notice
> > that we are already including it when doing the include-tag
> > portion, and avoid adding it to the packlist.
> >
> > The simplest place to fix this is right in add_ref_tag,
> > where we could avoid peeling the tag at all if we know that
> > we are already including it. However, I've pushed the check
> > instead into have_duplicate_entry(). This fixes not only
> > this case, but also means that we cannot have any similar
> > problems lurking in other code.
>
> I think this part could go in its own commit. If we introduce
> reuse_packfile_bitmap early, even if it's always an all-or-nothing chunk
> at the beginning of the file with the existing code, then we can
> introduce the extra checks in have_duplicate_entry() on top of that.

Ok, I have extracted this part into its own commit.

> And then it would be safe to do the cleanup in show_objects_from_type()
> that triggers the test failure in patch 4.

Ok, I have eventually moved patch 4 at the end of the series.

> >  builtin/pack-objects.c | 222 ++++++++++++++++++++++++++++++++---------
> >  pack-bitmap.c          | 150 ++++++++++++++++++++--------
> >  pack-bitmap.h          |   3 +-
> >  3 files changed, 288 insertions(+), 87 deletions(-)
>
> It might be worth having a perf test here. The case this is helping the
> most, I think, is when somebody cloning wants all of the objects from
> the beginning of the pack, but then there's a bunch of _other_ stuff.
>
> That could be unreachable objects kept by "repack -k", or maybe objects
> reachable outside of heads and tags. Or objects that are part of other
> delta islands. This last is the most plausible, really, because we'll
> put all of the root-island objects at the beginning of the pack. So
> maybe a good perf test would be to take an existing repository add a
> bunch of new commits in refs/foo/,

Not sure how I could do so. How would you do that?

I think it would be best to add completely new realistic commits that
are changing the main code base.

> and then repack with delta islands
> such that refs/heads and refs/tags are in one (root) island, but
> refs/foo is in another.
>
> The old code should fail to do the pack-reuse thing, but we should get
> pretty good reuse with the new code (and less CPU and peak RAM,
> hopefully, though the perf suite doesn't measure RAM directly).

I haven't added a perf test. I may do it if I get an idea about how to
add new commits in refs/foo/.

> Answering some questions from Jonathan's response in the last round
> (some of the quotes are his, some are yours):
>
> > I still don't understand this part. Going off what's written in the
> > commit message here, it seems to me that the issue is that (1) an object
> > can be added to the reuse bitmap without going through to_pack, and (2)
> > this is done when the client asks for a tag by SHA-1. But all objects
> > when specified by SHA-1 go through to_pack, don't they?
>
> No, if it's part of the reused chunk, we'd avoid adding it to to_pack at
> all (and I think the commit message should make that more clear, as I
> described above).

I used your description above to improve the commit message, so I
guess it now answers his question.

> > >> No tests, because git does not actually exhibit this "ask
> > >> for it and also include-tag" behavior. We do one or the
> > >> other on clone, depending on whether --single-branch is set.
> > >> However, libgit2 does both.
> >
> > So my wild guess sould be that maybe the reason is that some of this
> > code was shared (or intended to be shared) with libgit2?
>
> No, the question is how the client behaves, and how we on the server
> react to it. Git as a client would never ask for both include-tag and
> for the tag itself by sha1. But libgit2 will, so a libgit2 client
> cloning from a Git server would trigger the bug.

Ok, I have made it explicit in the commit message that the bug would
be triggered by a libgit2 client but not a Git client.

> > > +     if (pos >= bitmap_git->pack->num_objects)
> > > +             return; /* not actually in the pack */
> >
> > Is this case possible? try_partial_reuse() is only called when there is
> > a 1 at the bit location.
>
> Yes, it's possible. The result of a bitmap walk is larger than a given
> pack, because we have to extend it to match whatever objects the caller
> asked for. E.g., imagine we have commit A, repack into into a bitmapped
> pack, and then somebody pushes up commit B. Now I want to clone, getting
> both A and B.
>
> Our bitmap result represents the whole answer, and so must include both
> objects. But since "B" isn't in the pack, it doesn't have an assigned
> bit. We add it to the in-memory bitmap_git->ext_index, which gives it a
> bit (valid only for that walk result!). But of course for pack reuse, we
> only care about the objects that were actually in the bitmapped pack.

Not sure if these explanations should go into the commit message or a
comment in the code. So I haven't added them for now.

> > > +     /* Don't mark objects not in the packfile */
> > > +     if (i > bitmap_git->pack->num_objects / BITS_IN_EWORD)
> > > +             i = bitmap_git->pack->num_objects / BITS_IN_EWORD;
> >
> > Why is this necessary? Let's say we have 42 fully-1 words, and therefore
> > i is 42. Shouldn't we allocate 42 words in our reuse bitmap?
>
> This is the same issue as above. Imagine instead of two objects, imagine
> we have 42 words worth. But if only 20 words worth are in the packfile,
> then we have to stop there.
>
> Now we _could_ figure this out for each individual object through the
> similar logic in try_partial_reuse(). But we try to take an even faster
> path for the initial chunk of objects. We don't even walk over them
> looking to see if they're deltas or not. We just send that chunk of the
> pack verbatim.
>
> I don't have numbers for how often we hit this path versus the
> individual try_partial_reuse() path. Possibly the stats could
> differentiate that.

Also not sure if these explanations should go into the commit message
or a comment in the code. So I haven't added them for now.

> Thinking on it more, though, I wonder if there's a bug hiding here. We
> know that we can send the whole initial chunk verbatim for OFS_DELTA
> objects, because the relative offsets will remain unchanged (i.e., there
> are no "holes" that trigger our chunk code). But if there were a
> REF_DELTA in that initial chunk, we have no certainty that the base is
> being sent.
>
> In practice, this doesn't happen because the objects in question have to
> be in a pack with bitmaps, which means it was written ourselves by
> git-repack. And we'd never write REF_DELTA objects there.
>
> But I wonder if it's worth being a bit more careful (and what the
> performance impact is; it would mean checking the type of every object
> in that initial chunk).

I haven't done anything related to that. Maybe something can be done
in a follow up patch.

Thanks a lot for the review!

Christian.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 9/9] pack-objects: improve partial packfile reuse
  2019-12-18 11:26     ` Christian Couder
@ 2019-12-19  0:42       ` Jeff King
  2020-01-23 22:29         ` Christian Couder
  0 siblings, 1 reply; 42+ messages in thread
From: Jeff King @ 2019-12-19  0:42 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, Christian Couder, Ramsay Jones, Jonathan Tan

On Wed, Dec 18, 2019 at 12:26:28PM +0100, Christian Couder wrote:

> >   - one complication is that if we're omitting some objects, we can't
> >     set a delta against a base that we're not sending. So we have to
> >     check each object in try_partial_reuse() to make sure we have its
> >     delta (actually, that third big comment in that function is
> >     incomplete, I think; it talks about sending the object later, not as
> >     part of the reused pack, but we might not send it at all!).
> 
> Are you talking about this comment:
> 
>         /*
>          * And finally, if we're not sending the base as part of our
>          * reuse chunk, then don't send this object either. The base
>          * would come after us, along with other objects not
>          * necessarily in the pack, which means we'd need to convert
>          * to REF_DELTA on the fly. Better to just let the normal
>          * object_entry code path handle it.
>          */
> 
> ?
> 
> I don't see how it's talking about sending the object later, so I have
> left it as is. Maybe you can check it again in the v4 series I am
> going to send.

Yes, that's the comment. It says "the base would come after us". That
could be true, but it also could be that we are not sending the object
at all (in fact, that seems more likely in practice). The outcome is the
same, though: we don't verbatim reuse the delta'd object if its base is
not also being reused.

> > It might be worth having a perf test here. The case this is helping the
> > most, I think, is when somebody cloning wants all of the objects from
> > the beginning of the pack, but then there's a bunch of _other_ stuff.
> >
> > That could be unreachable objects kept by "repack -k", or maybe objects
> > reachable outside of heads and tags. Or objects that are part of other
> > delta islands. This last is the most plausible, really, because we'll
> > put all of the root-island objects at the beginning of the pack. So
> > maybe a good perf test would be to take an existing repository add a
> > bunch of new commits in refs/foo/,
> 
> Not sure how I could do so. How would you do that?
> 
> I think it would be best to add completely new realistic commits that
> are changing the main code base.

I agree that would be the most realistic for the "forked repository
network" case that we originally wrote this for. But I think a more
mundane (and possibly easier to generate) example might be having a
bunch of refs/notes.

So perhaps something like this:

-- >8 --
#!/bin/sh

test_description='verbatim pack reuse with bitmaps'
. ./perf-lib.sh

test_perf_large_repo

test_expect_success 'generate a bunch of notes' '
	# bleh, is there really no way to bulk-add with git-notes?
	{
		test_tick &&
		echo "commit refs/notes/commits" &&
		printf "author %s <%s> %s\\n" \
			"$GIT_AUTHOR_NAME" \
			"$GIT_AUTHOR_EMAIL" \
			"$GIT_AUTHOR_DATE" &&
		printf "committer %s <%s> %s\\n" \
			"$GIT_COMMITTER_NAME" \
			"$GIT_COMMITTER_EMAIL" \
			"$GIT_COMMITTER_DATE" &&
		echo "data <<EOF" &&
		echo "add many commit notes" &&
		echo "EOF" &&
		git rev-list HEAD |
		sed "s#^\\(..\\)\\(..\\)#\1/\2/#" |
		while read note
		do
			echo "M 644 inline $note"
			echo "data <<EOF"
			echo "a unique note for $note"
			echo "EOF"
		done &&
		echo
	} |
	git fast-import --force
'

test_expect_success 'create bitmaps' '
	git repack -adb
'

test_perf 'clone without notes' '
	git for-each-ref --format="%(objectname)" refs/heads/ refs/tags/ |
	git pack-objects --stdout --revs --delta-base-offset >clone.pack
'

test_size 'clone size' '
	wc -c <clone.pack
'

test_done
-- 8< --

From my limited prodding, it doesn't trigger pack-reuse with the current
code, but would after your series. On linux.git, it produces:

Test                          origin              origin/jk/packfile-reuse-cleanup
----------------------------------------------------------------------------------
5312.3: clone without notes   14.16(14.26+4.84)   10.80(10.40+0.44) -23.7%        
5312.4: clone size                       1.4G                1.4G +0.0%           

though I suspect the more interesting savings is in RAM (since we don't
have to allocate a "struct object_entry" for most of the objects). I
don't know how hard it would be to collect that data in the perf suite.

Running the final pack-objects manually with massif, I get peak heap
usage dropping from ~1500MB to ~380MB.

For git.git it seems less impressive (very little CPU saved, and the
resulting pack is slightly larger, perhaps due to not re-considering
some deltas whose on-disk versions had to be thrown away):

Test                          origin            origin/jk/packfile-reuse-cleanup
--------------------------------------------------------------------------------
5312.3: clone without notes   1.22(3.60+0.16)   1.20(3.56+0.16) -1.6%           
5312.4: clone size                      65.3M             67.0M +2.5%           

> > > Is this case possible? try_partial_reuse() is only called when there is
> > > a 1 at the bit location.
> >
> > Yes, it's possible. The result of a bitmap walk is larger than a given
> > pack, because we have to extend it to match whatever objects the caller
> > asked for. E.g., imagine we have commit A, repack into into a bitmapped
> > pack, and then somebody pushes up commit B. Now I want to clone, getting
> > both A and B.
> >
> > Our bitmap result represents the whole answer, and so must include both
> > objects. But since "B" isn't in the pack, it doesn't have an assigned
> > bit. We add it to the in-memory bitmap_git->ext_index, which gives it a
> > bit (valid only for that walk result!). But of course for pack reuse, we
> > only care about the objects that were actually in the bitmapped pack.
> 
> Not sure if these explanations should go into the commit message or a
> comment in the code. So I haven't added them for now.

I think this is really outside the scope of this series, even, and gets
into how the bitmap traversal works in general. I'm sure the
documentation for that could be a bit better.

> > Thinking on it more, though, I wonder if there's a bug hiding here. We
> > know that we can send the whole initial chunk verbatim for OFS_DELTA
> > objects, because the relative offsets will remain unchanged (i.e., there
> > are no "holes" that trigger our chunk code). But if there were a
> > REF_DELTA in that initial chunk, we have no certainty that the base is
> > being sent.
> >
> > In practice, this doesn't happen because the objects in question have to
> > be in a pack with bitmaps, which means it was written ourselves by
> > git-repack. And we'd never write REF_DELTA objects there.
> >
> > But I wonder if it's worth being a bit more careful (and what the
> > performance impact is; it would mean checking the type of every object
> > in that initial chunk).
> 
> I haven't done anything related to that. Maybe something can be done
> in a follow up patch.

Hmm.  I was thinking the problem was introduced in this series, but the
old code should suffer from this as well. I wondered if it might simply
be that the old code did not trigger often enough for anybody to notice,
but we have been running the code in this series for several years at
GitHub. So probably my reasoning above is sound (but it might still be
worth addressing).

-Peff

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 9/9] pack-objects: improve partial packfile reuse
  2019-12-19  0:42       ` Jeff King
@ 2020-01-23 22:29         ` Christian Couder
  0 siblings, 0 replies; 42+ messages in thread
From: Christian Couder @ 2020-01-23 22:29 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Junio C Hamano, Christian Couder, Ramsay Jones, Jonathan Tan

On Thu, Dec 19, 2019 at 1:42 AM Jeff King <peff@peff.net> wrote:
>
> On Wed, Dec 18, 2019 at 12:26:28PM +0100, Christian Couder wrote:
>
> > >   - one complication is that if we're omitting some objects, we can't
> > >     set a delta against a base that we're not sending. So we have to
> > >     check each object in try_partial_reuse() to make sure we have its
> > >     delta (actually, that third big comment in that function is
> > >     incomplete, I think; it talks about sending the object later, not as
> > >     part of the reused pack, but we might not send it at all!).
> >
> > Are you talking about this comment:
> >
> >         /*
> >          * And finally, if we're not sending the base as part of our
> >          * reuse chunk, then don't send this object either. The base
> >          * would come after us, along with other objects not
> >          * necessarily in the pack, which means we'd need to convert
> >          * to REF_DELTA on the fly. Better to just let the normal
> >          * object_entry code path handle it.
> >          */
> >
> > ?
> >
> > I don't see how it's talking about sending the object later, so I have
> > left it as is. Maybe you can check it again in the v4 series I am
> > going to send.
>
> Yes, that's the comment. It says "the base would come after us". That
> could be true, but it also could be that we are not sending the object
> at all (in fact, that seems more likely in practice). The outcome is the
> same, though: we don't verbatim reuse the delta'd object if its base is
> not also being reused.

In the V4 (https://public-inbox.org/git/20191218112547.4974-11-chriscool@tuxfamily.org/)
there is now the following in the commit message:

One complication is that if we're omitting some objects,
we can't set a delta against a base that we're not
sending. So we have to check each object in
try_partial_reuse() to make sure we have its delta.

Would you also want the comment to be improved to something like:

         /*
          * And finally, if we're not sending the base as part of our
          * reuse chunk, then don't send this object either. Because we
          * should not send the object at all. Or because the base
          * would come after us, along with other objects not
          * necessarily in the pack, which means we'd need to convert
          * to REF_DELTA on the fly. Better to just let the normal
          * object_entry code path handle it.
          */

> > > It might be worth having a perf test here. The case this is helping the
> > > most, I think, is when somebody cloning wants all of the objects from
> > > the beginning of the pack, but then there's a bunch of _other_ stuff.
> > >
> > > That could be unreachable objects kept by "repack -k", or maybe objects
> > > reachable outside of heads and tags. Or objects that are part of other
> > > delta islands. This last is the most plausible, really, because we'll
> > > put all of the root-island objects at the beginning of the pack. So
> > > maybe a good perf test would be to take an existing repository add a
> > > bunch of new commits in refs/foo/,
> >
> > Not sure how I could do so. How would you do that?
> >
> > I think it would be best to add completely new realistic commits that
> > are changing the main code base.
>
> I agree that would be the most realistic for the "forked repository
> network" case that we originally wrote this for. But I think a more
> mundane (and possibly easier to generate) example might be having a
> bunch of refs/notes.
>
> So perhaps something like this:

[...]

Ok, I will add this test to my patch series if I need to resend a V5.
Otherwise I will send it separately.

> From my limited prodding, it doesn't trigger pack-reuse with the current
> code, but would after your series. On linux.git, it produces:
>
> Test                          origin              origin/jk/packfile-reuse-cleanup
> ----------------------------------------------------------------------------------
> 5312.3: clone without notes   14.16(14.26+4.84)   10.80(10.40+0.44) -23.7%
> 5312.4: clone size                       1.4G                1.4G +0.0%

Yeah, nice improvement.

> though I suspect the more interesting savings is in RAM (since we don't
> have to allocate a "struct object_entry" for most of the objects). I
> don't know how hard it would be to collect that data in the perf suite.

I don't know either.

> Running the final pack-objects manually with massif, I get peak heap
> usage dropping from ~1500MB to ~380MB.

Very nice!

> For git.git it seems less impressive (very little CPU saved, and the
> resulting pack is slightly larger, perhaps due to not re-considering
> some deltas whose on-disk versions had to be thrown away):
>
> Test                          origin            origin/jk/packfile-reuse-cleanup
> --------------------------------------------------------------------------------
> 5312.3: clone without notes   1.22(3.60+0.16)   1.20(3.56+0.16) -1.6%
> 5312.4: clone size                      65.3M             67.0M +2.5%

Yeah, I think this is acceptable.

> > > > Is this case possible? try_partial_reuse() is only called when there is
> > > > a 1 at the bit location.
> > >
> > > Yes, it's possible. The result of a bitmap walk is larger than a given
> > > pack, because we have to extend it to match whatever objects the caller
> > > asked for. E.g., imagine we have commit A, repack into into a bitmapped
> > > pack, and then somebody pushes up commit B. Now I want to clone, getting
> > > both A and B.
> > >
> > > Our bitmap result represents the whole answer, and so must include both
> > > objects. But since "B" isn't in the pack, it doesn't have an assigned
> > > bit. We add it to the in-memory bitmap_git->ext_index, which gives it a
> > > bit (valid only for that walk result!). But of course for pack reuse, we
> > > only care about the objects that were actually in the bitmapped pack.
> >
> > Not sure if these explanations should go into the commit message or a
> > comment in the code. So I haven't added them for now.
>
> I think this is really outside the scope of this series, even, and gets
> into how the bitmap traversal works in general. I'm sure the
> documentation for that could be a bit better.

Ok, I will take a look at improving the documentation, either in this
patch series if I resend it, or in a follow up small series.

> > > Thinking on it more, though, I wonder if there's a bug hiding here. We
> > > know that we can send the whole initial chunk verbatim for OFS_DELTA
> > > objects, because the relative offsets will remain unchanged (i.e., there
> > > are no "holes" that trigger our chunk code). But if there were a
> > > REF_DELTA in that initial chunk, we have no certainty that the base is
> > > being sent.
> > >
> > > In practice, this doesn't happen because the objects in question have to
> > > be in a pack with bitmaps, which means it was written ourselves by
> > > git-repack. And we'd never write REF_DELTA objects there.
> > >
> > > But I wonder if it's worth being a bit more careful (and what the
> > > performance impact is; it would mean checking the type of every object
> > > in that initial chunk).
> >
> > I haven't done anything related to that. Maybe something can be done
> > in a follow up patch.
>
> Hmm.  I was thinking the problem was introduced in this series, but the
> old code should suffer from this as well. I wondered if it might simply
> be that the old code did not trigger often enough for anybody to notice,
> but we have been running the code in this series for several years at
> GitHub. So probably my reasoning above is sound (but it might still be
> worth addressing).

Ok, I will take a look.

Thanks,
Christian.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2020-01-23 22:29 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-15 14:15 [PATCH v3 0/9] Rewrite packfile reuse code Christian Couder
2019-11-15 14:15 ` [PATCH v3 1/9] builtin/pack-objects: report reused packfile objects Christian Couder
2019-12-09  6:24   ` Jeff King
2019-12-11 13:48     ` Christian Couder
2019-11-15 14:15 ` [PATCH v3 2/9] packfile: expose get_delta_base() Christian Couder
2019-12-09  6:26   ` Jeff King
2019-11-15 14:15 ` [PATCH v3 3/9] ewah/bitmap: introduce bitmap_word_alloc() Christian Couder
2019-12-09  6:28   ` Jeff King
2019-12-11 13:50     ` Christian Couder
2019-12-12  5:45       ` Jeff King
2019-11-15 14:15 ` [PATCH v3 4/9] pack-bitmap: don't rely on bitmap_git->reuse_objects Christian Couder
2019-12-09  6:47   ` Jeff King
2019-12-13 13:26     ` Christian Couder
2019-11-15 14:15 ` [PATCH v3 5/9] pack-bitmap: introduce bitmap_walk_contains() Christian Couder
2019-12-09  7:06   ` Jeff King
2019-12-13 13:27     ` Christian Couder
2019-11-15 14:15 ` [PATCH v3 6/9] csum-file: introduce hashfile_total() Christian Couder
2019-12-09  7:07   ` Jeff King
2019-11-15 14:15 ` [PATCH v3 7/9] pack-objects: introduce pack.allowPackReuse Christian Couder
2019-12-09  7:14   ` Jeff King
2019-12-13 13:27     ` Christian Couder
2019-11-15 14:15 ` [PATCH v3 8/9] builtin/pack-objects: introduce obj_is_packed() Christian Couder
2019-12-09  7:14   ` Jeff King
2019-11-15 14:15 ` [PATCH v3 9/9] pack-objects: improve partial packfile reuse Christian Couder
2019-12-09  8:11   ` Jeff King
2019-12-18 11:26     ` Christian Couder
2019-12-19  0:42       ` Jeff King
2020-01-23 22:29         ` Christian Couder
2019-11-15 18:03 ` [PATCH v3 0/9] Rewrite packfile reuse code Jonathan Tan
2019-11-25  6:30   ` Junio C Hamano
2019-11-25  6:36     ` Junio C Hamano
2019-12-06 21:42       ` Junio C Hamano
2019-12-07 10:12         ` Christian Couder
2019-12-07 20:47           ` Johannes Schindelin
2019-12-08  7:53             ` Christian Couder
2019-12-08  8:54               ` Johannes Schindelin
2019-12-08 10:26                 ` Christian Couder
2019-12-08 10:45                   ` Johannes Schindelin
2019-12-09  6:18                 ` Jeff King
2019-12-09  9:28                   ` Johannes Schindelin
2019-12-09 19:00                   ` Junio C Hamano
2019-12-09 19:05                   ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).