git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [RFC PATCH 00/10] Rewrite packfile reuse code
@ 2019-09-13 13:02 Christian Couder
  2019-09-13 13:02 ` [RFC PATCH 01/10] builtin/pack-objects: report reused packfile objects Christian Couder
                   ` (9 more replies)
  0 siblings, 10 replies; 31+ messages in thread
From: Christian Couder @ 2019-09-13 13:02 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones

This patch series is rewriting the code that tries to reuse existing
packfiles.

The code in this patch series was written by GitHub and Peff nicely
provided it in the following discussion:

https://public-inbox.org/git/3E56B0FD-EBE8-4057-A93A-16EBB09FBCE0@jramsay.com.au/

This is an RFC patch series that mostly for now just tries to split
the code into separate commits. If this split is considered ok, then
commit messages will be improved and some doc will be added
(especially doc for pack.allowPackReuse). Perhaps performance test
results will also be provided.

Most of the changes are in the last patch (10/10) and I haven't found
a good way to split them into several patches. Ideas are welcome. In
each of the other preparatory patches there is a small change that
might make sense separately.

According to Peff this new code is a lot smarter than what it
replaces. It allows "holes" in the chunks of packfile to be reused,
and skips over them. It rewrites OFS_DELTA offsets as it goes to
account for the holes. So it's basically a linear walk over the
packfile, but with the important distinction that we don't add those
objects to the object_entry array, which makes them very lightweight
(especially in memory use, but they also aren't considered bases for
finding new deltas, etc). It seems like a good compromise between the
cost to serve a clone and the quality of the resulting packfile.

I have put Peff as the author of all the commits.

Jeff King (10):
  builtin/pack-objects: report reused packfile objects
  packfile: expose get_delta_base()
  ewah/bitmap: introduce bitmap_word_alloc()
  ewah/bitmap: always allocate 2 more words
  pack-bitmap: don't rely on bitmap_git->reuse_objects
  pack-bitmap: introduce bitmap_walk_contains()
  csum-file: introduce hashfile_total()
  pack-objects: introduce pack.allowPackReuse
  builtin/pack-objects: introduce obj_is_packed()
  pack-objects: improve partial packfile reuse

 builtin/pack-objects.c | 248 +++++++++++++++++++++++++++++++++--------
 csum-file.h            |   9 ++
 ewah/bitmap.c          |  13 ++-
 ewah/ewok.h            |   1 +
 pack-bitmap.c          | 178 ++++++++++++++++++++---------
 pack-bitmap.h          |   6 +-
 packfile.c             |  10 +-
 packfile.h             |   3 +
 8 files changed, 358 insertions(+), 110 deletions(-)

-- 
2.23.0.46.gd213b4aca1.dirty


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC PATCH 01/10] builtin/pack-objects: report reused packfile objects
  2019-09-13 13:02 [RFC PATCH 00/10] Rewrite packfile reuse code Christian Couder
@ 2019-09-13 13:02 ` Christian Couder
  2019-09-13 13:02 ` [RFC PATCH 02/10] packfile: expose get_delta_base() Christian Couder
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Christian Couder @ 2019-09-13 13:02 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	James Ramsay

From: Jeff King <peff@peff.net>

To see when packfile reuse kicks in or not, it is useful to
show reused packfile objects statistics in the output of
upload-pack.

Helped-by: James Ramsay <james@jramsay.com.au>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 builtin/pack-objects.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 76ce906946..c11b2ea8d4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3525,7 +3525,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (progress)
 		fprintf_ln(stderr,
 			   _("Total %"PRIu32" (delta %"PRIu32"),"
-			     " reused %"PRIu32" (delta %"PRIu32")"),
-			   written, written_delta, reused, reused_delta);
+			     " reused %"PRIu32" (delta %"PRIu32"),"
+			     " pack-reused %"PRIu32),
+			   written, written_delta, reused, reused_delta,
+			   reuse_packfile_objects);
 	return 0;
 }
-- 
2.23.0.46.gd213b4aca1.dirty


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 02/10] packfile: expose get_delta_base()
  2019-09-13 13:02 [RFC PATCH 00/10] Rewrite packfile reuse code Christian Couder
  2019-09-13 13:02 ` [RFC PATCH 01/10] builtin/pack-objects: report reused packfile objects Christian Couder
@ 2019-09-13 13:02 ` Christian Couder
  2019-09-13 13:02 ` [RFC PATCH 03/10] ewah/bitmap: introduce bitmap_word_alloc() Christian Couder
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Christian Couder @ 2019-09-13 13:02 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones

From: Jeff King <peff@peff.net>

In a following commit get_delta_base() will be used outside
packfile.c, so let's make it non static and declare it in
packfile.h.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 packfile.c | 10 +++++-----
 packfile.h |  3 +++
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/packfile.c b/packfile.c
index fc43a6c52c..5a6e8d54f1 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1191,11 +1191,11 @@ const struct packed_git *has_packed_and_bad(struct repository *r,
 	return NULL;
 }
 
-static off_t get_delta_base(struct packed_git *p,
-				    struct pack_window **w_curs,
-				    off_t *curpos,
-				    enum object_type type,
-				    off_t delta_obj_offset)
+off_t get_delta_base(struct packed_git *p,
+		     struct pack_window **w_curs,
+		     off_t *curpos,
+		     enum object_type type,
+		     off_t delta_obj_offset)
 {
 	unsigned char *base_info = use_pack(p, w_curs, *curpos, NULL);
 	off_t base_offset;
diff --git a/packfile.h b/packfile.h
index 3e98910bdd..8049202f4c 100644
--- a/packfile.h
+++ b/packfile.h
@@ -151,6 +151,9 @@ void *unpack_entry(struct repository *r, struct packed_git *, off_t, enum object
 unsigned long unpack_object_header_buffer(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
 unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
 int unpack_object_header(struct packed_git *, struct pack_window **, off_t *, unsigned long *);
+off_t get_delta_base(struct packed_git *p, struct pack_window **w_curs,
+		     off_t *curpos, enum object_type type,
+		     off_t delta_obj_offset);
 
 void release_pack_memory(size_t);
 
-- 
2.23.0.46.gd213b4aca1.dirty


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 03/10] ewah/bitmap: introduce bitmap_word_alloc()
  2019-09-13 13:02 [RFC PATCH 00/10] Rewrite packfile reuse code Christian Couder
  2019-09-13 13:02 ` [RFC PATCH 01/10] builtin/pack-objects: report reused packfile objects Christian Couder
  2019-09-13 13:02 ` [RFC PATCH 02/10] packfile: expose get_delta_base() Christian Couder
@ 2019-09-13 13:02 ` Christian Couder
  2019-09-13 13:02 ` [RFC PATCH 04/10] ewah/bitmap: always allocate 2 more words Christian Couder
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Christian Couder @ 2019-09-13 13:02 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones

From: Jeff King <peff@peff.net>

In a following patch we will need to allocate a variable
number of bitmap words, instead of always 32, so let's add
bitmap_word_alloc() for this purpose.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 ewah/bitmap.c | 11 ++++++++---
 ewah/ewok.h   |  1 +
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/ewah/bitmap.c b/ewah/bitmap.c
index 52f1178db4..143dc71419 100644
--- a/ewah/bitmap.c
+++ b/ewah/bitmap.c
@@ -22,14 +22,19 @@
 #define EWAH_MASK(x) ((eword_t)1 << (x % BITS_IN_EWORD))
 #define EWAH_BLOCK(x) (x / BITS_IN_EWORD)
 
-struct bitmap *bitmap_new(void)
+struct bitmap *bitmap_word_alloc(size_t word_alloc)
 {
 	struct bitmap *bitmap = xmalloc(sizeof(struct bitmap));
-	bitmap->words = xcalloc(32, sizeof(eword_t));
-	bitmap->word_alloc = 32;
+	bitmap->words = xcalloc(word_alloc, sizeof(eword_t));
+	bitmap->word_alloc = word_alloc;
 	return bitmap;
 }
 
+struct bitmap *bitmap_new(void)
+{
+	return bitmap_word_alloc(32);
+}
+
 void bitmap_set(struct bitmap *self, size_t pos)
 {
 	size_t block = EWAH_BLOCK(pos);
diff --git a/ewah/ewok.h b/ewah/ewok.h
index 84b2a29faa..1b98b57c8b 100644
--- a/ewah/ewok.h
+++ b/ewah/ewok.h
@@ -172,6 +172,7 @@ struct bitmap {
 };
 
 struct bitmap *bitmap_new(void);
+struct bitmap *bitmap_word_alloc(size_t word_alloc);
 void bitmap_set(struct bitmap *self, size_t pos);
 int bitmap_get(struct bitmap *self, size_t pos);
 void bitmap_reset(struct bitmap *self);
-- 
2.23.0.46.gd213b4aca1.dirty


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 04/10] ewah/bitmap: always allocate 2 more words
  2019-09-13 13:02 [RFC PATCH 00/10] Rewrite packfile reuse code Christian Couder
                   ` (2 preceding siblings ...)
  2019-09-13 13:02 ` [RFC PATCH 03/10] ewah/bitmap: introduce bitmap_word_alloc() Christian Couder
@ 2019-09-13 13:02 ` Christian Couder
  2019-10-10 23:40   ` Jonathan Tan
  2019-09-13 13:02 ` [RFC PATCH 05/10] pack-bitmap: don't rely on bitmap_git->reuse_objects Christian Couder
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Christian Couder @ 2019-09-13 13:02 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones

From: Jeff King <peff@peff.net>

In a following patch we will allocate a variable number
of words in some bitmaps. When iterating over the words we
will need a mark to tell us when to stop iterating. Let's
always allocate 2 more words, that will always contain 0,
as that mark.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 ewah/bitmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ewah/bitmap.c b/ewah/bitmap.c
index 143dc71419..eac05485f1 100644
--- a/ewah/bitmap.c
+++ b/ewah/bitmap.c
@@ -41,7 +41,7 @@ void bitmap_set(struct bitmap *self, size_t pos)
 
 	if (block >= self->word_alloc) {
 		size_t old_size = self->word_alloc;
-		self->word_alloc = block * 2;
+		self->word_alloc = (block + 1) * 2;
 		REALLOC_ARRAY(self->words, self->word_alloc);
 		memset(self->words + old_size, 0x0,
 			(self->word_alloc - old_size) * sizeof(eword_t));
-- 
2.23.0.46.gd213b4aca1.dirty


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 05/10] pack-bitmap: don't rely on bitmap_git->reuse_objects
  2019-09-13 13:02 [RFC PATCH 00/10] Rewrite packfile reuse code Christian Couder
                   ` (3 preceding siblings ...)
  2019-09-13 13:02 ` [RFC PATCH 04/10] ewah/bitmap: always allocate 2 more words Christian Couder
@ 2019-09-13 13:02 ` Christian Couder
  2019-10-10 23:44   ` Jonathan Tan
  2019-09-13 13:02 ` [RFC PATCH 06/10] pack-bitmap: introduce bitmap_walk_contains() Christian Couder
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Christian Couder @ 2019-09-13 13:02 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones

From: Jeff King <peff@peff.net>

As we now allocate 2 more words than necessary for each
bitmap to serve as marks telling us that we can stop
iterating over the words, we don't need to rely on
bitmap_git->reuse_objects to stop iterating over the words.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 pack-bitmap.c | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index ed2befaac6..b2422fed8f 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -622,7 +622,7 @@ static void show_objects_for_type(
 	enum object_type object_type,
 	show_reachable_fn show_reach)
 {
-	size_t pos = 0, i = 0;
+	size_t i = 0;
 	uint32_t offset;
 
 	struct ewah_iterator it;
@@ -630,13 +630,15 @@ static void show_objects_for_type(
 
 	struct bitmap *objects = bitmap_git->result;
 
-	if (bitmap_git->reuse_objects == bitmap_git->pack->num_objects)
-		return;
-
 	ewah_iterator_init(&it, type_filter);
 
-	while (i < objects->word_alloc && ewah_iterator_next(&filter, &it)) {
+	for (i = 0; i < objects->word_alloc &&
+			ewah_iterator_next(&filter, &it); i++) {
 		eword_t word = objects->words[i] & filter;
+		size_t pos = (i * BITS_IN_EWORD);
+
+		if (!word)
+			continue;
 
 		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
 			struct object_id oid;
@@ -648,9 +650,6 @@ static void show_objects_for_type(
 
 			offset += ewah_bit_ctz64(word >> offset);
 
-			if (pos + offset < bitmap_git->reuse_objects)
-				continue;
-
 			entry = &bitmap_git->pack->revindex[pos + offset];
 			nth_packed_object_oid(&oid, bitmap_git->pack, entry->nr);
 
@@ -659,9 +658,6 @@ static void show_objects_for_type(
 
 			show_reach(&oid, object_type, 0, hash, bitmap_git->pack, entry->offset);
 		}
-
-		pos += BITS_IN_EWORD;
-		i++;
 	}
 }
 
-- 
2.23.0.46.gd213b4aca1.dirty


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 06/10] pack-bitmap: introduce bitmap_walk_contains()
  2019-09-13 13:02 [RFC PATCH 00/10] Rewrite packfile reuse code Christian Couder
                   ` (4 preceding siblings ...)
  2019-09-13 13:02 ` [RFC PATCH 05/10] pack-bitmap: don't rely on bitmap_git->reuse_objects Christian Couder
@ 2019-09-13 13:02 ` Christian Couder
  2019-09-13 13:02 ` [RFC PATCH 07/10] csum-file: introduce hashfile_total() Christian Couder
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Christian Couder @ 2019-09-13 13:02 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones

From: Jeff King <peff@peff.net>

We will use this helper function in a following patch to
tell us if an object is packed.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 pack-bitmap.c | 12 ++++++++++++
 pack-bitmap.h |  3 +++
 2 files changed, 15 insertions(+)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index b2422fed8f..1833971dc7 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -828,6 +828,18 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	return 0;
 }
 
+int bitmap_walk_contains(struct bitmap_index *bitmap_git,
+			 struct bitmap *bitmap, const struct object_id *oid)
+{
+	int idx;
+
+	if (!bitmap)
+		return 0;
+
+	idx = bitmap_position(bitmap_git, oid);
+	return idx >= 0 && bitmap_get(bitmap, idx);
+}
+
 void traverse_bitmap_commit_list(struct bitmap_index *bitmap_git,
 				 show_reachable_fn show_reachable)
 {
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 00de3ec8e4..5425767f0f 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -3,6 +3,7 @@
 
 #include "ewah/ewok.h"
 #include "khash.h"
+#include "pack.h"
 #include "pack-objects.h"
 
 struct commit;
@@ -53,6 +54,8 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *,
 int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
 			     kh_oid_map_t *reused_bitmaps, int show_progress);
 void free_bitmap_index(struct bitmap_index *);
+int bitmap_walk_contains(struct bitmap_index *,
+			 struct bitmap *bitmap, const struct object_id *oid);
 
 /*
  * After a traversal has been performed by prepare_bitmap_walk(), this can be
-- 
2.23.0.46.gd213b4aca1.dirty


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 07/10] csum-file: introduce hashfile_total()
  2019-09-13 13:02 [RFC PATCH 00/10] Rewrite packfile reuse code Christian Couder
                   ` (5 preceding siblings ...)
  2019-09-13 13:02 ` [RFC PATCH 06/10] pack-bitmap: introduce bitmap_walk_contains() Christian Couder
@ 2019-09-13 13:02 ` Christian Couder
  2019-09-13 13:02 ` [RFC PATCH 08/10] pack-objects: introduce pack.allowPackReuse Christian Couder
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Christian Couder @ 2019-09-13 13:02 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones

From: Jeff King <peff@peff.net>

We will need this helper function in a following patch
to give us total number of bytes fed to the hashfile so far.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 csum-file.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/csum-file.h b/csum-file.h
index a98b1eee53..f9cbd317fb 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -42,6 +42,15 @@ void hashflush(struct hashfile *f);
 void crc32_begin(struct hashfile *);
 uint32_t crc32_end(struct hashfile *);
 
+/*
+ * Returns the total number of bytes fed to the hashfile so far (including ones
+ * that have not been written out to the descriptor yet).
+ */
+static inline off_t hashfile_total(struct hashfile *f)
+{
+	return f->total + f->offset;
+}
+
 static inline void hashwrite_u8(struct hashfile *f, uint8_t data)
 {
 	hashwrite(f, &data, sizeof(data));
-- 
2.23.0.46.gd213b4aca1.dirty


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 08/10] pack-objects: introduce pack.allowPackReuse
  2019-09-13 13:02 [RFC PATCH 00/10] Rewrite packfile reuse code Christian Couder
                   ` (6 preceding siblings ...)
  2019-09-13 13:02 ` [RFC PATCH 07/10] csum-file: introduce hashfile_total() Christian Couder
@ 2019-09-13 13:02 ` Christian Couder
  2019-09-13 21:37   ` Junio C Hamano
  2019-09-13 13:02 ` [RFC PATCH 09/10] builtin/pack-objects: introduce obj_is_packed() Christian Couder
  2019-09-13 13:02 ` [RFC PATCH 10/10] pack-objects: improve partial packfile reuse Christian Couder
  9 siblings, 1 reply; 31+ messages in thread
From: Christian Couder @ 2019-09-13 13:02 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones

From: Jeff King <peff@peff.net>

Let's make it possible to configure if we want pack reuse or not.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 builtin/pack-objects.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c11b2ea8d4..1664969c97 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -96,6 +96,7 @@ static off_t reuse_packfile_offset;
 
 static int use_bitmap_index_default = 1;
 static int use_bitmap_index = -1;
+static int allow_pack_reuse = 1;
 static enum {
 	WRITE_BITMAP_FALSE = 0,
 	WRITE_BITMAP_QUIET,
@@ -2719,6 +2720,10 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 		sparse = git_config_bool(k, v);
 		return 0;
 	}
+	if (!strcmp(k, "pack.allowpackreuse")) {
+		allow_pack_reuse = git_config_bool(k, v);
+		return 0;
+	}
 	if (!strcmp(k, "pack.threads")) {
 		delta_search_threads = git_config_int(k, v);
 		if (delta_search_threads < 0)
@@ -3063,7 +3068,8 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
 	if (!(bitmap_git = prepare_bitmap_walk(revs)))
 		return -1;
 
-	if (pack_options_allow_reuse() &&
+	if (allow_pack_reuse &&
+	    pack_options_allow_reuse() &&
 	    !reuse_partial_packfile_from_bitmap(
 			bitmap_git,
 			&reuse_packfile,
-- 
2.23.0.46.gd213b4aca1.dirty


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 09/10] builtin/pack-objects: introduce obj_is_packed()
  2019-09-13 13:02 [RFC PATCH 00/10] Rewrite packfile reuse code Christian Couder
                   ` (7 preceding siblings ...)
  2019-09-13 13:02 ` [RFC PATCH 08/10] pack-objects: introduce pack.allowPackReuse Christian Couder
@ 2019-09-13 13:02 ` Christian Couder
  2019-09-13 13:02 ` [RFC PATCH 10/10] pack-objects: improve partial packfile reuse Christian Couder
  9 siblings, 0 replies; 31+ messages in thread
From: Christian Couder @ 2019-09-13 13:02 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones

From: Jeff King <peff@peff.net>

Let's refactor the way we check if an object is packed by
introducing obj_is_packed(). This function is now a simple
wrapper around packlist_find(), but it will evolve in a
following commit.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 builtin/pack-objects.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 1664969c97..5ee0b3092d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2569,6 +2569,11 @@ static void ll_find_deltas(struct object_entry **list, unsigned list_size,
 	free(p);
 }
 
+static int obj_is_packed(const struct object_id *oid)
+{
+	return !!packlist_find(&to_pack, oid, NULL);
+}
+
 static void add_tag_chain(const struct object_id *oid)
 {
 	struct tag *tag;
@@ -2580,7 +2585,7 @@ static void add_tag_chain(const struct object_id *oid)
 	 * it was included via bitmaps, we would not have parsed it
 	 * previously).
 	 */
-	if (packlist_find(&to_pack, oid, NULL))
+	if (obj_is_packed(oid))
 		return;
 
 	tag = lookup_tag(the_repository, oid);
@@ -2604,7 +2609,7 @@ static int add_ref_tag(const char *path, const struct object_id *oid, int flag,
 
 	if (starts_with(path, "refs/tags/") && /* is a tag? */
 	    !peel_ref(path, &peeled)    && /* peelable? */
-	    packlist_find(&to_pack, &peeled, NULL))      /* object packed? */
+	    obj_is_packed(&peeled)) /* object packed? */
 		add_tag_chain(oid);
 	return 0;
 }
-- 
2.23.0.46.gd213b4aca1.dirty


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 10/10] pack-objects: improve partial packfile reuse
  2019-09-13 13:02 [RFC PATCH 00/10] Rewrite packfile reuse code Christian Couder
                   ` (8 preceding siblings ...)
  2019-09-13 13:02 ` [RFC PATCH 09/10] builtin/pack-objects: introduce obj_is_packed() Christian Couder
@ 2019-09-13 13:02 ` Christian Couder
  2019-09-13 22:29   ` Junio C Hamano
  2019-10-10 23:59   ` Jonathan Tan
  9 siblings, 2 replies; 31+ messages in thread
From: Christian Couder @ 2019-09-13 13:02 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones

From: Jeff King <peff@peff.net>

Let's store the chunks of the packfile that we reuse in
a dynamic array of `struct reused_chunk`, and let's use
a reuse_packfile_bitmap to speed up reusing parts of
packfiles.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 builtin/pack-objects.c | 227 +++++++++++++++++++++++++++++++++--------
 pack-bitmap.c          | 150 +++++++++++++++++++--------
 pack-bitmap.h          |   3 +-
 3 files changed, 293 insertions(+), 87 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5ee0b3092d..6822ac1026 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -92,7 +92,7 @@ static struct progress *progress_state;
 
 static struct packed_git *reuse_packfile;
 static uint32_t reuse_packfile_objects;
-static off_t reuse_packfile_offset;
+static struct bitmap *reuse_packfile_bitmap;
 
 static int use_bitmap_index_default = 1;
 static int use_bitmap_index = -1;
@@ -785,57 +785,189 @@ static struct object_entry **compute_write_order(void)
 	return wo;
 }
 
-static off_t write_reused_pack(struct hashfile *f)
+/*
+ * Record the offsets needed in our reused packfile chunks due to
+ * "gaps" where we omitted some objects.
+ */
+static struct reused_chunk {
+	off_t start;
+	off_t offset;
+} *reused_chunks;
+static int reused_chunks_nr;
+static int reused_chunks_alloc;
+
+static void record_reused_object(off_t where, off_t offset)
 {
-	unsigned char buffer[8192];
-	off_t to_write, total;
-	int fd;
+	if (reused_chunks_nr && reused_chunks[reused_chunks_nr-1].offset == offset)
+		return;
 
-	if (!is_pack_valid(reuse_packfile))
-		die(_("packfile is invalid: %s"), reuse_packfile->pack_name);
+	ALLOC_GROW(reused_chunks, reused_chunks_nr + 1,
+		   reused_chunks_alloc);
+	reused_chunks[reused_chunks_nr].start = where;
+	reused_chunks[reused_chunks_nr].offset = offset;
+	reused_chunks_nr++;
+}
 
-	fd = git_open(reuse_packfile->pack_name);
-	if (fd < 0)
-		die_errno(_("unable to open packfile for reuse: %s"),
-			  reuse_packfile->pack_name);
+/*
+ * Binary search to find the chunk that "where" is in. Note
+ * that we're not looking for an exact match, just the first
+ * chunk that contains it (which implicitly ends at the start
+ * of the next chunk.
+ */
+static off_t find_reused_offset(off_t where)
+{
+	int lo = 0, hi = reused_chunks_nr;
+	while (lo < hi) {
+		int mi = lo + ((hi - lo) / 2);
+		if (where == reused_chunks[mi].start)
+			return reused_chunks[mi].offset;
+		if (where < reused_chunks[mi].start)
+			hi = mi;
+		else
+			lo = mi + 1;
+	}
 
-	if (lseek(fd, sizeof(struct pack_header), SEEK_SET) == -1)
-		die_errno(_("unable to seek in reused packfile"));
+	/*
+	 * The first chunk starts at zero, so we can't have gone below
+	 * there.
+	 */
+	assert(lo);
+	return reused_chunks[lo-1].offset;
+}
 
-	if (reuse_packfile_offset < 0)
-		reuse_packfile_offset = reuse_packfile->pack_size - the_hash_algo->rawsz;
+static void write_reused_pack_one(size_t pos, struct hashfile *out,
+				  struct pack_window **w_curs)
+{
+	off_t offset, next, cur;
+	enum object_type type;
+	unsigned long size;
 
-	total = to_write = reuse_packfile_offset - sizeof(struct pack_header);
+	offset = reuse_packfile->revindex[pos].offset;
+	next = reuse_packfile->revindex[pos + 1].offset;
 
-	while (to_write) {
-		int read_pack = xread(fd, buffer, sizeof(buffer));
+	record_reused_object(offset, offset - hashfile_total(out));
 
-		if (read_pack <= 0)
-			die_errno(_("unable to read from reused packfile"));
+	cur = offset;
+	type = unpack_object_header(reuse_packfile, w_curs, &cur, &size);
+	assert(type >= 0);
 
-		if (read_pack > to_write)
-			read_pack = to_write;
+	if (type == OBJ_OFS_DELTA) {
+		off_t base_offset;
+		off_t fixup;
+
+		unsigned char header[MAX_PACK_OBJECT_HEADER];
+		unsigned len;
+
+		base_offset = get_delta_base(reuse_packfile, w_curs, &cur, type, offset);
+		assert(base_offset != 0);
+
+		/* Convert to REF_DELTA if we must... */
+		if (!allow_ofs_delta) {
+			int base_pos = find_revindex_position(reuse_packfile, base_offset);
+			const unsigned char *base_sha1 =
+				nth_packed_object_sha1(reuse_packfile,
+						       reuse_packfile->revindex[base_pos].nr);
+
+			len = encode_in_pack_object_header(header, sizeof(header),
+							   OBJ_REF_DELTA, size);
+			hashwrite(out, header, len);
+			hashwrite(out, base_sha1, 20);
+			copy_pack_data(out, reuse_packfile, w_curs, cur, next - cur);
+			return;
+		}
 
-		hashwrite(f, buffer, read_pack);
-		to_write -= read_pack;
+		/* Otherwise see if we need to rewrite the offset... */
+		fixup = find_reused_offset(offset) -
+			find_reused_offset(base_offset);
+		if (fixup) {
+			unsigned char ofs_header[10];
+			unsigned i, ofs_len;
+			off_t ofs = offset - base_offset - fixup;
+
+			len = encode_in_pack_object_header(header, sizeof(header),
+							   OBJ_OFS_DELTA, size);
+
+			i = sizeof(ofs_header) - 1;
+			ofs_header[i] = ofs & 127;
+			while (ofs >>= 7)
+				ofs_header[--i] = 128 | (--ofs & 127);
+
+			ofs_len = sizeof(ofs_header) - i;
+
+			if (0) {
+				off_t expected_size = cur - offset;
+
+				if (len + ofs_len < expected_size) {
+					unsigned max_pad = (len >= 4) ? 9 : 5;
+					header[len - 1] |= 0x80;
+					while (len < max_pad && len + ofs_len < expected_size)
+						header[len++] = 0x80;
+					header[len - 1] &= 0x7F;
+				}
+			}
+
+			hashwrite(out, header, len);
+			hashwrite(out, ofs_header + sizeof(ofs_header) - ofs_len, ofs_len);
+			copy_pack_data(out, reuse_packfile, w_curs, cur, next - cur);
+			return;
+		}
+
+		/* ...otherwise we have no fixup, and can write it verbatim */
+	}
+
+	copy_pack_data(out, reuse_packfile, w_curs, offset, next - offset);
+}
+
+static size_t write_reused_pack_verbatim(struct hashfile *out,
+					 struct pack_window **w_curs)
+{
+	size_t pos = 0;
+
+	while (pos < reuse_packfile_bitmap->word_alloc &&
+			reuse_packfile_bitmap->words[pos] == (eword_t)~0)
+		pos++;
+
+	if (pos) {
+		off_t to_write;
+
+		written = (pos * BITS_IN_EWORD);
+		to_write = reuse_packfile->revindex[written].offset
+			- sizeof(struct pack_header);
+
+		record_reused_object(sizeof(struct pack_header), 0);
+		hashflush(out);
+		copy_pack_data(out, reuse_packfile, w_curs,
+			sizeof(struct pack_header), to_write);
 
-		/*
-		 * We don't know the actual number of objects written,
-		 * only how many bytes written, how many bytes total, and
-		 * how many objects total. So we can fake it by pretending all
-		 * objects we are writing are the same size. This gives us a
-		 * smooth progress meter, and at the end it matches the true
-		 * answer.
-		 */
-		written = reuse_packfile_objects *
-				(((double)(total - to_write)) / total);
 		display_progress(progress_state, written);
 	}
+	return pos;
+}
+
+static void write_reused_pack(struct hashfile *f)
+{
+	size_t i = 0;
+	uint32_t offset;
+	struct pack_window *w_curs = NULL;
+
+	if (allow_ofs_delta)
+		i = write_reused_pack_verbatim(f, &w_curs);
+
+	for (; i < reuse_packfile_bitmap->word_alloc; ++i) {
+		eword_t word = reuse_packfile_bitmap->words[i];
+		size_t pos = (i * BITS_IN_EWORD);
+
+		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
+			if ((word >> offset) == 0)
+				break;
+
+			offset += ewah_bit_ctz64(word >> offset);
+			write_reused_pack_one(pos + offset, f, &w_curs);
+			display_progress(progress_state, ++written);
+		}
+	}
 
-	close(fd);
-	written = reuse_packfile_objects;
-	display_progress(progress_state, written);
-	return reuse_packfile_offset - sizeof(struct pack_header);
+	unuse_pack(&w_curs);
 }
 
 static const char no_split_warning[] = N_(
@@ -868,11 +1000,9 @@ static void write_pack_file(void)
 		offset = write_pack_header(f, nr_remaining);
 
 		if (reuse_packfile) {
-			off_t packfile_size;
 			assert(pack_to_stdout);
-
-			packfile_size = write_reused_pack(f);
-			offset += packfile_size;
+			write_reused_pack(f);
+			offset = hashfile_total(f);
 		}
 
 		nr_written = 0;
@@ -1002,6 +1132,10 @@ static int have_duplicate_entry(const struct object_id *oid,
 {
 	struct object_entry *entry;
 
+	if (reuse_packfile_bitmap &&
+	    bitmap_walk_contains(bitmap_git, reuse_packfile_bitmap, oid))
+		return 1;
+
 	entry = packlist_find(&to_pack, oid, index_pos);
 	if (!entry)
 		return 0;
@@ -1192,6 +1326,7 @@ static int add_object_entry(const struct object_id *oid, enum object_type type,
 	create_object_entry(oid, type, pack_name_hash(name),
 			    exclude, name && no_try_delta(name),
 			    index_pos, found_pack, found_offset);
+
 	return 1;
 }
 
@@ -2571,7 +2706,9 @@ static void ll_find_deltas(struct object_entry **list, unsigned list_size,
 
 static int obj_is_packed(const struct object_id *oid)
 {
-	return !!packlist_find(&to_pack, oid, NULL);
+	return packlist_find(&to_pack, oid, NULL) ||
+		(reuse_packfile_bitmap &&
+		 bitmap_walk_contains(bitmap_git, reuse_packfile_bitmap, oid));
 }
 
 static void add_tag_chain(const struct object_id *oid)
@@ -2677,6 +2814,7 @@ static void prepare_pack(int window, int depth)
 
 	if (nr_deltas && n > 1) {
 		unsigned nr_done = 0;
+
 		if (progress)
 			progress_state = start_progress(_("Compressing objects"),
 							nr_deltas);
@@ -3061,7 +3199,6 @@ static void loosen_unused_packed_objects(void)
 static int pack_options_allow_reuse(void)
 {
 	return pack_to_stdout &&
-	       allow_ofs_delta &&
 	       !ignore_packed_keep_on_disk &&
 	       !ignore_packed_keep_in_core &&
 	       (!local || !have_non_local_packs) &&
@@ -3079,7 +3216,7 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
 			bitmap_git,
 			&reuse_packfile,
 			&reuse_packfile_objects,
-			&reuse_packfile_offset)) {
+			&reuse_packfile_bitmap)) {
 		assert(reuse_packfile_objects);
 		nr_result += reuse_packfile_objects;
 		display_progress(progress_state, nr_result);
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 1833971dc7..1d4c95ebc1 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -326,6 +326,13 @@ static int load_pack_bitmap(struct bitmap_index *bitmap_git)
 	munmap(bitmap_git->map, bitmap_git->map_size);
 	bitmap_git->map = NULL;
 	bitmap_git->map_size = 0;
+
+	kh_destroy_oid_map(bitmap_git->bitmaps);
+	bitmap_git->bitmaps = NULL;
+
+	kh_destroy_oid_pos(bitmap_git->ext_index.positions);
+	bitmap_git->ext_index.positions = NULL;
+
 	return -1;
 }
 
@@ -766,65 +773,126 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
 	return NULL;
 }
 
-int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
-				       struct packed_git **packfile,
-				       uint32_t *entries,
-				       off_t *up_to)
+static void try_partial_reuse(struct bitmap_index *bitmap_git,
+			      size_t pos,
+			      struct bitmap *reuse,
+			      struct pack_window **w_curs)
 {
+	struct revindex_entry *revidx;
+	off_t offset;
+	enum object_type type;
+	unsigned long size;
+
+	if (pos >= bitmap_git->pack->num_objects)
+		return; /* not actually in the pack */
+
+	revidx = &bitmap_git->pack->revindex[pos];
+	offset = revidx->offset;
+	type = unpack_object_header(bitmap_git->pack, w_curs, &offset, &size);
+	if (type < 0)
+		return; /* broken packfile, punt */
+
+	if (type == OBJ_REF_DELTA || type == OBJ_OFS_DELTA) {
+		off_t base_offset;
+		int base_pos;
+
+		/*
+		 * Find the position of the base object so we can look it up
+		 * in our bitmaps. If we can't come up with an offset, or if
+		 * that offset is not in the revidx, the pack is corrupt.
+		 * There's nothing we can do, so just punt on this object,
+		 * and the normal slow path will complain about it in
+		 * more detail.
+		 */
+		base_offset = get_delta_base(bitmap_git->pack, w_curs,
+					     &offset, type, revidx->offset);
+		if (!base_offset)
+			return;
+		base_pos = find_revindex_position(bitmap_git->pack, base_offset);
+		if (base_pos < 0)
+			return;
+
+		/*
+		 * We assume delta dependencies always point backwards. This
+		 * lets us do a single pass, and is basically always true
+		 * due to the way OFS_DELTAs work. You would not typically
+		 * find REF_DELTA in a bitmapped pack, since we only bitmap
+		 * packs we write fresh, and OFS_DELTA is the default). But
+		 * let's double check to make sure the pack wasn't written with
+		 * odd parameters.
+		 */
+		if (base_pos >= pos)
+			return;
+
+		/*
+		 * And finally, if we're not sending the base as part of our
+		 * reuse chunk, then don't send this object either. The base
+		 * would come after us, along with other objects not
+		 * necessarily in the pack, which means we'd need to convert
+		 * to REF_DELTA on the fly. Better to just let the normal
+		 * object_entry code path handle it.
+		 */
+		if (!bitmap_get(reuse, base_pos))
+			return;
+	}
+
 	/*
-	 * Reuse the packfile content if we need more than
-	 * 90% of its objects
+	 * If we got here, then the object is OK to reuse. Mark it.
 	 */
-	static const double REUSE_PERCENT = 0.9;
+	bitmap_set(reuse, pos);
+}
 
+int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
+				       struct packed_git **packfile_out,
+				       uint32_t *entries,
+				       struct bitmap **reuse_out)
+{
 	struct bitmap *result = bitmap_git->result;
-	uint32_t reuse_threshold;
-	uint32_t i, reuse_objects = 0;
+	struct bitmap *reuse;
+	struct pack_window *w_curs = NULL;
+	size_t i = 0;
+	uint32_t offset;
 
 	assert(result);
 
-	for (i = 0; i < result->word_alloc; ++i) {
-		if (result->words[i] != (eword_t)~0) {
-			reuse_objects += ewah_bit_ctz64(~result->words[i]);
-			break;
-		}
-
-		reuse_objects += BITS_IN_EWORD;
-	}
+	while (i < result->word_alloc && result->words[i] == (eword_t)~0)
+		i++;
 
-#ifdef GIT_BITMAP_DEBUG
-	{
-		const unsigned char *sha1;
-		struct revindex_entry *entry;
+	/* Don't mark objects not in the packfile */
+	if (i > bitmap_git->pack->num_objects / BITS_IN_EWORD)
+		i = bitmap_git->pack->num_objects / BITS_IN_EWORD;
 
-		entry = &bitmap_git->reverse_index->revindex[reuse_objects];
-		sha1 = nth_packed_object_sha1(bitmap_git->pack, entry->nr);
+	reuse = bitmap_word_alloc(i);
+	memset(reuse->words, 0xFF, i * sizeof(eword_t));
 
-		fprintf(stderr, "Failed to reuse at %d (%016llx)\n",
-			reuse_objects, result->words[i]);
-		fprintf(stderr, " %s\n", hash_to_hex(sha1));
-	}
-#endif
+	for (; i < result->word_alloc; ++i) {
+		eword_t word = result->words[i];
+		size_t pos = (i * BITS_IN_EWORD);
 
-	if (!reuse_objects)
-		return -1;
+		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
+			if ((word >> offset) == 0)
+				break;
 
-	if (reuse_objects >= bitmap_git->pack->num_objects) {
-		bitmap_git->reuse_objects = *entries = bitmap_git->pack->num_objects;
-		*up_to = -1; /* reuse the full pack */
-		*packfile = bitmap_git->pack;
-		return 0;
+			offset += ewah_bit_ctz64(word >> offset);
+			try_partial_reuse(bitmap_git, pos + offset, reuse, &w_curs);
+		}
 	}
 
-	reuse_threshold = bitmap_popcount(bitmap_git->result) * REUSE_PERCENT;
+	unuse_pack(&w_curs);
 
-	if (reuse_objects < reuse_threshold)
+	*entries = bitmap_popcount(reuse);
+	if (!*entries) {
+		bitmap_free(reuse);
 		return -1;
+	}
 
-	bitmap_git->reuse_objects = *entries = reuse_objects;
-	*up_to = bitmap_git->pack->revindex[reuse_objects].offset;
-	*packfile = bitmap_git->pack;
-
+	/*
+	 * Drop any reused objects from the result, since they will not
+	 * need to be handled separately.
+	 */
+	bitmap_and_not(result, reuse);
+	*packfile_out = bitmap_git->pack;
+	*reuse_out = reuse;
 	return 0;
 }
 
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 5425767f0f..7af7335f2e 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -50,7 +50,8 @@ void test_bitmap_walk(struct rev_info *revs);
 struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs);
 int reuse_partial_packfile_from_bitmap(struct bitmap_index *,
 				       struct packed_git **packfile,
-				       uint32_t *entries, off_t *up_to);
+				       uint32_t *entries,
+				       struct bitmap **bitmap);
 int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
 			     kh_oid_map_t *reused_bitmaps, int show_progress);
 void free_bitmap_index(struct bitmap_index *);
-- 
2.23.0.46.gd213b4aca1.dirty


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 08/10] pack-objects: introduce pack.allowPackReuse
  2019-09-13 13:02 ` [RFC PATCH 08/10] pack-objects: introduce pack.allowPackReuse Christian Couder
@ 2019-09-13 21:37   ` Junio C Hamano
  0 siblings, 0 replies; 31+ messages in thread
From: Junio C Hamano @ 2019-09-13 21:37 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, Jeff King, Christian Couder, Ramsay Jones

Christian Couder <christian.couder@gmail.com> writes:

> From: Jeff King <peff@peff.net>
>
> Let's make it possible to configure if we want pack reuse or not.
>
> Signed-off-by: Jeff King <peff@peff.net>
> Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> ---
>  builtin/pack-objects.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index c11b2ea8d4..1664969c97 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -96,6 +96,7 @@ static off_t reuse_packfile_offset;
>  
>  static int use_bitmap_index_default = 1;
>  static int use_bitmap_index = -1;
> +static int allow_pack_reuse = 1;
>  static enum {
>  	WRITE_BITMAP_FALSE = 0,
>  	WRITE_BITMAP_QUIET,
> @@ -2719,6 +2720,10 @@ static int git_pack_config(const char *k, const char *v, void *cb)
>  		sparse = git_config_bool(k, v);
>  		return 0;
>  	}
> +	if (!strcmp(k, "pack.allowpackreuse")) {
> +		allow_pack_reuse = git_config_bool(k, v);
> +		return 0;
> +	}
>  	if (!strcmp(k, "pack.threads")) {
>  		delta_search_threads = git_config_int(k, v);
>  		if (delta_search_threads < 0)
> @@ -3063,7 +3068,8 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
>  	if (!(bitmap_git = prepare_bitmap_walk(revs)))
>  		return -1;
>  
> -	if (pack_options_allow_reuse() &&
> +	if (allow_pack_reuse &&
> +	    pack_options_allow_reuse() &&

It somehow looks strange to have this code check for both
allow_pack_reuse and pack_options_allow_reuse().  I would have
expected that the referene to the new variable to be contained in
the existing helper function, so that any future code that wants to
ask "are we allowed to reuse?" would get the same answer from the
helper consistently, without having to check the new variable.

>  	    !reuse_partial_packfile_from_bitmap(
>  			bitmap_git,
>  			&reuse_packfile,

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 10/10] pack-objects: improve partial packfile reuse
  2019-09-13 13:02 ` [RFC PATCH 10/10] pack-objects: improve partial packfile reuse Christian Couder
@ 2019-09-13 22:29   ` Junio C Hamano
  2019-09-14  2:02     ` Jeff King
  2019-10-10 23:59   ` Jonathan Tan
  1 sibling, 1 reply; 31+ messages in thread
From: Junio C Hamano @ 2019-09-13 22:29 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, Jeff King, Christian Couder, Ramsay Jones

Christian Couder <christian.couder@gmail.com> writes:

> +/*
> + * Record the offsets needed in our reused packfile chunks due to
> + * "gaps" where we omitted some objects.
> + */

The meaning of 'start' and 'offset' is unclear from the first
reading.  Is it "starting offset" and "for how many bytes the region
lasts"?  If so, 'offset', which is usually a location (unless you
always measure from the beginning, in which case you could say it
names the byte-size of a region), may be a misnomer (side note: I'll
pretend that I haven't realized the 'offset' may be the end offset
of a region for now---this is a good illustration why a better
comment should be here anyway ;-).

> +static struct reused_chunk {
> +	off_t start;
> +	off_t offset;
> +} *reused_chunks;
> +static int reused_chunks_nr;
> +static int reused_chunks_alloc;
> +
> +static void record_reused_object(off_t where, off_t offset)

And here, 'start' is called 'where'; either is a good word for a
location; we'd want to pick either one to be consistent, perhaps?

>  {
> -	unsigned char buffer[8192];
> -	off_t to_write, total;
> -	int fd;
> +	if (reused_chunks_nr && reused_chunks[reused_chunks_nr-1].offset == offset)
> +		return;

The reason why the above code works is because this function will
always be called in the ascending order of the 'offset'?

Hmmm, perhaps 'offset' is not a region-size after all.  Is it the
end offset, as opposed to 'start' which is the starting offset, and
the two offsets sandwitch a region?

> -	if (!is_pack_valid(reuse_packfile))
> -		die(_("packfile is invalid: %s"), reuse_packfile->pack_name);
> +	ALLOC_GROW(reused_chunks, reused_chunks_nr + 1,
> +		   reused_chunks_alloc);
> +	reused_chunks[reused_chunks_nr].start = where;
> +	reused_chunks[reused_chunks_nr].offset = offset;
> +	reused_chunks_nr++;
> +}
>  
> -	fd = git_open(reuse_packfile->pack_name);
> -	if (fd < 0)
> -		die_errno(_("unable to open packfile for reuse: %s"),
> -			  reuse_packfile->pack_name);
> +/*
> + * Binary search to find the chunk that "where" is in. Note
> + * that we're not looking for an exact match, just the first
> + * chunk that contains it (which implicitly ends at the start
> + * of the next chunk.
> + */
> +static off_t find_reused_offset(off_t where)
> +{
> +	int lo = 0, hi = reused_chunks_nr;
> +	while (lo < hi) {
> +		int mi = lo + ((hi - lo) / 2);
> +		if (where == reused_chunks[mi].start)
> +			return reused_chunks[mi].offset;
> +		if (where < reused_chunks[mi].start)
> +			hi = mi;
> +		else
> +			lo = mi + 1;
> +	}
>  
> -	if (lseek(fd, sizeof(struct pack_header), SEEK_SET) == -1)
> -		die_errno(_("unable to seek in reused packfile"));
> +	/*
> +	 * The first chunk starts at zero, so we can't have gone below
> +	 * there.
> +	 */
> +	assert(lo);
> +	return reused_chunks[lo-1].offset;
> +}

This comment has nothing to do with the change, but the way the
patch is presented is quite hard to follow, in that the preimage or
the common context lines do not help understand what the new code is
doing at all ;-)

I'll come back to the remainder of the patch later.  Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 10/10] pack-objects: improve partial packfile reuse
  2019-09-13 22:29   ` Junio C Hamano
@ 2019-09-14  2:02     ` Jeff King
  2019-09-14  3:06       ` Junio C Hamano
  0 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2019-09-14  2:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Christian Couder, git, Christian Couder, Ramsay Jones

On Fri, Sep 13, 2019 at 03:29:00PM -0700, Junio C Hamano wrote:

> This comment has nothing to do with the change, but the way the
> patch is presented is quite hard to follow, in that the preimage or
> the common context lines do not help understand what the new code is
> doing at all ;-)
> 
> I'll come back to the remainder of the patch later.  Thanks.

I applaud Christian's effort to tease it out into separate patches. I
think I may need to take a pass and try to explain what's going on in
some of them (which will definitely involve paging in some ancient work
mentally, but hopefully between my memory and general familiarity with
the area, I can massage it into something a bit more understandable).

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 10/10] pack-objects: improve partial packfile reuse
  2019-09-14  2:02     ` Jeff King
@ 2019-09-14  3:06       ` Junio C Hamano
  2019-10-02 15:57         ` Jeff King
  0 siblings, 1 reply; 31+ messages in thread
From: Junio C Hamano @ 2019-09-14  3:06 UTC (permalink / raw)
  To: Jeff King; +Cc: Christian Couder, git, Christian Couder, Ramsay Jones

Jeff King <peff@peff.net> writes:

> On Fri, Sep 13, 2019 at 03:29:00PM -0700, Junio C Hamano wrote:
>
>> This comment has nothing to do with the change, but the way the
>> patch is presented is quite hard to follow, in that the preimage or
>> the common context lines do not help understand what the new code is
>> doing at all ;-)
>> 
>> I'll come back to the remainder of the patch later.  Thanks.
>
> I applaud Christian's effort to tease it out into separate patches.

Ah, no question about it.  I have a suspicion that 10/10 alone may
still be a bit too large a ball of wax, but with all the earlier
preparatory steps are bite-sized and trivial to see how they are
correct.

The "way the patch is presented" comment was not at all about what
Christian did, but was about what the diff machinery computed when
comparing the 9th step Christian created and the final step.  In its
attempt to find and align common context lines, it ended up finding
blank lines and almost nothing else in the earlier part of the
patch, not just making it harder to read the new helper function
(i.e. the best way to read record_reused_object(), for example, is
to look only at '+' and ' ' lines, because the '-' lines are
irrelevant), it also made it hard to see what got discarded.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 10/10] pack-objects: improve partial packfile reuse
  2019-09-14  3:06       ` Junio C Hamano
@ 2019-10-02 15:57         ` Jeff King
  2019-10-03  2:06           ` Junio C Hamano
  0 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2019-10-02 15:57 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Christian Couder, git, Christian Couder, Ramsay Jones

On Fri, Sep 13, 2019 at 08:06:01PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > On Fri, Sep 13, 2019 at 03:29:00PM -0700, Junio C Hamano wrote:
> >
> >> This comment has nothing to do with the change, but the way the
> >> patch is presented is quite hard to follow, in that the preimage or
> >> the common context lines do not help understand what the new code is
> >> doing at all ;-)
> >> 
> >> I'll come back to the remainder of the patch later.  Thanks.
> >
> > I applaud Christian's effort to tease it out into separate patches.
> 
> Ah, no question about it.  I have a suspicion that 10/10 alone may
> still be a bit too large a ball of wax, but with all the earlier
> preparatory steps are bite-sized and trivial to see how they are
> correct.
> 
> The "way the patch is presented" comment was not at all about what
> Christian did, but was about what the diff machinery computed when
> comparing the 9th step Christian created and the final step.  In its
> attempt to find and align common context lines, it ended up finding
> blank lines and almost nothing else in the earlier part of the
> patch, not just making it harder to read the new helper function
> (i.e. the best way to read record_reused_object(), for example, is
> to look only at '+' and ' ' lines, because the '-' lines are
> irrelevant), it also made it hard to see what got discarded.

Hmm, I see the early parts of this graduated to 'next'. I'm not sure
everything there is completely correct, though. E.g. I'm not sure of the
reasoning in df75281e78 (ewah/bitmap: always allocate 2 more words,
2019-09-13).

I think the "block+1" there is actually because "block" might be "0".
Prior to 2820ed171a (ewah/bitmap: introduce bitmap_word_alloc(),
2019-09-13) from the same series, that could never be the case because
we know that we always start with at least 32 allocated words. But after
that we _could_ start with an empty word array, and allocating "block *
2" would not make forward progress.

And then the "2 more words" thing is used as justification in the next
patch, 04a32d357f (pack-bitmap: don't rely on bitmap_git->reuse_objects,
2019-09-13). I won't say that there isn't some subtle dependency there,
but I certainly don't remember any logic like that at all. ;) So I think
it might bear a little more scrutiny.

I'm sorry for being so slow on giving it a more careful review. I was
traveling for work, then playing catch-up, and am now going on vacation.
So it might be a little while yet.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 10/10] pack-objects: improve partial packfile reuse
  2019-10-02 15:57         ` Jeff King
@ 2019-10-03  2:06           ` Junio C Hamano
  2019-10-03  6:55             ` Christian Couder
  0 siblings, 1 reply; 31+ messages in thread
From: Junio C Hamano @ 2019-10-03  2:06 UTC (permalink / raw)
  To: Jeff King; +Cc: Christian Couder, git, Christian Couder, Ramsay Jones

Jeff King <peff@peff.net> writes:

> Hmm, I see the early parts of this graduated to 'next'. I'm not sure
> everything there is completely correct, though. E.g. I'm not sure of the
> reasoning in df75281e78 (ewah/bitmap: always allocate 2 more words,
> 2019-09-13).
> ...
> I'm sorry for being so slow on giving it a more careful review. I was
> traveling for work, then playing catch-up, and am now going on vacation.
> So it might be a little while yet.

Thanks for a status update.  I do not mind moving this topic much
slower than other topics at all (if somebody is actively working on
it, I do not even mind reverting the merge and requeuing an updated
series, but I do not think that is the case here).  It would give me
much more confidence in the topic if we can collectively promise
ourselves that we'll give it a good review before we let it graduate
to 'master'.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 10/10] pack-objects: improve partial packfile reuse
  2019-10-03  2:06           ` Junio C Hamano
@ 2019-10-03  6:55             ` Christian Couder
  0 siblings, 0 replies; 31+ messages in thread
From: Christian Couder @ 2019-10-03  6:55 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Christian Couder, Ramsay Jones

On Thu, Oct 3, 2019 at 4:06 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Jeff King <peff@peff.net> writes:
>
> > Hmm, I see the early parts of this graduated to 'next'. I'm not sure
> > everything there is completely correct, though. E.g. I'm not sure of the
> > reasoning in df75281e78 (ewah/bitmap: always allocate 2 more words,
> > 2019-09-13).

Yeah, when I prepared the series I wondered why we allocate 2 more
words instead of just 1 more, but I forgot to ask that when sending
it.

> > I'm sorry for being so slow on giving it a more careful review. I was
> > traveling for work, then playing catch-up, and am now going on vacation.
> > So it might be a little while yet.
>
> Thanks for a status update.  I do not mind moving this topic much
> slower than other topics at all (if somebody is actively working on
> it, I do not even mind reverting the merge and requeuing an updated
> series, but I do not think that is the case here).

I think the series requires at least documenting pack.allowPackReuse
which is introduced in d35b73c5e9 (pack-objects: introduce
pack.allowPackReuse, 2019-09-13). I was planning to send an additional
patch to do that, but if you prefer I can add the documentation to the
same commit that introduce the config variable and resend everything.

> It would give me
> much more confidence in the topic if we can collectively promise
> ourselves that we'll give it a good review before we let it graduate
> to 'master'.

Yeah, a review from Peff could be especially insightful as the code
comes from GitHub.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 04/10] ewah/bitmap: always allocate 2 more words
  2019-09-13 13:02 ` [RFC PATCH 04/10] ewah/bitmap: always allocate 2 more words Christian Couder
@ 2019-10-10 23:40   ` Jonathan Tan
  2019-10-11  7:49     ` Christian Couder
  0 siblings, 1 reply; 31+ messages in thread
From: Jonathan Tan @ 2019-10-10 23:40 UTC (permalink / raw)
  To: christian.couder; +Cc: git, gitster, peff, chriscool, ramsay, Jonathan Tan

> From: Jeff King <peff@peff.net>
> 
> In a following patch we will allocate a variable number
> of words in some bitmaps. When iterating over the words we
> will need a mark to tell us when to stop iterating. Let's
> always allocate 2 more words, that will always contain 0,
> as that mark.

[snip]

>  	if (block >= self->word_alloc) {
>  		size_t old_size = self->word_alloc;
> -		self->word_alloc = block * 2;
> +		self->word_alloc = (block + 1) * 2;
>  		REALLOC_ARRAY(self->words, self->word_alloc);
>  		memset(self->words + old_size, 0x0,
>  			(self->word_alloc - old_size) * sizeof(eword_t));

This patch set was mentioned as needing more thorough review in "What's
Cooking" [1], so I thought I'd give it a try. As Peff said [2], the
justification in the commit message looks incorrect. He suggests that it
is most likely because "block" might be 0 (which is possible because a
previous patch eliminated the minimum of 32), which makes sense to me.

In any case, the next patch does not use 0 as a sentinel mark. Iteration
stops when word_alloc is reached anyway, and since this is a regular
bitmap, 0 is a valid word and cannot be used as a sentinel. (Maybe 0 is
a valid word in a compressed EWAH bitmap too...not sure about that.)

I think this should be squashed with patch 3, adding to that commit
message "since word_alloc might be 0, we need to change the growth
function". (Or just make the minimum word_alloc be 1 or 32 or something
positive, if that's possible.)

[1] https://public-inbox.org/git/xmqq36g5444k.fsf@gitster-ct.c.googlers.com/
[2] https://public-inbox.org/git/20191002155721.GD6116@sigill.intra.peff.net/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 05/10] pack-bitmap: don't rely on bitmap_git->reuse_objects
  2019-09-13 13:02 ` [RFC PATCH 05/10] pack-bitmap: don't rely on bitmap_git->reuse_objects Christian Couder
@ 2019-10-10 23:44   ` Jonathan Tan
  2019-10-11  7:50     ` Christian Couder
  0 siblings, 1 reply; 31+ messages in thread
From: Jonathan Tan @ 2019-10-10 23:44 UTC (permalink / raw)
  To: christian.couder; +Cc: git, gitster, peff, chriscool, ramsay, Jonathan Tan

> As we now allocate 2 more words than necessary for each
> bitmap to serve as marks telling us that we can stop
> iterating over the words, we don't need to rely on
> bitmap_git->reuse_objects to stop iterating over the words.

As Peff states [1], this justification is probably incorrect as well.
The actual justification seems to be that we will no longer compute
reuse_objects (in a future patch), so we cannot rely on it anymore to
terminate the loop early; we have to iterate to the end.

[1] https://public-inbox.org/git/20191002155721.GD6116@sigill.intra.peff.net/

> @@ -622,7 +622,7 @@ static void show_objects_for_type(
>  	enum object_type object_type,
>  	show_reachable_fn show_reach)
>  {
> -	size_t pos = 0, i = 0;
> +	size_t i = 0;
>  	uint32_t offset;
>  
>  	struct ewah_iterator it;
> @@ -630,13 +630,15 @@ static void show_objects_for_type(
>  
>  	struct bitmap *objects = bitmap_git->result;
>  
> -	if (bitmap_git->reuse_objects == bitmap_git->pack->num_objects)
> -		return;
> -
>  	ewah_iterator_init(&it, type_filter);
>  
> -	while (i < objects->word_alloc && ewah_iterator_next(&filter, &it)) {
> +	for (i = 0; i < objects->word_alloc &&
> +			ewah_iterator_next(&filter, &it); i++) {
>  		eword_t word = objects->words[i] & filter;
> +		size_t pos = (i * BITS_IN_EWORD);
> +
> +		if (!word)
> +			continue;

Here, iteration is not terminated when we see a 0. We just proceed to
the next one.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 10/10] pack-objects: improve partial packfile reuse
  2019-09-13 13:02 ` [RFC PATCH 10/10] pack-objects: improve partial packfile reuse Christian Couder
  2019-09-13 22:29   ` Junio C Hamano
@ 2019-10-10 23:59   ` Jonathan Tan
  2019-10-11  7:39     ` Christian Couder
  2019-10-11 18:01     ` Jeff King
  1 sibling, 2 replies; 31+ messages in thread
From: Jonathan Tan @ 2019-10-10 23:59 UTC (permalink / raw)
  To: christian.couder; +Cc: git, gitster, peff, chriscool, ramsay, Jonathan Tan

I'm going to start with pack-bitmap.h, then builtin/pack-objects.c.

>  int reuse_partial_packfile_from_bitmap(struct bitmap_index *,
>  				       struct packed_git **packfile,
> -				       uint32_t *entries, off_t *up_to);
> +				       uint32_t *entries,
> +				       struct bitmap **bitmap);

Makes sense. The existing code determines if the given packfile is
suitable, and if yes, the span of the packfile to use. With this patch,
we resolve to use the given packfile, and want to compute which objects
to use, storing the computation result in an uncompressed bitmap.
(Resolving to use the given packfile is not that much different from
existing behavior, as far as I know, since we currently only consider
one packfile at most anyway.)

So replacing the out param makes sense, although a more descriptive name
than "bitmap" would be nice.

Skipping pack-bitmaps.c for now.

OK, builtin/pack-objects.c.

> +/*
> + * Record the offsets needed in our reused packfile chunks due to
> + * "gaps" where we omitted some objects.
> + */
> +static struct reused_chunk {
> +	off_t start;
> +	off_t offset;
> +} *reused_chunks;
> +static int reused_chunks_nr;
> +static int reused_chunks_alloc;

This makes sense - offsets may be different when we omit objects from
the packfile. I think this can be computed by calculating the number of
zero bits between the current object's index and the nth object prior
(where n is the offset) in the bitmap resulting from
reuse_partial_packfile_from_bitmap() above, thus eliminating the need
for this array, but I haven't tested it.

> +static void write_reused_pack_one(size_t pos, struct hashfile *out,
> +				  struct pack_window **w_curs)
> +{
> +	off_t offset, next, cur;
> +	enum object_type type;
> +	unsigned long size;
>  
> +	offset = reuse_packfile->revindex[pos].offset;
> +	next = reuse_packfile->revindex[pos + 1].offset;
>  
> +	record_reused_object(offset, offset - hashfile_total(out));
>  
> +	cur = offset;
> +	type = unpack_object_header(reuse_packfile, w_curs, &cur, &size);
> +	assert(type >= 0);
>  
> +	if (type == OBJ_OFS_DELTA) {
> +		off_t base_offset;
> +		off_t fixup;
> +
> +		unsigned char header[MAX_PACK_OBJECT_HEADER];
> +		unsigned len;
> +
> +		base_offset = get_delta_base(reuse_packfile, w_curs, &cur, type, offset);
> +		assert(base_offset != 0);
> +
> +		/* Convert to REF_DELTA if we must... */
> +		if (!allow_ofs_delta) {
> +			int base_pos = find_revindex_position(reuse_packfile, base_offset);
> +			const unsigned char *base_sha1 =
> +				nth_packed_object_sha1(reuse_packfile,
> +						       reuse_packfile->revindex[base_pos].nr);
> +
> +			len = encode_in_pack_object_header(header, sizeof(header),
> +							   OBJ_REF_DELTA, size);
> +			hashwrite(out, header, len);
> +			hashwrite(out, base_sha1, 20);
> +			copy_pack_data(out, reuse_packfile, w_curs, cur, next - cur);
> +			return;
> +		}
>  
> +		/* Otherwise see if we need to rewrite the offset... */
> +		fixup = find_reused_offset(offset) -
> +			find_reused_offset(base_offset);
> +		if (fixup) {
> +			unsigned char ofs_header[10];
> +			unsigned i, ofs_len;
> +			off_t ofs = offset - base_offset - fixup;
> +
> +			len = encode_in_pack_object_header(header, sizeof(header),
> +							   OBJ_OFS_DELTA, size);
> +
> +			i = sizeof(ofs_header) - 1;
> +			ofs_header[i] = ofs & 127;
> +			while (ofs >>= 7)
> +				ofs_header[--i] = 128 | (--ofs & 127);
> +
> +			ofs_len = sizeof(ofs_header) - i;
> +
> +			if (0) {
> +				off_t expected_size = cur - offset;
> +
> +				if (len + ofs_len < expected_size) {
> +					unsigned max_pad = (len >= 4) ? 9 : 5;
> +					header[len - 1] |= 0x80;
> +					while (len < max_pad && len + ofs_len < expected_size)
> +						header[len++] = 0x80;
> +					header[len - 1] &= 0x7F;
> +				}
> +			}

This if(0) should be deleted.

> +
> +			hashwrite(out, header, len);
> +			hashwrite(out, ofs_header + sizeof(ofs_header) - ofs_len, ofs_len);
> +			copy_pack_data(out, reuse_packfile, w_curs, cur, next - cur);
> +			return;
> +		}
> +
> +		/* ...otherwise we have no fixup, and can write it verbatim */
> +	}
> +
> +	copy_pack_data(out, reuse_packfile, w_curs, offset, next - offset);
> +}

I didn't look into detail, but this looks like it's writing a single
object. In particular, it can convert OFS into REF, something that the
existing code cannot do.

> +static size_t write_reused_pack_verbatim(struct hashfile *out,
> +					 struct pack_window **w_curs)
> +{
> +	size_t pos = 0;
> +
> +	while (pos < reuse_packfile_bitmap->word_alloc &&
> +			reuse_packfile_bitmap->words[pos] == (eword_t)~0)
> +		pos++;
> +
> +	if (pos) {
> +		off_t to_write;
> +
> +		written = (pos * BITS_IN_EWORD);
> +		to_write = reuse_packfile->revindex[written].offset
> +			- sizeof(struct pack_header);
> +
> +		record_reused_object(sizeof(struct pack_header), 0);
> +		hashflush(out);
> +		copy_pack_data(out, reuse_packfile, w_curs,
> +			sizeof(struct pack_header), to_write);
>  
>  		display_progress(progress_state, written);
>  	}
> +	return pos;
> +}

I presume this optimization is needed for the case where we are using
*all* objects of a packfile, as is typical during a clone.

> +static void write_reused_pack(struct hashfile *f)
> +{
> +	size_t i = 0;
> +	uint32_t offset;
> +	struct pack_window *w_curs = NULL;
> +
> +	if (allow_ofs_delta)
> +		i = write_reused_pack_verbatim(f, &w_curs);
> +
> +	for (; i < reuse_packfile_bitmap->word_alloc; ++i) {
> +		eword_t word = reuse_packfile_bitmap->words[i];
> +		size_t pos = (i * BITS_IN_EWORD);
> +
> +		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
> +			if ((word >> offset) == 0)
> +				break;
> +
> +			offset += ewah_bit_ctz64(word >> offset);
> +			write_reused_pack_one(pos + offset, f, &w_curs);
> +			display_progress(progress_state, ++written);
> +		}
> +	}
>  
> +	unuse_pack(&w_curs);
>  }

I didn't look into detail, but it looks like a straightforward loop.

> @@ -1002,6 +1132,10 @@ static int have_duplicate_entry(const struct object_id *oid,
>  {
>  	struct object_entry *entry;
>  
> +	if (reuse_packfile_bitmap &&
> +	    bitmap_walk_contains(bitmap_git, reuse_packfile_bitmap, oid))
> +		return 1;

Hmm...why did we previously not need to check the reuse information, but
we do now? I gave the code a cursory glance but couldn't find the
answer.

> @@ -2571,7 +2706,9 @@ static void ll_find_deltas(struct object_entry **list, unsigned list_size,
>  
>  static int obj_is_packed(const struct object_id *oid)
>  {
> -	return !!packlist_find(&to_pack, oid, NULL);
> +	return packlist_find(&to_pack, oid, NULL) ||
> +		(reuse_packfile_bitmap &&
> +		 bitmap_walk_contains(bitmap_git, reuse_packfile_bitmap, oid));

Same question here - why do we need to check the reuse information?

> @@ -3061,7 +3199,6 @@ static void loosen_unused_packed_objects(void)
>  static int pack_options_allow_reuse(void)
>  {
>  	return pack_to_stdout &&
> -	       allow_ofs_delta &&
>  	       !ignore_packed_keep_on_disk &&
>  	       !ignore_packed_keep_in_core &&
>  	       (!local || !have_non_local_packs) &&

Relaxing of the allow_ofs_delta condition - makes sense given (as above)
we can convert OFS to REF.

Overall I think that this makes sense, except for my unanswered
questions (and the presence of reused_chunk).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 10/10] pack-objects: improve partial packfile reuse
  2019-10-10 23:59   ` Jonathan Tan
@ 2019-10-11  7:39     ` Christian Couder
  2019-10-11 18:01     ` Jeff King
  1 sibling, 0 replies; 31+ messages in thread
From: Christian Couder @ 2019-10-11  7:39 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: git, Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones

On Fri, Oct 11, 2019 at 1:59 AM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> I'm going to start with pack-bitmap.h, then builtin/pack-objects.c.
>
> >  int reuse_partial_packfile_from_bitmap(struct bitmap_index *,
> >                                      struct packed_git **packfile,
> > -                                    uint32_t *entries, off_t *up_to);
> > +                                    uint32_t *entries,
> > +                                    struct bitmap **bitmap);
>
> Makes sense. The existing code determines if the given packfile is
> suitable, and if yes, the span of the packfile to use. With this patch,
> we resolve to use the given packfile, and want to compute which objects
> to use, storing the computation result in an uncompressed bitmap.
> (Resolving to use the given packfile is not that much different from
> existing behavior, as far as I know, since we currently only consider
> one packfile at most anyway.)
>
> So replacing the out param makes sense, although a more descriptive name
> than "bitmap" would be nice.

Yeah, in pack-bitmap.c this argument is called "reuse_out", so the
same name could be used in pack-bitmap.c too. I changed that on my
current version.

> > +/*
> > + * Record the offsets needed in our reused packfile chunks due to
> > + * "gaps" where we omitted some objects.
> > + */
> > +static struct reused_chunk {
> > +     off_t start;
> > +     off_t offset;
> > +} *reused_chunks;
> > +static int reused_chunks_nr;
> > +static int reused_chunks_alloc;
>
> This makes sense - offsets may be different when we omit objects from
> the packfile. I think this can be computed by calculating the number of
> zero bits between the current object's index and the nth object prior
> (where n is the offset) in the bitmap resulting from
> reuse_partial_packfile_from_bitmap() above, thus eliminating the need
> for this array, but I haven't tested it.

Thanks for the idea. I will let others comment on that though before
maybe trying to implement it. I think it could come as a further
optimization in a following patch if it makes things faster or reduce
memory usage.

> > +                     if (0) {
> > +                             off_t expected_size = cur - offset;
> > +
> > +                             if (len + ofs_len < expected_size) {
> > +                                     unsigned max_pad = (len >= 4) ? 9 : 5;
> > +                                     header[len - 1] |= 0x80;
> > +                                     while (len < max_pad && len + ofs_len < expected_size)
> > +                                             header[len++] = 0x80;
> > +                                     header[len - 1] &= 0x7F;
> > +                             }
> > +                     }
>
> This if(0) should be deleted.

Yeah, good eyes. I removed it on my current version.

> > @@ -1002,6 +1132,10 @@ static int have_duplicate_entry(const struct object_id *oid,
> >  {
> >       struct object_entry *entry;
> >
> > +     if (reuse_packfile_bitmap &&
> > +         bitmap_walk_contains(bitmap_git, reuse_packfile_bitmap, oid))
> > +             return 1;
>
> Hmm...why did we previously not need to check the reuse information, but
> we do now? I gave the code a cursory glance but couldn't find the
> answer.
>
> > @@ -2571,7 +2706,9 @@ static void ll_find_deltas(struct object_entry **list, unsigned list_size,
> >
> >  static int obj_is_packed(const struct object_id *oid)
> >  {
> > -     return !!packlist_find(&to_pack, oid, NULL);
> > +     return packlist_find(&to_pack, oid, NULL) ||
> > +             (reuse_packfile_bitmap &&
> > +              bitmap_walk_contains(bitmap_git, reuse_packfile_bitmap, oid));
>
> Same question here - why do we need to check the reuse information?

Maybe a reuse bitmap makes it cheap enough to check that information?

Thank you for the review,
Christian.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 04/10] ewah/bitmap: always allocate 2 more words
  2019-10-10 23:40   ` Jonathan Tan
@ 2019-10-11  7:49     ` Christian Couder
  2019-10-11 18:05       ` Jeff King
  0 siblings, 1 reply; 31+ messages in thread
From: Christian Couder @ 2019-10-11  7:49 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: git, Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones

On Fri, Oct 11, 2019 at 1:40 AM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > From: Jeff King <peff@peff.net>
> >
> > In a following patch we will allocate a variable number
> > of words in some bitmaps. When iterating over the words we
> > will need a mark to tell us when to stop iterating. Let's
> > always allocate 2 more words, that will always contain 0,
> > as that mark.
>
> [snip]
>
> >       if (block >= self->word_alloc) {
> >               size_t old_size = self->word_alloc;
> > -             self->word_alloc = block * 2;
> > +             self->word_alloc = (block + 1) * 2;
> >               REALLOC_ARRAY(self->words, self->word_alloc);
> >               memset(self->words + old_size, 0x0,
> >                       (self->word_alloc - old_size) * sizeof(eword_t));
>
> This patch set was mentioned as needing more thorough review in "What's
> Cooking" [1], so I thought I'd give it a try.

Thanks!

> As Peff said [2], the
> justification in the commit message looks incorrect. He suggests that it
> is most likely because "block" might be 0 (which is possible because a
> previous patch eliminated the minimum of 32), which makes sense to me.

Ok I will try to come up with a better justification, though Peff said
that he would took another look at this series and I'd rather wait
until he has done that.

> In any case, the next patch does not use 0 as a sentinel mark. Iteration
> stops when word_alloc is reached anyway, and since this is a regular
> bitmap, 0 is a valid word and cannot be used as a sentinel. (Maybe 0 is
> a valid word in a compressed EWAH bitmap too...not sure about that.)

Yeah I misread this. Hopefully Peff can shed some light on this.

> I think this should be squashed with patch 3, adding to that commit
> message "since word_alloc might be 0, we need to change the growth
> function". (Or just make the minimum word_alloc be 1 or 32 or something
> positive, if that's possible.)

Yeah, thank you for the suggestion. I still wonder why 2 is added
instead of just 1 though.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 05/10] pack-bitmap: don't rely on bitmap_git->reuse_objects
  2019-10-10 23:44   ` Jonathan Tan
@ 2019-10-11  7:50     ` Christian Couder
  0 siblings, 0 replies; 31+ messages in thread
From: Christian Couder @ 2019-10-11  7:50 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: git, Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones

On Fri, Oct 11, 2019 at 1:44 AM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > As we now allocate 2 more words than necessary for each
> > bitmap to serve as marks telling us that we can stop
> > iterating over the words, we don't need to rely on
> > bitmap_git->reuse_objects to stop iterating over the words.
>
> As Peff states [1], this justification is probably incorrect as well.
> The actual justification seems to be that we will no longer compute
> reuse_objects (in a future patch), so we cannot rely on it anymore to
> terminate the loop early; we have to iterate to the end.
>
> [1] https://public-inbox.org/git/20191002155721.GD6116@sigill.intra.peff.net/

Ok, thanks! I will change the description.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 10/10] pack-objects: improve partial packfile reuse
  2019-10-10 23:59   ` Jonathan Tan
  2019-10-11  7:39     ` Christian Couder
@ 2019-10-11 18:01     ` Jeff King
  2019-10-11 21:04       ` Jonathan Tan
  2019-10-12  0:04       ` Junio C Hamano
  1 sibling, 2 replies; 31+ messages in thread
From: Jeff King @ 2019-10-11 18:01 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: christian.couder, git, gitster, chriscool, ramsay

On Thu, Oct 10, 2019 at 04:59:52PM -0700, Jonathan Tan wrote:

> > +/*
> > + * Record the offsets needed in our reused packfile chunks due to
> > + * "gaps" where we omitted some objects.
> > + */
> > +static struct reused_chunk {
> > +	off_t start;
> > +	off_t offset;
> > +} *reused_chunks;
> > +static int reused_chunks_nr;
> > +static int reused_chunks_alloc;
> 
> This makes sense - offsets may be different when we omit objects from
> the packfile. I think this can be computed by calculating the number of
> zero bits between the current object's index and the nth object prior
> (where n is the offset) in the bitmap resulting from
> reuse_partial_packfile_from_bitmap() above, thus eliminating the need
> for this array, but I haven't tested it.

You need to know not just the number of zero bits, but the accumulated
offset due to those missing objects. So you'd end up having to walk over
the revindex for that set of objects. This array is basically caching
those accumulated offsets (for the parts we _do_ include) so we don't
have to compute them repeatedly.

There's also a more subtle issue with entry sizes; see below.

> > +			if (0) {
> > +				off_t expected_size = cur - offset;
> > +
> > +				if (len + ofs_len < expected_size) {
> > +					unsigned max_pad = (len >= 4) ? 9 : 5;
> > +					header[len - 1] |= 0x80;
> > +					while (len < max_pad && len + ofs_len < expected_size)
> > +						header[len++] = 0x80;
> > +					header[len - 1] &= 0x7F;
> > +				}
> > +			}
> 
> This if(0) should be deleted.

Yeah, definitely. I had to scratch my head at what this code was doing.
IIRC, the issue is this:

  - imagine we have a sequence of objects in a pack, A B C D, but we
    want to generate an output pack that contains just A C D

  - imagine C is a delta against A. We adjust its delta base offset to
    account for the absence of B

  - because the offset is stored in a variable-length encoding, that may mean
    that the entry for C gets shorter!

  - now imagine D is a delta against A, as well. We have to account for
    the missing B, but also for the shrunken C.

The current code does so by creating a new entry in the reused_chunks
array. In the worst case that can grow to have the same number of
entries as we have objects. So this code was an attempt to pad the
header of a shrunken entry to keep it the same size. I don't remember
all the problems we ran into with that, but in the end we found that it
didn't actually help much, because in practice we don't end up with a
lot of chunks anyway.

For instance, packing torvalds/linux on our servers just now reused 6.5M
objects, but only needed ~50k chunks.

> > +	copy_pack_data(out, reuse_packfile, w_curs, offset, next - offset);
> > +}
> 
> I didn't look into detail, but this looks like it's writing a single
> object. In particular, it can convert OFS into REF, something that the
> existing code cannot do.

Right, that was an easy thing to do once we started actually walking
over the set of objects. The old code just tried to dump a whole segment
of the pack verbatim. That's faster than the traditional way of actually
adding objects to the packing list, but it didn't kick in very often.
This new code is really going for a middle ground: do _some_ per-object
work, but way less than we'd traditionally do.

By the way, that's another possible solution for the offset problem:
convert everything after the first gap to REF_DELTA. But the resulting
pack is larger and slightly less efficient for the client to use. And I
don't know that it's actually cheaper to generate than just adjusting
the offset (for each OFS_DELTA, you have to compute the offset of its
base and then do a revindex lookup to find the actual sha1 to write).

> > +static size_t write_reused_pack_verbatim(struct hashfile *out,
> > +					 struct pack_window **w_curs)
> > +{
> > +	size_t pos = 0;
> > +
> > +	while (pos < reuse_packfile_bitmap->word_alloc &&
> > +			reuse_packfile_bitmap->words[pos] == (eword_t)~0)
> > +		pos++;
> > +
> > +	if (pos) {
> > +		off_t to_write;
> > +
> > +		written = (pos * BITS_IN_EWORD);
> > +		to_write = reuse_packfile->revindex[written].offset
> > +			- sizeof(struct pack_header);
> > +
> > +		record_reused_object(sizeof(struct pack_header), 0);
> > +		hashflush(out);
> > +		copy_pack_data(out, reuse_packfile, w_curs,
> > +			sizeof(struct pack_header), to_write);
> >  
> >  		display_progress(progress_state, written);
> >  	}
> > +	return pos;
> > +}
> 
> I presume this optimization is needed for the case where we are using
> *all* objects of a packfile, as is typical during a clone.

It's not strictly needed, but yeah. If we know we're sending the first N
objects of the pack, then we don't even have to look at them. We can
just dump the bytes, and start our per-object traversal at N+1. That
should make it just as fast as the original pack-reuse code for this
case.

> > @@ -1002,6 +1132,10 @@ static int have_duplicate_entry(const struct object_id *oid,
> >  {
> >  	struct object_entry *entry;
> >  
> > +	if (reuse_packfile_bitmap &&
> > +	    bitmap_walk_contains(bitmap_git, reuse_packfile_bitmap, oid))
> > +		return 1;
> 
> Hmm...why did we previously not need to check the reuse information, but
> we do now? I gave the code a cursory glance but couldn't find the
> answer.

I think the original code may simply have been buggy and nobody noticed.
Here's what I wrote when this line was added in our fork:

      pack-objects: check reused pack bitmap for duplicate objects
  
      If a client both asks for a tag by sha1 and specifies
      "include-tag", we may end up including the tag in the reuse
      bitmap (due to the first thing), and then later adding it to
      the packlist (due to the second). This results in duplicate
      objects in the pack, which git chokes on. We should notice
      that we are already including it when doing the include-tag
      portion, and avoid adding it to the packlist.
  
      The simplest place to fix this is right in add_ref_tag,
      where we could avoid peeling the tag at all if we know that
      we are already including it. However, I've pushed the check
      instead into have_duplicate_entry. This fixes not only this
      case, but also means that we cannot have any similar
      problems lurking in other code.
  
      No tests, because git does not actually exhibit this "ask
      for it and also include-tag" behavior. We do one or the
      other on clone, depending on whether --single-branch is set.
      However, libgit2 does both.

I'm not sure why we didn't notice it with the older reuse code. It may
simply have been that it hardly ever triggered on our servers.

> >  static int obj_is_packed(const struct object_id *oid)
> >  {
> > -	return !!packlist_find(&to_pack, oid, NULL);
> > +	return packlist_find(&to_pack, oid, NULL) ||
> > +		(reuse_packfile_bitmap &&
> > +		 bitmap_walk_contains(bitmap_git, reuse_packfile_bitmap, oid));
> 
> Same question here - why do we need to check the reuse information?

This was added earlier than the call above. It's meant to handle
include-tag as well. Again, I'm not sure why the earlier reuse code
didn't need that, and I'd suspect it may simply be buggy.

> > @@ -3061,7 +3199,6 @@ static void loosen_unused_packed_objects(void)
> >  static int pack_options_allow_reuse(void)
> >  {
> >  	return pack_to_stdout &&
> > -	       allow_ofs_delta &&
> >  	       !ignore_packed_keep_on_disk &&
> >  	       !ignore_packed_keep_in_core &&
> >  	       (!local || !have_non_local_packs) &&
> 
> Relaxing of the allow_ofs_delta condition - makes sense given (as above)
> we can convert OFS to REF.

Yep, exactly.

> Overall I think that this makes sense, except for my unanswered
> questions (and the presence of reused_chunk).

Thanks for looking at it. I still have to take a careful pass over the
whole split, but I've tried to at least answer your questions in the
meantime.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 04/10] ewah/bitmap: always allocate 2 more words
  2019-10-11  7:49     ` Christian Couder
@ 2019-10-11 18:05       ` Jeff King
  0 siblings, 0 replies; 31+ messages in thread
From: Jeff King @ 2019-10-11 18:05 UTC (permalink / raw)
  To: Christian Couder
  Cc: Jonathan Tan, git, Junio C Hamano, Christian Couder, Ramsay Jones

On Fri, Oct 11, 2019 at 09:49:53AM +0200, Christian Couder wrote:

> > I think this should be squashed with patch 3, adding to that commit
> > message "since word_alloc might be 0, we need to change the growth
> > function". (Or just make the minimum word_alloc be 1 or 32 or something
> > positive, if that's possible.)
> 
> Yeah, thank you for the suggestion. I still wonder why 2 is added
> instead of just 1 though.

Yeah, I think it should be squashed. I think it is not intentionally 2,
it is just that adding "1" to block makes sure we always make forward
progress. It could equally well be:

  self->word_alloc = block ? block * 2 : 1;

I think. Or probably this whole thing could be ALLOC_GROW(), as the
numbers aren't particularly important. I guess we need to make sure the
grown part is zero'd, so probably using alloc_nr() directly would make
more sense.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 10/10] pack-objects: improve partial packfile reuse
  2019-10-11 18:01     ` Jeff King
@ 2019-10-11 21:04       ` Jonathan Tan
  2019-10-12  0:04       ` Junio C Hamano
  1 sibling, 0 replies; 31+ messages in thread
From: Jonathan Tan @ 2019-10-11 21:04 UTC (permalink / raw)
  To: peff; +Cc: jonathantanmy, christian.couder, git, gitster, chriscool, ramsay

> > This makes sense - offsets may be different when we omit objects from
> > the packfile. I think this can be computed by calculating the number of
> > zero bits between the current object's index and the nth object prior
> > (where n is the offset) in the bitmap resulting from
> > reuse_partial_packfile_from_bitmap() above, thus eliminating the need
> > for this array, but I haven't tested it.
> 
> You need to know not just the number of zero bits, but the accumulated
> offset due to those missing objects. So you'd end up having to walk over
> the revindex for that set of objects. This array is basically caching
> those accumulated offsets (for the parts we _do_ include) so we don't
> have to compute them repeatedly.

Ah...yes. For some reason I thought that the offset was a number of
objects, but it is actually a number of bytes. The patch makes sense
now.

> There's also a more subtle issue with entry sizes; see below.

Good point.

> > > @@ -1002,6 +1132,10 @@ static int have_duplicate_entry(const struct object_id *oid,
> > >  {
> > >  	struct object_entry *entry;
> > >  
> > > +	if (reuse_packfile_bitmap &&
> > > +	    bitmap_walk_contains(bitmap_git, reuse_packfile_bitmap, oid))
> > > +		return 1;
> > 
> > Hmm...why did we previously not need to check the reuse information, but
> > we do now? I gave the code a cursory glance but couldn't find the
> > answer.
> 
> I think the original code may simply have been buggy and nobody noticed.
> Here's what I wrote when this line was added in our fork:

[snip explanation]

Thanks - I'll also take a look if I have time.

> Thanks for looking at it. I still have to take a careful pass over the
> whole split, but I've tried to at least answer your questions in the
> meantime.

Thanks for your responses. Also thanks to Christian for splitting it in
the first place, making it easier to review.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 10/10] pack-objects: improve partial packfile reuse
  2019-10-11 18:01     ` Jeff King
  2019-10-11 21:04       ` Jonathan Tan
@ 2019-10-12  0:04       ` Junio C Hamano
  2019-10-13  7:38         ` Jeff King
  1 sibling, 1 reply; 31+ messages in thread
From: Junio C Hamano @ 2019-10-12  0:04 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, christian.couder, git, chriscool, ramsay

Jeff King <peff@peff.net> writes:

> The current code does so by creating a new entry in the reused_chunks
> array. In the worst case that can grow to have the same number of
> entries as we have objects. So this code was an attempt to pad the
> header of a shrunken entry to keep it the same size. I don't remember
> all the problems we ran into with that, but in the end we found that it
> didn't actually help much, because in practice we don't end up with a
> lot of chunks anyway.

Hmm, I am kind of surprised that the decoding side allowed such a
padding.

> For instance, packing torvalds/linux on our servers just now reused 6.5M
> objects, but only needed ~50k chunks.

OK, that's a good argument for not trying to optimize.

> I think the original code may simply have been buggy and nobody noticed.
> Here's what I wrote when this line was added in our fork:
>
>       pack-objects: check reused pack bitmap for duplicate objects
>   
>       If a client both asks for a tag by sha1 and specifies
>       "include-tag", we may end up including the tag in the reuse
>       bitmap (due to the first thing), and then later adding it to
>       the packlist (due to the second). This results in duplicate
>       objects in the pack, which git chokes on. We should notice
>       that we are already including it when doing the include-tag
>       portion, and avoid adding it to the packlist.
>   
>       The simplest place to fix this is right in add_ref_tag,
>       where we could avoid peeling the tag at all if we know that
>       we are already including it. However, I've pushed the check
>       instead into have_duplicate_entry. This fixes not only this
>       case, but also means that we cannot have any similar
>       problems lurking in other code.
>   
>       No tests, because git does not actually exhibit this "ask
>       for it and also include-tag" behavior. We do one or the
>       other on clone, depending on whether --single-branch is set.
>       However, libgit2 does both.
>
> I'm not sure why we didn't notice it with the older reuse code. It may
> simply have been that it hardly ever triggered on our servers.

Impressed by the careful thinking here.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 10/10] pack-objects: improve partial packfile reuse
  2019-10-12  0:04       ` Junio C Hamano
@ 2019-10-13  7:38         ` Jeff King
  2019-10-17  7:03           ` Junio C Hamano
  0 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2019-10-13  7:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, christian.couder, git, chriscool, ramsay

On Sat, Oct 12, 2019 at 09:04:58AM +0900, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > The current code does so by creating a new entry in the reused_chunks
> > array. In the worst case that can grow to have the same number of
> > entries as we have objects. So this code was an attempt to pad the
> > header of a shrunken entry to keep it the same size. I don't remember
> > all the problems we ran into with that, but in the end we found that it
> > didn't actually help much, because in practice we don't end up with a
> > lot of chunks anyway.
> 
> Hmm, I am kind of surprised that the decoding side allowed such a
> padding.

IIRC, the "padding" is just a sequence of 0-length-plus-continuation-bit
varint bytes. And for some reason it worked for the size but not for the
delta offset value. So the decoder wasn't aware of it, but simply hadn't
explicitly enforced that there weren't pointless bytes.

But yeah, it seems like a pretty hacky thing to rely on. I don't think
we ever even ran that code in production, and the if(0) was just
leftover experimental cruft.

> > I think the original code may simply have been buggy and nobody noticed.
> > Here's what I wrote when this line was added in our fork:
> [...]
> Impressed by the careful thinking here.

It's unfortunate that the reasoning there wasn't part of the earlier
submission. I'm not sure how to reconcile that. The patches as
originally written can't be applied now (they were munged over the years
during merges with upstream). And some of them are just "oops, fix this
dumb bug" trash that we wouldn't want to take anyway.

So I think at best they're something for somebody to manually look at
and try to incorporate into the commit messages of the revised patch
series. But I didn't give them to Christian to work with because it's
hard to even figure out which patches are still relevant. I wish I had a
better tooling there. I've been playing with something that looks at a
diff and then tries to blame each of the touched lines. Which is sort of
like the "-L" line-log, I guess, but for a very non-contiguous set of
lines.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 10/10] pack-objects: improve partial packfile reuse
  2019-10-13  7:38         ` Jeff King
@ 2019-10-17  7:03           ` Junio C Hamano
  2019-10-17  7:23             ` Jeff King
  0 siblings, 1 reply; 31+ messages in thread
From: Junio C Hamano @ 2019-10-17  7:03 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, christian.couder, git, chriscool, ramsay

Jeff King <peff@peff.net> writes:

>> Hmm, I am kind of surprised that the decoding side allowed such a
>> padding.
>
> IIRC, the "padding" is just a sequence of 0-length-plus-continuation-bit
> varint bytes. And for some reason it worked for the size but not for the
> delta offset value.

I think the reason is because they use different varint definition.

The encoding used in builtin/pack-objects.c::write_no_reuse_object()
is for offsets, and it came much later and with an improvement over
the encoding used for delta size in diff-delta.c::create_delta().
The more recent encoding does not allow padding (when I compare the
decoders for these two encodings, I notice there is +1 for each
7-bit iteration; this essentially declares that a byte with "not the
final byte" bit set with all other bits clear does not mean 0 but it
means 1, which breaks the idea of padding to encode filler zero
bits).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 10/10] pack-objects: improve partial packfile reuse
  2019-10-17  7:03           ` Junio C Hamano
@ 2019-10-17  7:23             ` Jeff King
  0 siblings, 0 replies; 31+ messages in thread
From: Jeff King @ 2019-10-17  7:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, christian.couder, git, chriscool, ramsay

On Thu, Oct 17, 2019 at 04:03:00PM +0900, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> >> Hmm, I am kind of surprised that the decoding side allowed such a
> >> padding.
> >
> > IIRC, the "padding" is just a sequence of 0-length-plus-continuation-bit
> > varint bytes. And for some reason it worked for the size but not for the
> > delta offset value.
> 
> I think the reason is because they use different varint definition.
> 
> The encoding used in builtin/pack-objects.c::write_no_reuse_object()
> is for offsets, and it came much later and with an improvement over
> the encoding used for delta size in diff-delta.c::create_delta().
> The more recent encoding does not allow padding (when I compare the
> decoders for these two encodings, I notice there is +1 for each
> 7-bit iteration; this essentially declares that a byte with "not the
> final byte" bit set with all other bits clear does not mean 0 but it
> means 1, which breaks the idea of padding to encode filler zero
> bits).

Yeah, that sounds right. I think the "old" one is actually the pack size
header in unpack_object_header_buffer(), not the delta size header.

At any rate, it seems like a terrible idea, so I'll be glad to see the
code dropped. :)

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2019-10-17  7:25 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-13 13:02 [RFC PATCH 00/10] Rewrite packfile reuse code Christian Couder
2019-09-13 13:02 ` [RFC PATCH 01/10] builtin/pack-objects: report reused packfile objects Christian Couder
2019-09-13 13:02 ` [RFC PATCH 02/10] packfile: expose get_delta_base() Christian Couder
2019-09-13 13:02 ` [RFC PATCH 03/10] ewah/bitmap: introduce bitmap_word_alloc() Christian Couder
2019-09-13 13:02 ` [RFC PATCH 04/10] ewah/bitmap: always allocate 2 more words Christian Couder
2019-10-10 23:40   ` Jonathan Tan
2019-10-11  7:49     ` Christian Couder
2019-10-11 18:05       ` Jeff King
2019-09-13 13:02 ` [RFC PATCH 05/10] pack-bitmap: don't rely on bitmap_git->reuse_objects Christian Couder
2019-10-10 23:44   ` Jonathan Tan
2019-10-11  7:50     ` Christian Couder
2019-09-13 13:02 ` [RFC PATCH 06/10] pack-bitmap: introduce bitmap_walk_contains() Christian Couder
2019-09-13 13:02 ` [RFC PATCH 07/10] csum-file: introduce hashfile_total() Christian Couder
2019-09-13 13:02 ` [RFC PATCH 08/10] pack-objects: introduce pack.allowPackReuse Christian Couder
2019-09-13 21:37   ` Junio C Hamano
2019-09-13 13:02 ` [RFC PATCH 09/10] builtin/pack-objects: introduce obj_is_packed() Christian Couder
2019-09-13 13:02 ` [RFC PATCH 10/10] pack-objects: improve partial packfile reuse Christian Couder
2019-09-13 22:29   ` Junio C Hamano
2019-09-14  2:02     ` Jeff King
2019-09-14  3:06       ` Junio C Hamano
2019-10-02 15:57         ` Jeff King
2019-10-03  2:06           ` Junio C Hamano
2019-10-03  6:55             ` Christian Couder
2019-10-10 23:59   ` Jonathan Tan
2019-10-11  7:39     ` Christian Couder
2019-10-11 18:01     ` Jeff King
2019-10-11 21:04       ` Jonathan Tan
2019-10-12  0:04       ` Junio C Hamano
2019-10-13  7:38         ` Jeff King
2019-10-17  7:03           ` Junio C Hamano
2019-10-17  7:23             ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).