git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH v2 0/9] Rewrite packfile reuse code
@ 2019-10-19 10:35 Christian Couder
  2019-10-19 10:35 ` [PATCH v2 1/9] builtin/pack-objects: report reused packfile objects Christian Couder
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Christian Couder @ 2019-10-19 10:35 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

This patch series is rewriting the code that tries to reuse existing
packfiles.

The code in this patch series was written by GitHub, and Peff nicely
provided it in the following discussion:

https://public-inbox.org/git/3E56B0FD-EBE8-4057-A93A-16EBB09FBCE0@jramsay.com.au/

The first version of this patch series was also discussed:

https://public-inbox.org/git/20190913130226.7449-1-chriscool@tuxfamily.org/

Thanks to the reviewers!

According to Peff this new code is a lot smarter than what it
replaces. It allows "holes" in the chunks of packfile to be reused,
and skips over them. It rewrites OFS_DELTA offsets as it goes to
account for the holes. So it's basically a linear walk over the
packfile, but with the important distinction that we don't add those
objects to the object_entry array, which makes them very lightweight
(especially in memory use, but they also aren't considered bases for
finding new deltas, etc). It seems like a good compromise between the
cost to serve a clone and the quality of the resulting packfile.

Changes since the previous patch series are the following:

  - Rebased onto current master (v2.24.0-rc0), which means that
    conflicts due to dropping the last argument of packlist_find() are
    resolved.

  - Replaced "in a following patch" with "in a following commit" in
    commit messages.

  - Squashed previous patchs 3/10 and 4/10 into new patch 3/9 as
    suggested by Peff and Jonathan.

  - Changed commit message of patch 3/9, so that the justification
    should be correct now.

  - Also in patch 3/9, `block ? block * 2 : 1` is now used to compute
    the number of words that should be allocated.

  - Changed commit message of patch 4/9 (previously 5/10), so that the
    justification should be correct now.

  - Document pack.allowPackReuse in patch 7/9.

  - Also in patch 7/9 move using the `allow_pack_reuse` variable into
    pack_options_allow_reuse() as suggested by Junio.

  - Improved commit message in patch 9/9 a lot by adding many comments
    made by Peff about it.

  - In patch 9/9 removed `if (0) { ... }` code.

  - Also in patch 9/9 changed parameter name from
    `struct bitmap **bitmap` to `struct bitmap **reuse_out` in
    pack-bitmap.h as suggested by Jonathan.

I tried to use `git range-diff` to see if I could send its output to
compare this patch series with the previous one, but it looks like it
unfortunately thinks that new patch 7/9 is completely different than
previous patch 8/10, and also it shows a lot more changes from patch
10/10 to patch 9/9 than necessary for some reason. So I decided not to
send it, but if you want it anyway please tell me.

I have put Peff as the author of all the commits.

Jeff King (9):
  builtin/pack-objects: report reused packfile objects
  packfile: expose get_delta_base()
  ewah/bitmap: introduce bitmap_word_alloc()
  pack-bitmap: don't rely on bitmap_git->reuse_objects
  pack-bitmap: introduce bitmap_walk_contains()
  csum-file: introduce hashfile_total()
  pack-objects: introduce pack.allowPackReuse
  builtin/pack-objects: introduce obj_is_packed()
  pack-objects: improve partial packfile reuse

 Documentation/config/pack.txt |   4 +
 builtin/pack-objects.c        | 235 +++++++++++++++++++++++++++-------
 csum-file.h                   |   9 ++
 ewah/bitmap.c                 |  13 +-
 ewah/ewok.h                   |   1 +
 pack-bitmap.c                 | 178 +++++++++++++++++--------
 pack-bitmap.h                 |   6 +-
 packfile.c                    |  10 +-
 packfile.h                    |   3 +
 9 files changed, 349 insertions(+), 110 deletions(-)

-- 
2.24.0.rc0.9.gef620577e2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/9] builtin/pack-objects: report reused packfile objects
  2019-10-19 10:35 [PATCH v2 0/9] Rewrite packfile reuse code Christian Couder
@ 2019-10-19 10:35 ` Christian Couder
  2019-10-19 10:35 ` [PATCH v2 2/9] packfile: expose get_delta_base() Christian Couder
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Christian Couder @ 2019-10-19 10:35 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan, James Ramsay

From: Jeff King <peff@peff.net>

To see when packfile reuse kicks in or not, it is useful to
show reused packfile objects statistics in the output of
upload-pack.

Helped-by: James Ramsay <james@jramsay.com.au>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 builtin/pack-objects.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5876583220..f2c2703090 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3509,7 +3509,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (progress)
 		fprintf_ln(stderr,
 			   _("Total %"PRIu32" (delta %"PRIu32"),"
-			     " reused %"PRIu32" (delta %"PRIu32")"),
-			   written, written_delta, reused, reused_delta);
+			     " reused %"PRIu32" (delta %"PRIu32"),"
+			     " pack-reused %"PRIu32),
+			   written, written_delta, reused, reused_delta,
+			   reuse_packfile_objects);
 	return 0;
 }
-- 
2.24.0.rc0.9.gef620577e2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 2/9] packfile: expose get_delta_base()
  2019-10-19 10:35 [PATCH v2 0/9] Rewrite packfile reuse code Christian Couder
  2019-10-19 10:35 ` [PATCH v2 1/9] builtin/pack-objects: report reused packfile objects Christian Couder
@ 2019-10-19 10:35 ` Christian Couder
  2019-10-19 10:35 ` [PATCH v2 3/9] ewah/bitmap: introduce bitmap_word_alloc() Christian Couder
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Christian Couder @ 2019-10-19 10:35 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

From: Jeff King <peff@peff.net>

In a following commit get_delta_base() will be used outside
packfile.c, so let's make it non static and declare it in
packfile.h.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 packfile.c | 10 +++++-----
 packfile.h |  3 +++
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/packfile.c b/packfile.c
index 355066de17..81e66847bf 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1173,11 +1173,11 @@ const struct packed_git *has_packed_and_bad(struct repository *r,
 	return NULL;
 }
 
-static off_t get_delta_base(struct packed_git *p,
-				    struct pack_window **w_curs,
-				    off_t *curpos,
-				    enum object_type type,
-				    off_t delta_obj_offset)
+off_t get_delta_base(struct packed_git *p,
+		     struct pack_window **w_curs,
+		     off_t *curpos,
+		     enum object_type type,
+		     off_t delta_obj_offset)
 {
 	unsigned char *base_info = use_pack(p, w_curs, *curpos, NULL);
 	off_t base_offset;
diff --git a/packfile.h b/packfile.h
index fc7904ec81..ec536a4ae5 100644
--- a/packfile.h
+++ b/packfile.h
@@ -151,6 +151,9 @@ void *unpack_entry(struct repository *r, struct packed_git *, off_t, enum object
 unsigned long unpack_object_header_buffer(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
 unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
 int unpack_object_header(struct packed_git *, struct pack_window **, off_t *, unsigned long *);
+off_t get_delta_base(struct packed_git *p, struct pack_window **w_curs,
+		     off_t *curpos, enum object_type type,
+		     off_t delta_obj_offset);
 
 void release_pack_memory(size_t);
 
-- 
2.24.0.rc0.9.gef620577e2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 3/9] ewah/bitmap: introduce bitmap_word_alloc()
  2019-10-19 10:35 [PATCH v2 0/9] Rewrite packfile reuse code Christian Couder
  2019-10-19 10:35 ` [PATCH v2 1/9] builtin/pack-objects: report reused packfile objects Christian Couder
  2019-10-19 10:35 ` [PATCH v2 2/9] packfile: expose get_delta_base() Christian Couder
@ 2019-10-19 10:35 ` Christian Couder
  2019-10-22 17:46   ` Jonathan Tan
  2019-10-19 10:35 ` [PATCH v2 4/9] pack-bitmap: don't rely on bitmap_git->reuse_objects Christian Couder
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Christian Couder @ 2019-10-19 10:35 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

From: Jeff King <peff@peff.net>

In a following commit we will need to allocate a variable
number of bitmap words, instead of always 32, so let's add
bitmap_word_alloc() for this purpose.

We will also always access at least one word for each bitmap,
so we want to make sure that at least one is always
allocated.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 ewah/bitmap.c | 13 +++++++++----
 ewah/ewok.h   |  1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/ewah/bitmap.c b/ewah/bitmap.c
index 52f1178db4..b5fed9621f 100644
--- a/ewah/bitmap.c
+++ b/ewah/bitmap.c
@@ -22,21 +22,26 @@
 #define EWAH_MASK(x) ((eword_t)1 << (x % BITS_IN_EWORD))
 #define EWAH_BLOCK(x) (x / BITS_IN_EWORD)
 
-struct bitmap *bitmap_new(void)
+struct bitmap *bitmap_word_alloc(size_t word_alloc)
 {
 	struct bitmap *bitmap = xmalloc(sizeof(struct bitmap));
-	bitmap->words = xcalloc(32, sizeof(eword_t));
-	bitmap->word_alloc = 32;
+	bitmap->words = xcalloc(word_alloc, sizeof(eword_t));
+	bitmap->word_alloc = word_alloc;
 	return bitmap;
 }
 
+struct bitmap *bitmap_new(void)
+{
+	return bitmap_word_alloc(32);
+}
+
 void bitmap_set(struct bitmap *self, size_t pos)
 {
 	size_t block = EWAH_BLOCK(pos);
 
 	if (block >= self->word_alloc) {
 		size_t old_size = self->word_alloc;
-		self->word_alloc = block * 2;
+		self->word_alloc = block ? block * 2 : 1;
 		REALLOC_ARRAY(self->words, self->word_alloc);
 		memset(self->words + old_size, 0x0,
 			(self->word_alloc - old_size) * sizeof(eword_t));
diff --git a/ewah/ewok.h b/ewah/ewok.h
index 84b2a29faa..1b98b57c8b 100644
--- a/ewah/ewok.h
+++ b/ewah/ewok.h
@@ -172,6 +172,7 @@ struct bitmap {
 };
 
 struct bitmap *bitmap_new(void);
+struct bitmap *bitmap_word_alloc(size_t word_alloc);
 void bitmap_set(struct bitmap *self, size_t pos);
 int bitmap_get(struct bitmap *self, size_t pos);
 void bitmap_reset(struct bitmap *self);
-- 
2.24.0.rc0.9.gef620577e2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 4/9] pack-bitmap: don't rely on bitmap_git->reuse_objects
  2019-10-19 10:35 [PATCH v2 0/9] Rewrite packfile reuse code Christian Couder
                   ` (2 preceding siblings ...)
  2019-10-19 10:35 ` [PATCH v2 3/9] ewah/bitmap: introduce bitmap_word_alloc() Christian Couder
@ 2019-10-19 10:35 ` Christian Couder
  2019-10-19 10:35 ` [PATCH v2 5/9] pack-bitmap: introduce bitmap_walk_contains() Christian Couder
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Christian Couder @ 2019-10-19 10:35 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

From: Jeff King <peff@peff.net>

We will no longer compute bitmap_git->reuse_objects in a
following commit, so we cannot rely on it anymore to
terminate the loop early; we have to iterate to the end.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 pack-bitmap.c | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index e07c798879..016d0319fc 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -622,7 +622,7 @@ static void show_objects_for_type(
 	enum object_type object_type,
 	show_reachable_fn show_reach)
 {
-	size_t pos = 0, i = 0;
+	size_t i = 0;
 	uint32_t offset;
 
 	struct ewah_iterator it;
@@ -630,13 +630,15 @@ static void show_objects_for_type(
 
 	struct bitmap *objects = bitmap_git->result;
 
-	if (bitmap_git->reuse_objects == bitmap_git->pack->num_objects)
-		return;
-
 	ewah_iterator_init(&it, type_filter);
 
-	while (i < objects->word_alloc && ewah_iterator_next(&filter, &it)) {
+	for (i = 0; i < objects->word_alloc &&
+			ewah_iterator_next(&filter, &it); i++) {
 		eword_t word = objects->words[i] & filter;
+		size_t pos = (i * BITS_IN_EWORD);
+
+		if (!word)
+			continue;
 
 		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
 			struct object_id oid;
@@ -648,9 +650,6 @@ static void show_objects_for_type(
 
 			offset += ewah_bit_ctz64(word >> offset);
 
-			if (pos + offset < bitmap_git->reuse_objects)
-				continue;
-
 			entry = &bitmap_git->pack->revindex[pos + offset];
 			nth_packed_object_oid(&oid, bitmap_git->pack, entry->nr);
 
@@ -659,9 +658,6 @@ static void show_objects_for_type(
 
 			show_reach(&oid, object_type, 0, hash, bitmap_git->pack, entry->offset);
 		}
-
-		pos += BITS_IN_EWORD;
-		i++;
 	}
 }
 
-- 
2.24.0.rc0.9.gef620577e2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 5/9] pack-bitmap: introduce bitmap_walk_contains()
  2019-10-19 10:35 [PATCH v2 0/9] Rewrite packfile reuse code Christian Couder
                   ` (3 preceding siblings ...)
  2019-10-19 10:35 ` [PATCH v2 4/9] pack-bitmap: don't rely on bitmap_git->reuse_objects Christian Couder
@ 2019-10-19 10:35 ` Christian Couder
  2019-10-19 15:25   ` Philip Oakley
  2019-10-19 10:35 ` [PATCH v2 6/9] csum-file: introduce hashfile_total() Christian Couder
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Christian Couder @ 2019-10-19 10:35 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

From: Jeff King <peff@peff.net>

We will use this helper function in a following commit to
tell us if an object is packed.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 pack-bitmap.c | 12 ++++++++++++
 pack-bitmap.h |  3 +++
 2 files changed, 15 insertions(+)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 016d0319fc..8a51302a1a 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -826,6 +826,18 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	return 0;
 }
 
+int bitmap_walk_contains(struct bitmap_index *bitmap_git,
+			 struct bitmap *bitmap, const struct object_id *oid)
+{
+	int idx;
+
+	if (!bitmap)
+		return 0;
+
+	idx = bitmap_position(bitmap_git, oid);
+	return idx >= 0 && bitmap_get(bitmap, idx);
+}
+
 void traverse_bitmap_commit_list(struct bitmap_index *bitmap_git,
 				 show_reachable_fn show_reachable)
 {
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 466c5afa09..6ab6033dbe 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -3,6 +3,7 @@
 
 #include "ewah/ewok.h"
 #include "khash.h"
+#include "pack.h"
 #include "pack-objects.h"
 
 struct commit;
@@ -53,6 +54,8 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *,
 int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
 			     kh_oid_map_t *reused_bitmaps, int show_progress);
 void free_bitmap_index(struct bitmap_index *);
+int bitmap_walk_contains(struct bitmap_index *,
+			 struct bitmap *bitmap, const struct object_id *oid);
 
 /*
  * After a traversal has been performed by prepare_bitmap_walk(), this can be
-- 
2.24.0.rc0.9.gef620577e2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 6/9] csum-file: introduce hashfile_total()
  2019-10-19 10:35 [PATCH v2 0/9] Rewrite packfile reuse code Christian Couder
                   ` (4 preceding siblings ...)
  2019-10-19 10:35 ` [PATCH v2 5/9] pack-bitmap: introduce bitmap_walk_contains() Christian Couder
@ 2019-10-19 10:35 ` Christian Couder
  2019-10-19 10:35 ` [PATCH v2 7/9] pack-objects: introduce pack.allowPackReuse Christian Couder
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Christian Couder @ 2019-10-19 10:35 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

From: Jeff King <peff@peff.net>

We will need this helper function in a following commit
to give us total number of bytes fed to the hashfile so far.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 csum-file.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/csum-file.h b/csum-file.h
index a98b1eee53..f9cbd317fb 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -42,6 +42,15 @@ void hashflush(struct hashfile *f);
 void crc32_begin(struct hashfile *);
 uint32_t crc32_end(struct hashfile *);
 
+/*
+ * Returns the total number of bytes fed to the hashfile so far (including ones
+ * that have not been written out to the descriptor yet).
+ */
+static inline off_t hashfile_total(struct hashfile *f)
+{
+	return f->total + f->offset;
+}
+
 static inline void hashwrite_u8(struct hashfile *f, uint8_t data)
 {
 	hashwrite(f, &data, sizeof(data));
-- 
2.24.0.rc0.9.gef620577e2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 7/9] pack-objects: introduce pack.allowPackReuse
  2019-10-19 10:35 [PATCH v2 0/9] Rewrite packfile reuse code Christian Couder
                   ` (5 preceding siblings ...)
  2019-10-19 10:35 ` [PATCH v2 6/9] csum-file: introduce hashfile_total() Christian Couder
@ 2019-10-19 10:35 ` Christian Couder
  2019-10-19 10:35 ` [PATCH v2 8/9] builtin/pack-objects: introduce obj_is_packed() Christian Couder
  2019-10-19 10:35 ` [PATCH v2 9/9] pack-objects: improve partial packfile reuse Christian Couder
  8 siblings, 0 replies; 22+ messages in thread
From: Christian Couder @ 2019-10-19 10:35 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

From: Jeff King <peff@peff.net>

Let's make it possible to configure if we want pack reuse or not.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/pack.txt | 4 ++++
 builtin/pack-objects.c        | 8 +++++++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index 1d66f0c992..58323a351f 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -27,6 +27,10 @@ Note that changing the compression level will not automatically recompress
 all existing objects. You can force recompression by passing the -F option
 to linkgit:git-repack[1].
 
+pack.allowPackReuse::
+	When true, which is the default, Git will try to reuse parts
+	of existing packfiles when preparing new packfiles.
+
 pack.island::
 	An extended regular expression configuring a set of delta
 	islands. See "DELTA ISLANDS" in linkgit:git-pack-objects[1]
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index f2c2703090..4fcfcf6097 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -96,6 +96,7 @@ static off_t reuse_packfile_offset;
 
 static int use_bitmap_index_default = 1;
 static int use_bitmap_index = -1;
+static int allow_pack_reuse = 1;
 static enum {
 	WRITE_BITMAP_FALSE = 0,
 	WRITE_BITMAP_QUIET,
@@ -2699,6 +2700,10 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 		use_bitmap_index_default = git_config_bool(k, v);
 		return 0;
 	}
+	if (!strcmp(k, "pack.allowpackreuse")) {
+		allow_pack_reuse = git_config_bool(k, v);
+		return 0;
+	}
 	if (!strcmp(k, "pack.threads")) {
 		delta_search_threads = git_config_int(k, v);
 		if (delta_search_threads < 0)
@@ -3030,7 +3035,8 @@ static void loosen_unused_packed_objects(void)
  */
 static int pack_options_allow_reuse(void)
 {
-	return pack_to_stdout &&
+	return allow_pack_reuse &&
+	       pack_to_stdout &&
 	       allow_ofs_delta &&
 	       !ignore_packed_keep_on_disk &&
 	       !ignore_packed_keep_in_core &&
-- 
2.24.0.rc0.9.gef620577e2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 8/9] builtin/pack-objects: introduce obj_is_packed()
  2019-10-19 10:35 [PATCH v2 0/9] Rewrite packfile reuse code Christian Couder
                   ` (6 preceding siblings ...)
  2019-10-19 10:35 ` [PATCH v2 7/9] pack-objects: introduce pack.allowPackReuse Christian Couder
@ 2019-10-19 10:35 ` Christian Couder
  2019-10-19 10:35 ` [PATCH v2 9/9] pack-objects: improve partial packfile reuse Christian Couder
  8 siblings, 0 replies; 22+ messages in thread
From: Christian Couder @ 2019-10-19 10:35 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

From: Jeff King <peff@peff.net>

Let's refactor the way we check if an object is packed by
introducing obj_is_packed(). This function is now a simple
wrapper around packlist_find(), but it will evolve in a
following commit.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 builtin/pack-objects.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 4fcfcf6097..08898331ef 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2553,6 +2553,11 @@ static void ll_find_deltas(struct object_entry **list, unsigned list_size,
 	free(p);
 }
 
+static int obj_is_packed(const struct object_id *oid)
+{
+	return !!packlist_find(&to_pack, oid);
+}
+
 static void add_tag_chain(const struct object_id *oid)
 {
 	struct tag *tag;
@@ -2564,7 +2569,7 @@ static void add_tag_chain(const struct object_id *oid)
 	 * it was included via bitmaps, we would not have parsed it
 	 * previously).
 	 */
-	if (packlist_find(&to_pack, oid))
+	if (obj_is_packed(oid))
 		return;
 
 	tag = lookup_tag(the_repository, oid);
@@ -2588,7 +2593,7 @@ static int add_ref_tag(const char *path, const struct object_id *oid, int flag,
 
 	if (starts_with(path, "refs/tags/") && /* is a tag? */
 	    !peel_ref(path, &peeled)    && /* peelable? */
-	    packlist_find(&to_pack, &peeled))      /* object packed? */
+	    obj_is_packed(&peeled)) /* object packed? */
 		add_tag_chain(oid);
 	return 0;
 }
-- 
2.24.0.rc0.9.gef620577e2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 9/9] pack-objects: improve partial packfile reuse
  2019-10-19 10:35 [PATCH v2 0/9] Rewrite packfile reuse code Christian Couder
                   ` (7 preceding siblings ...)
  2019-10-19 10:35 ` [PATCH v2 8/9] builtin/pack-objects: introduce obj_is_packed() Christian Couder
@ 2019-10-19 10:35 ` Christian Couder
  2019-10-19 15:30   ` Philip Oakley
  2019-10-22 19:48   ` Jonathan Tan
  8 siblings, 2 replies; 22+ messages in thread
From: Christian Couder @ 2019-10-19 10:35 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

From: Jeff King <peff@peff.net>

Let's store the chunks of the packfile that we reuse in
a dynamic array of `struct reused_chunk`, and let's use
a reuse_packfile_bitmap to speed up reusing parts of
packfiles.

The dynamic array of `struct reused_chunk` is useful
because we need to know not just the number of zero bits,
but the accumulated offset due to missing objects. So
without the array we'd end up having to walk over the
revindex for that set of objects. The array is
basically caching those accumulated offsets (for the
parts we _do_ include), so we don't have to compute them
repeatedly.

The old code just tried to dump a whole segment of the
pack verbatim. That's faster than the traditional way of
actually adding objects to the packing list, but it
didn't kick in very often. This new code is really going
for a middle ground: do _some_ per-object work, but way
less than we'd traditionally do.

For instance, packing torvalds/linux on GitHub servers
just now reused 6.5M objects, but only needed ~50k
chunks.

Additional checks are added in have_duplicate_entry()
and obj_is_packed() to avoid duplicate objects in the
reuse bitmap. It was probably buggy to not have such a
check before.

If a client both asks for a tag by sha1 and specifies
"include-tag", we may end up including the tag in the reuse
bitmap (due to the first thing), and then later adding it to
the packlist (due to the second). This results in duplicate
objects in the pack, which git chokes on. We should notice
that we are already including it when doing the include-tag
portion, and avoid adding it to the packlist.

The simplest place to fix this is right in add_ref_tag,
where we could avoid peeling the tag at all if we know that
we are already including it. However, I've pushed the check
instead into have_duplicate_entry(). This fixes not only
this case, but also means that we cannot have any similar
problems lurking in other code.

No tests, because git does not actually exhibit this "ask
for it and also include-tag" behavior. We do one or the
other on clone, depending on whether --single-branch is set.
However, libgit2 does both.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 builtin/pack-objects.c | 214 ++++++++++++++++++++++++++++++++---------
 pack-bitmap.c          | 150 +++++++++++++++++++++--------
 pack-bitmap.h          |   3 +-
 3 files changed, 280 insertions(+), 87 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 08898331ef..f710f8dde3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -92,7 +92,7 @@ static struct progress *progress_state;
 
 static struct packed_git *reuse_packfile;
 static uint32_t reuse_packfile_objects;
-static off_t reuse_packfile_offset;
+static struct bitmap *reuse_packfile_bitmap;
 
 static int use_bitmap_index_default = 1;
 static int use_bitmap_index = -1;
@@ -785,57 +785,177 @@ static struct object_entry **compute_write_order(void)
 	return wo;
 }
 
-static off_t write_reused_pack(struct hashfile *f)
+/*
+ * Record the offsets needed in our reused packfile chunks due to
+ * "gaps" where we omitted some objects.
+ */
+static struct reused_chunk {
+	off_t start;
+	off_t offset;
+} *reused_chunks;
+static int reused_chunks_nr;
+static int reused_chunks_alloc;
+
+static void record_reused_object(off_t where, off_t offset)
 {
-	unsigned char buffer[8192];
-	off_t to_write, total;
-	int fd;
+	if (reused_chunks_nr && reused_chunks[reused_chunks_nr-1].offset == offset)
+		return;
 
-	if (!is_pack_valid(reuse_packfile))
-		die(_("packfile is invalid: %s"), reuse_packfile->pack_name);
+	ALLOC_GROW(reused_chunks, reused_chunks_nr + 1,
+		   reused_chunks_alloc);
+	reused_chunks[reused_chunks_nr].start = where;
+	reused_chunks[reused_chunks_nr].offset = offset;
+	reused_chunks_nr++;
+}
 
-	fd = git_open(reuse_packfile->pack_name);
-	if (fd < 0)
-		die_errno(_("unable to open packfile for reuse: %s"),
-			  reuse_packfile->pack_name);
+/*
+ * Binary search to find the chunk that "where" is in. Note
+ * that we're not looking for an exact match, just the first
+ * chunk that contains it (which implicitly ends at the start
+ * of the next chunk.
+ */
+static off_t find_reused_offset(off_t where)
+{
+	int lo = 0, hi = reused_chunks_nr;
+	while (lo < hi) {
+		int mi = lo + ((hi - lo) / 2);
+		if (where == reused_chunks[mi].start)
+			return reused_chunks[mi].offset;
+		if (where < reused_chunks[mi].start)
+			hi = mi;
+		else
+			lo = mi + 1;
+	}
 
-	if (lseek(fd, sizeof(struct pack_header), SEEK_SET) == -1)
-		die_errno(_("unable to seek in reused packfile"));
+	/*
+	 * The first chunk starts at zero, so we can't have gone below
+	 * there.
+	 */
+	assert(lo);
+	return reused_chunks[lo-1].offset;
+}
+
+static void write_reused_pack_one(size_t pos, struct hashfile *out,
+				  struct pack_window **w_curs)
+{
+	off_t offset, next, cur;
+	enum object_type type;
+	unsigned long size;
+
+	offset = reuse_packfile->revindex[pos].offset;
+	next = reuse_packfile->revindex[pos + 1].offset;
 
-	if (reuse_packfile_offset < 0)
-		reuse_packfile_offset = reuse_packfile->pack_size - the_hash_algo->rawsz;
+	record_reused_object(offset, offset - hashfile_total(out));
 
-	total = to_write = reuse_packfile_offset - sizeof(struct pack_header);
+	cur = offset;
+	type = unpack_object_header(reuse_packfile, w_curs, &cur, &size);
+	assert(type >= 0);
 
-	while (to_write) {
-		int read_pack = xread(fd, buffer, sizeof(buffer));
+	if (type == OBJ_OFS_DELTA) {
+		off_t base_offset;
+		off_t fixup;
+
+		unsigned char header[MAX_PACK_OBJECT_HEADER];
+		unsigned len;
+
+		base_offset = get_delta_base(reuse_packfile, w_curs, &cur, type, offset);
+		assert(base_offset != 0);
+
+		/* Convert to REF_DELTA if we must... */
+		if (!allow_ofs_delta) {
+			int base_pos = find_revindex_position(reuse_packfile, base_offset);
+			const unsigned char *base_sha1 =
+				nth_packed_object_sha1(reuse_packfile,
+						       reuse_packfile->revindex[base_pos].nr);
+
+			len = encode_in_pack_object_header(header, sizeof(header),
+							   OBJ_REF_DELTA, size);
+			hashwrite(out, header, len);
+			hashwrite(out, base_sha1, 20);
+			copy_pack_data(out, reuse_packfile, w_curs, cur, next - cur);
+			return;
+		}
 
-		if (read_pack <= 0)
-			die_errno(_("unable to read from reused packfile"));
+		/* Otherwise see if we need to rewrite the offset... */
+		fixup = find_reused_offset(offset) -
+			find_reused_offset(base_offset);
+		if (fixup) {
+			unsigned char ofs_header[10];
+			unsigned i, ofs_len;
+			off_t ofs = offset - base_offset - fixup;
 
-		if (read_pack > to_write)
-			read_pack = to_write;
+			len = encode_in_pack_object_header(header, sizeof(header),
+							   OBJ_OFS_DELTA, size);
 
-		hashwrite(f, buffer, read_pack);
-		to_write -= read_pack;
+			i = sizeof(ofs_header) - 1;
+			ofs_header[i] = ofs & 127;
+			while (ofs >>= 7)
+				ofs_header[--i] = 128 | (--ofs & 127);
+
+			ofs_len = sizeof(ofs_header) - i;
+
+			hashwrite(out, header, len);
+			hashwrite(out, ofs_header + sizeof(ofs_header) - ofs_len, ofs_len);
+			copy_pack_data(out, reuse_packfile, w_curs, cur, next - cur);
+			return;
+		}
+
+		/* ...otherwise we have no fixup, and can write it verbatim */
+	}
+
+	copy_pack_data(out, reuse_packfile, w_curs, offset, next - offset);
+}
+
+static size_t write_reused_pack_verbatim(struct hashfile *out,
+					 struct pack_window **w_curs)
+{
+	size_t pos = 0;
+
+	while (pos < reuse_packfile_bitmap->word_alloc &&
+			reuse_packfile_bitmap->words[pos] == (eword_t)~0)
+		pos++;
+
+	if (pos) {
+		off_t to_write;
+
+		written = (pos * BITS_IN_EWORD);
+		to_write = reuse_packfile->revindex[written].offset
+			- sizeof(struct pack_header);
+
+		record_reused_object(sizeof(struct pack_header), 0);
+		hashflush(out);
+		copy_pack_data(out, reuse_packfile, w_curs,
+			sizeof(struct pack_header), to_write);
 
-		/*
-		 * We don't know the actual number of objects written,
-		 * only how many bytes written, how many bytes total, and
-		 * how many objects total. So we can fake it by pretending all
-		 * objects we are writing are the same size. This gives us a
-		 * smooth progress meter, and at the end it matches the true
-		 * answer.
-		 */
-		written = reuse_packfile_objects *
-				(((double)(total - to_write)) / total);
 		display_progress(progress_state, written);
 	}
+	return pos;
+}
+
+static void write_reused_pack(struct hashfile *f)
+{
+	size_t i = 0;
+	uint32_t offset;
+	struct pack_window *w_curs = NULL;
+
+	if (allow_ofs_delta)
+		i = write_reused_pack_verbatim(f, &w_curs);
+
+	for (; i < reuse_packfile_bitmap->word_alloc; ++i) {
+		eword_t word = reuse_packfile_bitmap->words[i];
+		size_t pos = (i * BITS_IN_EWORD);
+
+		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
+			if ((word >> offset) == 0)
+				break;
+
+			offset += ewah_bit_ctz64(word >> offset);
+			write_reused_pack_one(pos + offset, f, &w_curs);
+			display_progress(progress_state, ++written);
+		}
+	}
 
-	close(fd);
-	written = reuse_packfile_objects;
-	display_progress(progress_state, written);
-	return reuse_packfile_offset - sizeof(struct pack_header);
+	unuse_pack(&w_curs);
 }
 
 static const char no_split_warning[] = N_(
@@ -868,11 +988,9 @@ static void write_pack_file(void)
 		offset = write_pack_header(f, nr_remaining);
 
 		if (reuse_packfile) {
-			off_t packfile_size;
 			assert(pack_to_stdout);
-
-			packfile_size = write_reused_pack(f);
-			offset += packfile_size;
+			write_reused_pack(f);
+			offset = hashfile_total(f);
 		}
 
 		nr_written = 0;
@@ -1001,6 +1119,10 @@ static int have_duplicate_entry(const struct object_id *oid,
 {
 	struct object_entry *entry;
 
+	if (reuse_packfile_bitmap &&
+	    bitmap_walk_contains(bitmap_git, reuse_packfile_bitmap, oid))
+		return 1;
+
 	entry = packlist_find(&to_pack, oid);
 	if (!entry)
 		return 0;
@@ -2555,7 +2677,9 @@ static void ll_find_deltas(struct object_entry **list, unsigned list_size,
 
 static int obj_is_packed(const struct object_id *oid)
 {
-	return !!packlist_find(&to_pack, oid);
+	return packlist_find(&to_pack, oid) ||
+		(reuse_packfile_bitmap &&
+		 bitmap_walk_contains(bitmap_git, reuse_packfile_bitmap, oid));
 }
 
 static void add_tag_chain(const struct object_id *oid)
@@ -2661,6 +2785,7 @@ static void prepare_pack(int window, int depth)
 
 	if (nr_deltas && n > 1) {
 		unsigned nr_done = 0;
+
 		if (progress)
 			progress_state = start_progress(_("Compressing objects"),
 							nr_deltas);
@@ -3042,7 +3167,6 @@ static int pack_options_allow_reuse(void)
 {
 	return allow_pack_reuse &&
 	       pack_to_stdout &&
-	       allow_ofs_delta &&
 	       !ignore_packed_keep_on_disk &&
 	       !ignore_packed_keep_in_core &&
 	       (!local || !have_non_local_packs) &&
@@ -3059,7 +3183,7 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
 			bitmap_git,
 			&reuse_packfile,
 			&reuse_packfile_objects,
-			&reuse_packfile_offset)) {
+			&reuse_packfile_bitmap)) {
 		assert(reuse_packfile_objects);
 		nr_result += reuse_packfile_objects;
 		display_progress(progress_state, nr_result);
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 8a51302a1a..cbfc544411 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -326,6 +326,13 @@ static int load_pack_bitmap(struct bitmap_index *bitmap_git)
 	munmap(bitmap_git->map, bitmap_git->map_size);
 	bitmap_git->map = NULL;
 	bitmap_git->map_size = 0;
+
+	kh_destroy_oid_map(bitmap_git->bitmaps);
+	bitmap_git->bitmaps = NULL;
+
+	kh_destroy_oid_pos(bitmap_git->ext_index.positions);
+	bitmap_git->ext_index.positions = NULL;
+
 	return -1;
 }
 
@@ -764,65 +771,126 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
 	return NULL;
 }
 
-int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
-				       struct packed_git **packfile,
-				       uint32_t *entries,
-				       off_t *up_to)
+static void try_partial_reuse(struct bitmap_index *bitmap_git,
+			      size_t pos,
+			      struct bitmap *reuse,
+			      struct pack_window **w_curs)
 {
+	struct revindex_entry *revidx;
+	off_t offset;
+	enum object_type type;
+	unsigned long size;
+
+	if (pos >= bitmap_git->pack->num_objects)
+		return; /* not actually in the pack */
+
+	revidx = &bitmap_git->pack->revindex[pos];
+	offset = revidx->offset;
+	type = unpack_object_header(bitmap_git->pack, w_curs, &offset, &size);
+	if (type < 0)
+		return; /* broken packfile, punt */
+
+	if (type == OBJ_REF_DELTA || type == OBJ_OFS_DELTA) {
+		off_t base_offset;
+		int base_pos;
+
+		/*
+		 * Find the position of the base object so we can look it up
+		 * in our bitmaps. If we can't come up with an offset, or if
+		 * that offset is not in the revidx, the pack is corrupt.
+		 * There's nothing we can do, so just punt on this object,
+		 * and the normal slow path will complain about it in
+		 * more detail.
+		 */
+		base_offset = get_delta_base(bitmap_git->pack, w_curs,
+					     &offset, type, revidx->offset);
+		if (!base_offset)
+			return;
+		base_pos = find_revindex_position(bitmap_git->pack, base_offset);
+		if (base_pos < 0)
+			return;
+
+		/*
+		 * We assume delta dependencies always point backwards. This
+		 * lets us do a single pass, and is basically always true
+		 * due to the way OFS_DELTAs work. You would not typically
+		 * find REF_DELTA in a bitmapped pack, since we only bitmap
+		 * packs we write fresh, and OFS_DELTA is the default). But
+		 * let's double check to make sure the pack wasn't written with
+		 * odd parameters.
+		 */
+		if (base_pos >= pos)
+			return;
+
+		/*
+		 * And finally, if we're not sending the base as part of our
+		 * reuse chunk, then don't send this object either. The base
+		 * would come after us, along with other objects not
+		 * necessarily in the pack, which means we'd need to convert
+		 * to REF_DELTA on the fly. Better to just let the normal
+		 * object_entry code path handle it.
+		 */
+		if (!bitmap_get(reuse, base_pos))
+			return;
+	}
+
 	/*
-	 * Reuse the packfile content if we need more than
-	 * 90% of its objects
+	 * If we got here, then the object is OK to reuse. Mark it.
 	 */
-	static const double REUSE_PERCENT = 0.9;
+	bitmap_set(reuse, pos);
+}
 
+int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
+				       struct packed_git **packfile_out,
+				       uint32_t *entries,
+				       struct bitmap **reuse_out)
+{
 	struct bitmap *result = bitmap_git->result;
-	uint32_t reuse_threshold;
-	uint32_t i, reuse_objects = 0;
+	struct bitmap *reuse;
+	struct pack_window *w_curs = NULL;
+	size_t i = 0;
+	uint32_t offset;
 
 	assert(result);
 
-	for (i = 0; i < result->word_alloc; ++i) {
-		if (result->words[i] != (eword_t)~0) {
-			reuse_objects += ewah_bit_ctz64(~result->words[i]);
-			break;
-		}
-
-		reuse_objects += BITS_IN_EWORD;
-	}
+	while (i < result->word_alloc && result->words[i] == (eword_t)~0)
+		i++;
 
-#ifdef GIT_BITMAP_DEBUG
-	{
-		const unsigned char *sha1;
-		struct revindex_entry *entry;
+	/* Don't mark objects not in the packfile */
+	if (i > bitmap_git->pack->num_objects / BITS_IN_EWORD)
+		i = bitmap_git->pack->num_objects / BITS_IN_EWORD;
 
-		entry = &bitmap_git->reverse_index->revindex[reuse_objects];
-		sha1 = nth_packed_object_sha1(bitmap_git->pack, entry->nr);
+	reuse = bitmap_word_alloc(i);
+	memset(reuse->words, 0xFF, i * sizeof(eword_t));
 
-		fprintf(stderr, "Failed to reuse at %d (%016llx)\n",
-			reuse_objects, result->words[i]);
-		fprintf(stderr, " %s\n", hash_to_hex(sha1));
-	}
-#endif
+	for (; i < result->word_alloc; ++i) {
+		eword_t word = result->words[i];
+		size_t pos = (i * BITS_IN_EWORD);
 
-	if (!reuse_objects)
-		return -1;
+		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
+			if ((word >> offset) == 0)
+				break;
 
-	if (reuse_objects >= bitmap_git->pack->num_objects) {
-		bitmap_git->reuse_objects = *entries = bitmap_git->pack->num_objects;
-		*up_to = -1; /* reuse the full pack */
-		*packfile = bitmap_git->pack;
-		return 0;
+			offset += ewah_bit_ctz64(word >> offset);
+			try_partial_reuse(bitmap_git, pos + offset, reuse, &w_curs);
+		}
 	}
 
-	reuse_threshold = bitmap_popcount(bitmap_git->result) * REUSE_PERCENT;
+	unuse_pack(&w_curs);
 
-	if (reuse_objects < reuse_threshold)
+	*entries = bitmap_popcount(reuse);
+	if (!*entries) {
+		bitmap_free(reuse);
 		return -1;
+	}
 
-	bitmap_git->reuse_objects = *entries = reuse_objects;
-	*up_to = bitmap_git->pack->revindex[reuse_objects].offset;
-	*packfile = bitmap_git->pack;
-
+	/*
+	 * Drop any reused objects from the result, since they will not
+	 * need to be handled separately.
+	 */
+	bitmap_and_not(result, reuse);
+	*packfile_out = bitmap_git->pack;
+	*reuse_out = reuse;
 	return 0;
 }
 
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 6ab6033dbe..bcd03b8993 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -50,7 +50,8 @@ void test_bitmap_walk(struct rev_info *revs);
 struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs);
 int reuse_partial_packfile_from_bitmap(struct bitmap_index *,
 				       struct packed_git **packfile,
-				       uint32_t *entries, off_t *up_to);
+				       uint32_t *entries,
+				       struct bitmap **reuse_out);
 int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
 			     kh_oid_map_t *reused_bitmaps, int show_progress);
 void free_bitmap_index(struct bitmap_index *);
-- 
2.24.0.rc0.9.gef620577e2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 5/9] pack-bitmap: introduce bitmap_walk_contains()
  2019-10-19 10:35 ` [PATCH v2 5/9] pack-bitmap: introduce bitmap_walk_contains() Christian Couder
@ 2019-10-19 15:25   ` Philip Oakley
  2019-10-19 18:55     ` Christian Couder
  2019-10-19 23:18     ` Jeff King
  0 siblings, 2 replies; 22+ messages in thread
From: Philip Oakley @ 2019-10-19 15:25 UTC (permalink / raw)
  To: Christian Couder, git
  Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

Hi Christian,
can I check one thing?

On 19/10/2019 11:35, Christian Couder wrote:
> From: Jeff King <peff@peff.net>
>
> We will use this helper function in a following commit to
> tell us if an object is packed.
>
> Signed-off-by: Jeff King <peff@peff.net>
> Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> ---
>   pack-bitmap.c | 12 ++++++++++++
>   pack-bitmap.h |  3 +++
>   2 files changed, 15 insertions(+)
>
> diff --git a/pack-bitmap.c b/pack-bitmap.c
> index 016d0319fc..8a51302a1a 100644
> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -826,6 +826,18 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
>   	return 0;
>   }
>   
> +int bitmap_walk_contains(struct bitmap_index *bitmap_git,
> +			 struct bitmap *bitmap, const struct object_id *oid)
> +{
> +	int idx;
Excuse my ignorance here...

For the case on Windows (int/long 32 bit), is this return value 
guaranteed to be less than 2GiB, i.e. not a memory offset?

I'm just thinking ahead to the resolution of the 4GiB file limit issue 
on Git-for-Windows (https://github.com/git-for-windows/git/pull/2179)

> +
> +	if (!bitmap)
> +		return 0;
> +
> +	idx = bitmap_position(bitmap_git, oid);
> +	return idx >= 0 && bitmap_get(bitmap, idx);
> +}
> +
>   void traverse_bitmap_commit_list(struct bitmap_index *bitmap_git,
>   				 show_reachable_fn show_reachable)
>   {
> diff --git a/pack-bitmap.h b/pack-bitmap.h
> index 466c5afa09..6ab6033dbe 100644
> --- a/pack-bitmap.h
> +++ b/pack-bitmap.h
> @@ -3,6 +3,7 @@
>   
>   #include "ewah/ewok.h"
>   #include "khash.h"
> +#include "pack.h"
>   #include "pack-objects.h"
>   
>   struct commit;
> @@ -53,6 +54,8 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *,
>   int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
>   			     kh_oid_map_t *reused_bitmaps, int show_progress);
>   void free_bitmap_index(struct bitmap_index *);
> +int bitmap_walk_contains(struct bitmap_index *,
> +			 struct bitmap *bitmap, const struct object_id *oid);
>   
>   /*
>    * After a traversal has been performed by prepare_bitmap_walk(), this can be
-- 
Philip

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 9/9] pack-objects: improve partial packfile reuse
  2019-10-19 10:35 ` [PATCH v2 9/9] pack-objects: improve partial packfile reuse Christian Couder
@ 2019-10-19 15:30   ` Philip Oakley
  2019-10-19 19:20     ` Christian Couder
  2019-10-22 19:48   ` Jonathan Tan
  1 sibling, 1 reply; 22+ messages in thread
From: Philip Oakley @ 2019-10-19 15:30 UTC (permalink / raw)
  To: Christian Couder, git
  Cc: Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

Hi Christian,

a couple of mem_size questions?

On 19/10/2019 11:35, Christian Couder wrote:
> From: Jeff King <peff@peff.net>
>
> Let's store the chunks of the packfile that we reuse in
> a dynamic array of `struct reused_chunk`, and let's use
> a reuse_packfile_bitmap to speed up reusing parts of
> packfiles.
>
> The dynamic array of `struct reused_chunk` is useful
> because we need to know not just the number of zero bits,
> but the accumulated offset due to missing objects. So
> without the array we'd end up having to walk over the
> revindex for that set of objects. The array is
> basically caching those accumulated offsets (for the
> parts we _do_ include), so we don't have to compute them
> repeatedly.
>
> The old code just tried to dump a whole segment of the
> pack verbatim. That's faster than the traditional way of
> actually adding objects to the packing list, but it
> didn't kick in very often. This new code is really going
> for a middle ground: do _some_ per-object work, but way
> less than we'd traditionally do.
>
> For instance, packing torvalds/linux on GitHub servers
> just now reused 6.5M objects, but only needed ~50k
> chunks.
>
> Additional checks are added in have_duplicate_entry()
> and obj_is_packed() to avoid duplicate objects in the
> reuse bitmap. It was probably buggy to not have such a
> check before.
>
> If a client both asks for a tag by sha1 and specifies
> "include-tag", we may end up including the tag in the reuse
> bitmap (due to the first thing), and then later adding it to
> the packlist (due to the second). This results in duplicate
> objects in the pack, which git chokes on. We should notice
> that we are already including it when doing the include-tag
> portion, and avoid adding it to the packlist.
>
> The simplest place to fix this is right in add_ref_tag,
> where we could avoid peeling the tag at all if we know that
> we are already including it. However, I've pushed the check
> instead into have_duplicate_entry(). This fixes not only
> this case, but also means that we cannot have any similar
> problems lurking in other code.
>
> No tests, because git does not actually exhibit this "ask
> for it and also include-tag" behavior. We do one or the
> other on clone, depending on whether --single-branch is set.
> However, libgit2 does both.
>
> Signed-off-by: Jeff King <peff@peff.net>
> Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> ---
>   builtin/pack-objects.c | 214 ++++++++++++++++++++++++++++++++---------
>   pack-bitmap.c          | 150 +++++++++++++++++++++--------
>   pack-bitmap.h          |   3 +-
>   3 files changed, 280 insertions(+), 87 deletions(-)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 08898331ef..f710f8dde3 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -92,7 +92,7 @@ static struct progress *progress_state;
>   
>   static struct packed_git *reuse_packfile;
>   static uint32_t reuse_packfile_objects;
> -static off_t reuse_packfile_offset;
> +static struct bitmap *reuse_packfile_bitmap;
>   
>   static int use_bitmap_index_default = 1;
>   static int use_bitmap_index = -1;
> @@ -785,57 +785,177 @@ static struct object_entry **compute_write_order(void)
>   	return wo;
>   }
>   
> -static off_t write_reused_pack(struct hashfile *f)
> +/*
> + * Record the offsets needed in our reused packfile chunks due to
> + * "gaps" where we omitted some objects.
> + */
> +static struct reused_chunk {
> +	off_t start;
> +	off_t offset;
> +} *reused_chunks;
> +static int reused_chunks_nr;
> +static int reused_chunks_alloc;
> +
> +static void record_reused_object(off_t where, off_t offset)
>   {
> -	unsigned char buffer[8192];
> -	off_t to_write, total;
> -	int fd;
> +	if (reused_chunks_nr && reused_chunks[reused_chunks_nr-1].offset == offset)
> +		return;
>   
> -	if (!is_pack_valid(reuse_packfile))
> -		die(_("packfile is invalid: %s"), reuse_packfile->pack_name);
> +	ALLOC_GROW(reused_chunks, reused_chunks_nr + 1,
> +		   reused_chunks_alloc);
> +	reused_chunks[reused_chunks_nr].start = where;
> +	reused_chunks[reused_chunks_nr].offset = offset;
> +	reused_chunks_nr++;
> +}
>   
> -	fd = git_open(reuse_packfile->pack_name);
> -	if (fd < 0)
> -		die_errno(_("unable to open packfile for reuse: %s"),
> -			  reuse_packfile->pack_name);
> +/*
> + * Binary search to find the chunk that "where" is in. Note
> + * that we're not looking for an exact match, just the first
> + * chunk that contains it (which implicitly ends at the start
> + * of the next chunk.
> + */
> +static off_t find_reused_offset(off_t where)
> +{
> +	int lo = 0, hi = reused_chunks_nr;
> +	while (lo < hi) {
> +		int mi = lo + ((hi - lo) / 2);
> +		if (where == reused_chunks[mi].start)
> +			return reused_chunks[mi].offset;
> +		if (where < reused_chunks[mi].start)
> +			hi = mi;
> +		else
> +			lo = mi + 1;
> +	}
>   
> -	if (lseek(fd, sizeof(struct pack_header), SEEK_SET) == -1)
> -		die_errno(_("unable to seek in reused packfile"));
> +	/*
> +	 * The first chunk starts at zero, so we can't have gone below
> +	 * there.
> +	 */
> +	assert(lo);
> +	return reused_chunks[lo-1].offset;
> +}
> +
> +static void write_reused_pack_one(size_t pos, struct hashfile *out,
> +				  struct pack_window **w_curs)
> +{
> +	off_t offset, next, cur;
> +	enum object_type type;
> +	unsigned long size;

Is this a mem_sized size or a counter for less that 4GiB items?

> +
> +	offset = reuse_packfile->revindex[pos].offset;
> +	next = reuse_packfile->revindex[pos + 1].offset;
>   
> -	if (reuse_packfile_offset < 0)
> -		reuse_packfile_offset = reuse_packfile->pack_size - the_hash_algo->rawsz;
> +	record_reused_object(offset, offset - hashfile_total(out));
>   
> -	total = to_write = reuse_packfile_offset - sizeof(struct pack_header);
> +	cur = offset;
> +	type = unpack_object_header(reuse_packfile, w_curs, &cur, &size);
> +	assert(type >= 0);
>   
> -	while (to_write) {
> -		int read_pack = xread(fd, buffer, sizeof(buffer));
> +	if (type == OBJ_OFS_DELTA) {
> +		off_t base_offset;
> +		off_t fixup;
> +
> +		unsigned char header[MAX_PACK_OBJECT_HEADER];
> +		unsigned len;
> +
> +		base_offset = get_delta_base(reuse_packfile, w_curs, &cur, type, offset);
> +		assert(base_offset != 0);
> +
> +		/* Convert to REF_DELTA if we must... */
> +		if (!allow_ofs_delta) {
> +			int base_pos = find_revindex_position(reuse_packfile, base_offset);
> +			const unsigned char *base_sha1 =
> +				nth_packed_object_sha1(reuse_packfile,
> +						       reuse_packfile->revindex[base_pos].nr);
> +
> +			len = encode_in_pack_object_header(header, sizeof(header),
> +							   OBJ_REF_DELTA, size);
> +			hashwrite(out, header, len);
> +			hashwrite(out, base_sha1, 20);
> +			copy_pack_data(out, reuse_packfile, w_curs, cur, next - cur);
> +			return;
> +		}
>   
> -		if (read_pack <= 0)
> -			die_errno(_("unable to read from reused packfile"));
> +		/* Otherwise see if we need to rewrite the offset... */
> +		fixup = find_reused_offset(offset) -
> +			find_reused_offset(base_offset);
> +		if (fixup) {
> +			unsigned char ofs_header[10];
> +			unsigned i, ofs_len;
> +			off_t ofs = offset - base_offset - fixup;
>   
> -		if (read_pack > to_write)
> -			read_pack = to_write;
> +			len = encode_in_pack_object_header(header, sizeof(header),
> +							   OBJ_OFS_DELTA, size);
>   
> -		hashwrite(f, buffer, read_pack);
> -		to_write -= read_pack;
> +			i = sizeof(ofs_header) - 1;
> +			ofs_header[i] = ofs & 127;
> +			while (ofs >>= 7)
> +				ofs_header[--i] = 128 | (--ofs & 127);
> +
> +			ofs_len = sizeof(ofs_header) - i;
> +
> +			hashwrite(out, header, len);
> +			hashwrite(out, ofs_header + sizeof(ofs_header) - ofs_len, ofs_len);
> +			copy_pack_data(out, reuse_packfile, w_curs, cur, next - cur);
> +			return;
> +		}
> +
> +		/* ...otherwise we have no fixup, and can write it verbatim */
> +	}
> +
> +	copy_pack_data(out, reuse_packfile, w_curs, offset, next - offset);
> +}
> +
> +static size_t write_reused_pack_verbatim(struct hashfile *out,
> +					 struct pack_window **w_curs)
> +{
> +	size_t pos = 0;
> +
> +	while (pos < reuse_packfile_bitmap->word_alloc &&
> +			reuse_packfile_bitmap->words[pos] == (eword_t)~0)
> +		pos++;
> +
> +	if (pos) {
> +		off_t to_write;
> +
> +		written = (pos * BITS_IN_EWORD);
> +		to_write = reuse_packfile->revindex[written].offset
> +			- sizeof(struct pack_header);
> +
> +		record_reused_object(sizeof(struct pack_header), 0);
> +		hashflush(out);
> +		copy_pack_data(out, reuse_packfile, w_curs,
> +			sizeof(struct pack_header), to_write);
>   
> -		/*
> -		 * We don't know the actual number of objects written,
> -		 * only how many bytes written, how many bytes total, and
> -		 * how many objects total. So we can fake it by pretending all
> -		 * objects we are writing are the same size. This gives us a
> -		 * smooth progress meter, and at the end it matches the true
> -		 * answer.
> -		 */
> -		written = reuse_packfile_objects *
> -				(((double)(total - to_write)) / total);
>   		display_progress(progress_state, written);
>   	}
> +	return pos;
> +}
> +
> +static void write_reused_pack(struct hashfile *f)
> +{
> +	size_t i = 0;
> +	uint32_t offset;
> +	struct pack_window *w_curs = NULL;
> +
> +	if (allow_ofs_delta)
> +		i = write_reused_pack_verbatim(f, &w_curs);
> +
> +	for (; i < reuse_packfile_bitmap->word_alloc; ++i) {
> +		eword_t word = reuse_packfile_bitmap->words[i];
> +		size_t pos = (i * BITS_IN_EWORD);
> +
> +		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
> +			if ((word >> offset) == 0)
> +				break;
> +
> +			offset += ewah_bit_ctz64(word >> offset);
> +			write_reused_pack_one(pos + offset, f, &w_curs);
> +			display_progress(progress_state, ++written);
> +		}
> +	}
>   
> -	close(fd);
> -	written = reuse_packfile_objects;
> -	display_progress(progress_state, written);
> -	return reuse_packfile_offset - sizeof(struct pack_header);
> +	unuse_pack(&w_curs);
>   }
>   
>   static const char no_split_warning[] = N_(
> @@ -868,11 +988,9 @@ static void write_pack_file(void)
>   		offset = write_pack_header(f, nr_remaining);
>   
>   		if (reuse_packfile) {
> -			off_t packfile_size;
>   			assert(pack_to_stdout);
> -
> -			packfile_size = write_reused_pack(f);
> -			offset += packfile_size;
> +			write_reused_pack(f);
> +			offset = hashfile_total(f);
>   		}
>   
>   		nr_written = 0;
> @@ -1001,6 +1119,10 @@ static int have_duplicate_entry(const struct object_id *oid,
>   {
>   	struct object_entry *entry;
>   
> +	if (reuse_packfile_bitmap &&
> +	    bitmap_walk_contains(bitmap_git, reuse_packfile_bitmap, oid))
> +		return 1;
> +
>   	entry = packlist_find(&to_pack, oid);
>   	if (!entry)
>   		return 0;
> @@ -2555,7 +2677,9 @@ static void ll_find_deltas(struct object_entry **list, unsigned list_size,
>   
>   static int obj_is_packed(const struct object_id *oid)
>   {
> -	return !!packlist_find(&to_pack, oid);
> +	return packlist_find(&to_pack, oid) ||
> +		(reuse_packfile_bitmap &&
> +		 bitmap_walk_contains(bitmap_git, reuse_packfile_bitmap, oid));
>   }
>   
>   static void add_tag_chain(const struct object_id *oid)
> @@ -2661,6 +2785,7 @@ static void prepare_pack(int window, int depth)
>   
>   	if (nr_deltas && n > 1) {
>   		unsigned nr_done = 0;
> +
>   		if (progress)
>   			progress_state = start_progress(_("Compressing objects"),
>   							nr_deltas);
> @@ -3042,7 +3167,6 @@ static int pack_options_allow_reuse(void)
>   {
>   	return allow_pack_reuse &&
>   	       pack_to_stdout &&
> -	       allow_ofs_delta &&
>   	       !ignore_packed_keep_on_disk &&
>   	       !ignore_packed_keep_in_core &&
>   	       (!local || !have_non_local_packs) &&
> @@ -3059,7 +3183,7 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
>   			bitmap_git,
>   			&reuse_packfile,
>   			&reuse_packfile_objects,
> -			&reuse_packfile_offset)) {
> +			&reuse_packfile_bitmap)) {
>   		assert(reuse_packfile_objects);
>   		nr_result += reuse_packfile_objects;
>   		display_progress(progress_state, nr_result);
> diff --git a/pack-bitmap.c b/pack-bitmap.c
> index 8a51302a1a..cbfc544411 100644
> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -326,6 +326,13 @@ static int load_pack_bitmap(struct bitmap_index *bitmap_git)
>   	munmap(bitmap_git->map, bitmap_git->map_size);
>   	bitmap_git->map = NULL;
>   	bitmap_git->map_size = 0;
> +
> +	kh_destroy_oid_map(bitmap_git->bitmaps);
> +	bitmap_git->bitmaps = NULL;
> +
> +	kh_destroy_oid_pos(bitmap_git->ext_index.positions);
> +	bitmap_git->ext_index.positions = NULL;
> +
>   	return -1;
>   }
>   
> @@ -764,65 +771,126 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
>   	return NULL;
>   }
>   
> -int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
> -				       struct packed_git **packfile,
> -				       uint32_t *entries,
> -				       off_t *up_to)
> +static void try_partial_reuse(struct bitmap_index *bitmap_git,
> +			      size_t pos,
> +			      struct bitmap *reuse,
> +			      struct pack_window **w_curs)
>   {
> +	struct revindex_entry *revidx;
> +	off_t offset;
> +	enum object_type type;
> +	unsigned long size;
Is this mem_sized or a <4GiB size?
> +
> +	if (pos >= bitmap_git->pack->num_objects)
> +		return; /* not actually in the pack */
> +
> +	revidx = &bitmap_git->pack->revindex[pos];
> +	offset = revidx->offset;
> +	type = unpack_object_header(bitmap_git->pack, w_curs, &offset, &size);
> +	if (type < 0)
> +		return; /* broken packfile, punt */
> +
> +	if (type == OBJ_REF_DELTA || type == OBJ_OFS_DELTA) {
> +		off_t base_offset;
> +		int base_pos;
> +
> +		/*
> +		 * Find the position of the base object so we can look it up
> +		 * in our bitmaps. If we can't come up with an offset, or if
> +		 * that offset is not in the revidx, the pack is corrupt.
> +		 * There's nothing we can do, so just punt on this object,
> +		 * and the normal slow path will complain about it in
> +		 * more detail.
> +		 */
> +		base_offset = get_delta_base(bitmap_git->pack, w_curs,
> +					     &offset, type, revidx->offset);
> +		if (!base_offset)
> +			return;
> +		base_pos = find_revindex_position(bitmap_git->pack, base_offset);
> +		if (base_pos < 0)
> +			return;
> +
> +		/*
> +		 * We assume delta dependencies always point backwards. This
> +		 * lets us do a single pass, and is basically always true
> +		 * due to the way OFS_DELTAs work. You would not typically
> +		 * find REF_DELTA in a bitmapped pack, since we only bitmap
> +		 * packs we write fresh, and OFS_DELTA is the default). But
> +		 * let's double check to make sure the pack wasn't written with
> +		 * odd parameters.
> +		 */
> +		if (base_pos >= pos)
> +			return;
> +
> +		/*
> +		 * And finally, if we're not sending the base as part of our
> +		 * reuse chunk, then don't send this object either. The base
> +		 * would come after us, along with other objects not
> +		 * necessarily in the pack, which means we'd need to convert
> +		 * to REF_DELTA on the fly. Better to just let the normal
> +		 * object_entry code path handle it.
> +		 */
> +		if (!bitmap_get(reuse, base_pos))
> +			return;
> +	}
> +
>   	/*
> -	 * Reuse the packfile content if we need more than
> -	 * 90% of its objects
> +	 * If we got here, then the object is OK to reuse. Mark it.
>   	 */
> -	static const double REUSE_PERCENT = 0.9;
> +	bitmap_set(reuse, pos);
> +}
>   
> +int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
> +				       struct packed_git **packfile_out,
> +				       uint32_t *entries,
> +				       struct bitmap **reuse_out)
> +{
>   	struct bitmap *result = bitmap_git->result;
> -	uint32_t reuse_threshold;
> -	uint32_t i, reuse_objects = 0;
> +	struct bitmap *reuse;
> +	struct pack_window *w_curs = NULL;
> +	size_t i = 0;
> +	uint32_t offset;
>   
>   	assert(result);
>   
> -	for (i = 0; i < result->word_alloc; ++i) {
> -		if (result->words[i] != (eword_t)~0) {
> -			reuse_objects += ewah_bit_ctz64(~result->words[i]);
> -			break;
> -		}
> -
> -		reuse_objects += BITS_IN_EWORD;
> -	}
> +	while (i < result->word_alloc && result->words[i] == (eword_t)~0)
> +		i++;
>   
> -#ifdef GIT_BITMAP_DEBUG
> -	{
> -		const unsigned char *sha1;
> -		struct revindex_entry *entry;
> +	/* Don't mark objects not in the packfile */
> +	if (i > bitmap_git->pack->num_objects / BITS_IN_EWORD)
> +		i = bitmap_git->pack->num_objects / BITS_IN_EWORD;
>   
> -		entry = &bitmap_git->reverse_index->revindex[reuse_objects];
> -		sha1 = nth_packed_object_sha1(bitmap_git->pack, entry->nr);
> +	reuse = bitmap_word_alloc(i);
> +	memset(reuse->words, 0xFF, i * sizeof(eword_t));
>   
> -		fprintf(stderr, "Failed to reuse at %d (%016llx)\n",
> -			reuse_objects, result->words[i]);
> -		fprintf(stderr, " %s\n", hash_to_hex(sha1));
> -	}
> -#endif
> +	for (; i < result->word_alloc; ++i) {
> +		eword_t word = result->words[i];
> +		size_t pos = (i * BITS_IN_EWORD);
>   
> -	if (!reuse_objects)
> -		return -1;
> +		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
> +			if ((word >> offset) == 0)
> +				break;
>   
> -	if (reuse_objects >= bitmap_git->pack->num_objects) {
> -		bitmap_git->reuse_objects = *entries = bitmap_git->pack->num_objects;
> -		*up_to = -1; /* reuse the full pack */
> -		*packfile = bitmap_git->pack;
> -		return 0;
> +			offset += ewah_bit_ctz64(word >> offset);
> +			try_partial_reuse(bitmap_git, pos + offset, reuse, &w_curs);
> +		}
>   	}
>   
> -	reuse_threshold = bitmap_popcount(bitmap_git->result) * REUSE_PERCENT;
> +	unuse_pack(&w_curs);
>   
> -	if (reuse_objects < reuse_threshold)
> +	*entries = bitmap_popcount(reuse);
> +	if (!*entries) {
> +		bitmap_free(reuse);
>   		return -1;
> +	}
>   
> -	bitmap_git->reuse_objects = *entries = reuse_objects;
> -	*up_to = bitmap_git->pack->revindex[reuse_objects].offset;
> -	*packfile = bitmap_git->pack;
> -
> +	/*
> +	 * Drop any reused objects from the result, since they will not
> +	 * need to be handled separately.
> +	 */
> +	bitmap_and_not(result, reuse);
> +	*packfile_out = bitmap_git->pack;
> +	*reuse_out = reuse;
>   	return 0;
>   }
>   
> diff --git a/pack-bitmap.h b/pack-bitmap.h
> index 6ab6033dbe..bcd03b8993 100644
> --- a/pack-bitmap.h
> +++ b/pack-bitmap.h
> @@ -50,7 +50,8 @@ void test_bitmap_walk(struct rev_info *revs);
>   struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs);
>   int reuse_partial_packfile_from_bitmap(struct bitmap_index *,
>   				       struct packed_git **packfile,
> -				       uint32_t *entries, off_t *up_to);
> +				       uint32_t *entries,
> +				       struct bitmap **reuse_out);
>   int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping,
>   			     kh_oid_map_t *reused_bitmaps, int show_progress);
>   void free_bitmap_index(struct bitmap_index *);
Apologies if these are dumb queries..
Philip

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 5/9] pack-bitmap: introduce bitmap_walk_contains()
  2019-10-19 15:25   ` Philip Oakley
@ 2019-10-19 18:55     ` Christian Couder
  2019-10-19 20:15       ` Philip Oakley
  2019-10-19 23:18     ` Jeff King
  1 sibling, 1 reply; 22+ messages in thread
From: Christian Couder @ 2019-10-19 18:55 UTC (permalink / raw)
  To: Philip Oakley
  Cc: git, Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

Hi Philip,

On Sat, Oct 19, 2019 at 5:25 PM Philip Oakley <philipoakley@iee.email> wrote:
>
> Hi Christian,
> can I check one thing?

Yeah, sure! Thanks for taking a look at my patches!

> On 19/10/2019 11:35, Christian Couder wrote:

> > +int bitmap_walk_contains(struct bitmap_index *bitmap_git,
> > +                      struct bitmap *bitmap, const struct object_id *oid)
> > +{
> > +     int idx;
> Excuse my ignorance here...
>
> For the case on Windows (int/long 32 bit), is this return value
> guaranteed to be less than 2GiB, i.e. not a memory offset?
>
> I'm just thinking ahead to the resolution of the 4GiB file limit issue
> on Git-for-Windows (https://github.com/git-for-windows/git/pull/2179)

I understand your concern, unfortunately, below we have:

idx = bitmap_position(bitmap_git, oid);

and bitmap_position() returns an int at least since 3ae5fa0768
(pack-bitmap: remove bitmap_git global variable, 2018-06-07)

So I think the fix would be much more involved than just changing the
type of the idx variable. It would likely involve modifying
bitmap_position(), and thus would probably best be addressed in a
separate patch series.

> > +
> > +     if (!bitmap)
> > +             return 0;
> > +
> > +     idx = bitmap_position(bitmap_git, oid);
> > +     return idx >= 0 && bitmap_get(bitmap, idx);
> > +}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 9/9] pack-objects: improve partial packfile reuse
  2019-10-19 15:30   ` Philip Oakley
@ 2019-10-19 19:20     ` Christian Couder
  2019-10-19 23:23       ` Jeff King
  0 siblings, 1 reply; 22+ messages in thread
From: Christian Couder @ 2019-10-19 19:20 UTC (permalink / raw)
  To: Philip Oakley
  Cc: git, Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

Hi Philip,

On Sat, Oct 19, 2019 at 5:30 PM Philip Oakley <philipoakley@iee.email> wrote:

> On 19/10/2019 11:35, Christian Couder wrote:

> > +static void write_reused_pack_one(size_t pos, struct hashfile *out,
> > +                               struct pack_window **w_curs)
> > +{
> > +     off_t offset, next, cur;
> > +     enum object_type type;
> > +     unsigned long size;
>
> Is this a mem_sized size or a counter for less that 4GiB items?

What I can see is that `&size` is passed as the last argument to
unpack_object_header() below. And unpack_object_header() is defined in
packfile.h like this:

int unpack_object_header(struct packed_git *, struct pack_window **,
off_t *, unsigned long *);

since at least 336226c259 (packfile.h: drop extern from function
declarations, 2019-04-05)

So fixing this, if it needs to be fixed, should probably be part of a
separate topic fixing unpack_object_header().

> > +
> > +     offset = reuse_packfile->revindex[pos].offset;
> > +     next = reuse_packfile->revindex[pos + 1].offset;
> >
> > -     if (reuse_packfile_offset < 0)
> > -             reuse_packfile_offset = reuse_packfile->pack_size - the_hash_algo->rawsz;
> > +     record_reused_object(offset, offset - hashfile_total(out));
> >
> > -     total = to_write = reuse_packfile_offset - sizeof(struct pack_header);
> > +     cur = offset;
> > +     type = unpack_object_header(reuse_packfile, w_curs, &cur, &size);
> > +     assert(type >= 0);

> > +static void try_partial_reuse(struct bitmap_index *bitmap_git,
> > +                           size_t pos,
> > +                           struct bitmap *reuse,
> > +                           struct pack_window **w_curs)
> >   {
> > +     struct revindex_entry *revidx;
> > +     off_t offset;
> > +     enum object_type type;
> > +     unsigned long size;
>
> Is this mem_sized or a <4GiB size?

Again this `size` variable is passed as the last argument to
unpack_object_header() below.

...

> Apologies if these are dumb queries..

I think it's a valid concern, so it's certainly ok for you to ask
these questions.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 5/9] pack-bitmap: introduce bitmap_walk_contains()
  2019-10-19 18:55     ` Christian Couder
@ 2019-10-19 20:15       ` Philip Oakley
  0 siblings, 0 replies; 22+ messages in thread
From: Philip Oakley @ 2019-10-19 20:15 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, Jeff King, Christian Couder, Ramsay Jones,
	Jonathan Tan

Hi Christian,

On 19/10/2019 19:55, Christian Couder wrote:
> Hi Philip,
>
> On Sat, Oct 19, 2019 at 5:25 PM Philip Oakley <philipoakley@iee.email> wrote:
>> Hi Christian,
>> can I check one thing?
> Yeah, sure! Thanks for taking a look at my patches!
>
>> On 19/10/2019 11:35, Christian Couder wrote:
>>> +int bitmap_walk_contains(struct bitmap_index *bitmap_git,
>>> +                      struct bitmap *bitmap, const struct object_id *oid)
>>> +{
>>> +     int idx;
>> Excuse my ignorance here...
>>
>> For the case on Windows (int/long 32 bit), is this return value
>> guaranteed to be less than 2GiB, i.e. not a memory offset?
>>
>> I'm just thinking ahead to the resolution of the 4GiB file limit issue
>> on Git-for-Windows (https://github.com/git-for-windows/git/pull/2179)
> I understand your concern, unfortunately, below we have:
>
> idx = bitmap_position(bitmap_git, oid);
>
> and bitmap_position() returns an int at least since 3ae5fa0768
> (pack-bitmap: remove bitmap_git global variable, 2018-06-07)
>
> So I think the fix would be much more involved than just changing the
> type of the idx variable. It would likely involve modifying
> bitmap_position(), and thus would probably best be addressed in a
> separate patch series.

So, IIUC it is mem-sized, so I should at least note it and pay attention 
to it for my >4G series, which like you say is "much more involved than 
just"...

The patch to flip over all the affected locations is a bit humongous 
(big), plus it's a bit of a moving target...
>>> +
>>> +     if (!bitmap)
>>> +             return 0;
>>> +
>>> +     idx = bitmap_position(bitmap_git, oid);
>>> +     return idx >= 0 && bitmap_get(bitmap, idx);
>>> +}
Philip

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 5/9] pack-bitmap: introduce bitmap_walk_contains()
  2019-10-19 15:25   ` Philip Oakley
  2019-10-19 18:55     ` Christian Couder
@ 2019-10-19 23:18     ` Jeff King
  1 sibling, 0 replies; 22+ messages in thread
From: Jeff King @ 2019-10-19 23:18 UTC (permalink / raw)
  To: Philip Oakley
  Cc: Christian Couder, git, Junio C Hamano, Christian Couder,
	Ramsay Jones, Jonathan Tan

On Sat, Oct 19, 2019 at 04:25:19PM +0100, Philip Oakley wrote:

> > +int bitmap_walk_contains(struct bitmap_index *bitmap_git,
> > +			 struct bitmap *bitmap, const struct object_id *oid)
> > +{
> > +	int idx;
> Excuse my ignorance here...
> 
> For the case on Windows (int/long 32 bit), is this return value guaranteed
> to be less than 2GiB, i.e. not a memory offset?
> 
> I'm just thinking ahead to the resolution of the 4GiB file limit issue on
> Git-for-Windows (https://github.com/git-for-windows/git/pull/2179)

Yes, it's not a memory offset.

This "idx" here (and the return value of bitmap_position) represents a
position within an array of objects. This isn't strictly limited to the
objects in a single pack (because a traversal might extend to objects
outside the bitmapped pack), but we can use that as a general ballpark.
And it's limited to a 4-byte object count already.

So the "best" type here would be a uint32_t (which is used elsewhere
in the pack code), but we use signedness to indicate that the object
wasn't found.

That's probably OK. The biggest repos I've seen have on the order of
10-100M objects. That still gives us a factor of 20 before we hit 2^31.
If we imagine those repos took 10 years or so to accrue that many
objects, then we probably still have 200 years of growth left. Of course
growth accelerates over time, but I suspect repos with 2B objects will
run into other scaling problems first. So I don't think it's worth
worrying about too much for now.

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 9/9] pack-objects: improve partial packfile reuse
  2019-10-19 19:20     ` Christian Couder
@ 2019-10-19 23:23       ` Jeff King
  2019-10-20 11:26         ` Philip Oakley
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff King @ 2019-10-19 23:23 UTC (permalink / raw)
  To: Christian Couder
  Cc: Philip Oakley, git, Junio C Hamano, Christian Couder,
	Ramsay Jones, Jonathan Tan

On Sat, Oct 19, 2019 at 09:20:11PM +0200, Christian Couder wrote:

> > > +static void write_reused_pack_one(size_t pos, struct hashfile *out,
> > > +                               struct pack_window **w_curs)
> > > +{
> > > +     off_t offset, next, cur;
> > > +     enum object_type type;
> > > +     unsigned long size;
> >
> > Is this a mem_sized size or a counter for less that 4GiB items?
> 
> What I can see is that `&size` is passed as the last argument to
> unpack_object_header() below. And unpack_object_header() is defined in
> packfile.h like this:
> 
> int unpack_object_header(struct packed_git *, struct pack_window **,
> off_t *, unsigned long *);
> 
> since at least 336226c259 (packfile.h: drop extern from function
> declarations, 2019-04-05)
> 
> So fixing this, if it needs to be fixed, should probably be part of a
> separate topic fixing unpack_object_header().

Yeah, this one definitely should be moved to whatever we used to
represent object sizes in the future (size_t, or I guess off_t if we
really want to handle huge objects on 32-bit systems too). But
definitely it shouldn't happen in this series, and I don't think anybody
interested in the other topic (converting the integer type for object
sizes) needs to keep tabs on it. When they convert
unpack_object_header(), the compiler will complain because of passing
it as a pointer (the more insidious ones will be where we return an
unsigned long to represent an object type, and somebody will have to
look into every caller).

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 9/9] pack-objects: improve partial packfile reuse
  2019-10-19 23:23       ` Jeff King
@ 2019-10-20 11:26         ` Philip Oakley
  0 siblings, 0 replies; 22+ messages in thread
From: Philip Oakley @ 2019-10-20 11:26 UTC (permalink / raw)
  To: Jeff King, Christian Couder
  Cc: git, Junio C Hamano, Christian Couder, Ramsay Jones, Jonathan Tan

On 20/10/2019 00:23, Jeff King wrote:
> On Sat, Oct 19, 2019 at 09:20:11PM +0200, Christian Couder wrote:
>
>>>> +static void write_reused_pack_one(size_t pos, struct hashfile *out,
>>>> +                               struct pack_window **w_curs)
>>>> +{
>>>> +     off_t offset, next, cur;
>>>> +     enum object_type type;
>>>> +     unsigned long size;
>>> Is this a mem_sized size or a counter for less that 4GiB items?
>> What I can see is that `&size` is passed as the last argument to
>> unpack_object_header() below. And unpack_object_header() is defined in
>> packfile.h like this:
>>
>> int unpack_object_header(struct packed_git *, struct pack_window **,
>> off_t *, unsigned long *);
>>
>> since at least 336226c259 (packfile.h: drop extern from function
>> declarations, 2019-04-05)
>>
>> So fixing this, if it needs to be fixed, should probably be part of a
>> separate topic fixing unpack_object_header().
> Yeah, this one definitely should be moved to whatever we used to
> represent object sizes in the future (size_t, or I guess off_t if we
> really want to handle huge objects on 32-bit systems too). But
> definitely it shouldn't happen in this series, and I don't think anybody
> interested in the other topic (converting the integer type for object
> sizes) needs to keep tabs on it. When they convert
> unpack_object_header(), the compiler will complain because of passing
> it as a pointer (the more insidious ones will be where we return an
> unsigned long to represent an object type, and somebody will have to
> look into every caller).
>
Thanks, I'll add that one to my list of size values that I should check 
when rebasing my current series.

If I understand correctly gcc is no longer detecting those size 
conversions, but clang has -Wshorten-64-to-32 but I've still to check if 
it catches some of these conversion issues (esp when int (32) is 
extended to 64 bit size_t, rather than being 64 bit in the first place)

Philip

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 3/9] ewah/bitmap: introduce bitmap_word_alloc()
  2019-10-19 10:35 ` [PATCH v2 3/9] ewah/bitmap: introduce bitmap_word_alloc() Christian Couder
@ 2019-10-22 17:46   ` Jonathan Tan
  2019-10-26  9:29     ` Christian Couder
  0 siblings, 1 reply; 22+ messages in thread
From: Jonathan Tan @ 2019-10-22 17:46 UTC (permalink / raw)
  To: christian.couder; +Cc: git, peff, chriscool, Jonathan Tan

> From: Jeff King <peff@peff.net>
> 
> In a following commit we will need to allocate a variable
> number of bitmap words, instead of always 32, so let's add
> bitmap_word_alloc() for this purpose.
> 
> We will also always access at least one word for each bitmap,
> so we want to make sure that at least one is always
> allocated.

The last paragraph is not true - we still can allocate 0. (We just ensure
that when we grow, we grow to at least 1.) I think we should just
delete the last paragraph - the first paragraph is sufficient.

Other than that, all patches up to but not including the last one look
good, except the ones that just add a new function because I'll have to
review the last patch to see how they are used.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 9/9] pack-objects: improve partial packfile reuse
  2019-10-19 10:35 ` [PATCH v2 9/9] pack-objects: improve partial packfile reuse Christian Couder
  2019-10-19 15:30   ` Philip Oakley
@ 2019-10-22 19:48   ` Jonathan Tan
  2019-10-26  9:29     ` Christian Couder
  1 sibling, 1 reply; 22+ messages in thread
From: Jonathan Tan @ 2019-10-22 19:48 UTC (permalink / raw)
  To: christian.couder; +Cc: git, gitster, peff, chriscool, Jonathan Tan

First, the commit message could probably be reordered to start with the
main point (reusing of packfiles at object granularity instead of
allowing reuse of only one contiguous region). To specific points:

> The dynamic array of `struct reused_chunk` is useful
> because we need to know not just the number of zero bits,
> but the accumulated offset due to missing objects.

The number of zero bits is irrelevant, I think. (I know I introduced
this idea, but at that time I was somehow confused that the offset was a
number of objects and not a number of bytes.)

> If a client both asks for a tag by sha1 and specifies
> "include-tag", we may end up including the tag in the reuse
> bitmap (due to the first thing), and then later adding it to
> the packlist (due to the second). This results in duplicate
> objects in the pack, which git chokes on. We should notice
> that we are already including it when doing the include-tag
> portion, and avoid adding it to the packlist.

I still don't understand this part. Going off what's written in the
commit message here, it seems to me that the issue is that (1) an object
can be added to the reuse bitmap without going through to_pack, and (2)
this is done when the client asks for a tag by SHA-1. But all objects
when specified by SHA-1 go through to_pack, don't they?

On to the review of the code. The ordering of - and + lines are
confusing, so I have just removed all - lines.

> +/*
> + * Record the offsets needed in our reused packfile chunks due to
> + * "gaps" where we omitted some objects.
> + */
> +static struct reused_chunk {
> +	off_t start;
> +	off_t offset;
> +} *reused_chunks;
> +static int reused_chunks_nr;
> +static int reused_chunks_alloc;

A chunk is a set of objects from the original packfile that we are
reusing; all objects in a chunk have the same relative position in the
original packfile and the generated packfile. (This does not mean that
the bytes are exactly the same or, even, that the last object in the
chunk has the same size in the original packfile and the generated
packfile.)

"start" is the offset of the first object of the chunk in the original
packfile.

"offset" is "start" minus the offset of the first object of the chunk in
the generated packfile. (I think it is easier to understand if this was
"new minus old" instead of "old minus new", that is, the negation of its
current value.)

So I would prefer:

  /*
   * A reused set of objects. All objects in a chunk have the same
   * relative position in the original packfile and the generated
   * packfile.
   */
  struct reused_chunk {
    /* The offset of the first object of this chunk in the original
     * packfile. */
    off_t original;
    /* The offset of the first object of this chunk in the generated
     * packfile minus "original". */
    off_t difference;
  }

> +static void record_reused_object(off_t where, off_t offset)

Straightforward, other than the renaming of parameters.

> +/*
> + * Binary search to find the chunk that "where" is in. Note
> + * that we're not looking for an exact match, just the first
> + * chunk that contains it (which implicitly ends at the start
> + * of the next chunk.
> + */
> +static off_t find_reused_offset(off_t where)

Also straightforward.

> +static void write_reused_pack_one(size_t pos, struct hashfile *out,
> +				  struct pack_window **w_curs)
> +{
> +	off_t offset, next, cur;
> +	enum object_type type;
> +	unsigned long size;
> +
> +	offset = reuse_packfile->revindex[pos].offset;
> +	next = reuse_packfile->revindex[pos + 1].offset;
>  
> +	record_reused_object(offset, offset - hashfile_total(out));

This is where I understood what "offset" meant in struct reused_chunk.

>  
> +	cur = offset;
> +	type = unpack_object_header(reuse_packfile, w_curs, &cur, &size);
> +	assert(type >= 0);
>  
> +	if (type == OBJ_OFS_DELTA) {
> +		off_t base_offset;
> +		off_t fixup;
> +
> +		unsigned char header[MAX_PACK_OBJECT_HEADER];
> +		unsigned len;
> +
> +		base_offset = get_delta_base(reuse_packfile, w_curs, &cur, type, offset);
> +		assert(base_offset != 0);
> +
> +		/* Convert to REF_DELTA if we must... */
> +		if (!allow_ofs_delta) {
> +			int base_pos = find_revindex_position(reuse_packfile, base_offset);
> +			const unsigned char *base_sha1 =
> +				nth_packed_object_sha1(reuse_packfile,
> +						       reuse_packfile->revindex[base_pos].nr);
> +
> +			len = encode_in_pack_object_header(header, sizeof(header),
> +							   OBJ_REF_DELTA, size);
> +			hashwrite(out, header, len);
> +			hashwrite(out, base_sha1, 20);
> +			copy_pack_data(out, reuse_packfile, w_curs, cur, next - cur);
> +			return;
> +		}
>  
> +		/* Otherwise see if we need to rewrite the offset... */
> +		fixup = find_reused_offset(offset) -
> +			find_reused_offset(base_offset);

The "offset" in "reused offset" is actually the difference in offset
between the original packfile and the generated packfile, as I explained
earlier. If the base_original->base_generated and
delta_original->delta_generated distances are the same, then nothing
needs to be done, as the comment above correctly explains.

> +		if (fixup) {
> +			unsigned char ofs_header[10];
> +			unsigned i, ofs_len;
> +			off_t ofs = offset - base_offset - fixup;
>  
> +			len = encode_in_pack_object_header(header, sizeof(header),
> +							   OBJ_OFS_DELTA, size);
>  
> +			i = sizeof(ofs_header) - 1;
> +			ofs_header[i] = ofs & 127;
> +			while (ofs >>= 7)
> +				ofs_header[--i] = 128 | (--ofs & 127);
> +
> +			ofs_len = sizeof(ofs_header) - i;
> +
> +			hashwrite(out, header, len);
> +			hashwrite(out, ofs_header + sizeof(ofs_header) - ofs_len, ofs_len);
> +			copy_pack_data(out, reuse_packfile, w_curs, cur, next - cur);
> +			return;
> +		}
> +
> +		/* ...otherwise we have no fixup, and can write it verbatim */
> +	}
> +
> +	copy_pack_data(out, reuse_packfile, w_curs, offset, next - offset);
> +}

We can write verbatim if (OFS + allow_ofs_delta + no fixup necessary) or
if it is REF.

> +static size_t write_reused_pack_verbatim(struct hashfile *out,
> +					 struct pack_window **w_curs)
> +{
> +	size_t pos = 0;
> +
> +	while (pos < reuse_packfile_bitmap->word_alloc &&
> +			reuse_packfile_bitmap->words[pos] == (eword_t)~0)
> +		pos++;
> +
> +	if (pos) {
> +		off_t to_write;
> +
> +		written = (pos * BITS_IN_EWORD);
> +		to_write = reuse_packfile->revindex[written].offset
> +			- sizeof(struct pack_header);
> +
> +		record_reused_object(sizeof(struct pack_header), 0);

Probably worth a comment, since we're recording one chunk, not one
object. The code is correct, though - one big chunk with original offset
as written, and no difference between original and generated offsets.

> +		hashflush(out);
> +		copy_pack_data(out, reuse_packfile, w_curs,
> +			sizeof(struct pack_header), to_write);
>  
>  		display_progress(progress_state, written);
>  	}
> +	return pos;
> +}

The number of objects written is always a multiple of <number of bits in
eword_t>, but that's fine.

> +static void write_reused_pack(struct hashfile *f)
> +{
> +	size_t i = 0;
> +	uint32_t offset;
> +	struct pack_window *w_curs = NULL;
> +
> +	if (allow_ofs_delta)
> +		i = write_reused_pack_verbatim(f, &w_curs);
> +
> +	for (; i < reuse_packfile_bitmap->word_alloc; ++i) {
> +		eword_t word = reuse_packfile_bitmap->words[i];
> +		size_t pos = (i * BITS_IN_EWORD);
> +
> +		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
> +			if ((word >> offset) == 0)
> +				break;
> +
> +			offset += ewah_bit_ctz64(word >> offset);
> +			write_reused_pack_one(pos + offset, f, &w_curs);
> +			display_progress(progress_state, ++written);
> +		}
> +	}
>  
> +	unuse_pack(&w_curs);
>  }

Looks straightforward to me. I found the inner "for" loop a bit
confusing, because the counter variable ("offset") is incremented both
in the 3rd clause and in the body.

>  		if (reuse_packfile) {
> -			off_t packfile_size;
>  			assert(pack_to_stdout);
> -
> -			packfile_size = write_reused_pack(f);
> -			offset += packfile_size;
> +			write_reused_pack(f);
> +			offset = hashfile_total(f);
>  		}

Straightforward.

> @@ -1001,6 +1119,10 @@ static int have_duplicate_entry(const struct object_id *oid,
>  {
>  	struct object_entry *entry;
>  
> +	if (reuse_packfile_bitmap &&
> +	    bitmap_walk_contains(bitmap_git, reuse_packfile_bitmap, oid))
> +		return 1;
> +
>  	entry = packlist_find(&to_pack, oid);
>  	if (!entry)
>  		return 0;
> @@ -2555,7 +2677,9 @@ static void ll_find_deltas(struct object_entry **list, unsigned list_size,
>  
>  static int obj_is_packed(const struct object_id *oid)
>  {
> +	return packlist_find(&to_pack, oid) ||
> +		(reuse_packfile_bitmap &&
> +		 bitmap_walk_contains(bitmap_git, reuse_packfile_bitmap, oid));
>  }

These are the parts I was confused about, as I said when I was
discussing the commit message.

> @@ -2661,6 +2785,7 @@ static void prepare_pack(int window, int depth)
>  
>  	if (nr_deltas && n > 1) {
>  		unsigned nr_done = 0;
> +
>  		if (progress)
>  			progress_state = start_progress(_("Compressing objects"),
>  							nr_deltas);

Unnecessary added line - please remove.

On to pack-bitmap.c.

> +static void try_partial_reuse(struct bitmap_index *bitmap_git,
> +			      size_t pos,
> +			      struct bitmap *reuse,
> +			      struct pack_window **w_curs)
>  {
> +	struct revindex_entry *revidx;
> +	off_t offset;
> +	enum object_type type;
> +	unsigned long size;
> +
> +	if (pos >= bitmap_git->pack->num_objects)
> +		return; /* not actually in the pack */

Is this case possible? try_partial_reuse() is only called when there is
a 1 at the bit location.

> +
> +	revidx = &bitmap_git->pack->revindex[pos];
> +	offset = revidx->offset;
> +	type = unpack_object_header(bitmap_git->pack, w_curs, &offset, &size);
> +	if (type < 0)
> +		return; /* broken packfile, punt */
> +
> +	if (type == OBJ_REF_DELTA || type == OBJ_OFS_DELTA) {
> +		off_t base_offset;
> +		int base_pos;
> +
> +		/*
> +		 * Find the position of the base object so we can look it up
> +		 * in our bitmaps. If we can't come up with an offset, or if
> +		 * that offset is not in the revidx, the pack is corrupt.
> +		 * There's nothing we can do, so just punt on this object,
> +		 * and the normal slow path will complain about it in
> +		 * more detail.
> +		 */
> +		base_offset = get_delta_base(bitmap_git->pack, w_curs,
> +					     &offset, type, revidx->offset);
> +		if (!base_offset)
> +			return;
> +		base_pos = find_revindex_position(bitmap_git->pack, base_offset);
> +		if (base_pos < 0)
> +			return;
> +
> +		/*
> +		 * We assume delta dependencies always point backwards. This
> +		 * lets us do a single pass, and is basically always true
> +		 * due to the way OFS_DELTAs work. You would not typically
> +		 * find REF_DELTA in a bitmapped pack, since we only bitmap
> +		 * packs we write fresh, and OFS_DELTA is the default). But
> +		 * let's double check to make sure the pack wasn't written with
> +		 * odd parameters.
> +		 */
> +		if (base_pos >= pos)
> +			return;
> +
> +		/*
> +		 * And finally, if we're not sending the base as part of our
> +		 * reuse chunk, then don't send this object either. The base
> +		 * would come after us, along with other objects not
> +		 * necessarily in the pack, which means we'd need to convert
> +		 * to REF_DELTA on the fly. Better to just let the normal
> +		 * object_entry code path handle it.
> +		 */
> +		if (!bitmap_get(reuse, base_pos))
> +			return;
> +	}
> +
>  	/*
> +	 * If we got here, then the object is OK to reuse. Mark it.
>  	 */
> +	bitmap_set(reuse, pos);
> +}

OK. Thanks for the great comments.

> +int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
> +				       struct packed_git **packfile_out,
> +				       uint32_t *entries,
> +				       struct bitmap **reuse_out)
> +{
>  	struct bitmap *result = bitmap_git->result;
> +	struct bitmap *reuse;
> +	struct pack_window *w_curs = NULL;
> +	size_t i = 0;
> +	uint32_t offset;
>  
>  	assert(result);
>  
> +	while (i < result->word_alloc && result->words[i] == (eword_t)~0)
> +		i++;

Skip through fully-1 words so that we don't have to iterate through
every bit - sounds good.

>  
> +	/* Don't mark objects not in the packfile */
> +	if (i > bitmap_git->pack->num_objects / BITS_IN_EWORD)
> +		i = bitmap_git->pack->num_objects / BITS_IN_EWORD;

Why is this necessary? Let's say we have 42 fully-1 words, and therefore
i is 42. Shouldn't we allocate 42 words in our reuse bitmap?

>  
> +	reuse = bitmap_word_alloc(i);
> +	memset(reuse->words, 0xFF, i * sizeof(eword_t));
>  
> +	for (; i < result->word_alloc; ++i) {
> +		eword_t word = result->words[i];
> +		size_t pos = (i * BITS_IN_EWORD);
>  
> +		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
> +			if ((word >> offset) == 0)
> +				break;
>  
> +			offset += ewah_bit_ctz64(word >> offset);
> +			try_partial_reuse(bitmap_git, pos + offset, reuse, &w_curs);
> +		}
>  	}
>  
> +	unuse_pack(&w_curs);
>  
> +	*entries = bitmap_popcount(reuse);
> +	if (!*entries) {
> +		bitmap_free(reuse);
>  		return -1;
> +	}
>  
> +	/*
> +	 * Drop any reused objects from the result, since they will not
> +	 * need to be handled separately.
> +	 */
> +	bitmap_and_not(result, reuse);
> +	*packfile_out = bitmap_git->pack;
> +	*reuse_out = reuse;
>  	return 0;
>  }

Looks good.

And thanks for this patch - other than the difficult-to-understand parts
I described above, I found the overall flow easy to discern.

I presume tests would also need to be written. One way is to introduce
yet another test variable, but maybe that is overkill.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 9/9] pack-objects: improve partial packfile reuse
  2019-10-22 19:48   ` Jonathan Tan
@ 2019-10-26  9:29     ` Christian Couder
  0 siblings, 0 replies; 22+ messages in thread
From: Christian Couder @ 2019-10-26  9:29 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Junio C Hamano, Jeff King, Christian Couder

On Tue, Oct 22, 2019 at 9:48 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> First, the commit message could probably be reordered to start with the
> main point (reusing of packfiles at object granularity instead of
> allowing reuse of only one contiguous region). To specific points:

Done in my current version.

> > The dynamic array of `struct reused_chunk` is useful
> > because we need to know not just the number of zero bits,
> > but the accumulated offset due to missing objects.
>
> The number of zero bits is irrelevant, I think. (I know I introduced
> this idea, but at that time I was somehow confused that the offset was a
> number of objects and not a number of bytes.)

Ok, I removed "not just the number of zero bits,".

> > If a client both asks for a tag by sha1 and specifies
> > "include-tag", we may end up including the tag in the reuse
> > bitmap (due to the first thing), and then later adding it to
> > the packlist (due to the second). This results in duplicate
> > objects in the pack, which git chokes on. We should notice
> > that we are already including it when doing the include-tag
> > portion, and avoid adding it to the packlist.
>
> I still don't understand this part. Going off what's written in the
> commit message here, it seems to me that the issue is that (1) an object
> can be added to the reuse bitmap without going through to_pack, and (2)
> this is done when the client asks for a tag by SHA-1. But all objects
> when specified by SHA-1 go through to_pack, don't they?

That's how Peff explained it in response to one of your previous
messages. Peff, would you mind answering Jonathan's questions?

At the end of the commit message I added the following which might help though:

>> No tests, because git does not actually exhibit this "ask
>> for it and also include-tag" behavior. We do one or the
>> other on clone, depending on whether --single-branch is set.
>> However, libgit2 does both.

So my wild guess sould be that maybe the reason is that some of this
code was shared (or intended to be shared) with libgit2?

> On to the review of the code. The ordering of - and + lines are
> confusing, so I have just removed all - lines.
>
> > +/*
> > + * Record the offsets needed in our reused packfile chunks due to
> > + * "gaps" where we omitted some objects.
> > + */
> > +static struct reused_chunk {
> > +     off_t start;
> > +     off_t offset;
> > +} *reused_chunks;
> > +static int reused_chunks_nr;
> > +static int reused_chunks_alloc;
>
> A chunk is a set of objects from the original packfile that we are
> reusing; all objects in a chunk have the same relative position in the
> original packfile and the generated packfile. (This does not mean that
> the bytes are exactly the same or, even, that the last object in the
> chunk has the same size in the original packfile and the generated
> packfile.)
>
> "start" is the offset of the first object of the chunk in the original
> packfile.
>
> "offset" is "start" minus the offset of the first object of the chunk in
> the generated packfile. (I think it is easier to understand if this was
> "new minus old" instead of "old minus new", that is, the negation of its
> current value.)
>
> So I would prefer:
>
>   /*
>    * A reused set of objects. All objects in a chunk have the same
>    * relative position in the original packfile and the generated
>    * packfile.
>    */
>   struct reused_chunk {
>     /* The offset of the first object of this chunk in the original
>      * packfile. */
>     off_t original;
>     /* The offset of the first object of this chunk in the generated
>      * packfile minus "original". */
>     off_t difference;
>   }

Ok, I have used that though it introduces some discrepancies, as
record_reused_object() for example still has an "offset" argument.

> > +static size_t write_reused_pack_verbatim(struct hashfile *out,
> > +                                      struct pack_window **w_curs)
> > +{
> > +     size_t pos = 0;
> > +
> > +     while (pos < reuse_packfile_bitmap->word_alloc &&
> > +                     reuse_packfile_bitmap->words[pos] == (eword_t)~0)
> > +             pos++;
> > +
> > +     if (pos) {
> > +             off_t to_write;
> > +
> > +             written = (pos * BITS_IN_EWORD);
> > +             to_write = reuse_packfile->revindex[written].offset
> > +                     - sizeof(struct pack_header);
> > +
> > +             record_reused_object(sizeof(struct pack_header), 0);
>
> Probably worth a comment, since we're recording one chunk, not one
> object.

Ok, added a comment.

> The code is correct, though - one big chunk with original offset
> as written, and no difference between original and generated offsets.


> > @@ -2661,6 +2785,7 @@ static void prepare_pack(int window, int depth)
> >
> >       if (nr_deltas && n > 1) {
> >               unsigned nr_done = 0;
> > +
> >               if (progress)
> >                       progress_state = start_progress(_("Compressing objects"),
> >                                                       nr_deltas);
>
> Unnecessary added line - please remove.

We often separate variable declaration from the rest of function code
with a blank line, so I don't think the added line is completely
unnecessary. It helps make the code more standard and therefore more
readable, so I left it.

You might say that it should be in a preparatory patch, but a
preparatory patch for just such a small change is a bit wasteful. I
think it's ok to add this blank line "while at it" in this patch.

> On to pack-bitmap.c.
>
> > +static void try_partial_reuse(struct bitmap_index *bitmap_git,
> > +                           size_t pos,
> > +                           struct bitmap *reuse,
> > +                           struct pack_window **w_curs)
> >  {
> > +     struct revindex_entry *revidx;
> > +     off_t offset;
> > +     enum object_type type;
> > +     unsigned long size;
> > +
> > +     if (pos >= bitmap_git->pack->num_objects)
> > +             return; /* not actually in the pack */
>
> Is this case possible? try_partial_reuse() is only called when there is
> a 1 at the bit location.

I'd like Peff to answer here too, as I don't think I fully understand
what's going on here.

> > +     /* Don't mark objects not in the packfile */
> > +     if (i > bitmap_git->pack->num_objects / BITS_IN_EWORD)
> > +             i = bitmap_git->pack->num_objects / BITS_IN_EWORD;
>
> Why is this necessary? Let's say we have 42 fully-1 words, and therefore
> i is 42. Shouldn't we allocate 42 words in our reuse bitmap?

I'd also prefer Peff to answer here.

[...]

> And thanks for this patch - other than the difficult-to-understand parts
> I described above, I found the overall flow easy to discern.
>
> I presume tests would also need to be written. One way is to introduce
> yet another test variable, but maybe that is overkill.

This part of the commit message might again help understand why there
is no test:

>> No tests, because git does not actually exhibit this "ask
>> for it and also include-tag" behavior. We do one or the
>> other on clone, depending on whether --single-branch is set.
>> However, libgit2 does both.

But I'd be again happy if Peff could confirm.

Thanks for another review,
Christian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 3/9] ewah/bitmap: introduce bitmap_word_alloc()
  2019-10-22 17:46   ` Jonathan Tan
@ 2019-10-26  9:29     ` Christian Couder
  0 siblings, 0 replies; 22+ messages in thread
From: Christian Couder @ 2019-10-26  9:29 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Jeff King, Christian Couder

On Tue, Oct 22, 2019 at 7:46 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > From: Jeff King <peff@peff.net>
> >
> > In a following commit we will need to allocate a variable
> > number of bitmap words, instead of always 32, so let's add
> > bitmap_word_alloc() for this purpose.
> >
> > We will also always access at least one word for each bitmap,
> > so we want to make sure that at least one is always
> > allocated.
>
> The last paragraph is not true - we still can allocate 0. (We just ensure
> that when we grow, we grow to at least 1.) I think we should just
> delete the last paragraph - the first paragraph is sufficient.

Ok, I removed the last paragraph in my current version.

> Other than that, all patches up to but not including the last one look
> good, except the ones that just add a new function because I'll have to
> review the last patch to see how they are used.

Thanks for your review,
Christian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-10-26  9:34 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-19 10:35 [PATCH v2 0/9] Rewrite packfile reuse code Christian Couder
2019-10-19 10:35 ` [PATCH v2 1/9] builtin/pack-objects: report reused packfile objects Christian Couder
2019-10-19 10:35 ` [PATCH v2 2/9] packfile: expose get_delta_base() Christian Couder
2019-10-19 10:35 ` [PATCH v2 3/9] ewah/bitmap: introduce bitmap_word_alloc() Christian Couder
2019-10-22 17:46   ` Jonathan Tan
2019-10-26  9:29     ` Christian Couder
2019-10-19 10:35 ` [PATCH v2 4/9] pack-bitmap: don't rely on bitmap_git->reuse_objects Christian Couder
2019-10-19 10:35 ` [PATCH v2 5/9] pack-bitmap: introduce bitmap_walk_contains() Christian Couder
2019-10-19 15:25   ` Philip Oakley
2019-10-19 18:55     ` Christian Couder
2019-10-19 20:15       ` Philip Oakley
2019-10-19 23:18     ` Jeff King
2019-10-19 10:35 ` [PATCH v2 6/9] csum-file: introduce hashfile_total() Christian Couder
2019-10-19 10:35 ` [PATCH v2 7/9] pack-objects: introduce pack.allowPackReuse Christian Couder
2019-10-19 10:35 ` [PATCH v2 8/9] builtin/pack-objects: introduce obj_is_packed() Christian Couder
2019-10-19 10:35 ` [PATCH v2 9/9] pack-objects: improve partial packfile reuse Christian Couder
2019-10-19 15:30   ` Philip Oakley
2019-10-19 19:20     ` Christian Couder
2019-10-19 23:23       ` Jeff King
2019-10-20 11:26         ` Philip Oakley
2019-10-22 19:48   ` Jonathan Tan
2019-10-26  9:29     ` Christian Couder

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).