git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / Atom feed
* [PATCH 0/5] Create 'expire' and 'repack' verbs for git-multi-pack-index
@ 2018-12-10 18:06 Derrick Stolee via GitGitGadget
  2018-12-10 18:06 ` [PATCH 1/5] multi-pack-index: prepare for 'expire' verb Derrick Stolee via GitGitGadget
                   ` (5 more replies)
  0 siblings, 6 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-10 18:06 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano

The multi-pack-index provides a fast way to find an object among a large
list of pack-files. It stores a single pack-reference for each object id, so
duplicate objects are ignored. Among a list of pack-files storing the same
object, the most-recently modified one is used.

Create new verbs for the multi-pack-index builtin.

 * 'git multi-pack-index expire': If we have a pack-file indexed by the
   multi-pack-index, but all objects in that pack are duplicated in
   more-recently modified packs, then delete that pack (and any others like
   it). Delete the reference to that pack in the multi-pack-index.
   
   
 * 'git multi-pack-index repack --batch-size=': Starting from the oldest
   pack-files covered by the multi-pack-index, find those whose on-disk size
   is below the batch size until we have a collection of packs whose sizes
   add up to the batch size. Create a new pack containing all objects that
   the multi-pack-index references to those packs.
   
   

This allows us to create a new pattern for repacking objects: run 'repack'.
After enough time has passed that all Git commands that started before the
last 'repack' are finished, run 'expire' again. This approach has some
advantages over the existing "repack everything" model:

 1. Incremental. We can repack a small batch of objects at a time, instead
    of repacking all reachable objects. We can also limit ourselves to the
    objects that do not appear in newer pack-files.
    
    
 2. Highly Available. By adding a new pack-file (and not deleting the old
    pack-files) we do not interrupt concurrent Git commands, and do not
    suffer performance degradation. By expiring only pack-files that have no
    referenced objects, we know that Git commands that are doing normal
    object lookups* will not be interrupted.
    
    
 3. Note: if someone concurrently runs a Git command that uses
    get_all_packs(), then that command could try to read the pack-files and
    pack-indexes that we are deleting during an expire command. Such
    commands are usually related to object maintenance (i.e. fsck, gc,
    pack-objects) or are related to less-often-used features (i.e.
    fast-import, http-backend, server-info).
    
    

We plan to use this approach in VFS for Git to do background maintenance of
the "shared object cache" which is a Git alternate directory filled with
packfiles containing commits and trees. We currently download pack-files on
an hourly basis to keep up-to-date with the central server. The cache
servers supply packs on an hourly and daily basis, so most of the hourly
packs become useless after a new daily pack is downloaded. The 'expire'
command would clear out most of those packs, but many will still remain with
fewer than 100 objects remaining. The 'repack' command (with a batch size of
1-3gb, probably) can condense the remaining packs in commands that run for
1-3 min at a time. Since the daily packs range from 100-250mb, we will also
combine and condense those packs.

Thanks, -Stolee

Derrick Stolee (5):
  multi-pack-index: prepare for 'expire' verb
  midx: refactor permutation logic
  multi-pack-index: implement 'expire' verb
  multi-pack-index: prepare 'repack' verb
  midx: implement midx_repack()

 Documentation/git-multi-pack-index.txt |  20 +++
 builtin/multi-pack-index.c             |  12 +-
 midx.c                                 | 222 +++++++++++++++++++++++--
 midx.h                                 |   2 +
 t/t5319-multi-pack-index.sh            |  98 +++++++++++
 5 files changed, 343 insertions(+), 11 deletions(-)


base-commit: 26aa9fc81d4c7f6c3b456a29da0b7ec72e5c6595
Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-92%2Fderrickstolee%2Fmidx-expire%2Fupstream-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-92/derrickstolee/midx-expire/upstream-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/92
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 1/5] multi-pack-index: prepare for 'expire' verb
  2018-12-10 18:06 [PATCH 0/5] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
@ 2018-12-10 18:06 ` Derrick Stolee via GitGitGadget
  2018-12-11  1:35   ` Stefan Beller
  2018-12-10 18:06 ` [PATCH 2/5] midx: refactor permutation logic Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-10 18:06 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The multi-pack-index tracks objects in a collection of pack-files.
Only one copy of each object is indexed, using the modified time
of the pack-files to determine tie-breakers. It is possible to
have a pack-file with no referenced objects because all objects
have a duplicate in a newer pack-file.

Introduce a new 'expire' verb to the multi-pack-index builtin.
This verb will delete these unused pack-files and rewrite the
multi-pack-index to no longer refer to those files. More details
about the specifics will follow as the method is implemented.

Add a test that verifies the 'expire' verb is correctly wired,
but will still be valid when the verb is implemented. Specifically,
create a set of packs that should all have referenced objects and
should not be removed during an 'expire' operation.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt |  8 +++++
 builtin/multi-pack-index.c             |  4 ++-
 midx.c                                 |  5 +++
 midx.h                                 |  1 +
 t/t5319-multi-pack-index.sh            | 47 ++++++++++++++++++++++++++
 5 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index f7778a2c85..822d83c845 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -31,6 +31,14 @@ verify::
 	When given as the verb, verify the contents of the MIDX file
 	at `<dir>/packs/multi-pack-index`.
 
+expire::
+	When given as the verb, delete the pack-files that are tracked
+	by the MIDX file at `<dir>/packs/multi-pack-index` but have
+	no objects referenced by the MIDX. All objects in these pack-
+	files have another copy in a more-recently modified pack-file.
+	Rewrite the MIDX file afterward to remove all references to
+	these pack-files.
+
 
 EXAMPLES
 --------
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index fca70f8e4f..145de3a46c 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,7 +5,7 @@
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-	N_("git multi-pack-index [--object-dir=<dir>] (write|verify)"),
+	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire)"),
 	NULL
 };
 
@@ -44,6 +44,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
 		return write_midx_file(opts.object_dir);
 	if (!strcmp(argv[0], "verify"))
 		return verify_midx_file(opts.object_dir);
+	if (!strcmp(argv[0], "expire"))
+		return expire_midx_packs(opts.object_dir);
 
 	die(_("unrecognized verb: %s"), argv[0]);
 }
diff --git a/midx.c b/midx.c
index 730ff84dff..bb825ef816 100644
--- a/midx.c
+++ b/midx.c
@@ -1025,3 +1025,8 @@ int verify_midx_file(const char *object_dir)
 
 	return verify_midx_error;
 }
+
+int expire_midx_packs(const char *object_dir)
+{
+	return 0;
+}
diff --git a/midx.h b/midx.h
index 774f652530..e3a2b740b5 100644
--- a/midx.h
+++ b/midx.h
@@ -49,6 +49,7 @@ int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, i
 int write_midx_file(const char *object_dir);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(const char *object_dir);
+int expire_midx_packs(const char *object_dir);
 
 void close_midx(struct multi_pack_index *m);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 70926b5bc0..948effc1ee 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -348,4 +348,51 @@ test_expect_success 'verify incorrect 64-bit offset' '
 		"incorrect object offset"
 '
 
+test_expect_success 'setup expire tests' '
+	mkdir dup &&
+	(
+		cd dup &&
+		git init &&
+		for i in $(test_seq 1 20)
+		do
+			test_commit $i
+		done &&
+		git branch A HEAD &&
+		git branch B HEAD~8 &&
+		git branch C HEAD~13 &&
+		git branch D HEAD~16 &&
+		git branch E HEAD~18 &&
+		git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
+		refs/heads/E
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-D <<-EOF &&
+		refs/heads/D
+		^refs/heads/E
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-C <<-EOF &&
+		refs/heads/C
+		^refs/heads/D
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-B <<-EOF &&
+		refs/heads/B
+		^refs/heads/C
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-A <<-EOF &&
+		refs/heads/A
+		^refs/heads/B
+		EOF
+		git multi-pack-index write
+	)
+'
+
+test_expect_success 'expire does not remove any packs' '
+	(
+		cd dup &&
+		ls .git/objects/pack >expect &&
+		git multi-pack-index expire &&
+		ls .git/objects/pack >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 2/5] midx: refactor permutation logic
  2018-12-10 18:06 [PATCH 0/5] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
  2018-12-10 18:06 ` [PATCH 1/5] multi-pack-index: prepare for 'expire' verb Derrick Stolee via GitGitGadget
@ 2018-12-10 18:06 ` Derrick Stolee via GitGitGadget
  2018-12-10 18:06 ` [PATCH 3/5] multi-pack-index: implement 'expire' verb Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-10 18:06 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When writing a multi-pack-index, we keep track of an integer
permutation, tracking the list of pack-files that we know about
(both from the existing multi-pack-index and the new pack-files
being introduced) and converting them into a sorted order for
the new multi-pack-index.

In anticipation of dropping pack-files from the existing multi-
pack-index, refactor the logic around how we track this permutation.

First, insert the permutation into the pack_list structure. This
allows us to grow the permutation dynamically as we add packs.

Second, fill the permutation with values corresponding to their
position in the list of pack-files, sorted as follows:

  1. The pack-files in the existing multi-pack-index,
     sorted lexicographically.

  2. The pack-files not in the existing multi-pack-index,
     sorted as discovered from the filesystem.

There is a subtle thing in how we initialize this permutation,
specifically how we use 'i' for the initial value. This will
matter more when we implement the logic for dropping existing
packs, as we will create holes in the ordering.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/midx.c b/midx.c
index bb825ef816..3bd7183a53 100644
--- a/midx.c
+++ b/midx.c
@@ -380,9 +380,11 @@ static size_t write_midx_header(struct hashfile *f,
 struct pack_list {
 	struct packed_git **list;
 	char **names;
+	uint32_t *perm;
 	uint32_t nr;
 	uint32_t alloc_list;
 	uint32_t alloc_names;
+	uint32_t alloc_perm;
 	size_t pack_name_concat_len;
 	struct multi_pack_index *m;
 };
@@ -398,6 +400,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 
 		ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc_list);
 		ALLOC_GROW(packs->names, packs->nr + 1, packs->alloc_names);
+		ALLOC_GROW(packs->perm, packs->nr + 1, packs->alloc_perm);
 
 		packs->list[packs->nr] = add_packed_git(full_path,
 							full_path_len,
@@ -417,6 +420,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 			return;
 		}
 
+		packs->perm[packs->nr] = packs->nr;
 		packs->names[packs->nr] = xstrdup(file_name);
 		packs->pack_name_concat_len += strlen(file_name) + 1;
 		packs->nr++;
@@ -443,7 +447,7 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p
 	ALLOC_ARRAY(pairs, nr_packs);
 
 	for (i = 0; i < nr_packs; i++) {
-		pairs[i].pack_int_id = i;
+		pairs[i].pack_int_id = perm[i];
 		pairs[i].pack_name = pack_names[i];
 	}
 
@@ -755,7 +759,6 @@ int write_midx_file(const char *object_dir)
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct pack_list packs;
-	uint32_t *pack_perm = NULL;
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
@@ -774,18 +777,22 @@ int write_midx_file(const char *object_dir)
 
 	packs.nr = 0;
 	packs.alloc_list = packs.m ? packs.m->num_packs : 16;
-	packs.alloc_names = packs.alloc_list;
+	packs.alloc_perm = packs.alloc_names = packs.alloc_list;
 	packs.list = NULL;
 	packs.names = NULL;
+	packs.perm = NULL;
 	packs.pack_name_concat_len = 0;
 	ALLOC_ARRAY(packs.list, packs.alloc_list);
 	ALLOC_ARRAY(packs.names, packs.alloc_names);
+	ALLOC_ARRAY(packs.perm, packs.alloc_perm);
 
 	if (packs.m) {
 		for (i = 0; i < packs.m->num_packs; i++) {
 			ALLOC_GROW(packs.list, packs.nr + 1, packs.alloc_list);
 			ALLOC_GROW(packs.names, packs.nr + 1, packs.alloc_names);
+			ALLOC_GROW(packs.perm, packs.nr + 1, packs.alloc_perm);
 
+			packs.perm[packs.nr] = i;
 			packs.list[packs.nr] = NULL;
 			packs.names[packs.nr] = xstrdup(packs.m->pack_names[i]);
 			packs.pack_name_concat_len += strlen(packs.names[packs.nr]) + 1;
@@ -802,10 +809,9 @@ int write_midx_file(const char *object_dir)
 		packs.pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
 					      (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
 
-	ALLOC_ARRAY(pack_perm, packs.nr);
-	sort_packs_by_name(packs.names, packs.nr, pack_perm);
+	sort_packs_by_name(packs.names, packs.nr, packs.perm);
 
-	entries = get_sorted_entries(packs.m, packs.list, pack_perm, packs.nr, &nr_entries);
+	entries = get_sorted_entries(packs.m, packs.list, packs.perm, packs.nr, &nr_entries);
 
 	for (i = 0; i < nr_entries; i++) {
 		if (entries[i].offset > 0x7fffffff)
@@ -923,8 +929,8 @@ cleanup:
 
 	free(packs.list);
 	free(packs.names);
+	free(packs.perm);
 	free(entries);
-	free(pack_perm);
 	free(midx_name);
 	return 0;
 }
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 3/5] multi-pack-index: implement 'expire' verb
  2018-12-10 18:06 [PATCH 0/5] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
  2018-12-10 18:06 ` [PATCH 1/5] multi-pack-index: prepare for 'expire' verb Derrick Stolee via GitGitGadget
  2018-12-10 18:06 ` [PATCH 2/5] midx: refactor permutation logic Derrick Stolee via GitGitGadget
@ 2018-12-10 18:06 ` Derrick Stolee via GitGitGadget
  2018-12-10 18:06 ` [PATCH 4/5] multi-pack-index: prepare 'repack' verb Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-10 18:06 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The 'git multi-pack-index expire' command looks at the existing
mult-pack-index, counts the number of objects referenced in each
pack-file, deletes the pack-fils with no referenced objects, and
rewrites the multi-pack-index to no longer reference those packs.

Refactor the write_midx_file() method to call write_midx_internal()
which now takes an existing 'struct multi_pack_index' and a list
of pack-files to drop (as specified by the names of their pack-
indexes). As we write the new multi-pack-index, we drop those
file names from the list of known pack-files.

The expire_midx_packs() method removes the unreferenced pack-files
after carefully closing the packs to avoid open handles.

Test that a new pack-file that covers the contents of two other
pack-files leads to those pack-files being deleted during the
expire command. Be sure to read the multi-pack-index to ensure
it no longer references those packs.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 87 +++++++++++++++++++++++++++++++++++--
 t/t5319-multi-pack-index.sh | 15 +++++++
 2 files changed, 98 insertions(+), 4 deletions(-)

diff --git a/midx.c b/midx.c
index 3bd7183a53..50e4cd7270 100644
--- a/midx.c
+++ b/midx.c
@@ -751,7 +751,8 @@ static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_off
 	return written;
 }
 
-int write_midx_file(const char *object_dir)
+static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
+			       struct string_list *packs_to_drop)
 {
 	unsigned char cur_chunk, num_chunks = 0;
 	char *midx_name;
@@ -765,6 +766,7 @@ int write_midx_file(const char *object_dir)
 	uint32_t nr_entries, num_large_offsets = 0;
 	struct pack_midx_entry *entries = NULL;
 	int large_offsets_needed = 0;
+	int result = 0;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -773,7 +775,10 @@ int write_midx_file(const char *object_dir)
 			  midx_name);
 	}
 
-	packs.m = load_multi_pack_index(object_dir, 1);
+	if (m)
+		packs.m = m;
+	else
+		packs.m = load_multi_pack_index(object_dir, 1);
 
 	packs.nr = 0;
 	packs.alloc_list = packs.m ? packs.m->num_packs : 16;
@@ -787,7 +792,24 @@ int write_midx_file(const char *object_dir)
 	ALLOC_ARRAY(packs.perm, packs.alloc_perm);
 
 	if (packs.m) {
+		int drop_index = 0, missing_drops = 0;
 		for (i = 0; i < packs.m->num_packs; i++) {
+			if (packs_to_drop && drop_index < packs_to_drop->nr) {
+				int cmp = strcmp(packs.m->pack_names[i],
+						 packs_to_drop->items[drop_index].string);
+
+				if (!cmp) {
+					drop_index++;
+					continue;
+				} else if (cmp > 0) {
+					error(_("did not see pack-file %s to drop"),
+					      packs_to_drop->items[drop_index].string);
+					drop_index++;
+					i--;
+					missing_drops++;
+				}
+			}
+
 			ALLOC_GROW(packs.list, packs.nr + 1, packs.alloc_list);
 			ALLOC_GROW(packs.names, packs.nr + 1, packs.alloc_names);
 			ALLOC_GROW(packs.perm, packs.nr + 1, packs.alloc_perm);
@@ -798,6 +820,12 @@ int write_midx_file(const char *object_dir)
 			packs.pack_name_concat_len += strlen(packs.names[packs.nr]) + 1;
 			packs.nr++;
 		}
+
+		if (packs_to_drop && (drop_index < packs_to_drop->nr || missing_drops)) {
+			error(_("did not see all pack-files to drop"));
+			result = 1;
+			goto cleanup;
+		}
 	}
 
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
@@ -932,7 +960,12 @@ cleanup:
 	free(packs.perm);
 	free(entries);
 	free(midx_name);
-	return 0;
+	return result;
+}
+
+int write_midx_file(const char *object_dir)
+{
+	return write_midx_internal(object_dir, NULL, NULL);
 }
 
 void clear_midx_file(struct repository *r)
@@ -1034,5 +1067,51 @@ int verify_midx_file(const char *object_dir)
 
 int expire_midx_packs(const char *object_dir)
 {
-	return 0;
+	uint32_t i, *count, result = 0;
+	size_t dirlen;
+	struct strbuf buf = STRBUF_INIT;
+	struct string_list packs_to_drop = STRING_LIST_INIT_DUP;
+	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
+
+	if (!m)
+		return 0;
+
+	count = xcalloc(m->num_packs, sizeof(uint32_t));
+	for (i = 0; i < m->num_objects; i++) {
+		int pack_int_id = nth_midxed_pack_int_id(m, i);
+		count[pack_int_id]++;
+	}
+
+	strbuf_addstr(&buf, object_dir);
+	strbuf_addstr(&buf, "/pack/");
+	dirlen = buf.len;
+
+	for (i = 0; i < m->num_packs; i++) {
+		if (count[i])
+			continue;
+
+		if (m->packs[i]) {
+			close_pack(m->packs[i]);
+			m->packs[i] = NULL;
+		}
+
+		string_list_insert(&packs_to_drop, m->pack_names[i]);
+
+		strbuf_setlen(&buf, dirlen);
+		strbuf_addstr(&buf, m->pack_names[i]);
+		unlink(buf.buf);
+
+		strip_suffix_mem(buf.buf, &buf.len, "idx");
+		strbuf_addstr(&buf, "pack");
+		unlink(buf.buf);
+	}
+
+	strbuf_release(&buf);
+	free(count);
+
+	if (packs_to_drop.nr)
+		result = write_midx_internal(object_dir, m, &packs_to_drop);
+
+	string_list_clear(&packs_to_drop, 0);
+	return result;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 948effc1ee..210279a3cf 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -395,4 +395,19 @@ test_expect_success 'expire does not remove any packs' '
 	)
 '
 
+test_expect_success 'expire removes unreferenced packs' '
+	(
+		cd dup &&
+		git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
+		refs/heads/A
+		^refs/heads/C
+		EOF
+		git multi-pack-index write &&
+		ls .git/objects/pack | grep -v -e pack-[AB] >expect &&
+		git multi-pack-index expire &&
+		ls .git/objects/pack >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 4/5] multi-pack-index: prepare 'repack' verb
  2018-12-10 18:06 [PATCH 0/5] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2018-12-10 18:06 ` [PATCH 3/5] multi-pack-index: implement 'expire' verb Derrick Stolee via GitGitGadget
@ 2018-12-10 18:06 ` Derrick Stolee via GitGitGadget
  2018-12-11  1:54   ` Stefan Beller
  2018-12-10 18:06 ` [PATCH 5/5] midx: implement midx_repack() Derrick Stolee via GitGitGadget
  2018-12-21 16:28 ` [PATCH v2 0/7] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
  5 siblings, 1 reply; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-10 18:06 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an environment where the multi-pack-index is useful, it is due
to many pack-files and an inability to repack the object store
into a single pack-file. However, it is likely that many of these
pack-files are rather small, and could be repacked into a slightly
larger pack-file without too much effort. It may also be important
to ensure the object store is highly available and the repack
operation does not interrupt concurrent git commands.

Introduce a 'repack' verb to 'git multi-pack-index' that takes a
'--batch-size' option. The verb will inspect the multi-pack-index
for referenced pack-files whose size is smaller than the batch
size, until collecting a list of pack-files whose sizes sum to
larger than the batch size. Then, a new pack-file will be created
containing the objects from those pack-files that are referenced
by the multi-pack-index. The resulting pack is likely to actually
be smaller than the batch size due to compression and the fact
that there may be objects in the pack-files that have duplicate
copies in other pack-files.

The current change introduces the command-line arguments, and we
add a test that ensures we parse these options properly. Since
we specify a small batch size, we will guarantee that future
implementations do not change the list of pack-files.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt | 12 ++++++++++++
 builtin/multi-pack-index.c             | 10 +++++++++-
 midx.c                                 |  5 +++++
 midx.h                                 |  1 +
 t/t5319-multi-pack-index.sh            | 11 +++++++++++
 5 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index 822d83c845..e43e7da71e 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -39,6 +39,18 @@ expire::
 	Rewrite the MIDX file afterward to remove all references to
 	these pack-files.
 
+repack::
+	When given as the verb, collect a batch of pack-files whose
+	size are all at most the size given by --batch-size, but
+	whose sizes sum to larger than --batch-size. The batch is
+	selected by greedily adding small pack-files starting with
+	the oldest pack-files that fit the size. Create a new pack-
+	file containing the objects the multi-pack-index indexes
+	into thos pack-files, and rewrite the multi-pack-index to
+	contain that pack-file. A later run of 'git multi-pack-index
+	expire' will delete the pack-files that were part of this
+	batch.
+
 
 EXAMPLES
 --------
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 145de3a46c..d87a2235e3 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,12 +5,13 @@
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire)"),
+	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire|repack --batch-size=<size>)"),
 	NULL
 };
 
 static struct opts_multi_pack_index {
 	const char *object_dir;
+	unsigned long batch_size;
 } opts;
 
 int cmd_multi_pack_index(int argc, const char **argv,
@@ -19,6 +20,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
 	static struct option builtin_multi_pack_index_options[] = {
 		OPT_FILENAME(0, "object-dir", &opts.object_dir,
 		  N_("object directory containing set of packfile and pack-index pairs")),
+		OPT_MAGNITUDE(0, "batch-size", &opts.batch_size,
+		  N_("during repack, collect pack-files of smaller size into a batch that is larger than this size")),
 		OPT_END(),
 	};
 
@@ -40,6 +43,11 @@ int cmd_multi_pack_index(int argc, const char **argv,
 		return 1;
 	}
 
+	if (!strcmp(argv[0], "repack"))
+		return midx_repack(opts.object_dir, (size_t)opts.batch_size);
+	if (opts.batch_size)
+		die(_("--batch-size option is only for 'repack' verb"));
+
 	if (!strcmp(argv[0], "write"))
 		return write_midx_file(opts.object_dir);
 	if (!strcmp(argv[0], "verify"))
diff --git a/midx.c b/midx.c
index 50e4cd7270..4caf148464 100644
--- a/midx.c
+++ b/midx.c
@@ -1115,3 +1115,8 @@ int expire_midx_packs(const char *object_dir)
 	string_list_clear(&packs_to_drop, 0);
 	return result;
 }
+
+int midx_repack(const char *object_dir, size_t batch_size)
+{
+	return 0;
+}
diff --git a/midx.h b/midx.h
index e3a2b740b5..394a21ee96 100644
--- a/midx.h
+++ b/midx.h
@@ -50,6 +50,7 @@ int write_midx_file(const char *object_dir);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(const char *object_dir);
 int expire_midx_packs(const char *object_dir);
+int midx_repack(const char *object_dir, size_t batch_size);
 
 void close_midx(struct multi_pack_index *m);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 210279a3cf..c23e930a5d 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -410,4 +410,15 @@ test_expect_success 'expire removes unreferenced packs' '
 	)
 '
 
+test_expect_success 'repack does not create any packs' '
+	(
+		cd dup &&
+		ls .git/objects/pack >expect &&
+		MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
+		git multi-pack-index repack --batch-size=$MINSIZE &&
+		ls .git/objects/pack >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 5/5] midx: implement midx_repack()
  2018-12-10 18:06 [PATCH 0/5] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2018-12-10 18:06 ` [PATCH 4/5] multi-pack-index: prepare 'repack' verb Derrick Stolee via GitGitGadget
@ 2018-12-10 18:06 ` Derrick Stolee via GitGitGadget
  2018-12-11  2:32   ` Stefan Beller
  2018-12-12  7:40   ` Junio C Hamano
  2018-12-21 16:28 ` [PATCH v2 0/7] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
  5 siblings, 2 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-10 18:06 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

To repack using a multi-pack-index, first sort all pack-files by
their modified time. Second, walk those pack-files from oldest
to newest, adding the packs to a list if they are smaller than the
given pack-size. Finally, collect the objects from the multi-pack-
index that are in those packs and send them to 'git pack-objects'.

While first designing a 'git multi-pack-index repack' operation, I
started by collecting the batches based on the size of the objects
instead of the size of the pack-files. This allows repacking a
large pack-file that has very few referencd objects. However, this
came at a significant cost of parsing pack-files instead of simply
reading the multi-pack-index and getting the file information for
the pack-files. This object-size idea could be a direction for
future expansion in this area.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 109 +++++++++++++++++++++++++++++++++++-
 t/t5319-multi-pack-index.sh |  25 +++++++++
 2 files changed, 133 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 4caf148464..3718e78132 100644
--- a/midx.c
+++ b/midx.c
@@ -8,6 +8,7 @@
 #include "sha1-lookup.h"
 #include "midx.h"
 #include "progress.h"
+#include "run-command.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -1116,7 +1117,113 @@ int expire_midx_packs(const char *object_dir)
 	return result;
 }
 
-int midx_repack(const char *object_dir, size_t batch_size)
+struct time_and_id {
+	timestamp_t mtime;
+	uint32_t pack_int_id;
+};
+
+static int compare_by_mtime(const void *a_, const void *b_)
 {
+	const struct time_and_id *a, *b;
+
+	a = (const struct time_and_id *)a_;
+	b = (const struct time_and_id *)b_;
+
+	if (a->mtime < b->mtime)
+		return -1;
+	if (a->mtime > b->mtime)
+		return 1;
 	return 0;
 }
+
+int midx_repack(const char *object_dir, size_t batch_size)
+{
+	int result = 0;
+	uint32_t i, packs_to_repack;
+	size_t total_size;
+	struct time_and_id *pack_ti;
+	unsigned char *include_pack;
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf base_name = STRBUF_INIT;
+	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
+
+	if (!m)
+		return 0;
+
+	include_pack = xcalloc(m->num_packs, sizeof(unsigned char));
+	pack_ti = xcalloc(m->num_packs, sizeof(struct time_and_id));
+
+	for (i = 0; i < m->num_packs; i++) {
+		pack_ti[i].pack_int_id = i;
+
+		if (prepare_midx_pack(m, i))
+			continue;
+
+		pack_ti[i].mtime = m->packs[i]->mtime;
+	}
+	QSORT(pack_ti, m->num_packs, compare_by_mtime);
+
+	total_size = 0;
+	packs_to_repack = 0;
+	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
+		int pack_int_id = pack_ti[i].pack_int_id;
+		struct packed_git *p = m->packs[pack_int_id];
+
+		if (!p)
+			continue;
+		if (p->pack_size >= batch_size)
+			continue;
+
+		packs_to_repack++;
+		total_size += p->pack_size;
+		include_pack[pack_int_id] = 1;
+	}
+
+	if (total_size < batch_size || packs_to_repack < 2)
+		goto cleanup;
+
+	argv_array_push(&cmd.args, "pack-objects");
+
+	strbuf_addstr(&base_name, object_dir);
+	strbuf_addstr(&base_name, "/pack/pack");
+	argv_array_push(&cmd.args, base_name.buf);
+	strbuf_release(&base_name);
+
+	cmd.git_cmd = 1;
+	cmd.in = cmd.out = -1;
+
+	if (start_command(&cmd)) {
+		error(_("could not start pack-objects"));
+		result = 1;
+		goto cleanup;
+	}
+
+	for (i = 0; i < m->num_objects; i++) {
+		struct object_id oid;
+		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
+
+		if (!include_pack[pack_int_id])
+			continue;
+
+		nth_midxed_object_oid(&oid, m, i);
+		xwrite(cmd.in, oid_to_hex(&oid), the_hash_algo->hexsz);
+		xwrite(cmd.in, "\n", 1);
+	}
+	close(cmd.in);
+
+	if (finish_command(&cmd)) {
+		error(_("could not finish pack-objects"));
+		result = 1;
+		goto cleanup;
+	}
+
+	result = write_midx_internal(object_dir, m, NULL);
+	m = NULL;
+
+cleanup:
+	if (m)
+		close_midx(m);
+	free(include_pack);
+	free(pack_ti);
+	return result;
+}
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index c23e930a5d..3cc9c918d5 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -421,4 +421,29 @@ test_expect_success 'repack does not create any packs' '
 	)
 '
 
+test_expect_success 'repack creates a new pack' '
+	(
+		cd dup &&
+		SECOND_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 2 | tail -n 1) &&
+		BATCH_SIZE=$(($SECOND_SMALLEST_SIZE + 1)) &&
+		git multi-pack-index repack --batch-size=$BATCH_SIZE &&
+		ls .git/objects/pack/*idx >idx-list &&
+		test_line_count = 5 idx-list &&
+		test-tool read-midx .git/objects | grep idx >midx-list &&
+		test_line_count = 5 midx-list
+	)
+'
+
+test_expect_success 'expire removes repacked packs' '
+	(
+		cd dup &&
+		ls -S .git/objects/pack/*pack | head -n 3 >expect &&
+		git multi-pack-index expire &&
+		ls -S .git/objects/pack/*pack >actual &&
+		test_cmp expect actual &&
+		test-tool read-midx .git/objects | grep idx >midx-list &&
+		test_line_count = 3 midx-list
+	)
+'
+
 test_done
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 1/5] multi-pack-index: prepare for 'expire' verb
  2018-12-10 18:06 ` [PATCH 1/5] multi-pack-index: prepare for 'expire' verb Derrick Stolee via GitGitGadget
@ 2018-12-11  1:35   ` Stefan Beller
  2018-12-11  1:59     ` SZEDER Gábor
  0 siblings, 1 reply; 89+ messages in thread
From: Stefan Beller @ 2018-12-11  1:35 UTC (permalink / raw)
  To: gitgitgadget
  Cc: git, Jeff King, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Derrick Stolee

On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The multi-pack-index tracks objects in a collection of pack-files.
> Only one copy of each object is indexed, using the modified time
> of the pack-files to determine tie-breakers. It is possible to
> have a pack-file with no referenced objects because all objects
> have a duplicate in a newer pack-file.
>
> Introduce a new 'expire' verb to the multi-pack-index builtin.
> This verb will delete these unused pack-files and rewrite the
> multi-pack-index to no longer refer to those files. More details
> about the specifics will follow as the method is implemented.
>
> Add a test that verifies the 'expire' verb is correctly wired,
> but will still be valid when the verb is implemented. Specifically,
> create a set of packs that should all have referenced objects and
> should not be removed during an 'expire' operation.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-multi-pack-index.txt |  8 +++++
>  builtin/multi-pack-index.c             |  4 ++-
>  midx.c                                 |  5 +++
>  midx.h                                 |  1 +
>  t/t5319-multi-pack-index.sh            | 47 ++++++++++++++++++++++++++
>  5 files changed, 64 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> index f7778a2c85..822d83c845 100644
> --- a/Documentation/git-multi-pack-index.txt
> +++ b/Documentation/git-multi-pack-index.txt
> @@ -31,6 +31,14 @@ verify::
>         When given as the verb, verify the contents of the MIDX file
>         at `<dir>/packs/multi-pack-index`.
>
> +expire::
> +       When given as the verb,

Can it be given in another way? Or rather "if the verb is expire",
then ...
(I just checked the current man page, and both write and verify use
this pattern as well. I find it strange as this first part of the sentence
conveys little information, but is repeated 3 times now (once for
each verb)).

Maybe we can restructure the man page to have it more like

    The following verbs are available:
    +write::
    +    create a new MIDX file, writing to <dir>/packs/multi-pack-index.
    +
    +verify::
    +    verify the contents ...

>               delete the pack-files that are tracked
> +       by the MIDX file at `<dir>/packs/multi-pack-index`

We're mentioning the location a lot. Could we keep a more detailed
note in --object-dir and not go into such detail in the verbs section?
(Then the paragraph here would be more concise. That makes it
easier to understand)

>              but have
> +       no objects referenced by the MIDX. All objects in these pack-
> +       files have another copy in a more-recently modified pack-file.

The second sentence reads like a reason on why the first is a good
thing to have, so maybe use some subordinating conjunction adverb
("because") to make tell the reader

> +       Rewrite the MIDX file afterward to remove all references to
> +       these pack-files.

Makes sense.


>
> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> index 70926b5bc0..948effc1ee 100755
> --- a/t/t5319-multi-pack-index.sh
> +++ b/t/t5319-multi-pack-index.sh
> @@ -348,4 +348,51 @@ test_expect_success 'verify incorrect 64-bit offset' '
>                 "incorrect object offset"
>  '
>
> +test_expect_success 'setup expire tests' '
> +       mkdir dup &&
> +       (
> +               cd dup &&
> +               git init &&
> +               for i in $(test_seq 1 20)
> +               do
> +                       test_commit $i
> +               done &&
> +               git branch A HEAD &&
> +               git branch B HEAD~8 &&
> +               git branch C HEAD~13 &&
> +               git branch D HEAD~16 &&
> +               git branch E HEAD~18 &&
> +               git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
> +               refs/heads/E
> +               EOF
> +               git pack-objects --revs .git/objects/pack/pack-D <<-EOF &&
> +               refs/heads/D
> +               ^refs/heads/E
> +               EOF
> +               git pack-objects --revs .git/objects/pack/pack-C <<-EOF &&
> +               refs/heads/C
> +               ^refs/heads/D
> +               EOF
> +               git pack-objects --revs .git/objects/pack/pack-B <<-EOF &&
> +               refs/heads/B
> +               ^refs/heads/C
> +               EOF
> +               git pack-objects --revs .git/objects/pack/pack-A <<-EOF &&
> +               refs/heads/A
> +               ^refs/heads/B
> +               EOF
> +               git multi-pack-index write
> +       )
> +'
> +
> +test_expect_success 'expire does not remove any packs' '

With the clever setup, this test is already correctly testing
what the docs claims it should do, despite having
no implementation. Nice.
Although the core issue is that the packs are disjunct sets
of objects, so maybe /s/any packs/required packs/ or such?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/5] multi-pack-index: prepare 'repack' verb
  2018-12-10 18:06 ` [PATCH 4/5] multi-pack-index: prepare 'repack' verb Derrick Stolee via GitGitGadget
@ 2018-12-11  1:54   ` Stefan Beller
  2018-12-11 12:45     ` Derrick Stolee
  0 siblings, 1 reply; 89+ messages in thread
From: Stefan Beller @ 2018-12-11  1:54 UTC (permalink / raw)
  To: gitgitgadget
  Cc: git, Jeff King, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Derrick Stolee

On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> In an environment where the multi-pack-index is useful, it is due
> to many pack-files and an inability to repack the object store
> into a single pack-file. However, it is likely that many of these
> pack-files are rather small, and could be repacked into a slightly
> larger pack-file without too much effort. It may also be important
> to ensure the object store is highly available and the repack
> operation does not interrupt concurrent git commands.
>
> Introduce a 'repack' verb to 'git multi-pack-index' that takes a
> '--batch-size' option. The verb will inspect the multi-pack-index
> for referenced pack-files whose size is smaller than the batch
> size, until collecting a list of pack-files whose sizes sum to
> larger than the batch size. Then, a new pack-file will be created
> containing the objects from those pack-files that are referenced
> by the multi-pack-index. The resulting pack is likely to actually
> be smaller than the batch size due to compression and the fact
> that there may be objects in the pack-files that have duplicate
> copies in other pack-files.

This highlights an optimization problem: How do we pick the
batches optimally?
Ideally we'd pick packs that have an overlap of many
same objects to dedup them completely, next best would
be to have objects that are very similar, such that they delta
very well.
(Assuming that the sum of the resulting pack sizes is a metric
we'd want to optimize for eventually)

For now it seems we just take a random cut of "small" packs.

>
> The current change introduces the command-line arguments, and we
> add a test that ensures we parse these options properly. Since
> we specify a small batch size, we will guarantee that future
> implementations do not change the list of pack-files.

This is another clever trick that makes the test correct
despite no implementation yet. :-)

> +repack::
> +       When given as the verb, collect a batch of pack-files whose
> +       size are all at most the size given by --batch-size,

okay.

>  but
> +       whose sizes sum to larger than --batch-size.

or less if there are not enough packs.

Now it would be interesting if we can specify an upper bound
(e.g. my laptop has 8GB of ram, can I use this incremental
repacking optimally by telling git to make batches of at most
7.5G), the answer seems to follow...

>   The batch is
> +       selected by greedily adding small pack-files starting with
> +       the oldest pack-files that fit the size. Create a new pack-
> +       file containing the objects the multi-pack-index indexes
> +       into thos pack-files, and rewrite the multi-pack-index to

those

> +       contain that pack-file. A later run of 'git multi-pack-index
> +       expire' will delete the pack-files that were part of this
> +       batch.

... but the optimization seems to be rather about getting rid
of the oldest packs first instead of getting as close to the batch
size. (e.g. another way to look at this is to "find the permutation
of all packs that (each are smaller than batch size), but in sum
are the smallest threshold above the batch size).

I guess that the strategy of picking the oldest is just easiest
to implement and should be sufficient for now, but memory
bounds might be interesting to keep in mind, just as the
optimal packing from above.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 1/5] multi-pack-index: prepare for 'expire' verb
  2018-12-11  1:35   ` Stefan Beller
@ 2018-12-11  1:59     ` SZEDER Gábor
  2018-12-11 12:32       ` Derrick Stolee
  0 siblings, 1 reply; 89+ messages in thread
From: SZEDER Gábor @ 2018-12-11  1:59 UTC (permalink / raw)
  To: Stefan Beller
  Cc: gitgitgadget, git, Jeff King, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Derrick Stolee

On Mon, Dec 10, 2018 at 05:35:28PM -0800, Stefan Beller wrote:
> On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> >
> > From: Derrick Stolee <dstolee@microsoft.com>
> >
> > The multi-pack-index tracks objects in a collection of pack-files.
> > Only one copy of each object is indexed, using the modified time
> > of the pack-files to determine tie-breakers. It is possible to
> > have a pack-file with no referenced objects because all objects
> > have a duplicate in a newer pack-file.
> >
> > Introduce a new 'expire' verb to the multi-pack-index builtin.
> > This verb will delete these unused pack-files and rewrite the
> > multi-pack-index to no longer refer to those files. More details
> > about the specifics will follow as the method is implemented.
> >
> > Add a test that verifies the 'expire' verb is correctly wired,
> > but will still be valid when the verb is implemented. Specifically,
> > create a set of packs that should all have referenced objects and
> > should not be removed during an 'expire' operation.
> >
> > Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> > ---
> >  Documentation/git-multi-pack-index.txt |  8 +++++
> >  builtin/multi-pack-index.c             |  4 ++-
> >  midx.c                                 |  5 +++
> >  midx.h                                 |  1 +
> >  t/t5319-multi-pack-index.sh            | 47 ++++++++++++++++++++++++++
> >  5 files changed, 64 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> > index f7778a2c85..822d83c845 100644
> > --- a/Documentation/git-multi-pack-index.txt
> > +++ b/Documentation/git-multi-pack-index.txt
> > @@ -31,6 +31,14 @@ verify::
> >         When given as the verb, verify the contents of the MIDX file
> >         at `<dir>/packs/multi-pack-index`.
> >
> > +expire::
> > +       When given as the verb,
> 
> Can it be given in another way? Or rather "if the verb is expire",
> then ...
> (I just checked the current man page, and both write and verify use
> this pattern as well. I find it strange as this first part of the sentence
> conveys little information, but is repeated 3 times now (once for
> each verb)).
> 
> Maybe we can restructure the man page to have it more like
> 
>     The following verbs are available:
>     +write::
>     +    create a new MIDX file, writing to <dir>/packs/multi-pack-index.
>     +
>     +verify::
>     +    verify the contents ...

I think a s/verb/subcommand/ would help a lot, too, because that's
what we call it everywhere else.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 5/5] midx: implement midx_repack()
  2018-12-10 18:06 ` [PATCH 5/5] midx: implement midx_repack() Derrick Stolee via GitGitGadget
@ 2018-12-11  2:32   ` Stefan Beller
  2018-12-11 13:00     ` Derrick Stolee
  2018-12-12  7:40   ` Junio C Hamano
  1 sibling, 1 reply; 89+ messages in thread
From: Stefan Beller @ 2018-12-11  2:32 UTC (permalink / raw)
  To: gitgitgadget
  Cc: git, Jeff King, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Derrick Stolee

On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> To repack using a multi-pack-index, first sort all pack-files by
> their modified time. Second, walk those pack-files from oldest
> to newest, adding the packs to a list if they are smaller than the
> given pack-size. Finally, collect the objects from the multi-pack-
> index that are in those packs and send them to 'git pack-objects'.

Makes sense.

With this operation we only coalesce some packfiles into a new
pack file. So to perform the "complete" repack this command
has to be run repeatedly until there is at most one packfile
left that is smaller than batch size.

Imagine the following scenario:

  There are 5 packfiles A, B, C, D, E,
  created last Monday thru Friday (A is oldest, E youngest).
  The sizes are [A=4, B=6, C=5, D=5, E=4]

  You'd issue a repack with batch size=10, such that
  A and B would be repacked into F, which is
  created today, size is less or equal than 10.

  You issue another repack tomorrow, which then would
  coalesce C and D to G, which is
  dated tomorrow, size is less or equal to 10 as well.

  You issue a third repack, which then takes E
  (as it is the oldest) and would probably find F as the
  next oldest (assuming it is less than 10), to repack
  into H.

  H is then compromised of A, B and E, and G is C+D.

In a way these repacks, always picking up the oldest,
sound like you "roll forward" objects into new packs.
As the new packs are newest (we have no packs from
the future), we'd cycle through different packs to look at
for packing on each repacking.

It is however more likely that content is more similar
on a temporal basis. (e.g. I am boldly claiming that
[ABC, DE] would take less space than [ABE, CD]
as produced above).

(The obvious solution to this hypothetical would be
to backdate the resulting pack to the youngest pack
that is input to the new pack, but I dislike fudging with
the time a file is created/touched, so let's not go there)

Would the object count make sense as input instead of
the pack date?


> While first designing a 'git multi-pack-index repack' operation, I
> started by collecting the batches based on the size of the objects
> instead of the size of the pack-files. This allows repacking a
> large pack-file that has very few referencd objects. However, this

referenced

> came at a significant cost of parsing pack-files instead of simply
> reading the multi-pack-index and getting the file information for
> the pack-files. This object-size idea could be a direction for
> future expansion in this area.

Ah, that also explains why the above idea is toast.

Would it make sense to extend or annotate the midx file
to give hints at which packs are easy to combine?

I guess such an "annotation worker" could run in a separate
thread / pool with the lowest priority as this seems like a
decent fallback for the lack of any better information how
to pick the packfiles.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 1/5] multi-pack-index: prepare for 'expire' verb
  2018-12-11  1:59     ` SZEDER Gábor
@ 2018-12-11 12:32       ` Derrick Stolee
  0 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2018-12-11 12:32 UTC (permalink / raw)
  To: SZEDER Gábor, Stefan Beller
  Cc: gitgitgadget, git, Jeff King, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Derrick Stolee

On 12/10/2018 8:59 PM, SZEDER Gábor wrote:
> On Mon, Dec 10, 2018 at 05:35:28PM -0800, Stefan Beller wrote:
>> On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget
>> <gitgitgadget@gmail.com> wrote:
>>> +expire::
>>> +       When given as the verb,
>> Can it be given in another way? Or rather "if the verb is expire",
>> then ...
>> (I just checked the current man page, and both write and verify use
>> this pattern as well. I find it strange as this first part of the sentence
>> conveys little information, but is repeated 3 times now (once for
>> each verb)).
>>
>> Maybe we can restructure the man page to have it more like
>>
>>      The following verbs are available:
>>      +write::
>>      +    create a new MIDX file, writing to <dir>/packs/multi-pack-index.
>>      +
>>      +verify::
>>      +    verify the contents ...
> I think a s/verb/subcommand/ would help a lot, too, because that's
> what we call it everywhere else.

Thanks, both. V2 will include a new patch that reformats the doc to use 
these suggestions, then extend it for the new subcommand.

-Stolee


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/5] multi-pack-index: prepare 'repack' verb
  2018-12-11  1:54   ` Stefan Beller
@ 2018-12-11 12:45     ` Derrick Stolee
  0 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2018-12-11 12:45 UTC (permalink / raw)
  To: Stefan Beller, gitgitgadget
  Cc: git, Jeff King, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Derrick Stolee

On 12/10/2018 8:54 PM, Stefan Beller wrote:
> On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> In an environment where the multi-pack-index is useful, it is due
>> to many pack-files and an inability to repack the object store
>> into a single pack-file. However, it is likely that many of these
>> pack-files are rather small, and could be repacked into a slightly
>> larger pack-file without too much effort. It may also be important
>> to ensure the object store is highly available and the repack
>> operation does not interrupt concurrent git commands.
>>
>> Introduce a 'repack' verb to 'git multi-pack-index' that takes a
>> '--batch-size' option. The verb will inspect the multi-pack-index
>> for referenced pack-files whose size is smaller than the batch
>> size, until collecting a list of pack-files whose sizes sum to
>> larger than the batch size. Then, a new pack-file will be created
>> containing the objects from those pack-files that are referenced
>> by the multi-pack-index. The resulting pack is likely to actually
>> be smaller than the batch size due to compression and the fact
>> that there may be objects in the pack-files that have duplicate
>> copies in other pack-files.
> This highlights an optimization problem: How do we pick the
> batches optimally?
> Ideally we'd pick packs that have an overlap of many
> same objects to dedup them completely, next best would
> be to have objects that are very similar, such that they delta
> very well.
> (Assuming that the sum of the resulting pack sizes is a metric
> we'd want to optimize for eventually)
>
> For now it seems we just take a random cut of "small" packs.
>
>> The current change introduces the command-line arguments, and we
>> add a test that ensures we parse these options properly. Since
>> we specify a small batch size, we will guarantee that future
>> implementations do not change the list of pack-files.
> This is another clever trick that makes the test correct
> despite no implementation yet. :-)
>
>> +repack::
>> +       When given as the verb, collect a batch of pack-files whose
>> +       size are all at most the size given by --batch-size,
> okay.
>
>>   but
>> +       whose sizes sum to larger than --batch-size.
> or less if there are not enough packs.

If there are not enough packs to reach the batch size, then we do nothing.

> Now it would be interesting if we can specify an upper bound
> (e.g. my laptop has 8GB of ram, can I use this incremental
> repacking optimally by telling git to make batches of at most
> 7.5G), the answer seems to follow...

Well, this gets us to the "unsigned long" problem and how it pervades 
the OPT_MAGNITUDE() code and things that use that kind of parameter. 
This means that Windows users can specify a maximum of (1 << 32) - 1. 
This size is intended to be large enough to make a reasonable change in 
the pack organization without rewriting the entire repo (e.g. 1GB). If 
there is another word than "repack" that means "collect objects from a 
subset of pack-files and place them into a new pack-file, doing a small 
amount of redeltification in the process" then I am open to suggestions.

I tried (briefly) to fix this dependence, but it got too large for me to 
handle at the same time as this change. I'll consider revisiting it 
again later.

>>    The batch is
>> +       selected by greedily adding small pack-files starting with
>> +       the oldest pack-files that fit the size. Create a new pack-
>> +       file containing the objects the multi-pack-index indexes
>> +       into thos pack-files, and rewrite the multi-pack-index to
> those
>
>> +       contain that pack-file. A later run of 'git multi-pack-index
>> +       expire' will delete the pack-files that were part of this
>> +       batch.
> ... but the optimization seems to be rather about getting rid
> of the oldest packs first instead of getting as close to the batch
> size. (e.g. another way to look at this is to "find the permutation
> of all packs that (each are smaller than batch size), but in sum
> are the smallest threshold above the batch size).

You are describing the subset-sum problem, with an additional 
optimization component. While there are dynamic programming approaches 
that are usually effective (if the sum is small), this problem is 
NP-complete, and hence could lead to complications.

> I guess that the strategy of picking the oldest is just easiest
> to implement and should be sufficient for now, but memory
> bounds might be interesting to keep in mind, just as the
> optimal packing from above.

It is easy to implement, and fast. Further, we do have a heuristic that 
the pack modified time correlates with the time the objects were 
introduced to the repository and hence may compress well when placed 
together.

Thanks,

-Stolee


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 5/5] midx: implement midx_repack()
  2018-12-11  2:32   ` Stefan Beller
@ 2018-12-11 13:00     ` Derrick Stolee
  0 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2018-12-11 13:00 UTC (permalink / raw)
  To: Stefan Beller, gitgitgadget
  Cc: git, Jeff King, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Derrick Stolee

On 12/10/2018 9:32 PM, Stefan Beller wrote:
> On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> To repack using a multi-pack-index, first sort all pack-files by
>> their modified time. Second, walk those pack-files from oldest
>> to newest, adding the packs to a list if they are smaller than the
>> given pack-size. Finally, collect the objects from the multi-pack-
>> index that are in those packs and send them to 'git pack-objects'.
> Makes sense.
>
> With this operation we only coalesce some packfiles into a new
> pack file. So to perform the "complete" repack this command
> has to be run repeatedly until there is at most one packfile
> left that is smaller than batch size.

Well, the batch size essentially means "If a pack-file is larger than 
<size>, then leave it be. I'm happy with packs that large." This assumes 
that the reason the pack is that large is because it was already 
combined with other packs or contains a lot of objects.

>
> Imagine the following scenario:
>
>    There are 5 packfiles A, B, C, D, E,
>    created last Monday thru Friday (A is oldest, E youngest).
>    The sizes are [A=4, B=6, C=5, D=5, E=4]
>
>    You'd issue a repack with batch size=10, such that
>    A and B would be repacked into F, which is
>    created today, size is less or equal than 10.
>
>    You issue another repack tomorrow, which then would
>    coalesce C and D to G, which is
>    dated tomorrow, size is less or equal to 10 as well.
>
>    You issue a third repack, which then takes E
>    (as it is the oldest) and would probably find F as the
>    next oldest (assuming it is less than 10), to repack
>    into H.
>
>    H is then compromised of A, B and E, and G is C+D.
>
> In a way these repacks, always picking up the oldest,
> sound like you "roll forward" objects into new packs.
> As the new packs are newest (we have no packs from
> the future), we'd cycle through different packs to look at
> for packing on each repacking.
>
> It is however more likely that content is more similar
> on a temporal basis. (e.g. I am boldly claiming that
> [ABC, DE] would take less space than [ABE, CD]
> as produced above).
>
> (The obvious solution to this hypothetical would be
> to backdate the resulting pack to the youngest pack
> that is input to the new pack, but I dislike fudging with
> the time a file is created/touched, so let's not go there)

This raises a good point about what happens when we "roll over" into the 
"repacked" packs.

I'm not claiming that this is an optimal way to save space, but is a way 
to incrementally collect small packs into slightly larger packs, all 
while not interrupting concurrent Git commands. Reducing pack count 
improves data locality, which is my goal here. In our environment, we do 
see reduced space as a benefit, even if it is not optimal.

>
> Would the object count make sense as input instead of
> the pack date?
>
>
>> While first designing a 'git multi-pack-index repack' operation, I
>> started by collecting the batches based on the size of the objects
>> instead of the size of the pack-files. This allows repacking a
>> large pack-file that has very few referencd objects. However, this
> referenced
>
>> came at a significant cost of parsing pack-files instead of simply
>> reading the multi-pack-index and getting the file information for
>> the pack-files. This object-size idea could be a direction for
>> future expansion in this area.
> Ah, that also explains why the above idea is toast.
>
> Would it make sense to extend or annotate the midx file
> to give hints at which packs are easy to combine?
>
> I guess such an "annotation worker" could run in a separate
> thread / pool with the lowest priority as this seems like a
> decent fallback for the lack of any better information how
> to pick the packfiles.

One idea I had earlier (and is in 
Documentation/technical/multi-pack-index.txt) is to have the midx track 
metadata about pack-files. We could avoid this "rollover" problem by 
tracking which packs were repacked using this mechanism. This could 
create a "pack generation" value, and we could collect a batch of packs 
that have the same generation. This does seem a bit overcomplicated for 
the potential benefit, and could waste better use of that metadata 
concept. For instance, we could use the metadata to track the 
information given by ".keep" and ".promisor" files.

Thanks,

-Stolee


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 5/5] midx: implement midx_repack()
  2018-12-10 18:06 ` [PATCH 5/5] midx: implement midx_repack() Derrick Stolee via GitGitGadget
  2018-12-11  2:32   ` Stefan Beller
@ 2018-12-12  7:40   ` Junio C Hamano
  2018-12-13  4:23     ` Junio C Hamano
  1 sibling, 1 reply; 89+ messages in thread
From: Junio C Hamano @ 2018-12-12  7:40 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, sbeller, peff, jrnieder, avarab, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +		SECOND_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 2 | tail -n 1) &&

awk is capable of remembering $5 from each line of input, sorting
them and picking the second smallest element from it, isn't it?

> +		BATCH_SIZE=$(($SECOND_SMALLEST_SIZE + 1)) &&

... or incrementing the number by one, before reporting, for that
matter.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 5/5] midx: implement midx_repack()
  2018-12-12  7:40   ` Junio C Hamano
@ 2018-12-13  4:23     ` Junio C Hamano
  0 siblings, 0 replies; 89+ messages in thread
From: Junio C Hamano @ 2018-12-13  4:23 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, sbeller, peff, jrnieder, avarab, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> +		SECOND_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 2 | tail -n 1) &&
>
> awk is capable of remembering $5 from each line of input, sorting
> them and picking the second smallest element from it, isn't it?
>
>> +		BATCH_SIZE=$(($SECOND_SMALLEST_SIZE + 1)) &&
>
> ... or incrementing the number by one, before reporting, for that
> matter.


Oops, please disregard.  My awk is rusty.  Unless we are willing to
rely on gawk, which we are not, it is not practical to sort inside
awk.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v2 0/7] Create 'expire' and 'repack' verbs for git-multi-pack-index
  2018-12-10 18:06 [PATCH 0/5] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2018-12-10 18:06 ` [PATCH 5/5] midx: implement midx_repack() Derrick Stolee via GitGitGadget
@ 2018-12-21 16:28 ` Derrick Stolee via GitGitGadget
  2018-12-21 16:28   ` [PATCH v2 1/7] repack: refactor pack deletion for future use Derrick Stolee via GitGitGadget
                     ` (7 more replies)
  5 siblings, 8 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-21 16:28 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano

The multi-pack-index provides a fast way to find an object among a large
list of pack-files. It stores a single pack-reference for each object id, so
duplicate objects are ignored. Among a list of pack-files storing the same
object, the most-recently modified one is used.

Create new subcommands for the multi-pack-index builtin.

 * 'git multi-pack-index expire': If we have a pack-file indexed by the
   multi-pack-index, but all objects in that pack are duplicated in
   more-recently modified packs, then delete that pack (and any others like
   it). Delete the reference to that pack in the multi-pack-index.
   
   
 * 'git multi-pack-index repack --batch-size=': Starting from the oldest
   pack-files covered by the multi-pack-index, find those whose on-disk size
   is below the batch size until we have a collection of packs whose sizes
   add up to the batch size. Create a new pack containing all objects that
   the multi-pack-index references to those packs.
   
   

This allows us to create a new pattern for repacking objects: run 'repack'.
After enough time has passed that all Git commands that started before the
last 'repack' are finished, run 'expire' again. This approach has some
advantages over the existing "repack everything" model:

 1. Incremental. We can repack a small batch of objects at a time, instead
    of repacking all reachable objects. We can also limit ourselves to the
    objects that do not appear in newer pack-files.
    
    
 2. Highly Available. By adding a new pack-file (and not deleting the old
    pack-files) we do not interrupt concurrent Git commands, and do not
    suffer performance degradation. By expiring only pack-files that have no
    referenced objects, we know that Git commands that are doing normal
    object lookups* will not be interrupted.
    
    
 3. Note: if someone concurrently runs a Git command that uses
    get_all_packs(), then that command could try to read the pack-files and
    pack-indexes that we are deleting during an expire command. Such
    commands are usually related to object maintenance (i.e. fsck, gc,
    pack-objects) or are related to less-often-used features (i.e.
    fast-import, http-backend, server-info).
    
    

We plan to use this approach in VFS for Git to do background maintenance of
the "shared object cache" which is a Git alternate directory filled with
packfiles containing commits and trees. We currently download pack-files on
an hourly basis to keep up-to-date with the central server. The cache
servers supply packs on an hourly and daily basis, so most of the hourly
packs become useless after a new daily pack is downloaded. The 'expire'
command would clear out most of those packs, but many will still remain with
fewer than 100 objects remaining. The 'repack' command (with a batch size of
1-3gb, probably) can condense the remaining packs in commands that run for
1-3 min at a time. Since the daily packs range from 100-250mb, we will also
combine and condense those packs.

Updates in V2:

 * Added a method, unlink_pack_path() to remove packfiles, but with the
   additional check for a .keep file. This borrows logic from 
   builtin/repack.c.
   
   
 * Modified documentation and commit messages to replace 'verb' with
   'subcommand'. Simplified the documentation. (I left 'verbs' in the title
   of the cover letter for consistency.)
   
   

Thanks, -Stolee

Derrick Stolee (7):
  repack: refactor pack deletion for future use
  Docs: rearrange subcommands for multi-pack-index
  multi-pack-index: prepare for 'expire' subcommand
  midx: refactor permutation logic
  multi-pack-index: implement 'expire' verb
  multi-pack-index: prepare 'repack' subcommand
  midx: implement midx_repack()

 Documentation/git-multi-pack-index.txt |  26 ++-
 builtin/multi-pack-index.c             |  12 +-
 builtin/repack.c                       |  14 +-
 midx.c                                 | 217 +++++++++++++++++++++++--
 midx.h                                 |   2 +
 packfile.c                             |  28 ++++
 packfile.h                             |   7 +
 t/t5319-multi-pack-index.sh            |  98 +++++++++++
 8 files changed, 376 insertions(+), 28 deletions(-)


base-commit: 26aa9fc81d4c7f6c3b456a29da0b7ec72e5c6595
Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-92%2Fderrickstolee%2Fmidx-expire%2Fupstream-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-92/derrickstolee/midx-expire/upstream-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/92

Range-diff vs v1:

 -:  ---------- > 1:  a697df120c repack: refactor pack deletion for future use
 -:  ---------- > 2:  55df6b20ff Docs: rearrange subcommands for multi-pack-index
 1:  1e34b48a20 ! 3:  2529afe89e multi-pack-index: prepare for 'expire' verb
     @@ -1,6 +1,6 @@
      Author: Derrick Stolee <dstolee@microsoft.com>
      
     -    multi-pack-index: prepare for 'expire' verb
     +    multi-pack-index: prepare for 'expire' subcommand
      
          The multi-pack-index tracks objects in a collection of pack-files.
          Only one copy of each object is indexed, using the modified time
     @@ -8,12 +8,12 @@
          have a pack-file with no referenced objects because all objects
          have a duplicate in a newer pack-file.
      
     -    Introduce a new 'expire' verb to the multi-pack-index builtin.
     -    This verb will delete these unused pack-files and rewrite the
     +    Introduce a new 'expire' subcommand to the multi-pack-index builtin.
     +    This subcommand will delete these unused pack-files and rewrite the
          multi-pack-index to no longer refer to those files. More details
          about the specifics will follow as the method is implemented.
      
     -    Add a test that verifies the 'expire' verb is correctly wired,
     +    Add a test that verifies the 'expire' subcommand is correctly wired,
          but will still be valid when the verb is implemented. Specifically,
          create a set of packs that should all have referenced objects and
          should not be removed during an 'expire' operation.
     @@ -24,16 +24,13 @@
      --- a/Documentation/git-multi-pack-index.txt
      +++ b/Documentation/git-multi-pack-index.txt
      @@
     - 	When given as the verb, verify the contents of the MIDX file
     - 	at `<dir>/packs/multi-pack-index`.
     + verify::
     + 	Verify the contents of the MIDX file.
       
      +expire::
     -+	When given as the verb, delete the pack-files that are tracked
     -+	by the MIDX file at `<dir>/packs/multi-pack-index` but have
     -+	no objects referenced by the MIDX. All objects in these pack-
     -+	files have another copy in a more-recently modified pack-file.
     -+	Rewrite the MIDX file afterward to remove all references to
     -+	these pack-files.
     ++	Delete the pack-files that are tracked 	by the MIDX file, but
     ++	have no objects referenced by the MIDX. Rewrite the MIDX file
     ++	afterward to remove all references to these pack-files.
      +
       
       EXAMPLES
 2:  8f496ccb46 = 4:  0c29a242fe midx: refactor permutation logic
 3:  244bdf2a6f ! 5:  1c4af93f5e multi-pack-index: implement 'expire' verb
     @@ -75,6 +75,7 @@
      +					drop_index++;
      +					i--;
      +					missing_drops++;
     ++					continue;
      +				}
      +			}
      +
     @@ -114,8 +115,6 @@
       {
      -	return 0;
      +	uint32_t i, *count, result = 0;
     -+	size_t dirlen;
     -+	struct strbuf buf = STRBUF_INIT;
      +	struct string_list packs_to_drop = STRING_LIST_INIT_DUP;
      +	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
      +
     @@ -128,31 +127,27 @@
      +		count[pack_int_id]++;
      +	}
      +
     -+	strbuf_addstr(&buf, object_dir);
     -+	strbuf_addstr(&buf, "/pack/");
     -+	dirlen = buf.len;
     -+
      +	for (i = 0; i < m->num_packs; i++) {
     ++		char *pack_name;
     ++
      +		if (count[i])
      +			continue;
      +
     -+		if (m->packs[i]) {
     -+			close_pack(m->packs[i]);
     -+			m->packs[i] = NULL;
     -+		}
     ++		if (prepare_midx_pack(m, i))
     ++			continue;
      +
     -+		string_list_insert(&packs_to_drop, m->pack_names[i]);
     ++		if (m->packs[i]->pack_keep)
     ++			continue;
      +
     -+		strbuf_setlen(&buf, dirlen);
     -+		strbuf_addstr(&buf, m->pack_names[i]);
     -+		unlink(buf.buf);
     ++		pack_name = xstrdup(m->packs[i]->pack_name);
     ++		close_pack(m->packs[i]);
     ++		FREE_AND_NULL(m->packs[i]);
      +
     -+		strip_suffix_mem(buf.buf, &buf.len, "idx");
     -+		strbuf_addstr(&buf, "pack");
     -+		unlink(buf.buf);
     ++		string_list_insert(&packs_to_drop, m->pack_names[i]);
     ++		unlink_pack_path(pack_name, 0);
     ++		free(pack_name);
      +	}
      +
     -+	strbuf_release(&buf);
      +	free(count);
      +
      +	if (packs_to_drop.nr)
 4:  72b2139591 ! 6:  af08e21c97 multi-pack-index: prepare 'repack' verb
     @@ -1,6 +1,6 @@
      Author: Derrick Stolee <dstolee@microsoft.com>
      
     -    multi-pack-index: prepare 'repack' verb
     +    multi-pack-index: prepare 'repack' subcommand
      
          In an environment where the multi-pack-index is useful, it is due
          to many pack-files and an inability to repack the object store
     @@ -10,16 +10,16 @@
          to ensure the object store is highly available and the repack
          operation does not interrupt concurrent git commands.
      
     -    Introduce a 'repack' verb to 'git multi-pack-index' that takes a
     -    '--batch-size' option. The verb will inspect the multi-pack-index
     -    for referenced pack-files whose size is smaller than the batch
     -    size, until collecting a list of pack-files whose sizes sum to
     -    larger than the batch size. Then, a new pack-file will be created
     -    containing the objects from those pack-files that are referenced
     -    by the multi-pack-index. The resulting pack is likely to actually
     -    be smaller than the batch size due to compression and the fact
     -    that there may be objects in the pack-files that have duplicate
     -    copies in other pack-files.
     +    Introduce a 'repack' subcommand to 'git multi-pack-index' that
     +    takes a '--batch-size' option. The verb will inspect the
     +    multi-pack-index for referenced pack-files whose size is smaller
     +    than the batch size, until collecting a list of pack-files whose
     +    sizes sum to larger than the batch size. Then, a new pack-file
     +    will be created containing the objects from those pack-files that
     +    are referenced by the multi-pack-index. The resulting pack is
     +    likely to actually be smaller than the batch size due to
     +    compression and the fact that there may be objects in the pack-
     +    files that have duplicate copies in other pack-files.
      
          The current change introduces the command-line arguments, and we
          add a test that ensures we parse these options properly. Since
     @@ -32,20 +32,19 @@
      --- a/Documentation/git-multi-pack-index.txt
      +++ b/Documentation/git-multi-pack-index.txt
      @@
     - 	Rewrite the MIDX file afterward to remove all references to
     - 	these pack-files.
     + 	have no objects referenced by the MIDX. Rewrite the MIDX file
     + 	afterward to remove all references to these pack-files.
       
      +repack::
     -+	When given as the verb, collect a batch of pack-files whose
     -+	size are all at most the size given by --batch-size, but
     -+	whose sizes sum to larger than --batch-size. The batch is
     -+	selected by greedily adding small pack-files starting with
     -+	the oldest pack-files that fit the size. Create a new pack-
     -+	file containing the objects the multi-pack-index indexes
     -+	into thos pack-files, and rewrite the multi-pack-index to
     -+	contain that pack-file. A later run of 'git multi-pack-index
     -+	expire' will delete the pack-files that were part of this
     -+	batch.
     ++	Collect a batch of pack-files whose size are all at most the
     ++	size given by --batch-size, but whose sizes sum to larger
     ++	than --batch-size. The batch is selected by greedily adding
     ++	small pack-files starting with the oldest pack-files that fit
     ++	the size. Create a new pack-file containing the objects the
     ++	multi-pack-index indexes into those pack-files, and rewrite
     ++	the multi-pack-index to contain that pack-file. A later run
     ++	of 'git multi-pack-index expire' will delete the pack-files
     ++	that were part of this batch.
      +
       
       EXAMPLES
     @@ -123,7 +122,7 @@
       	)
       '
       
     -+test_expect_success 'repack does not create any packs' '
     ++test_expect_success 'repack with minimum size does not alter existing packs' '
      +	(
      +		cd dup &&
      +		ls .git/objects/pack >expect &&
 5:  41ef671ec8 = 7:  bef7aa007c midx: implement midx_repack()

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v2 1/7] repack: refactor pack deletion for future use
  2018-12-21 16:28 ` [PATCH v2 0/7] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
@ 2018-12-21 16:28   ` Derrick Stolee via GitGitGadget
  2018-12-21 16:28   ` [PATCH v2 2/7] Docs: rearrange subcommands for multi-pack-index Derrick Stolee via GitGitGadget
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-21 16:28 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The repack builtin deletes redundant pack-files and their
associated .idx, .promisor, .bitmap, and .keep files. We will want
to re-use this logic in the future for other types of repack, so
pull the logic into 'unlink_pack_path()' in packfile.c.

The 'ignore_keep' parameter is enabled for the use in repack, but
will be important for a future caller.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/repack.c | 14 ++------------
 packfile.c       | 28 ++++++++++++++++++++++++++++
 packfile.h       |  7 +++++++
 3 files changed, 37 insertions(+), 12 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 45583683ee..3d445b34b4 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -129,19 +129,9 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list,
 
 static void remove_redundant_pack(const char *dir_name, const char *base_name)
 {
-	const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", ".promisor"};
-	int i;
 	struct strbuf buf = STRBUF_INIT;
-	size_t plen;
-
-	strbuf_addf(&buf, "%s/%s", dir_name, base_name);
-	plen = buf.len;
-
-	for (i = 0; i < ARRAY_SIZE(exts); i++) {
-		strbuf_setlen(&buf, plen);
-		strbuf_addstr(&buf, exts[i]);
-		unlink(buf.buf);
-	}
+	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
+	unlink_pack_path(buf.buf, 1);
 	strbuf_release(&buf);
 }
 
diff --git a/packfile.c b/packfile.c
index d1e6683ffe..bacecb4d0d 100644
--- a/packfile.c
+++ b/packfile.c
@@ -352,6 +352,34 @@ void close_all_packs(struct raw_object_store *o)
 	}
 }
 
+void unlink_pack_path(const char *pack_name, int force_delete)
+{
+	static const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", ".promisor"};
+	int i;
+	struct strbuf buf = STRBUF_INIT;
+	size_t plen;
+
+	strbuf_addstr(&buf, pack_name);
+	strip_suffix_mem(buf.buf, &buf.len, ".pack");
+	plen = buf.len;
+
+	if (!force_delete) {
+		strbuf_addstr(&buf, ".keep");
+		if (!access(buf.buf, F_OK)) {
+			strbuf_release(&buf);
+			return;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(exts); i++) {
+		strbuf_setlen(&buf, plen);
+		strbuf_addstr(&buf, exts[i]);
+		unlink(buf.buf);
+	}
+
+	strbuf_release(&buf);
+}
+
 /*
  * The LRU pack is the one with the oldest MRU window, preferring packs
  * with no used windows, or the oldest mtime if it has no windows allocated.
diff --git a/packfile.h b/packfile.h
index 6c4037605d..5b7bcdb1dd 100644
--- a/packfile.h
+++ b/packfile.h
@@ -86,6 +86,13 @@ extern void unuse_pack(struct pack_window **);
 extern void clear_delta_base_cache(void);
 extern struct packed_git *add_packed_git(const char *path, size_t path_len, int local);
 
+/*
+ * Unlink the .pack and associated extension files.
+ * Does not unlink if 'force_delete' is false and the pack-file is
+ * marked as ".keep".
+ */
+extern void unlink_pack_path(const char *pack_name, int force_delete);
+
 /*
  * Make sure that a pointer access into an mmap'd index file is within bounds,
  * and can provide at least 8 bytes of data.
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v2 2/7] Docs: rearrange subcommands for multi-pack-index
  2018-12-21 16:28 ` [PATCH v2 0/7] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
  2018-12-21 16:28   ` [PATCH v2 1/7] repack: refactor pack deletion for future use Derrick Stolee via GitGitGadget
@ 2018-12-21 16:28   ` Derrick Stolee via GitGitGadget
  2018-12-21 16:28   ` [PATCH v2 3/7] multi-pack-index: prepare for 'expire' subcommand Derrick Stolee via GitGitGadget
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-21 16:28 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will add new subcommands to the multi-pack-index, and that will
make the documentation a bit messier. Clean up the 'verb'
descriptions by renaming the concept to 'subcommand' and removing
the reference to the object directory.

Helped-by: Stefan Beller <sbeller@google.com>
Helped-by: Szeder Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index f7778a2c85..1af406aca2 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -9,7 +9,7 @@ git-multi-pack-index - Write and verify multi-pack-indexes
 SYNOPSIS
 --------
 [verse]
-'git multi-pack-index' [--object-dir=<dir>] <verb>
+'git multi-pack-index' [--object-dir=<dir>] <subcommand>
 
 DESCRIPTION
 -----------
@@ -23,13 +23,13 @@ OPTIONS
 	`<dir>/packs/multi-pack-index` for the current MIDX file, and
 	`<dir>/packs` for the pack-files to index.
 
+The following subcommands are available:
+
 write::
-	When given as the verb, write a new MIDX file to
-	`<dir>/packs/multi-pack-index`.
+	Write a new MIDX file.
 
 verify::
-	When given as the verb, verify the contents of the MIDX file
-	at `<dir>/packs/multi-pack-index`.
+	Verify the contents of the MIDX file.
 
 
 EXAMPLES
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v2 3/7] multi-pack-index: prepare for 'expire' subcommand
  2018-12-21 16:28 ` [PATCH v2 0/7] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
  2018-12-21 16:28   ` [PATCH v2 1/7] repack: refactor pack deletion for future use Derrick Stolee via GitGitGadget
  2018-12-21 16:28   ` [PATCH v2 2/7] Docs: rearrange subcommands for multi-pack-index Derrick Stolee via GitGitGadget
@ 2018-12-21 16:28   ` Derrick Stolee via GitGitGadget
  2018-12-21 16:28   ` [PATCH v2 4/7] midx: refactor permutation logic Derrick Stolee via GitGitGadget
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-21 16:28 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The multi-pack-index tracks objects in a collection of pack-files.
Only one copy of each object is indexed, using the modified time
of the pack-files to determine tie-breakers. It is possible to
have a pack-file with no referenced objects because all objects
have a duplicate in a newer pack-file.

Introduce a new 'expire' subcommand to the multi-pack-index builtin.
This subcommand will delete these unused pack-files and rewrite the
multi-pack-index to no longer refer to those files. More details
about the specifics will follow as the method is implemented.

Add a test that verifies the 'expire' subcommand is correctly wired,
but will still be valid when the verb is implemented. Specifically,
create a set of packs that should all have referenced objects and
should not be removed during an 'expire' operation.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt |  5 +++
 builtin/multi-pack-index.c             |  4 ++-
 midx.c                                 |  5 +++
 midx.h                                 |  1 +
 t/t5319-multi-pack-index.sh            | 47 ++++++++++++++++++++++++++
 5 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index 1af406aca2..6186c4c936 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -31,6 +31,11 @@ write::
 verify::
 	Verify the contents of the MIDX file.
 
+expire::
+	Delete the pack-files that are tracked 	by the MIDX file, but
+	have no objects referenced by the MIDX. Rewrite the MIDX file
+	afterward to remove all references to these pack-files.
+
 
 EXAMPLES
 --------
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index fca70f8e4f..145de3a46c 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,7 +5,7 @@
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-	N_("git multi-pack-index [--object-dir=<dir>] (write|verify)"),
+	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire)"),
 	NULL
 };
 
@@ -44,6 +44,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
 		return write_midx_file(opts.object_dir);
 	if (!strcmp(argv[0], "verify"))
 		return verify_midx_file(opts.object_dir);
+	if (!strcmp(argv[0], "expire"))
+		return expire_midx_packs(opts.object_dir);
 
 	die(_("unrecognized verb: %s"), argv[0]);
 }
diff --git a/midx.c b/midx.c
index 730ff84dff..bb825ef816 100644
--- a/midx.c
+++ b/midx.c
@@ -1025,3 +1025,8 @@ int verify_midx_file(const char *object_dir)
 
 	return verify_midx_error;
 }
+
+int expire_midx_packs(const char *object_dir)
+{
+	return 0;
+}
diff --git a/midx.h b/midx.h
index 774f652530..e3a2b740b5 100644
--- a/midx.h
+++ b/midx.h
@@ -49,6 +49,7 @@ int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, i
 int write_midx_file(const char *object_dir);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(const char *object_dir);
+int expire_midx_packs(const char *object_dir);
 
 void close_midx(struct multi_pack_index *m);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 70926b5bc0..948effc1ee 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -348,4 +348,51 @@ test_expect_success 'verify incorrect 64-bit offset' '
 		"incorrect object offset"
 '
 
+test_expect_success 'setup expire tests' '
+	mkdir dup &&
+	(
+		cd dup &&
+		git init &&
+		for i in $(test_seq 1 20)
+		do
+			test_commit $i
+		done &&
+		git branch A HEAD &&
+		git branch B HEAD~8 &&
+		git branch C HEAD~13 &&
+		git branch D HEAD~16 &&
+		git branch E HEAD~18 &&
+		git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
+		refs/heads/E
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-D <<-EOF &&
+		refs/heads/D
+		^refs/heads/E
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-C <<-EOF &&
+		refs/heads/C
+		^refs/heads/D
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-B <<-EOF &&
+		refs/heads/B
+		^refs/heads/C
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-A <<-EOF &&
+		refs/heads/A
+		^refs/heads/B
+		EOF
+		git multi-pack-index write
+	)
+'
+
+test_expect_success 'expire does not remove any packs' '
+	(
+		cd dup &&
+		ls .git/objects/pack >expect &&
+		git multi-pack-index expire &&
+		ls .git/objects/pack >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v2 4/7] midx: refactor permutation logic
  2018-12-21 16:28 ` [PATCH v2 0/7] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
                     ` (2 preceding siblings ...)
  2018-12-21 16:28   ` [PATCH v2 3/7] multi-pack-index: prepare for 'expire' subcommand Derrick Stolee via GitGitGadget
@ 2018-12-21 16:28   ` Derrick Stolee via GitGitGadget
  2018-12-21 16:28   ` [PATCH v2 5/7] multi-pack-index: implement 'expire' verb Derrick Stolee via GitGitGadget
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-21 16:28 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When writing a multi-pack-index, we keep track of an integer
permutation, tracking the list of pack-files that we know about
(both from the existing multi-pack-index and the new pack-files
being introduced) and converting them into a sorted order for
the new multi-pack-index.

In anticipation of dropping pack-files from the existing multi-
pack-index, refactor the logic around how we track this permutation.

First, insert the permutation into the pack_list structure. This
allows us to grow the permutation dynamically as we add packs.

Second, fill the permutation with values corresponding to their
position in the list of pack-files, sorted as follows:

  1. The pack-files in the existing multi-pack-index,
     sorted lexicographically.

  2. The pack-files not in the existing multi-pack-index,
     sorted as discovered from the filesystem.

There is a subtle thing in how we initialize this permutation,
specifically how we use 'i' for the initial value. This will
matter more when we implement the logic for dropping existing
packs, as we will create holes in the ordering.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/midx.c b/midx.c
index bb825ef816..3bd7183a53 100644
--- a/midx.c
+++ b/midx.c
@@ -380,9 +380,11 @@ static size_t write_midx_header(struct hashfile *f,
 struct pack_list {
 	struct packed_git **list;
 	char **names;
+	uint32_t *perm;
 	uint32_t nr;
 	uint32_t alloc_list;
 	uint32_t alloc_names;
+	uint32_t alloc_perm;
 	size_t pack_name_concat_len;
 	struct multi_pack_index *m;
 };
@@ -398,6 +400,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 
 		ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc_list);
 		ALLOC_GROW(packs->names, packs->nr + 1, packs->alloc_names);
+		ALLOC_GROW(packs->perm, packs->nr + 1, packs->alloc_perm);
 
 		packs->list[packs->nr] = add_packed_git(full_path,
 							full_path_len,
@@ -417,6 +420,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 			return;
 		}
 
+		packs->perm[packs->nr] = packs->nr;
 		packs->names[packs->nr] = xstrdup(file_name);
 		packs->pack_name_concat_len += strlen(file_name) + 1;
 		packs->nr++;
@@ -443,7 +447,7 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p
 	ALLOC_ARRAY(pairs, nr_packs);
 
 	for (i = 0; i < nr_packs; i++) {
-		pairs[i].pack_int_id = i;
+		pairs[i].pack_int_id = perm[i];
 		pairs[i].pack_name = pack_names[i];
 	}
 
@@ -755,7 +759,6 @@ int write_midx_file(const char *object_dir)
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct pack_list packs;
-	uint32_t *pack_perm = NULL;
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
@@ -774,18 +777,22 @@ int write_midx_file(const char *object_dir)
 
 	packs.nr = 0;
 	packs.alloc_list = packs.m ? packs.m->num_packs : 16;
-	packs.alloc_names = packs.alloc_list;
+	packs.alloc_perm = packs.alloc_names = packs.alloc_list;
 	packs.list = NULL;
 	packs.names = NULL;
+	packs.perm = NULL;
 	packs.pack_name_concat_len = 0;
 	ALLOC_ARRAY(packs.list, packs.alloc_list);
 	ALLOC_ARRAY(packs.names, packs.alloc_names);
+	ALLOC_ARRAY(packs.perm, packs.alloc_perm);
 
 	if (packs.m) {
 		for (i = 0; i < packs.m->num_packs; i++) {
 			ALLOC_GROW(packs.list, packs.nr + 1, packs.alloc_list);
 			ALLOC_GROW(packs.names, packs.nr + 1, packs.alloc_names);
+			ALLOC_GROW(packs.perm, packs.nr + 1, packs.alloc_perm);
 
+			packs.perm[packs.nr] = i;
 			packs.list[packs.nr] = NULL;
 			packs.names[packs.nr] = xstrdup(packs.m->pack_names[i]);
 			packs.pack_name_concat_len += strlen(packs.names[packs.nr]) + 1;
@@ -802,10 +809,9 @@ int write_midx_file(const char *object_dir)
 		packs.pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
 					      (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
 
-	ALLOC_ARRAY(pack_perm, packs.nr);
-	sort_packs_by_name(packs.names, packs.nr, pack_perm);
+	sort_packs_by_name(packs.names, packs.nr, packs.perm);
 
-	entries = get_sorted_entries(packs.m, packs.list, pack_perm, packs.nr, &nr_entries);
+	entries = get_sorted_entries(packs.m, packs.list, packs.perm, packs.nr, &nr_entries);
 
 	for (i = 0; i < nr_entries; i++) {
 		if (entries[i].offset > 0x7fffffff)
@@ -923,8 +929,8 @@ int write_midx_file(const char *object_dir)
 
 	free(packs.list);
 	free(packs.names);
+	free(packs.perm);
 	free(entries);
-	free(pack_perm);
 	free(midx_name);
 	return 0;
 }
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v2 5/7] multi-pack-index: implement 'expire' verb
  2018-12-21 16:28 ` [PATCH v2 0/7] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
                     ` (3 preceding siblings ...)
  2018-12-21 16:28   ` [PATCH v2 4/7] midx: refactor permutation logic Derrick Stolee via GitGitGadget
@ 2018-12-21 16:28   ` Derrick Stolee via GitGitGadget
  2018-12-21 16:28   ` [PATCH v2 6/7] multi-pack-index: prepare 'repack' subcommand Derrick Stolee via GitGitGadget
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-21 16:28 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The 'git multi-pack-index expire' command looks at the existing
mult-pack-index, counts the number of objects referenced in each
pack-file, deletes the pack-fils with no referenced objects, and
rewrites the multi-pack-index to no longer reference those packs.

Refactor the write_midx_file() method to call write_midx_internal()
which now takes an existing 'struct multi_pack_index' and a list
of pack-files to drop (as specified by the names of their pack-
indexes). As we write the new multi-pack-index, we drop those
file names from the list of known pack-files.

The expire_midx_packs() method removes the unreferenced pack-files
after carefully closing the packs to avoid open handles.

Test that a new pack-file that covers the contents of two other
pack-files leads to those pack-files being deleted during the
expire command. Be sure to read the multi-pack-index to ensure
it no longer references those packs.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 82 +++++++++++++++++++++++++++++++++++--
 t/t5319-multi-pack-index.sh | 15 +++++++
 2 files changed, 93 insertions(+), 4 deletions(-)

diff --git a/midx.c b/midx.c
index 3bd7183a53..043cd1fd97 100644
--- a/midx.c
+++ b/midx.c
@@ -751,7 +751,8 @@ static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_off
 	return written;
 }
 
-int write_midx_file(const char *object_dir)
+static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
+			       struct string_list *packs_to_drop)
 {
 	unsigned char cur_chunk, num_chunks = 0;
 	char *midx_name;
@@ -765,6 +766,7 @@ int write_midx_file(const char *object_dir)
 	uint32_t nr_entries, num_large_offsets = 0;
 	struct pack_midx_entry *entries = NULL;
 	int large_offsets_needed = 0;
+	int result = 0;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -773,7 +775,10 @@ int write_midx_file(const char *object_dir)
 			  midx_name);
 	}
 
-	packs.m = load_multi_pack_index(object_dir, 1);
+	if (m)
+		packs.m = m;
+	else
+		packs.m = load_multi_pack_index(object_dir, 1);
 
 	packs.nr = 0;
 	packs.alloc_list = packs.m ? packs.m->num_packs : 16;
@@ -787,7 +792,25 @@ int write_midx_file(const char *object_dir)
 	ALLOC_ARRAY(packs.perm, packs.alloc_perm);
 
 	if (packs.m) {
+		int drop_index = 0, missing_drops = 0;
 		for (i = 0; i < packs.m->num_packs; i++) {
+			if (packs_to_drop && drop_index < packs_to_drop->nr) {
+				int cmp = strcmp(packs.m->pack_names[i],
+						 packs_to_drop->items[drop_index].string);
+
+				if (!cmp) {
+					drop_index++;
+					continue;
+				} else if (cmp > 0) {
+					error(_("did not see pack-file %s to drop"),
+					      packs_to_drop->items[drop_index].string);
+					drop_index++;
+					i--;
+					missing_drops++;
+					continue;
+				}
+			}
+
 			ALLOC_GROW(packs.list, packs.nr + 1, packs.alloc_list);
 			ALLOC_GROW(packs.names, packs.nr + 1, packs.alloc_names);
 			ALLOC_GROW(packs.perm, packs.nr + 1, packs.alloc_perm);
@@ -798,6 +821,12 @@ int write_midx_file(const char *object_dir)
 			packs.pack_name_concat_len += strlen(packs.names[packs.nr]) + 1;
 			packs.nr++;
 		}
+
+		if (packs_to_drop && (drop_index < packs_to_drop->nr || missing_drops)) {
+			error(_("did not see all pack-files to drop"));
+			result = 1;
+			goto cleanup;
+		}
 	}
 
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
@@ -932,7 +961,12 @@ int write_midx_file(const char *object_dir)
 	free(packs.perm);
 	free(entries);
 	free(midx_name);
-	return 0;
+	return result;
+}
+
+int write_midx_file(const char *object_dir)
+{
+	return write_midx_internal(object_dir, NULL, NULL);
 }
 
 void clear_midx_file(struct repository *r)
@@ -1034,5 +1068,45 @@ int verify_midx_file(const char *object_dir)
 
 int expire_midx_packs(const char *object_dir)
 {
-	return 0;
+	uint32_t i, *count, result = 0;
+	struct string_list packs_to_drop = STRING_LIST_INIT_DUP;
+	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
+
+	if (!m)
+		return 0;
+
+	count = xcalloc(m->num_packs, sizeof(uint32_t));
+	for (i = 0; i < m->num_objects; i++) {
+		int pack_int_id = nth_midxed_pack_int_id(m, i);
+		count[pack_int_id]++;
+	}
+
+	for (i = 0; i < m->num_packs; i++) {
+		char *pack_name;
+
+		if (count[i])
+			continue;
+
+		if (prepare_midx_pack(m, i))
+			continue;
+
+		if (m->packs[i]->pack_keep)
+			continue;
+
+		pack_name = xstrdup(m->packs[i]->pack_name);
+		close_pack(m->packs[i]);
+		FREE_AND_NULL(m->packs[i]);
+
+		string_list_insert(&packs_to_drop, m->pack_names[i]);
+		unlink_pack_path(pack_name, 0);
+		free(pack_name);
+	}
+
+	free(count);
+
+	if (packs_to_drop.nr)
+		result = write_midx_internal(object_dir, m, &packs_to_drop);
+
+	string_list_clear(&packs_to_drop, 0);
+	return result;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 948effc1ee..210279a3cf 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -395,4 +395,19 @@ test_expect_success 'expire does not remove any packs' '
 	)
 '
 
+test_expect_success 'expire removes unreferenced packs' '
+	(
+		cd dup &&
+		git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
+		refs/heads/A
+		^refs/heads/C
+		EOF
+		git multi-pack-index write &&
+		ls .git/objects/pack | grep -v -e pack-[AB] >expect &&
+		git multi-pack-index expire &&
+		ls .git/objects/pack >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v2 6/7] multi-pack-index: prepare 'repack' subcommand
  2018-12-21 16:28 ` [PATCH v2 0/7] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
                     ` (4 preceding siblings ...)
  2018-12-21 16:28   ` [PATCH v2 5/7] multi-pack-index: implement 'expire' verb Derrick Stolee via GitGitGadget
@ 2018-12-21 16:28   ` Derrick Stolee via GitGitGadget
  2018-12-21 16:28   ` [PATCH v2 7/7] midx: implement midx_repack() Derrick Stolee via GitGitGadget
  2019-01-09 15:21   ` [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
  7 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-21 16:28 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an environment where the multi-pack-index is useful, it is due
to many pack-files and an inability to repack the object store
into a single pack-file. However, it is likely that many of these
pack-files are rather small, and could be repacked into a slightly
larger pack-file without too much effort. It may also be important
to ensure the object store is highly available and the repack
operation does not interrupt concurrent git commands.

Introduce a 'repack' subcommand to 'git multi-pack-index' that
takes a '--batch-size' option. The verb will inspect the
multi-pack-index for referenced pack-files whose size is smaller
than the batch size, until collecting a list of pack-files whose
sizes sum to larger than the batch size. Then, a new pack-file
will be created containing the objects from those pack-files that
are referenced by the multi-pack-index. The resulting pack is
likely to actually be smaller than the batch size due to
compression and the fact that there may be objects in the pack-
files that have duplicate copies in other pack-files.

The current change introduces the command-line arguments, and we
add a test that ensures we parse these options properly. Since
we specify a small batch size, we will guarantee that future
implementations do not change the list of pack-files.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt | 11 +++++++++++
 builtin/multi-pack-index.c             | 10 +++++++++-
 midx.c                                 |  5 +++++
 midx.h                                 |  1 +
 t/t5319-multi-pack-index.sh            | 11 +++++++++++
 5 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index 6186c4c936..cc63531cc0 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -36,6 +36,17 @@ expire::
 	have no objects referenced by the MIDX. Rewrite the MIDX file
 	afterward to remove all references to these pack-files.
 
+repack::
+	Collect a batch of pack-files whose size are all at most the
+	size given by --batch-size, but whose sizes sum to larger
+	than --batch-size. The batch is selected by greedily adding
+	small pack-files starting with the oldest pack-files that fit
+	the size. Create a new pack-file containing the objects the
+	multi-pack-index indexes into those pack-files, and rewrite
+	the multi-pack-index to contain that pack-file. A later run
+	of 'git multi-pack-index expire' will delete the pack-files
+	that were part of this batch.
+
 
 EXAMPLES
 --------
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 145de3a46c..d87a2235e3 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,12 +5,13 @@
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire)"),
+	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire|repack --batch-size=<size>)"),
 	NULL
 };
 
 static struct opts_multi_pack_index {
 	const char *object_dir;
+	unsigned long batch_size;
 } opts;
 
 int cmd_multi_pack_index(int argc, const char **argv,
@@ -19,6 +20,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
 	static struct option builtin_multi_pack_index_options[] = {
 		OPT_FILENAME(0, "object-dir", &opts.object_dir,
 		  N_("object directory containing set of packfile and pack-index pairs")),
+		OPT_MAGNITUDE(0, "batch-size", &opts.batch_size,
+		  N_("during repack, collect pack-files of smaller size into a batch that is larger than this size")),
 		OPT_END(),
 	};
 
@@ -40,6 +43,11 @@ int cmd_multi_pack_index(int argc, const char **argv,
 		return 1;
 	}
 
+	if (!strcmp(argv[0], "repack"))
+		return midx_repack(opts.object_dir, (size_t)opts.batch_size);
+	if (opts.batch_size)
+		die(_("--batch-size option is only for 'repack' verb"));
+
 	if (!strcmp(argv[0], "write"))
 		return write_midx_file(opts.object_dir);
 	if (!strcmp(argv[0], "verify"))
diff --git a/midx.c b/midx.c
index 043cd1fd97..127c43f7b0 100644
--- a/midx.c
+++ b/midx.c
@@ -1110,3 +1110,8 @@ int expire_midx_packs(const char *object_dir)
 	string_list_clear(&packs_to_drop, 0);
 	return result;
 }
+
+int midx_repack(const char *object_dir, size_t batch_size)
+{
+	return 0;
+}
diff --git a/midx.h b/midx.h
index e3a2b740b5..394a21ee96 100644
--- a/midx.h
+++ b/midx.h
@@ -50,6 +50,7 @@ int write_midx_file(const char *object_dir);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(const char *object_dir);
 int expire_midx_packs(const char *object_dir);
+int midx_repack(const char *object_dir, size_t batch_size);
 
 void close_midx(struct multi_pack_index *m);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 210279a3cf..f675621080 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -410,4 +410,15 @@ test_expect_success 'expire removes unreferenced packs' '
 	)
 '
 
+test_expect_success 'repack with minimum size does not alter existing packs' '
+	(
+		cd dup &&
+		ls .git/objects/pack >expect &&
+		MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
+		git multi-pack-index repack --batch-size=$MINSIZE &&
+		ls .git/objects/pack >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v2 7/7] midx: implement midx_repack()
  2018-12-21 16:28 ` [PATCH v2 0/7] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
                     ` (5 preceding siblings ...)
  2018-12-21 16:28   ` [PATCH v2 6/7] multi-pack-index: prepare 'repack' subcommand Derrick Stolee via GitGitGadget
@ 2018-12-21 16:28   ` Derrick Stolee via GitGitGadget
  2019-01-09 15:21   ` [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
  7 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-12-21 16:28 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

To repack using a multi-pack-index, first sort all pack-files by
their modified time. Second, walk those pack-files from oldest
to newest, adding the packs to a list if they are smaller than the
given pack-size. Finally, collect the objects from the multi-pack-
index that are in those packs and send them to 'git pack-objects'.

While first designing a 'git multi-pack-index repack' operation, I
started by collecting the batches based on the size of the objects
instead of the size of the pack-files. This allows repacking a
large pack-file that has very few referencd objects. However, this
came at a significant cost of parsing pack-files instead of simply
reading the multi-pack-index and getting the file information for
the pack-files. This object-size idea could be a direction for
future expansion in this area.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 109 +++++++++++++++++++++++++++++++++++-
 t/t5319-multi-pack-index.sh |  25 +++++++++
 2 files changed, 133 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 127c43f7b0..f6bc111438 100644
--- a/midx.c
+++ b/midx.c
@@ -8,6 +8,7 @@
 #include "sha1-lookup.h"
 #include "midx.h"
 #include "progress.h"
+#include "run-command.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -1111,7 +1112,113 @@ int expire_midx_packs(const char *object_dir)
 	return result;
 }
 
-int midx_repack(const char *object_dir, size_t batch_size)
+struct time_and_id {
+	timestamp_t mtime;
+	uint32_t pack_int_id;
+};
+
+static int compare_by_mtime(const void *a_, const void *b_)
 {
+	const struct time_and_id *a, *b;
+
+	a = (const struct time_and_id *)a_;
+	b = (const struct time_and_id *)b_;
+
+	if (a->mtime < b->mtime)
+		return -1;
+	if (a->mtime > b->mtime)
+		return 1;
 	return 0;
 }
+
+int midx_repack(const char *object_dir, size_t batch_size)
+{
+	int result = 0;
+	uint32_t i, packs_to_repack;
+	size_t total_size;
+	struct time_and_id *pack_ti;
+	unsigned char *include_pack;
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf base_name = STRBUF_INIT;
+	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
+
+	if (!m)
+		return 0;
+
+	include_pack = xcalloc(m->num_packs, sizeof(unsigned char));
+	pack_ti = xcalloc(m->num_packs, sizeof(struct time_and_id));
+
+	for (i = 0; i < m->num_packs; i++) {
+		pack_ti[i].pack_int_id = i;
+
+		if (prepare_midx_pack(m, i))
+			continue;
+
+		pack_ti[i].mtime = m->packs[i]->mtime;
+	}
+	QSORT(pack_ti, m->num_packs, compare_by_mtime);
+
+	total_size = 0;
+	packs_to_repack = 0;
+	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
+		int pack_int_id = pack_ti[i].pack_int_id;
+		struct packed_git *p = m->packs[pack_int_id];
+
+		if (!p)
+			continue;
+		if (p->pack_size >= batch_size)
+			continue;
+
+		packs_to_repack++;
+		total_size += p->pack_size;
+		include_pack[pack_int_id] = 1;
+	}
+
+	if (total_size < batch_size || packs_to_repack < 2)
+		goto cleanup;
+
+	argv_array_push(&cmd.args, "pack-objects");
+
+	strbuf_addstr(&base_name, object_dir);
+	strbuf_addstr(&base_name, "/pack/pack");
+	argv_array_push(&cmd.args, base_name.buf);
+	strbuf_release(&base_name);
+
+	cmd.git_cmd = 1;
+	cmd.in = cmd.out = -1;
+
+	if (start_command(&cmd)) {
+		error(_("could not start pack-objects"));
+		result = 1;
+		goto cleanup;
+	}
+
+	for (i = 0; i < m->num_objects; i++) {
+		struct object_id oid;
+		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
+
+		if (!include_pack[pack_int_id])
+			continue;
+
+		nth_midxed_object_oid(&oid, m, i);
+		xwrite(cmd.in, oid_to_hex(&oid), the_hash_algo->hexsz);
+		xwrite(cmd.in, "\n", 1);
+	}
+	close(cmd.in);
+
+	if (finish_command(&cmd)) {
+		error(_("could not finish pack-objects"));
+		result = 1;
+		goto cleanup;
+	}
+
+	result = write_midx_internal(object_dir, m, NULL);
+	m = NULL;
+
+cleanup:
+	if (m)
+		close_midx(m);
+	free(include_pack);
+	free(pack_ti);
+	return result;
+}
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index f675621080..3f5e9ea653 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -421,4 +421,29 @@ test_expect_success 'repack with minimum size does not alter existing packs' '
 	)
 '
 
+test_expect_success 'repack creates a new pack' '
+	(
+		cd dup &&
+		SECOND_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 2 | tail -n 1) &&
+		BATCH_SIZE=$(($SECOND_SMALLEST_SIZE + 1)) &&
+		git multi-pack-index repack --batch-size=$BATCH_SIZE &&
+		ls .git/objects/pack/*idx >idx-list &&
+		test_line_count = 5 idx-list &&
+		test-tool read-midx .git/objects | grep idx >midx-list &&
+		test_line_count = 5 midx-list
+	)
+'
+
+test_expect_success 'expire removes repacked packs' '
+	(
+		cd dup &&
+		ls -S .git/objects/pack/*pack | head -n 3 >expect &&
+		git multi-pack-index expire &&
+		ls -S .git/objects/pack/*pack >actual &&
+		test_cmp expect actual &&
+		test-tool read-midx .git/objects | grep idx >midx-list &&
+		test_line_count = 3 midx-list
+	)
+'
+
 test_done
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index
  2018-12-21 16:28 ` [PATCH v2 0/7] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
                     ` (6 preceding siblings ...)
  2018-12-21 16:28   ` [PATCH v2 7/7] midx: implement midx_repack() Derrick Stolee via GitGitGadget
@ 2019-01-09 15:21   ` Derrick Stolee via GitGitGadget
  2019-01-09 15:21     ` [PATCH v3 1/9] repack: refactor pack deletion for future use Derrick Stolee via GitGitGadget
                       ` (11 more replies)
  7 siblings, 12 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-09 15:21 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano

The multi-pack-index provides a fast way to find an object among a large
list of pack-files. It stores a single pack-reference for each object id, so
duplicate objects are ignored. Among a list of pack-files storing the same
object, the most-recently modified one is used.

Create new subcommands for the multi-pack-index builtin.

 * 'git multi-pack-index expire': If we have a pack-file indexed by the
   multi-pack-index, but all objects in that pack are duplicated in
   more-recently modified packs, then delete that pack (and any others like
   it). Delete the reference to that pack in the multi-pack-index.
   
   
 * 'git multi-pack-index repack --batch-size=': Starting from the oldest
   pack-files covered by the multi-pack-index, find those whose on-disk size
   is below the batch size until we have a collection of packs whose sizes
   add up to the batch size. Create a new pack containing all objects that
   the multi-pack-index references to those packs.
   
   

This allows us to create a new pattern for repacking objects: run 'repack'.
After enough time has passed that all Git commands that started before the
last 'repack' are finished, run 'expire' again. This approach has some
advantages over the existing "repack everything" model:

 1. Incremental. We can repack a small batch of objects at a time, instead
    of repacking all reachable objects. We can also limit ourselves to the
    objects that do not appear in newer pack-files.
    
    
 2. Highly Available. By adding a new pack-file (and not deleting the old
    pack-files) we do not interrupt concurrent Git commands, and do not
    suffer performance degradation. By expiring only pack-files that have no
    referenced objects, we know that Git commands that are doing normal
    object lookups* will not be interrupted.
    
    
 3. Note: if someone concurrently runs a Git command that uses
    get_all_packs(), then that command could try to read the pack-files and
    pack-indexes that we are deleting during an expire command. Such
    commands are usually related to object maintenance (i.e. fsck, gc,
    pack-objects) or are related to less-often-used features (i.e.
    fast-import, http-backend, server-info).
    
    

We plan to use this approach in VFS for Git to do background maintenance of
the "shared object cache" which is a Git alternate directory filled with
packfiles containing commits and trees. We currently download pack-files on
an hourly basis to keep up-to-date with the central server. The cache
servers supply packs on an hourly and daily basis, so most of the hourly
packs become useless after a new daily pack is downloaded. The 'expire'
command would clear out most of those packs, but many will still remain with
fewer than 100 objects remaining. The 'repack' command (with a batch size of
1-3gb, probably) can condense the remaining packs in commands that run for
1-3 min at a time. Since the daily packs range from 100-250mb, we will also
combine and condense those packs.

Updates in V2:

 * Added a method, unlink_pack_path() to remove packfiles, but with the
   additional check for a .keep file. This borrows logic from 
   builtin/repack.c.
   
   
 * Modified documentation and commit messages to replace 'verb' with
   'subcommand'. Simplified the documentation. (I left 'verbs' in the title
   of the cover letter for consistency.)
   
   

Updates in V3:

 * There was a bug in the expire logic when simultaneously removing packs
   and adding uncovered packs, specifically around the pack permutation.
   This was hard to see during review because I was using the 'pack_perm'
   array for multiple purposes. First, I was reducing its length, and then I
   was adding to it and resorting. In V3, I significantly overhauled the
   logic here, which required some extra commits before implementing
   'expire'. The final commit includes a test that would cover this case.

Thanks, -Stolee

Derrick Stolee (9):
  repack: refactor pack deletion for future use
  Docs: rearrange subcommands for multi-pack-index
  multi-pack-index: prepare for 'expire' subcommand
  midx: simplify computation of pack name lengths
  midx: refactor permutation logic and pack sorting
  multi-pack-index: implement 'expire' verb
  multi-pack-index: prepare 'repack' subcommand
  midx: implement midx_repack()
  multi-pack-index: test expire while adding packs

 Documentation/git-multi-pack-index.txt |  26 +-
 builtin/multi-pack-index.c             |  12 +-
 builtin/repack.c                       |  14 +-
 midx.c                                 | 393 ++++++++++++++++++-------
 midx.h                                 |   2 +
 packfile.c                             |  28 ++
 packfile.h                             |   7 +
 t/t5319-multi-pack-index.sh            | 133 +++++++++
 8 files changed, 497 insertions(+), 118 deletions(-)


base-commit: 26aa9fc81d4c7f6c3b456a29da0b7ec72e5c6595
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-92%2Fderrickstolee%2Fmidx-expire%2Fupstream-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-92/derrickstolee/midx-expire/upstream-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/92

Range-diff vs v2:

  1:  a697df120c =  1:  62b393b816 repack: refactor pack deletion for future use
  2:  55df6b20ff =  2:  7886785904 Docs: rearrange subcommands for multi-pack-index
  3:  2529afe89e =  3:  f06382b4ae multi-pack-index: prepare for 'expire' subcommand
  4:  0c29a242fe <  -:  ---------- midx: refactor permutation logic
  -:  ---------- >  4:  2a763990ae midx: simplify computation of pack name lengths
  -:  ---------- >  5:  a0d4cc6cb3 midx: refactor permutation logic and pack sorting
  5:  1c4af93f5e !  6:  4dbff40e7a multi-pack-index: implement 'expire' verb
     @@ -26,6 +26,62 @@
       diff --git a/midx.c b/midx.c
       --- a/midx.c
       +++ b/midx.c
     +@@
     + #define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t))
     + #define MIDX_LARGE_OFFSET_NEEDED 0x80000000
     + 
     ++#define PACK_EXPIRED UINT_MAX
     ++
     + static char *get_midx_filename(const char *object_dir)
     + {
     + 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
     +@@
     + 	uint32_t orig_pack_int_id;
     + 	char *pack_name;
     + 	struct packed_git *p;
     ++	unsigned expired : 1;
     + };
     + 
     + static int pack_info_compare(const void *_a, const void *_b)
     +@@
     + 
     + 		packs->info[packs->nr].pack_name = xstrdup(file_name);
     + 		packs->info[packs->nr].orig_pack_int_id = packs->nr;
     ++		packs->info[packs->nr].expired = 0;
     + 		packs->nr++;
     + 	}
     + }
     +@@
     + 	size_t written = 0;
     + 
     + 	for (i = 0; i < num_packs; i++) {
     +-		size_t writelen = strlen(info[i].pack_name) + 1;
     ++		size_t writelen;
     ++
     ++		if (info[i].expired)
     ++			continue;
     + 
     + 		if (i && strcmp(info[i].pack_name, info[i - 1].pack_name) <= 0)
     + 			BUG("incorrect pack-file order: %s before %s",
     + 			    info[i - 1].pack_name,
     + 			    info[i].pack_name);
     + 
     ++		writelen = strlen(info[i].pack_name) + 1;
     + 		hashwrite(f, info[i].pack_name, writelen);
     + 		written += writelen;
     + 	}
     +@@
     + 	for (i = 0; i < nr_objects; i++) {
     + 		struct pack_midx_entry *obj = list++;
     + 
     ++		if (perm[obj->pack_int_id] == PACK_EXPIRED)
     ++			BUG("object %s is in an expired pack with int-id %d",
     ++			    oid_to_hex(&obj->oid),
     ++			    obj->pack_int_id);
     ++
     + 		hashwrite_be32(f, perm[obj->pack_int_id]);
     + 
     + 		if (large_offset_needed && obj->offset >> 31)
      @@
       	return written;
       }
     @@ -37,9 +93,10 @@
       	unsigned char cur_chunk, num_chunks = 0;
       	char *midx_name;
      @@
     - 	uint32_t nr_entries, num_large_offsets = 0;
       	struct pack_midx_entry *entries = NULL;
       	int large_offsets_needed = 0;
     + 	int pack_name_concat_len = 0;
     ++	int dropped_packs = 0;
      +	int result = 0;
       
       	midx_name = get_midx_filename(object_dir);
     @@ -55,49 +112,87 @@
      +		packs.m = load_multi_pack_index(object_dir, 1);
       
       	packs.nr = 0;
     - 	packs.alloc_list = packs.m ? packs.m->num_packs : 16;
     + 	packs.alloc = packs.m ? packs.m->num_packs : 16;
      @@
     - 	ALLOC_ARRAY(packs.perm, packs.alloc_perm);
     - 
     - 	if (packs.m) {
     -+		int drop_index = 0, missing_drops = 0;
     - 		for (i = 0; i < packs.m->num_packs; i++) {
     -+			if (packs_to_drop && drop_index < packs_to_drop->nr) {
     -+				int cmp = strcmp(packs.m->pack_names[i],
     -+						 packs_to_drop->items[drop_index].string);
     -+
     -+				if (!cmp) {
     -+					drop_index++;
     -+					continue;
     -+				} else if (cmp > 0) {
     -+					error(_("did not see pack-file %s to drop"),
     -+					      packs_to_drop->items[drop_index].string);
     -+					drop_index++;
     -+					i--;
     -+					missing_drops++;
     -+					continue;
     -+				}
     -+			}
     -+
     - 			ALLOC_GROW(packs.list, packs.nr + 1, packs.alloc_list);
     - 			ALLOC_GROW(packs.names, packs.nr + 1, packs.alloc_names);
     - 			ALLOC_GROW(packs.perm, packs.nr + 1, packs.alloc_perm);
     -@@
     - 			packs.pack_name_concat_len += strlen(packs.names[packs.nr]) + 1;
     + 			packs.info[packs.nr].orig_pack_int_id = i;
     + 			packs.info[packs.nr].pack_name = xstrdup(packs.m->pack_names[i]);
     + 			packs.info[packs.nr].p = NULL;
     ++			packs.info[packs.nr].expired = 0;
       			packs.nr++;
       		}
     + 	}
     + 
     + 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
     + 
     +-	if (packs.m && packs.nr == packs.m->num_packs)
     ++	if (packs.m && packs.nr == packs.m->num_packs && !packs_to_drop)
     + 		goto cleanup;
     + 
     + 	entries = get_sorted_entries(packs.m, packs.info, packs.nr, &nr_entries);
     +@@
     + 
     + 	QSORT(packs.info, packs.nr, pack_info_compare);
     + 
     ++	if (packs_to_drop && packs_to_drop->nr) {
     ++		int drop_index = 0;
     ++		int missing_drops = 0;
     ++
     ++		for (i = 0; i < packs.nr && drop_index < packs_to_drop->nr; i++) {
     ++			int cmp = strcmp(packs.info[i].pack_name,
     ++					 packs_to_drop->items[drop_index].string);
     ++
     ++			if (!cmp) {
     ++				drop_index++;
     ++				packs.info[i].expired = 1;
     ++			} else if (cmp > 0) {
     ++				error(_("did not see pack-file %s to drop"),
     ++				      packs_to_drop->items[drop_index].string);
     ++				drop_index++;
     ++				missing_drops++;
     ++				i--;
     ++			} else {
     ++				packs.info[i].expired = 0;
     ++			}
     ++		}
      +
     -+		if (packs_to_drop && (drop_index < packs_to_drop->nr || missing_drops)) {
     -+			error(_("did not see all pack-files to drop"));
     ++		if (missing_drops) {
      +			result = 1;
      +			goto cleanup;
     ++		}
     ++	}
     ++
     + 	ALLOC_ARRAY(pack_perm, packs.nr);
     + 	for (i = 0; i < packs.nr; i++) {
     +-		pack_perm[packs.info[i].orig_pack_int_id] = i;
     ++		if (packs.info[i].expired) {
     ++			dropped_packs++;
     ++			pack_perm[packs.info[i].orig_pack_int_id] = PACK_EXPIRED;
     ++		} else {
     ++			pack_perm[packs.info[i].orig_pack_int_id] = i - dropped_packs;
      +		}
       	}
       
     - 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
     +-	for (i = 0; i < packs.nr; i++)
     +-		pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
     ++	for (i = 0; i < packs.nr; i++) {
     ++		if (!packs.info[i].expired)
     ++			pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
     ++	}
     + 
     + 	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
     + 		pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
     +@@
     + 	cur_chunk = 0;
     + 	num_chunks = large_offsets_needed ? 5 : 4;
     + 
     +-	written = write_midx_header(f, num_chunks, packs.nr);
     ++	written = write_midx_header(f, num_chunks, packs.nr - dropped_packs);
     + 
     + 	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
     + 	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
      @@
     - 	free(packs.perm);
       	free(entries);
     + 	free(pack_perm);
       	free(midx_name);
      -	return 0;
      +	return result;
     @@ -175,7 +270,10 @@
      +		ls .git/objects/pack | grep -v -e pack-[AB] >expect &&
      +		git multi-pack-index expire &&
      +		ls .git/objects/pack >actual &&
     -+		test_cmp expect actual
     ++		test_cmp expect actual &&
     ++		ls .git/objects/pack/ | grep idx >expect-idx &&
     ++		test-tool read-midx .git/objects | grep idx >actual-midx &&
     ++		test_cmp expect-idx actual-midx
      +	)
      +'
      +
  6:  af08e21c97 =  7:  b39f90ad09 multi-pack-index: prepare 'repack' subcommand
  7:  bef7aa007c =  8:  a4c2d5a8e1 midx: implement midx_repack()
  -:  ---------- >  9:  b97fb35ba9 multi-pack-index: test expire while adding packs

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 1/9] repack: refactor pack deletion for future use
  2019-01-09 15:21   ` [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
@ 2019-01-09 15:21     ` Derrick Stolee via GitGitGadget
  2019-01-09 15:21     ` [PATCH v3 2/9] Docs: rearrange subcommands for multi-pack-index Derrick Stolee via GitGitGadget
                       ` (10 subsequent siblings)
  11 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-09 15:21 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The repack builtin deletes redundant pack-files and their
associated .idx, .promisor, .bitmap, and .keep files. We will want
to re-use this logic in the future for other types of repack, so
pull the logic into 'unlink_pack_path()' in packfile.c.

The 'ignore_keep' parameter is enabled for the use in repack, but
will be important for a future caller.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/repack.c | 14 ++------------
 packfile.c       | 28 ++++++++++++++++++++++++++++
 packfile.h       |  7 +++++++
 3 files changed, 37 insertions(+), 12 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 45583683ee..3d445b34b4 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -129,19 +129,9 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list,
 
 static void remove_redundant_pack(const char *dir_name, const char *base_name)
 {
-	const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", ".promisor"};
-	int i;
 	struct strbuf buf = STRBUF_INIT;
-	size_t plen;
-
-	strbuf_addf(&buf, "%s/%s", dir_name, base_name);
-	plen = buf.len;
-
-	for (i = 0; i < ARRAY_SIZE(exts); i++) {
-		strbuf_setlen(&buf, plen);
-		strbuf_addstr(&buf, exts[i]);
-		unlink(buf.buf);
-	}
+	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
+	unlink_pack_path(buf.buf, 1);
 	strbuf_release(&buf);
 }
 
diff --git a/packfile.c b/packfile.c
index d1e6683ffe..bacecb4d0d 100644
--- a/packfile.c
+++ b/packfile.c
@@ -352,6 +352,34 @@ void close_all_packs(struct raw_object_store *o)
 	}
 }
 
+void unlink_pack_path(const char *pack_name, int force_delete)
+{
+	static const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", ".promisor"};
+	int i;
+	struct strbuf buf = STRBUF_INIT;
+	size_t plen;
+
+	strbuf_addstr(&buf, pack_name);
+	strip_suffix_mem(buf.buf, &buf.len, ".pack");
+	plen = buf.len;
+
+	if (!force_delete) {
+		strbuf_addstr(&buf, ".keep");
+		if (!access(buf.buf, F_OK)) {
+			strbuf_release(&buf);
+			return;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(exts); i++) {
+		strbuf_setlen(&buf, plen);
+		strbuf_addstr(&buf, exts[i]);
+		unlink(buf.buf);
+	}
+
+	strbuf_release(&buf);
+}
+
 /*
  * The LRU pack is the one with the oldest MRU window, preferring packs
  * with no used windows, or the oldest mtime if it has no windows allocated.
diff --git a/packfile.h b/packfile.h
index 6c4037605d..5b7bcdb1dd 100644
--- a/packfile.h
+++ b/packfile.h
@@ -86,6 +86,13 @@ extern void unuse_pack(struct pack_window **);
 extern void clear_delta_base_cache(void);
 extern struct packed_git *add_packed_git(const char *path, size_t path_len, int local);
 
+/*
+ * Unlink the .pack and associated extension files.
+ * Does not unlink if 'force_delete' is false and the pack-file is
+ * marked as ".keep".
+ */
+extern void unlink_pack_path(const char *pack_name, int force_delete);
+
 /*
  * Make sure that a pointer access into an mmap'd index file is within bounds,
  * and can provide at least 8 bytes of data.
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 2/9] Docs: rearrange subcommands for multi-pack-index
  2019-01-09 15:21   ` [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
  2019-01-09 15:21     ` [PATCH v3 1/9] repack: refactor pack deletion for future use Derrick Stolee via GitGitGadget
@ 2019-01-09 15:21     ` Derrick Stolee via GitGitGadget
  2019-01-09 15:21     ` [PATCH v3 3/9] multi-pack-index: prepare for 'expire' subcommand Derrick Stolee via GitGitGadget
                       ` (9 subsequent siblings)
  11 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-09 15:21 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will add new subcommands to the multi-pack-index, and that will
make the documentation a bit messier. Clean up the 'verb'
descriptions by renaming the concept to 'subcommand' and removing
the reference to the object directory.

Helped-by: Stefan Beller <sbeller@google.com>
Helped-by: Szeder Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index f7778a2c85..1af406aca2 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -9,7 +9,7 @@ git-multi-pack-index - Write and verify multi-pack-indexes
 SYNOPSIS
 --------
 [verse]
-'git multi-pack-index' [--object-dir=<dir>] <verb>
+'git multi-pack-index' [--object-dir=<dir>] <subcommand>
 
 DESCRIPTION
 -----------
@@ -23,13 +23,13 @@ OPTIONS
 	`<dir>/packs/multi-pack-index` for the current MIDX file, and
 	`<dir>/packs` for the pack-files to index.
 
+The following subcommands are available:
+
 write::
-	When given as the verb, write a new MIDX file to
-	`<dir>/packs/multi-pack-index`.
+	Write a new MIDX file.
 
 verify::
-	When given as the verb, verify the contents of the MIDX file
-	at `<dir>/packs/multi-pack-index`.
+	Verify the contents of the MIDX file.
 
 
 EXAMPLES
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 3/9] multi-pack-index: prepare for 'expire' subcommand
  2019-01-09 15:21   ` [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
  2019-01-09 15:21     ` [PATCH v3 1/9] repack: refactor pack deletion for future use Derrick Stolee via GitGitGadget
  2019-01-09 15:21     ` [PATCH v3 2/9] Docs: rearrange subcommands for multi-pack-index Derrick Stolee via GitGitGadget
@ 2019-01-09 15:21     ` Derrick Stolee via GitGitGadget
  2019-01-09 15:21     ` [PATCH v3 4/9] midx: simplify computation of pack name lengths Derrick Stolee via GitGitGadget
                       ` (8 subsequent siblings)
  11 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-09 15:21 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The multi-pack-index tracks objects in a collection of pack-files.
Only one copy of each object is indexed, using the modified time
of the pack-files to determine tie-breakers. It is possible to
have a pack-file with no referenced objects because all objects
have a duplicate in a newer pack-file.

Introduce a new 'expire' subcommand to the multi-pack-index builtin.
This subcommand will delete these unused pack-files and rewrite the
multi-pack-index to no longer refer to those files. More details
about the specifics will follow as the method is implemented.

Add a test that verifies the 'expire' subcommand is correctly wired,
but will still be valid when the verb is implemented. Specifically,
create a set of packs that should all have referenced objects and
should not be removed during an 'expire' operation.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt |  5 +++
 builtin/multi-pack-index.c             |  4 ++-
 midx.c                                 |  5 +++
 midx.h                                 |  1 +
 t/t5319-multi-pack-index.sh            | 47 ++++++++++++++++++++++++++
 5 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index 1af406aca2..6186c4c936 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -31,6 +31,11 @@ write::
 verify::
 	Verify the contents of the MIDX file.
 
+expire::
+	Delete the pack-files that are tracked 	by the MIDX file, but
+	have no objects referenced by the MIDX. Rewrite the MIDX file
+	afterward to remove all references to these pack-files.
+
 
 EXAMPLES
 --------
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index fca70f8e4f..145de3a46c 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,7 +5,7 @@
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-	N_("git multi-pack-index [--object-dir=<dir>] (write|verify)"),
+	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire)"),
 	NULL
 };
 
@@ -44,6 +44,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
 		return write_midx_file(opts.object_dir);
 	if (!strcmp(argv[0], "verify"))
 		return verify_midx_file(opts.object_dir);
+	if (!strcmp(argv[0], "expire"))
+		return expire_midx_packs(opts.object_dir);
 
 	die(_("unrecognized verb: %s"), argv[0]);
 }
diff --git a/midx.c b/midx.c
index 730ff84dff..bb825ef816 100644
--- a/midx.c
+++ b/midx.c
@@ -1025,3 +1025,8 @@ int verify_midx_file(const char *object_dir)
 
 	return verify_midx_error;
 }
+
+int expire_midx_packs(const char *object_dir)
+{
+	return 0;
+}
diff --git a/midx.h b/midx.h
index 774f652530..e3a2b740b5 100644
--- a/midx.h
+++ b/midx.h
@@ -49,6 +49,7 @@ int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, i
 int write_midx_file(const char *object_dir);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(const char *object_dir);
+int expire_midx_packs(const char *object_dir);
 
 void close_midx(struct multi_pack_index *m);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 70926b5bc0..948effc1ee 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -348,4 +348,51 @@ test_expect_success 'verify incorrect 64-bit offset' '
 		"incorrect object offset"
 '
 
+test_expect_success 'setup expire tests' '
+	mkdir dup &&
+	(
+		cd dup &&
+		git init &&
+		for i in $(test_seq 1 20)
+		do
+			test_commit $i
+		done &&
+		git branch A HEAD &&
+		git branch B HEAD~8 &&
+		git branch C HEAD~13 &&
+		git branch D HEAD~16 &&
+		git branch E HEAD~18 &&
+		git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
+		refs/heads/E
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-D <<-EOF &&
+		refs/heads/D
+		^refs/heads/E
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-C <<-EOF &&
+		refs/heads/C
+		^refs/heads/D
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-B <<-EOF &&
+		refs/heads/B
+		^refs/heads/C
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-A <<-EOF &&
+		refs/heads/A
+		^refs/heads/B
+		EOF
+		git multi-pack-index write
+	)
+'
+
+test_expect_success 'expire does not remove any packs' '
+	(
+		cd dup &&
+		ls .git/objects/pack >expect &&
+		git multi-pack-index expire &&
+		ls .git/objects/pack >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 4/9] midx: simplify computation of pack name lengths
  2019-01-09 15:21   ` [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
                       ` (2 preceding siblings ...)
  2019-01-09 15:21     ` [PATCH v3 3/9] multi-pack-index: prepare for 'expire' subcommand Derrick Stolee via GitGitGadget
@ 2019-01-09 15:21     ` Derrick Stolee via GitGitGadget
  2019-01-09 15:21     ` [PATCH v3 5/9] midx: refactor permutation logic and pack sorting Derrick Stolee via GitGitGadget
                       ` (7 subsequent siblings)
  11 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-09 15:21 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Before writing the multi-pack-index, we compute the length of the
pack-index names concatenated together. This forms the data in the
pack name chunk, and we precompute it to compute chunk offsets.
The value is also modified to fit alignment needs.

Previously, this computation was coupled with adding packs from
the existing multi-pack-index and the remaining packs in the object
dir not already covered by the multi-pack-index.

In anticipation of this becoming more complicated with the 'expire'
command, simplify the computation by centralizing it to a single
loop before writing the file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/midx.c b/midx.c
index bb825ef816..f087bbbe82 100644
--- a/midx.c
+++ b/midx.c
@@ -383,7 +383,6 @@ struct pack_list {
 	uint32_t nr;
 	uint32_t alloc_list;
 	uint32_t alloc_names;
-	size_t pack_name_concat_len;
 	struct multi_pack_index *m;
 };
 
@@ -418,7 +417,6 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 		}
 
 		packs->names[packs->nr] = xstrdup(file_name);
-		packs->pack_name_concat_len += strlen(file_name) + 1;
 		packs->nr++;
 	}
 }
@@ -762,6 +760,7 @@ int write_midx_file(const char *object_dir)
 	uint32_t nr_entries, num_large_offsets = 0;
 	struct pack_midx_entry *entries = NULL;
 	int large_offsets_needed = 0;
+	int pack_name_concat_len = 0;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -777,7 +776,6 @@ int write_midx_file(const char *object_dir)
 	packs.alloc_names = packs.alloc_list;
 	packs.list = NULL;
 	packs.names = NULL;
-	packs.pack_name_concat_len = 0;
 	ALLOC_ARRAY(packs.list, packs.alloc_list);
 	ALLOC_ARRAY(packs.names, packs.alloc_names);
 
@@ -788,7 +786,6 @@ int write_midx_file(const char *object_dir)
 
 			packs.list[packs.nr] = NULL;
 			packs.names[packs.nr] = xstrdup(packs.m->pack_names[i]);
-			packs.pack_name_concat_len += strlen(packs.names[packs.nr]) + 1;
 			packs.nr++;
 		}
 	}
@@ -798,10 +795,6 @@ int write_midx_file(const char *object_dir)
 	if (packs.m && packs.nr == packs.m->num_packs)
 		goto cleanup;
 
-	if (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
-		packs.pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
-					      (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
-
 	ALLOC_ARRAY(pack_perm, packs.nr);
 	sort_packs_by_name(packs.names, packs.nr, pack_perm);
 
@@ -814,6 +807,13 @@ int write_midx_file(const char *object_dir)
 			large_offsets_needed = 1;
 	}
 
+	for (i = 0; i < packs.nr; i++)
+		pack_name_concat_len += strlen(packs.names[i]) + 1;
+
+	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
+		pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
+					(pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
+
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
@@ -831,7 +831,7 @@ int write_midx_file(const char *object_dir)
 
 	cur_chunk++;
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + packs.pack_name_concat_len;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
 
 	cur_chunk++;
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 5/9] midx: refactor permutation logic and pack sorting
  2019-01-09 15:21   ` [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
                       ` (3 preceding siblings ...)
  2019-01-09 15:21     ` [PATCH v3 4/9] midx: simplify computation of pack name lengths Derrick Stolee via GitGitGadget
@ 2019-01-09 15:21     ` Derrick Stolee via GitGitGadget
  2019-01-23 21:00       ` Jonathan Tan
  2019-01-09 15:21     ` [PATCH v3 6/9] multi-pack-index: implement 'expire' verb Derrick Stolee via GitGitGadget
                       ` (6 subsequent siblings)
  11 siblings, 1 reply; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-09 15:21 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In anticipation of the expire subcommand, refactor the way we sort
the packfiles by name. This will greatly simplify our approach to
dropping expired packs from the list.

First, create 'struct pack_info' to replace 'struct pack_pair'.
This struct contains the necessary information about a pack,
including its name, a pointer to its packfile struct (if not
already in the multi-pack-index), and the original pack-int-id.

Second, track the pack information using an array of pack_info
structs in the pack_list struct. This simplifies the logic around
the multiple arrays we were tracking in that struct.

Finally, update get_sorted_entries() to not permute the pack-int-id
and instead supply the permutation to write_midx_object_offsets().
This requires sorting the packs after get_sorted_entries().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 150 ++++++++++++++++++++++++---------------------------------
 1 file changed, 63 insertions(+), 87 deletions(-)

diff --git a/midx.c b/midx.c
index f087bbbe82..7ae5275c25 100644
--- a/midx.c
+++ b/midx.c
@@ -377,12 +377,23 @@ static size_t write_midx_header(struct hashfile *f,
 	return MIDX_HEADER_SIZE;
 }
 
+struct pack_info {
+	uint32_t orig_pack_int_id;
+	char *pack_name;
+	struct packed_git *p;
+};
+
+static int pack_info_compare(const void *_a, const void *_b)
+{
+	struct pack_info *a = (struct pack_info *)_a;
+	struct pack_info *b = (struct pack_info *)_b;
+	return strcmp(a->pack_name, b->pack_name);
+}
+
 struct pack_list {
-	struct packed_git **list;
-	char **names;
+	struct pack_info *info;
 	uint32_t nr;
-	uint32_t alloc_list;
-	uint32_t alloc_names;
+	uint32_t alloc;
 	struct multi_pack_index *m;
 };
 
@@ -395,66 +406,32 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 		if (packs->m && midx_contains_pack(packs->m, file_name))
 			return;
 
-		ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc_list);
-		ALLOC_GROW(packs->names, packs->nr + 1, packs->alloc_names);
+		ALLOC_GROW(packs->info, packs->nr + 1, packs->alloc);
 
-		packs->list[packs->nr] = add_packed_git(full_path,
-							full_path_len,
-							0);
+		packs->info[packs->nr].p = add_packed_git(full_path,
+							  full_path_len,
+							  0);
 
-		if (!packs->list[packs->nr]) {
+		if (!packs->info[packs->nr].p) {
 			warning(_("failed to add packfile '%s'"),
 				full_path);
 			return;
 		}
 
-		if (open_pack_index(packs->list[packs->nr])) {
+		if (open_pack_index(packs->info[packs->nr].p)) {
 			warning(_("failed to open pack-index '%s'"),
 				full_path);
-			close_pack(packs->list[packs->nr]);
-			FREE_AND_NULL(packs->list[packs->nr]);
+			close_pack(packs->info[packs->nr].p);
+			FREE_AND_NULL(packs->info[packs->nr].p);
 			return;
 		}
 
-		packs->names[packs->nr] = xstrdup(file_name);
+		packs->info[packs->nr].pack_name = xstrdup(file_name);
+		packs->info[packs->nr].orig_pack_int_id = packs->nr;
 		packs->nr++;
 	}
 }
 
-struct pack_pair {
-	uint32_t pack_int_id;
-	char *pack_name;
-};
-
-static int pack_pair_compare(const void *_a, const void *_b)
-{
-	struct pack_pair *a = (struct pack_pair *)_a;
-	struct pack_pair *b = (struct pack_pair *)_b;
-	return strcmp(a->pack_name, b->pack_name);
-}
-
-static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *perm)
-{
-	uint32_t i;
-	struct pack_pair *pairs;
-
-	ALLOC_ARRAY(pairs, nr_packs);
-
-	for (i = 0; i < nr_packs; i++) {
-		pairs[i].pack_int_id = i;
-		pairs[i].pack_name = pack_names[i];
-	}
-
-	QSORT(pairs, nr_packs, pack_pair_compare);
-
-	for (i = 0; i < nr_packs; i++) {
-		pack_names[i] = pairs[i].pack_name;
-		perm[pairs[i].pack_int_id] = i;
-	}
-
-	free(pairs);
-}
-
 struct pack_midx_entry {
 	struct object_id oid;
 	uint32_t pack_int_id;
@@ -480,7 +457,6 @@ static int midx_oid_compare(const void *_a, const void *_b)
 }
 
 static int nth_midxed_pack_midx_entry(struct multi_pack_index *m,
-				      uint32_t *pack_perm,
 				      struct pack_midx_entry *e,
 				      uint32_t pos)
 {
@@ -488,7 +464,7 @@ static int nth_midxed_pack_midx_entry(struct multi_pack_index *m,
 		return 1;
 
 	nth_midxed_object_oid(&e->oid, m, pos);
-	e->pack_int_id = pack_perm[nth_midxed_pack_int_id(m, pos)];
+	e->pack_int_id = nth_midxed_pack_int_id(m, pos);
 	e->offset = nth_midxed_offset(m, pos);
 
 	/* consider objects in midx to be from "old" packs */
@@ -522,8 +498,7 @@ static void fill_pack_entry(uint32_t pack_int_id,
  * of a packfile containing the object).
  */
 static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
-						  struct packed_git **p,
-						  uint32_t *perm,
+						  struct pack_info *info,
 						  uint32_t nr_packs,
 						  uint32_t *nr_objects)
 {
@@ -534,7 +509,7 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 	uint32_t start_pack = m ? m->num_packs : 0;
 
 	for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++)
-		total_objects += p[cur_pack]->num_objects;
+		total_objects += info[cur_pack].p->num_objects;
 
 	/*
 	 * As we de-duplicate by fanout value, we expect the fanout
@@ -559,7 +534,7 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 
 			for (cur_object = start; cur_object < end; cur_object++) {
 				ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout);
-				nth_midxed_pack_midx_entry(m, perm,
+				nth_midxed_pack_midx_entry(m,
 							   &entries_by_fanout[nr_fanout],
 							   cur_object);
 				nr_fanout++;
@@ -570,12 +545,12 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 			uint32_t start = 0, end;
 
 			if (cur_fanout)
-				start = get_pack_fanout(p[cur_pack], cur_fanout - 1);
-			end = get_pack_fanout(p[cur_pack], cur_fanout);
+				start = get_pack_fanout(info[cur_pack].p, cur_fanout - 1);
+			end = get_pack_fanout(info[cur_pack].p, cur_fanout);
 
 			for (cur_object = start; cur_object < end; cur_object++) {
 				ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout);
-				fill_pack_entry(perm[cur_pack], p[cur_pack], cur_object, &entries_by_fanout[nr_fanout]);
+				fill_pack_entry(cur_pack, info[cur_pack].p, cur_object, &entries_by_fanout[nr_fanout]);
 				nr_fanout++;
 			}
 		}
@@ -604,7 +579,7 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 }
 
 static size_t write_midx_pack_names(struct hashfile *f,
-				    char **pack_names,
+				    struct pack_info *info,
 				    uint32_t num_packs)
 {
 	uint32_t i;
@@ -612,14 +587,14 @@ static size_t write_midx_pack_names(struct hashfile *f,
 	size_t written = 0;
 
 	for (i = 0; i < num_packs; i++) {
-		size_t writelen = strlen(pack_names[i]) + 1;
+		size_t writelen = strlen(info[i].pack_name) + 1;
 
-		if (i && strcmp(pack_names[i], pack_names[i - 1]) <= 0)
+		if (i && strcmp(info[i].pack_name, info[i - 1].pack_name) <= 0)
 			BUG("incorrect pack-file order: %s before %s",
-			    pack_names[i - 1],
-			    pack_names[i]);
+			    info[i - 1].pack_name,
+			    info[i].pack_name);
 
-		hashwrite(f, pack_names[i], writelen);
+		hashwrite(f, info[i].pack_name, writelen);
 		written += writelen;
 	}
 
@@ -690,6 +665,7 @@ static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
 }
 
 static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_needed,
+					uint32_t *perm,
 					struct pack_midx_entry *objects, uint32_t nr_objects)
 {
 	struct pack_midx_entry *list = objects;
@@ -699,7 +675,7 @@ static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_nee
 	for (i = 0; i < nr_objects; i++) {
 		struct pack_midx_entry *obj = list++;
 
-		hashwrite_be32(f, obj->pack_int_id);
+		hashwrite_be32(f, perm[obj->pack_int_id]);
 
 		if (large_offset_needed && obj->offset >> 31)
 			hashwrite_be32(f, MIDX_LARGE_OFFSET_NEEDED | nr_large_offset++);
@@ -772,20 +748,17 @@ int write_midx_file(const char *object_dir)
 	packs.m = load_multi_pack_index(object_dir, 1);
 
 	packs.nr = 0;
-	packs.alloc_list = packs.m ? packs.m->num_packs : 16;
-	packs.alloc_names = packs.alloc_list;
-	packs.list = NULL;
-	packs.names = NULL;
-	ALLOC_ARRAY(packs.list, packs.alloc_list);
-	ALLOC_ARRAY(packs.names, packs.alloc_names);
+	packs.alloc = packs.m ? packs.m->num_packs : 16;
+	packs.info = NULL;
+	ALLOC_ARRAY(packs.info, packs.alloc);
 
 	if (packs.m) {
 		for (i = 0; i < packs.m->num_packs; i++) {
-			ALLOC_GROW(packs.list, packs.nr + 1, packs.alloc_list);
-			ALLOC_GROW(packs.names, packs.nr + 1, packs.alloc_names);
+			ALLOC_GROW(packs.info, packs.nr + 1, packs.alloc);
 
-			packs.list[packs.nr] = NULL;
-			packs.names[packs.nr] = xstrdup(packs.m->pack_names[i]);
+			packs.info[packs.nr].orig_pack_int_id = i;
+			packs.info[packs.nr].pack_name = xstrdup(packs.m->pack_names[i]);
+			packs.info[packs.nr].p = NULL;
 			packs.nr++;
 		}
 	}
@@ -795,10 +768,7 @@ int write_midx_file(const char *object_dir)
 	if (packs.m && packs.nr == packs.m->num_packs)
 		goto cleanup;
 
-	ALLOC_ARRAY(pack_perm, packs.nr);
-	sort_packs_by_name(packs.names, packs.nr, pack_perm);
-
-	entries = get_sorted_entries(packs.m, packs.list, pack_perm, packs.nr, &nr_entries);
+	entries = get_sorted_entries(packs.m, packs.info, packs.nr, &nr_entries);
 
 	for (i = 0; i < nr_entries; i++) {
 		if (entries[i].offset > 0x7fffffff)
@@ -807,8 +777,15 @@ int write_midx_file(const char *object_dir)
 			large_offsets_needed = 1;
 	}
 
+	QSORT(packs.info, packs.nr, pack_info_compare);
+
+	ALLOC_ARRAY(pack_perm, packs.nr);
+	for (i = 0; i < packs.nr; i++) {
+		pack_perm[packs.info[i].orig_pack_int_id] = i;
+	}
+
 	for (i = 0; i < packs.nr; i++)
-		pack_name_concat_len += strlen(packs.names[i]) + 1;
+		pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
 
 	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
 		pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
@@ -879,7 +856,7 @@ int write_midx_file(const char *object_dir)
 
 		switch (chunk_ids[i]) {
 			case MIDX_CHUNKID_PACKNAMES:
-				written += write_midx_pack_names(f, packs.names, packs.nr);
+				written += write_midx_pack_names(f, packs.info, packs.nr);
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
@@ -891,7 +868,7 @@ int write_midx_file(const char *object_dir)
 				break;
 
 			case MIDX_CHUNKID_OBJECTOFFSETS:
-				written += write_midx_object_offsets(f, large_offsets_needed, entries, nr_entries);
+				written += write_midx_object_offsets(f, large_offsets_needed, pack_perm, entries, nr_entries);
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
@@ -914,15 +891,14 @@ int write_midx_file(const char *object_dir)
 
 cleanup:
 	for (i = 0; i < packs.nr; i++) {
-		if (packs.list[i]) {
-			close_pack(packs.list[i]);
-			free(packs.list[i]);
+		if (packs.info[i].p) {
+			close_pack(packs.info[i].p);
+			free(packs.info[i].p);
 		}
-		free(packs.names[i]);
+		free(packs.info[i].pack_name);
 	}
 
-	free(packs.list);
-	free(packs.names);
+	free(packs.info);
 	free(entries);
 	free(pack_perm);
 	free(midx_name);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 6/9] multi-pack-index: implement 'expire' verb
  2019-01-09 15:21   ` [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
                       ` (4 preceding siblings ...)
  2019-01-09 15:21     ` [PATCH v3 5/9] midx: refactor permutation logic and pack sorting Derrick Stolee via GitGitGadget
@ 2019-01-09 15:21     ` Derrick Stolee via GitGitGadget
  2019-01-09 15:54       ` SZEDER Gábor
  2019-01-23 22:13       ` Jonathan Tan
  2019-01-09 15:21     ` [PATCH v3 7/9] multi-pack-index: prepare 'repack' subcommand Derrick Stolee via GitGitGadget
                       ` (5 subsequent siblings)
  11 siblings, 2 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-09 15:21 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The 'git multi-pack-index expire' command looks at the existing
mult-pack-index, counts the number of objects referenced in each
pack-file, deletes the pack-fils with no referenced objects, and
rewrites the multi-pack-index to no longer reference those packs.

Refactor the write_midx_file() method to call write_midx_internal()
which now takes an existing 'struct multi_pack_index' and a list
of pack-files to drop (as specified by the names of their pack-
indexes). As we write the new multi-pack-index, we drop those
file names from the list of known pack-files.

The expire_midx_packs() method removes the unreferenced pack-files
after carefully closing the packs to avoid open handles.

Test that a new pack-file that covers the contents of two other
pack-files leads to those pack-files being deleted during the
expire command. Be sure to read the multi-pack-index to ensure
it no longer references those packs.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 120 +++++++++++++++++++++++++++++++++---
 t/t5319-multi-pack-index.sh |  18 ++++++
 2 files changed, 128 insertions(+), 10 deletions(-)

diff --git a/midx.c b/midx.c
index 7ae5275c25..6ccbec3f19 100644
--- a/midx.c
+++ b/midx.c
@@ -33,6 +33,8 @@
 #define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t))
 #define MIDX_LARGE_OFFSET_NEEDED 0x80000000
 
+#define PACK_EXPIRED UINT_MAX
+
 static char *get_midx_filename(const char *object_dir)
 {
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
@@ -381,6 +383,7 @@ struct pack_info {
 	uint32_t orig_pack_int_id;
 	char *pack_name;
 	struct packed_git *p;
+	unsigned expired : 1;
 };
 
 static int pack_info_compare(const void *_a, const void *_b)
@@ -428,6 +431,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 
 		packs->info[packs->nr].pack_name = xstrdup(file_name);
 		packs->info[packs->nr].orig_pack_int_id = packs->nr;
+		packs->info[packs->nr].expired = 0;
 		packs->nr++;
 	}
 }
@@ -587,13 +591,17 @@ static size_t write_midx_pack_names(struct hashfile *f,
 	size_t written = 0;
 
 	for (i = 0; i < num_packs; i++) {
-		size_t writelen = strlen(info[i].pack_name) + 1;
+		size_t writelen;
+
+		if (info[i].expired)
+			continue;
 
 		if (i && strcmp(info[i].pack_name, info[i - 1].pack_name) <= 0)
 			BUG("incorrect pack-file order: %s before %s",
 			    info[i - 1].pack_name,
 			    info[i].pack_name);
 
+		writelen = strlen(info[i].pack_name) + 1;
 		hashwrite(f, info[i].pack_name, writelen);
 		written += writelen;
 	}
@@ -675,6 +683,11 @@ static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_nee
 	for (i = 0; i < nr_objects; i++) {
 		struct pack_midx_entry *obj = list++;
 
+		if (perm[obj->pack_int_id] == PACK_EXPIRED)
+			BUG("object %s is in an expired pack with int-id %d",
+			    oid_to_hex(&obj->oid),
+			    obj->pack_int_id);
+
 		hashwrite_be32(f, perm[obj->pack_int_id]);
 
 		if (large_offset_needed && obj->offset >> 31)
@@ -721,7 +734,8 @@ static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_off
 	return written;
 }
 
-int write_midx_file(const char *object_dir)
+static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
+			       struct string_list *packs_to_drop)
 {
 	unsigned char cur_chunk, num_chunks = 0;
 	char *midx_name;
@@ -737,6 +751,8 @@ int write_midx_file(const char *object_dir)
 	struct pack_midx_entry *entries = NULL;
 	int large_offsets_needed = 0;
 	int pack_name_concat_len = 0;
+	int dropped_packs = 0;
+	int result = 0;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -745,7 +761,10 @@ int write_midx_file(const char *object_dir)
 			  midx_name);
 	}
 
-	packs.m = load_multi_pack_index(object_dir, 1);
+	if (m)
+		packs.m = m;
+	else
+		packs.m = load_multi_pack_index(object_dir, 1);
 
 	packs.nr = 0;
 	packs.alloc = packs.m ? packs.m->num_packs : 16;
@@ -759,13 +778,14 @@ int write_midx_file(const char *object_dir)
 			packs.info[packs.nr].orig_pack_int_id = i;
 			packs.info[packs.nr].pack_name = xstrdup(packs.m->pack_names[i]);
 			packs.info[packs.nr].p = NULL;
+			packs.info[packs.nr].expired = 0;
 			packs.nr++;
 		}
 	}
 
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
 
-	if (packs.m && packs.nr == packs.m->num_packs)
+	if (packs.m && packs.nr == packs.m->num_packs && !packs_to_drop)
 		goto cleanup;
 
 	entries = get_sorted_entries(packs.m, packs.info, packs.nr, &nr_entries);
@@ -779,13 +799,48 @@ int write_midx_file(const char *object_dir)
 
 	QSORT(packs.info, packs.nr, pack_info_compare);
 
+	if (packs_to_drop && packs_to_drop->nr) {
+		int drop_index = 0;
+		int missing_drops = 0;
+
+		for (i = 0; i < packs.nr && drop_index < packs_to_drop->nr; i++) {
+			int cmp = strcmp(packs.info[i].pack_name,
+					 packs_to_drop->items[drop_index].string);
+
+			if (!cmp) {
+				drop_index++;
+				packs.info[i].expired = 1;
+			} else if (cmp > 0) {
+				error(_("did not see pack-file %s to drop"),
+				      packs_to_drop->items[drop_index].string);
+				drop_index++;
+				missing_drops++;
+				i--;
+			} else {
+				packs.info[i].expired = 0;
+			}
+		}
+
+		if (missing_drops) {
+			result = 1;
+			goto cleanup;
+		}
+	}
+
 	ALLOC_ARRAY(pack_perm, packs.nr);
 	for (i = 0; i < packs.nr; i++) {
-		pack_perm[packs.info[i].orig_pack_int_id] = i;
+		if (packs.info[i].expired) {
+			dropped_packs++;
+			pack_perm[packs.info[i].orig_pack_int_id] = PACK_EXPIRED;
+		} else {
+			pack_perm[packs.info[i].orig_pack_int_id] = i - dropped_packs;
+		}
 	}
 
-	for (i = 0; i < packs.nr; i++)
-		pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
+	for (i = 0; i < packs.nr; i++) {
+		if (!packs.info[i].expired)
+			pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
+	}
 
 	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
 		pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
@@ -801,7 +856,7 @@ int write_midx_file(const char *object_dir)
 	cur_chunk = 0;
 	num_chunks = large_offsets_needed ? 5 : 4;
 
-	written = write_midx_header(f, num_chunks, packs.nr);
+	written = write_midx_header(f, num_chunks, packs.nr - dropped_packs);
 
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
 	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
@@ -902,7 +957,12 @@ int write_midx_file(const char *object_dir)
 	free(entries);
 	free(pack_perm);
 	free(midx_name);
-	return 0;
+	return result;
+}
+
+int write_midx_file(const char *object_dir)
+{
+	return write_midx_internal(object_dir, NULL, NULL);
 }
 
 void clear_midx_file(struct repository *r)
@@ -1004,5 +1064,45 @@ int verify_midx_file(const char *object_dir)
 
 int expire_midx_packs(const char *object_dir)
 {
-	return 0;
+	uint32_t i, *count, result = 0;
+	struct string_list packs_to_drop = STRING_LIST_INIT_DUP;
+	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
+
+	if (!m)
+		return 0;
+
+	count = xcalloc(m->num_packs, sizeof(uint32_t));
+	for (i = 0; i < m->num_objects; i++) {
+		int pack_int_id = nth_midxed_pack_int_id(m, i);
+		count[pack_int_id]++;
+	}
+
+	for (i = 0; i < m->num_packs; i++) {
+		char *pack_name;
+
+		if (count[i])
+			continue;
+
+		if (prepare_midx_pack(m, i))
+			continue;
+
+		if (m->packs[i]->pack_keep)
+			continue;
+
+		pack_name = xstrdup(m->packs[i]->pack_name);
+		close_pack(m->packs[i]);
+		FREE_AND_NULL(m->packs[i]);
+
+		string_list_insert(&packs_to_drop, m->pack_names[i]);
+		unlink_pack_path(pack_name, 0);
+		free(pack_name);
+	}
+
+	free(count);
+
+	if (packs_to_drop.nr)
+		result = write_midx_internal(object_dir, m, &packs_to_drop);
+
+	string_list_clear(&packs_to_drop, 0);
+	return result;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 948effc1ee..f55a60a89c 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -395,4 +395,22 @@ test_expect_success 'expire does not remove any packs' '
 	)
 '
 
+test_expect_success 'expire removes unreferenced packs' '
+	(
+		cd dup &&
+		git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
+		refs/heads/A
+		^refs/heads/C
+		EOF
+		git multi-pack-index write &&
+		ls .git/objects/pack | grep -v -e pack-[AB] >expect &&
+		git multi-pack-index expire &&
+		ls .git/objects/pack >actual &&
+		test_cmp expect actual &&
+		ls .git/objects/pack/ | grep idx >expect-idx &&
+		test-tool read-midx .git/objects | grep idx >actual-midx &&
+		test_cmp expect-idx actual-midx
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 7/9] multi-pack-index: prepare 'repack' subcommand
  2019-01-09 15:21   ` [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
                       ` (5 preceding siblings ...)
  2019-01-09 15:21     ` [PATCH v3 6/9] multi-pack-index: implement 'expire' verb Derrick Stolee via GitGitGadget
@ 2019-01-09 15:21     ` Derrick Stolee via GitGitGadget
  2019-01-09 15:56       ` SZEDER Gábor
  2019-01-23 22:38       ` Jonathan Tan
  2019-01-09 15:21     ` [PATCH v3 8/9] midx: implement midx_repack() Derrick Stolee via GitGitGadget
                       ` (4 subsequent siblings)
  11 siblings, 2 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-09 15:21 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an environment where the multi-pack-index is useful, it is due
to many pack-files and an inability to repack the object store
into a single pack-file. However, it is likely that many of these
pack-files are rather small, and could be repacked into a slightly
larger pack-file without too much effort. It may also be important
to ensure the object store is highly available and the repack
operation does not interrupt concurrent git commands.

Introduce a 'repack' subcommand to 'git multi-pack-index' that
takes a '--batch-size' option. The verb will inspect the
multi-pack-index for referenced pack-files whose size is smaller
than the batch size, until collecting a list of pack-files whose
sizes sum to larger than the batch size. Then, a new pack-file
will be created containing the objects from those pack-files that
are referenced by the multi-pack-index. The resulting pack is
likely to actually be smaller than the batch size due to
compression and the fact that there may be objects in the pack-
files that have duplicate copies in other pack-files.

The current change introduces the command-line arguments, and we
add a test that ensures we parse these options properly. Since
we specify a small batch size, we will guarantee that future
implementations do not change the list of pack-files.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt | 11 +++++++++++
 builtin/multi-pack-index.c             | 10 +++++++++-
 midx.c                                 |  5 +++++
 midx.h                                 |  1 +
 t/t5319-multi-pack-index.sh            | 11 +++++++++++
 5 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index 6186c4c936..cc63531cc0 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -36,6 +36,17 @@ expire::
 	have no objects referenced by the MIDX. Rewrite the MIDX file
 	afterward to remove all references to these pack-files.
 
+repack::
+	Collect a batch of pack-files whose size are all at most the
+	size given by --batch-size, but whose sizes sum to larger
+	than --batch-size. The batch is selected by greedily adding
+	small pack-files starting with the oldest pack-files that fit
+	the size. Create a new pack-file containing the objects the
+	multi-pack-index indexes into those pack-files, and rewrite
+	the multi-pack-index to contain that pack-file. A later run
+	of 'git multi-pack-index expire' will delete the pack-files
+	that were part of this batch.
+
 
 EXAMPLES
 --------
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 145de3a46c..d87a2235e3 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,12 +5,13 @@
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire)"),
+	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire|repack --batch-size=<size>)"),
 	NULL
 };
 
 static struct opts_multi_pack_index {
 	const char *object_dir;
+	unsigned long batch_size;
 } opts;
 
 int cmd_multi_pack_index(int argc, const char **argv,
@@ -19,6 +20,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
 	static struct option builtin_multi_pack_index_options[] = {
 		OPT_FILENAME(0, "object-dir", &opts.object_dir,
 		  N_("object directory containing set of packfile and pack-index pairs")),
+		OPT_MAGNITUDE(0, "batch-size", &opts.batch_size,
+		  N_("during repack, collect pack-files of smaller size into a batch that is larger than this size")),
 		OPT_END(),
 	};
 
@@ -40,6 +43,11 @@ int cmd_multi_pack_index(int argc, const char **argv,
 		return 1;
 	}
 
+	if (!strcmp(argv[0], "repack"))
+		return midx_repack(opts.object_dir, (size_t)opts.batch_size);
+	if (opts.batch_size)
+		die(_("--batch-size option is only for 'repack' verb"));
+
 	if (!strcmp(argv[0], "write"))
 		return write_midx_file(opts.object_dir);
 	if (!strcmp(argv[0], "verify"))
diff --git a/midx.c b/midx.c
index 6ccbec3f19..30ff4430ab 100644
--- a/midx.c
+++ b/midx.c
@@ -1106,3 +1106,8 @@ int expire_midx_packs(const char *object_dir)
 	string_list_clear(&packs_to_drop, 0);
 	return result;
 }
+
+int midx_repack(const char *object_dir, size_t batch_size)
+{
+	return 0;
+}
diff --git a/midx.h b/midx.h
index e3a2b740b5..394a21ee96 100644
--- a/midx.h
+++ b/midx.h
@@ -50,6 +50,7 @@ int write_midx_file(const char *object_dir);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(const char *object_dir);
 int expire_midx_packs(const char *object_dir);
+int midx_repack(const char *object_dir, size_t batch_size);
 
 void close_midx(struct multi_pack_index *m);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index f55a60a89c..d64767600a 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -413,4 +413,15 @@ test_expect_success 'expire removes unreferenced packs' '
 	)
 '
 
+test_expect_success 'repack with minimum size does not alter existing packs' '
+	(
+		cd dup &&
+		ls .git/objects/pack >expect &&
+		MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
+		git multi-pack-index repack --batch-size=$MINSIZE &&
+		ls .git/objects/pack >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 8/9] midx: implement midx_repack()
  2019-01-09 15:21   ` [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
                       ` (6 preceding siblings ...)
  2019-01-09 15:21     ` [PATCH v3 7/9] multi-pack-index: prepare 'repack' subcommand Derrick Stolee via GitGitGadget
@ 2019-01-09 15:21     ` Derrick Stolee via GitGitGadget
  2019-01-23 22:33       ` Jonathan Tan
  2019-01-09 15:21     ` [PATCH v3 9/9] multi-pack-index: test expire while adding packs Derrick Stolee via GitGitGadget
                       ` (3 subsequent siblings)
  11 siblings, 1 reply; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-09 15:21 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

To repack using a multi-pack-index, first sort all pack-files by
their modified time. Second, walk those pack-files from oldest
to newest, adding the packs to a list if they are smaller than the
given pack-size. Finally, collect the objects from the multi-pack-
index that are in those packs and send them to 'git pack-objects'.

While first designing a 'git multi-pack-index repack' operation, I
started by collecting the batches based on the size of the objects
instead of the size of the pack-files. This allows repacking a
large pack-file that has very few referencd objects. However, this
came at a significant cost of parsing pack-files instead of simply
reading the multi-pack-index and getting the file information for
the pack-files. This object-size idea could be a direction for
future expansion in this area.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 109 +++++++++++++++++++++++++++++++++++-
 t/t5319-multi-pack-index.sh |  25 +++++++++
 2 files changed, 133 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 30ff4430ab..f0f16690e6 100644
--- a/midx.c
+++ b/midx.c
@@ -8,6 +8,7 @@
 #include "sha1-lookup.h"
 #include "midx.h"
 #include "progress.h"
+#include "run-command.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -1107,7 +1108,113 @@ int expire_midx_packs(const char *object_dir)
 	return result;
 }
 
-int midx_repack(const char *object_dir, size_t batch_size)
+struct time_and_id {
+	timestamp_t mtime;
+	uint32_t pack_int_id;
+};
+
+static int compare_by_mtime(const void *a_, const void *b_)
 {
+	const struct time_and_id *a, *b;
+
+	a = (const struct time_and_id *)a_;
+	b = (const struct time_and_id *)b_;
+
+	if (a->mtime < b->mtime)
+		return -1;
+	if (a->mtime > b->mtime)
+		return 1;
 	return 0;
 }
+
+int midx_repack(const char *object_dir, size_t batch_size)
+{
+	int result = 0;
+	uint32_t i, packs_to_repack;
+	size_t total_size;
+	struct time_and_id *pack_ti;
+	unsigned char *include_pack;
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf base_name = STRBUF_INIT;
+	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
+
+	if (!m)
+		return 0;
+
+	include_pack = xcalloc(m->num_packs, sizeof(unsigned char));
+	pack_ti = xcalloc(m->num_packs, sizeof(struct time_and_id));
+
+	for (i = 0; i < m->num_packs; i++) {
+		pack_ti[i].pack_int_id = i;
+
+		if (prepare_midx_pack(m, i))
+			continue;
+
+		pack_ti[i].mtime = m->packs[i]->mtime;
+	}
+	QSORT(pack_ti, m->num_packs, compare_by_mtime);
+
+	total_size = 0;
+	packs_to_repack = 0;
+	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
+		int pack_int_id = pack_ti[i].pack_int_id;
+		struct packed_git *p = m->packs[pack_int_id];
+
+		if (!p)
+			continue;
+		if (p->pack_size >= batch_size)
+			continue;
+
+		packs_to_repack++;
+		total_size += p->pack_size;
+		include_pack[pack_int_id] = 1;
+	}
+
+	if (total_size < batch_size || packs_to_repack < 2)
+		goto cleanup;
+
+	argv_array_push(&cmd.args, "pack-objects");
+
+	strbuf_addstr(&base_name, object_dir);
+	strbuf_addstr(&base_name, "/pack/pack");
+	argv_array_push(&cmd.args, base_name.buf);
+	strbuf_release(&base_name);
+
+	cmd.git_cmd = 1;
+	cmd.in = cmd.out = -1;
+
+	if (start_command(&cmd)) {
+		error(_("could not start pack-objects"));
+		result = 1;
+		goto cleanup;
+	}
+
+	for (i = 0; i < m->num_objects; i++) {
+		struct object_id oid;
+		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
+
+		if (!include_pack[pack_int_id])
+			continue;
+
+		nth_midxed_object_oid(&oid, m, i);
+		xwrite(cmd.in, oid_to_hex(&oid), the_hash_algo->hexsz);
+		xwrite(cmd.in, "\n", 1);
+	}
+	close(cmd.in);
+
+	if (finish_command(&cmd)) {
+		error(_("could not finish pack-objects"));
+		result = 1;
+		goto cleanup;
+	}
+
+	result = write_midx_internal(object_dir, m, NULL);
+	m = NULL;
+
+cleanup:
+	if (m)
+		close_midx(m);
+	free(include_pack);
+	free(pack_ti);
+	return result;
+}
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index d64767600a..99afb5ec51 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -424,4 +424,29 @@ test_expect_success 'repack with minimum size does not alter existing packs' '
 	)
 '
 
+test_expect_success 'repack creates a new pack' '
+	(
+		cd dup &&
+		SECOND_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 2 | tail -n 1) &&
+		BATCH_SIZE=$(($SECOND_SMALLEST_SIZE + 1)) &&
+		git multi-pack-index repack --batch-size=$BATCH_SIZE &&
+		ls .git/objects/pack/*idx >idx-list &&
+		test_line_count = 5 idx-list &&
+		test-tool read-midx .git/objects | grep idx >midx-list &&
+		test_line_count = 5 midx-list
+	)
+'
+
+test_expect_success 'expire removes repacked packs' '
+	(
+		cd dup &&
+		ls -S .git/objects/pack/*pack | head -n 3 >expect &&
+		git multi-pack-index expire &&
+		ls -S .git/objects/pack/*pack >actual &&
+		test_cmp expect actual &&
+		test-tool read-midx .git/objects | grep idx >midx-list &&
+		test_line_count = 3 midx-list
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 9/9] multi-pack-index: test expire while adding packs
  2019-01-09 15:21   ` [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
                       ` (7 preceding siblings ...)
  2019-01-09 15:21     ` [PATCH v3 8/9] midx: implement midx_repack() Derrick Stolee via GitGitGadget
@ 2019-01-09 15:21     ` Derrick Stolee via GitGitGadget
  2019-01-17 15:27     ` [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee
                       ` (2 subsequent siblings)
  11 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-09 15:21 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

During development of the multi-pack-index expire subcommand, a
version went out that improperly computed the pack order if a new
pack was introduced while other packs were being removed. Part of
the subtlety of the bug involved the new pack being placed before
other packs that already existed in the multi-pack-index.

Add a test to t5319-multi-pack-index.sh that catches this issue.
The test adds new packs that cause another pack to be expired, and
creates new packs that are lexicographically sorted before and
after the existing packs.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t5319-multi-pack-index.sh | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 99afb5ec51..1213549d25 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -449,4 +449,36 @@ test_expect_success 'expire removes repacked packs' '
 	)
 '
 
+test_expect_success 'expire works when adding new packs' '
+	(
+		cd dup &&
+		git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
+		refs/heads/A
+		^refs/heads/B
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
+		refs/heads/B
+		^refs/heads/C
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
+		refs/heads/C
+		^refs/heads/D
+		EOF
+		git multi-pack-index write &&
+		git pack-objects --revs .git/objects/pack/a-pack <<-EOF &&
+		refs/heads/D
+		^refs/heads/E
+		EOF
+		git multi-pack-index write &&
+		git pack-objects --revs .git/objects/pack/z-pack <<-EOF &&
+		refs/heads/E
+		EOF
+		git multi-pack-index expire &&
+		ls .git/objects/pack/ | grep idx >expect &&
+		test-tool read-midx .git/objects | grep idx >actual &&
+		test_cmp expect actual &&
+		git multi-pack-index verify
+	)
+'
+
 test_done
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 6/9] multi-pack-index: implement 'expire' verb
  2019-01-09 15:21     ` [PATCH v3 6/9] multi-pack-index: implement 'expire' verb Derrick Stolee via GitGitGadget
@ 2019-01-09 15:54       ` SZEDER Gábor
  2019-01-10 18:05         ` Junio C Hamano
  2019-01-23 22:13       ` Jonathan Tan
  1 sibling, 1 reply; 89+ messages in thread
From: SZEDER Gábor @ 2019-01-09 15:54 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

On Wed, Jan 09, 2019 at 07:21:16AM -0800, Derrick Stolee via GitGitGadget wrote:
> The 'git multi-pack-index expire' command ...

The subject line could use a s/verb/subcommand/.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 7/9] multi-pack-index: prepare 'repack' subcommand
  2019-01-09 15:21     ` [PATCH v3 7/9] multi-pack-index: prepare 'repack' subcommand Derrick Stolee via GitGitGadget
@ 2019-01-09 15:56       ` SZEDER Gábor
  2019-01-23 22:38       ` Jonathan Tan
  1 sibling, 0 replies; 89+ messages in thread
From: SZEDER Gábor @ 2019-01-09 15:56 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, sbeller, peff, jrnieder, avarab, Junio C Hamano, Derrick Stolee

On Wed, Jan 09, 2019 at 07:21:17AM -0800, Derrick Stolee via GitGitGadget wrote:
> Introduce a 'repack' subcommand to 'git multi-pack-index' that
> takes a '--batch-size' option. The verb will inspect the

s/verb/subcommand/

> diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> index 145de3a46c..d87a2235e3 100644
> --- a/builtin/multi-pack-index.c
> +++ b/builtin/multi-pack-index.c

> @@ -40,6 +43,11 @@ int cmd_multi_pack_index(int argc, const char **argv,
>  		return 1;
>  	}
>  
> +	if (!strcmp(argv[0], "repack"))
> +		return midx_repack(opts.object_dir, (size_t)opts.batch_size);
> +	if (opts.batch_size)
> +		die(_("--batch-size option is only for 'repack' verb"));

Likewise.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 6/9] multi-pack-index: implement 'expire' verb
  2019-01-09 15:54       ` SZEDER Gábor
@ 2019-01-10 18:05         ` Junio C Hamano
  0 siblings, 0 replies; 89+ messages in thread
From: Junio C Hamano @ 2019-01-10 18:05 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Derrick Stolee via GitGitGadget, git, sbeller, peff, jrnieder,
	avarab, Derrick Stolee

SZEDER Gábor <szeder.dev@gmail.com> writes:

> On Wed, Jan 09, 2019 at 07:21:16AM -0800, Derrick Stolee via GitGitGadget wrote:
>> The 'git multi-pack-index expire' command ...
>
> The subject line could use a s/verb/subcommand/.

Yeah, that probably is more in line with the existing terminology
for other Git commands.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index
  2019-01-09 15:21   ` [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
                       ` (8 preceding siblings ...)
  2019-01-09 15:21     ` [PATCH v3 9/9] multi-pack-index: test expire while adding packs Derrick Stolee via GitGitGadget
@ 2019-01-17 15:27     ` Derrick Stolee
  2019-01-23 22:44     ` Jonathan Tan
  2019-01-24 21:51     ` [PATCH v4 00/10] " Derrick Stolee via GitGitGadget
  11 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-01-17 15:27 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: sbeller, peff, jrnieder, avarab, Junio C Hamano, SZEDER Gábor

I've seen comments about some leftover uses of the word 'verb' instead 
of 'subcommand'. I can definitely clean those up, but I'm hoping that 
more comments could be made on the technical aspects of this series.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 5/9] midx: refactor permutation logic and pack sorting
  2019-01-09 15:21     ` [PATCH v3 5/9] midx: refactor permutation logic and pack sorting Derrick Stolee via GitGitGadget
@ 2019-01-23 21:00       ` Jonathan Tan
  2019-01-24 17:34         ` Derrick Stolee
  0 siblings, 1 reply; 89+ messages in thread
From: Jonathan Tan @ 2019-01-23 21:00 UTC (permalink / raw)
  To: gitgitgadget
  Cc: git, sbeller, peff, jrnieder, avarab, gitster, dstolee, Jonathan Tan

Following Stolee's wishes [1], I'll stick to the technical aspects here.
Patches 1-4 look correct technically to me, so let me start here.

[1] https://public-inbox.org/git/3aa0a7ea-6c30-2c61-0815-2b9ab8304564@gmail.com/

> +struct pack_info {
> +	uint32_t orig_pack_int_id;
> +	char *pack_name;
> +	struct packed_git *p;
> +};
> +
> +static int pack_info_compare(const void *_a, const void *_b)
> +{
> +	struct pack_info *a = (struct pack_info *)_a;
> +	struct pack_info *b = (struct pack_info *)_b;
> +	return strcmp(a->pack_name, b->pack_name);
> +}
> +
>  struct pack_list {
> -	struct packed_git **list;
> -	char **names;
> +	struct pack_info *info;
>  	uint32_t nr;
> -	uint32_t alloc_list;
> -	uint32_t alloc_names;
> +	uint32_t alloc;
>  	struct multi_pack_index *m;
>  };

"list" and "names" look like a parallel array, but as far as I can tell,
they are actually not, because "names" is eventually sorted but not
"list". So combining these two arrays is great, and removes a lot of
confusion.

Now the packed_git list will be sorted too when the names are sorted; I
was surprised that code did not need to be changed to take this into
account, but I see now that it's because the struct packed_gits don't
need to be referenced anymore after get_sorted_entries().

> -struct pack_pair {
> -	uint32_t pack_int_id;
> -	char *pack_name;
> -};
> -
> -static int pack_pair_compare(const void *_a, const void *_b)
> -{
> -	struct pack_pair *a = (struct pack_pair *)_a;
> -	struct pack_pair *b = (struct pack_pair *)_b;
> -	return strcmp(a->pack_name, b->pack_name);
> -}
> -
> -static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *perm)
> -{
> -	uint32_t i;
> -	struct pack_pair *pairs;
> -
> -	ALLOC_ARRAY(pairs, nr_packs);
> -
> -	for (i = 0; i < nr_packs; i++) {
> -		pairs[i].pack_int_id = i;
> -		pairs[i].pack_name = pack_names[i];
> -	}
> -
> -	QSORT(pairs, nr_packs, pack_pair_compare);
> -
> -	for (i = 0; i < nr_packs; i++) {
> -		pack_names[i] = pairs[i].pack_name;
> -		perm[pairs[i].pack_int_id] = i;
> -	}
> -
> -	free(pairs);
> -}

It's nice that this gets removed.

>  static int nth_midxed_pack_midx_entry(struct multi_pack_index *m,
> -				      uint32_t *pack_perm,
>  				      struct pack_midx_entry *e,
>  				      uint32_t pos)
>  {
> @@ -488,7 +464,7 @@ static int nth_midxed_pack_midx_entry(struct multi_pack_index *m,
>  		return 1;
>  
>  	nth_midxed_object_oid(&e->oid, m, pos);
> -	e->pack_int_id = pack_perm[nth_midxed_pack_int_id(m, pos)];
> +	e->pack_int_id = nth_midxed_pack_int_id(m, pos);
>  	e->offset = nth_midxed_offset(m, pos);

nth_midxed_pack_midx_entry() is only called from get_sorted_entries(),
which is now called *before* any sorting of pack_info, so there is no
longer any need for considering pack_perm. As to why
get_sorted_entries() needs to be called before sorting of pack_info and
not after, that question will be answered later.

> @@ -522,8 +498,7 @@ static void fill_pack_entry(uint32_t pack_int_id,
>   * of a packfile containing the object).
>   */
>  static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
> -						  struct packed_git **p,
> -						  uint32_t *perm,
> +						  struct pack_info *info,
>  						  uint32_t nr_packs,
>  						  uint32_t *nr_objects)

[snip rest of get_sorted_entries]

As stated above, get_sorted_entries() is now called before any sorting
of pack_info, so there is no need for perm.

>  static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_needed,
> +					uint32_t *perm,
>  					struct pack_midx_entry *objects, uint32_t nr_objects)
>  {
>  	struct pack_midx_entry *list = objects;
> @@ -699,7 +675,7 @@ static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_nee
>  	for (i = 0; i < nr_objects; i++) {
>  		struct pack_midx_entry *obj = list++;
>  
> -		hashwrite_be32(f, obj->pack_int_id);
> +		hashwrite_be32(f, perm[obj->pack_int_id]);
>  
>  		if (large_offset_needed && obj->offset >> 31)
>  			hashwrite_be32(f, MIDX_LARGE_OFFSET_NEEDED | nr_large_offset++);

This was unexpected to me - why is perm introduced here when it wasn't
needed before? This is because get_sorted_entries() is now called before
any sorting of pack_info. get_sorted_entries() generates entries using
the old pack_int_id, but the resulting packfile is written using new
pack numbers, so when writing the entries to the file, they must be
written using the new numbers.

The mapping of perm is: perm[old number] = new number.

> @@ -795,10 +768,7 @@ int write_midx_file(const char *object_dir)
>  	if (packs.m && packs.nr == packs.m->num_packs)
>  		goto cleanup;
>  
> -	ALLOC_ARRAY(pack_perm, packs.nr);
> -	sort_packs_by_name(packs.names, packs.nr, pack_perm);
> -
> -	entries = get_sorted_entries(packs.m, packs.list, pack_perm, packs.nr, &nr_entries);
> +	entries = get_sorted_entries(packs.m, packs.info, packs.nr, &nr_entries);
>  
>  	for (i = 0; i < nr_entries; i++) {
>  		if (entries[i].offset > 0x7fffffff)
> @@ -807,8 +777,15 @@ int write_midx_file(const char *object_dir)
>  			large_offsets_needed = 1;
>  	}
>  
> +	QSORT(packs.info, packs.nr, pack_info_compare);
> +
> +	ALLOC_ARRAY(pack_perm, packs.nr);
> +	for (i = 0; i < packs.nr; i++) {
> +		pack_perm[packs.info[i].orig_pack_int_id] = i;
> +	}
> +
>  	for (i = 0; i < packs.nr; i++)
> -		pack_name_concat_len += strlen(packs.names[i]) + 1;
> +		pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
>  
>  	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
>  		pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -

Indeed, the sorting of pack_info is moved to after get_sorted_entries().
Also, pack_perm[old number] = new number, as expected.

I think a comment explaining why the perm is needed would be helpful -
something explaining that the entries were generated using the old pack
numbers, so we need this mapping to be able to write them using the new
numbers.

As part of this review, I did attempt to make everything use the sorted
pack_info order, but failed. I got as far as making get_sorted_entries()
always use start_pack=0 and skip the pack_infos that didn't have the p
pointer (to match the existing behavior of get_sorted_entries() that
only operates on new pack_infos being added by add_pack_to_midx), but it
still didn't work, and I didn't investigate further.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 6/9] multi-pack-index: implement 'expire' verb
  2019-01-09 15:21     ` [PATCH v3 6/9] multi-pack-index: implement 'expire' verb Derrick Stolee via GitGitGadget
  2019-01-09 15:54       ` SZEDER Gábor
@ 2019-01-23 22:13       ` Jonathan Tan
  2019-01-24 17:36         ` Derrick Stolee
  1 sibling, 1 reply; 89+ messages in thread
From: Jonathan Tan @ 2019-01-23 22:13 UTC (permalink / raw)
  To: gitgitgadget
  Cc: git, sbeller, peff, jrnieder, avarab, gitster, dstolee, Jonathan Tan

> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The 'git multi-pack-index expire' command looks at the existing
> mult-pack-index, counts the number of objects referenced in each
> pack-file, deletes the pack-fils with no referenced objects, and
> rewrites the multi-pack-index to no longer reference those packs.

Thanks - this was quite straightforwardly written.

> @@ -745,7 +761,10 @@ int write_midx_file(const char *object_dir)
>  			  midx_name);
>  	}
>  
> -	packs.m = load_multi_pack_index(object_dir, 1);
> +	if (m)
> +		packs.m = m;
> +	else
> +		packs.m = load_multi_pack_index(object_dir, 1);

If we already loaded the m, we can just pass it in - OK.

> +	if (packs_to_drop && packs_to_drop->nr) {
> +		int drop_index = 0;
> +		int missing_drops = 0;
> +
> +		for (i = 0; i < packs.nr && drop_index < packs_to_drop->nr; i++) {
> +			int cmp = strcmp(packs.info[i].pack_name,
> +					 packs_to_drop->items[drop_index].string);
> +
> +			if (!cmp) {
> +				drop_index++;
> +				packs.info[i].expired = 1;
> +			} else if (cmp > 0) {
> +				error(_("did not see pack-file %s to drop"),
> +				      packs_to_drop->items[drop_index].string);
> +				drop_index++;
> +				missing_drops++;
> +				i--;
> +			} else {
> +				packs.info[i].expired = 0;
> +			}
> +		}
> +
> +		if (missing_drops) {
> +			result = 1;
> +			goto cleanup;
> +		}
> +	}

This takes into account that packfiles can shift while we run this
command, I see. Other than that, this is a common pattern - how we
iterate through 2 sorted arrays, one a subsequence of each other.

And indeed packs_to_drop is a sorted list, because we use
string_list_insert() below.

>  	ALLOC_ARRAY(pack_perm, packs.nr);
>  	for (i = 0; i < packs.nr; i++) {
> -		pack_perm[packs.info[i].orig_pack_int_id] = i;
> +		if (packs.info[i].expired) {
> +			dropped_packs++;
> +			pack_perm[packs.info[i].orig_pack_int_id] = PACK_EXPIRED;
> +		} else {
> +			pack_perm[packs.info[i].orig_pack_int_id] = i - dropped_packs;
> +		}

Here...

>  	}
>  
> -	for (i = 0; i < packs.nr; i++)
> -		pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
> +	for (i = 0; i < packs.nr; i++) {
> +		if (!packs.info[i].expired)
> +			pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
> +	}

...and here and elsewhere, we have to contend with the fact that
packs.info has pack_info that we don't want to write. I think it would
be slightly better to filter out the expired ones from packs.info, and
then when generating pack_perm, first memset it to 0xff. This way, we
wouldn't have to check expiry everywhere. But I don't feel too strongly
about this.

>  int expire_midx_packs(const char *object_dir)
>  {
> -	return 0;
> +	uint32_t i, *count, result = 0;
> +	struct string_list packs_to_drop = STRING_LIST_INIT_DUP;
> +	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
> +
> +	if (!m)
> +		return 0;
> +
> +	count = xcalloc(m->num_packs, sizeof(uint32_t));
> +	for (i = 0; i < m->num_objects; i++) {
> +		int pack_int_id = nth_midxed_pack_int_id(m, i);
> +		count[pack_int_id]++;
> +	}
> +
> +	for (i = 0; i < m->num_packs; i++) {
> +		char *pack_name;
> +
> +		if (count[i])
> +			continue;
> +
> +		if (prepare_midx_pack(m, i))
> +			continue;
> +
> +		if (m->packs[i]->pack_keep)
> +			continue;
> +
> +		pack_name = xstrdup(m->packs[i]->pack_name);
> +		close_pack(m->packs[i]);
> +		FREE_AND_NULL(m->packs[i]);
> +
> +		string_list_insert(&packs_to_drop, m->pack_names[i]);
> +		unlink_pack_path(pack_name, 0);
> +		free(pack_name);
> +	}
> +
> +	free(count);
> +
> +	if (packs_to_drop.nr)
> +		result = write_midx_internal(object_dir, m, &packs_to_drop);
> +
> +	string_list_clear(&packs_to_drop, 0);
> +	return result;
>  }

This is as I expected - unlink all the files we don't want, and even
though much of the midx hasn't changed, we still need to write it
because it has a new list of packfiles.

> +test_expect_success 'expire removes unreferenced packs' '
> +	(
> +		cd dup &&
> +		git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
> +		refs/heads/A
> +		^refs/heads/C
> +		EOF
> +		git multi-pack-index write &&
> +		ls .git/objects/pack | grep -v -e pack-[AB] >expect &&
> +		git multi-pack-index expire &&
> +		ls .git/objects/pack >actual &&
> +		test_cmp expect actual &&
> +		ls .git/objects/pack/ | grep idx >expect-idx &&
> +		test-tool read-midx .git/objects | grep idx >actual-midx &&
> +		test_cmp expect-idx actual-midx
> +	)
> +'

Maybe add a fsck at the end for sanity's sake. Also, I think that
preservation of .keep packfiles is an important feature, and maybe worth
a test.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 8/9] midx: implement midx_repack()
  2019-01-09 15:21     ` [PATCH v3 8/9] midx: implement midx_repack() Derrick Stolee via GitGitGadget
@ 2019-01-23 22:33       ` Jonathan Tan
  0 siblings, 0 replies; 89+ messages in thread
From: Jonathan Tan @ 2019-01-23 22:33 UTC (permalink / raw)
  To: gitgitgadget
  Cc: git, sbeller, peff, jrnieder, avarab, gitster, dstolee, Jonathan Tan

> From: Derrick Stolee <dstolee@microsoft.com>
> 
> To repack using a multi-pack-index, first sort all pack-files by
> their modified time. Second, walk those pack-files from oldest
> to newest, adding the packs to a list if they are smaller than the
> given pack-size. Finally, collect the objects from the multi-pack-
> index that are in those packs and send them to 'git pack-objects'.

Also mention that we stop once the total is at least the batch size.

> +test_expect_success 'repack creates a new pack' '
> +	(
> +		cd dup &&
> +		SECOND_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 2 | tail -n 1) &&
> +		BATCH_SIZE=$(($SECOND_SMALLEST_SIZE + 1)) &&
> +		git multi-pack-index repack --batch-size=$BATCH_SIZE &&
> +		ls .git/objects/pack/*idx >idx-list &&
> +		test_line_count = 5 idx-list &&
> +		test-tool read-midx .git/objects | grep idx >midx-list &&
> +		test_line_count = 5 midx-list
> +	)
> +'

Can there be a test_line_count of the output of ls at the beginning, so
that we do not have to look at the previous test to indeed see that the
5 is greater than before?

Also add a test to check what happens when we have 3 packs below the
batch size, but taking 2 of them together is sufficient to exceed the
batch size.

> +test_expect_success 'expire removes repacked packs' '
> +	(
> +		cd dup &&
> +		ls -S .git/objects/pack/*pack | head -n 3 >expect &&
> +		git multi-pack-index expire &&
> +		ls -S .git/objects/pack/*pack >actual &&
> +		test_cmp expect actual &&
> +		test-tool read-midx .git/objects | grep idx >midx-list &&
> +		test_line_count = 3 midx-list
> +	)
> +'

Same comment as above.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 7/9] multi-pack-index: prepare 'repack' subcommand
  2019-01-09 15:21     ` [PATCH v3 7/9] multi-pack-index: prepare 'repack' subcommand Derrick Stolee via GitGitGadget
  2019-01-09 15:56       ` SZEDER Gábor
@ 2019-01-23 22:38       ` Jonathan Tan
  2019-01-24 19:36         ` Derrick Stolee
  1 sibling, 1 reply; 89+ messages in thread
From: Jonathan Tan @ 2019-01-23 22:38 UTC (permalink / raw)
  To: gitgitgadget
  Cc: git, sbeller, peff, jrnieder, avarab, gitster, dstolee, Jonathan Tan

> diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> index 6186c4c936..cc63531cc0 100644
> --- a/Documentation/git-multi-pack-index.txt
> +++ b/Documentation/git-multi-pack-index.txt
> @@ -36,6 +36,17 @@ expire::
>  	have no objects referenced by the MIDX. Rewrite the MIDX file
>  	afterward to remove all references to these pack-files.
>  
> +repack::
> +	Collect a batch of pack-files whose size are all at most the
> +	size given by --batch-size, but whose sizes sum to larger
> +	than --batch-size. The batch is selected by greedily adding
> +	small pack-files starting with the oldest pack-files that fit
> +	the size. Create a new pack-file containing the objects the
> +	multi-pack-index indexes into those pack-files, and rewrite
> +	the multi-pack-index to contain that pack-file. A later run
> +	of 'git multi-pack-index expire' will delete the pack-files
> +	that were part of this batch.

I see in the subsequent patch that you stop once the batch size is
matched or exceeded - I see that you mention "whose sizes sum to larger
than --batch-size", but this leads me to think that if the total so
happens to not exceed the batch size, don't do anything, but otherwise
repack *all* the small packs together.

I would write this as:

  Create a new packfile containing the objects in the N least-sized
  packfiles referenced by the multi-pack-index, where N is the smallest
  number such that the total size of the packfiles equals or exceeds the
  given batch size. Rewrite the multi-pack-index to reference the new
  packfile instead of the N packfiles. A later run of ...

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index
  2019-01-09 15:21   ` [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
                       ` (9 preceding siblings ...)
  2019-01-17 15:27     ` [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee
@ 2019-01-23 22:44     ` Jonathan Tan
  2019-01-24 21:51     ` [PATCH v4 00/10] " Derrick Stolee via GitGitGadget
  11 siblings, 0 replies; 89+ messages in thread
From: Jonathan Tan @ 2019-01-23 22:44 UTC (permalink / raw)
  To: gitgitgadget; +Cc: git, sbeller, peff, jrnieder, avarab, gitster, Jonathan Tan

> The multi-pack-index provides a fast way to find an object among a large
> list of pack-files. It stores a single pack-reference for each object id, so
> duplicate objects are ignored. Among a list of pack-files storing the same
> object, the most-recently modified one is used.
> 
> Create new subcommands for the multi-pack-index builtin.
> 
>  * 'git multi-pack-index expire': If we have a pack-file indexed by the
>    multi-pack-index, but all objects in that pack are duplicated in
>    more-recently modified packs, then delete that pack (and any others like
>    it). Delete the reference to that pack in the multi-pack-index.
>    
>    
>  * 'git multi-pack-index repack --batch-size=': Starting from the oldest
>    pack-files covered by the multi-pack-index, find those whose on-disk size
>    is below the batch size until we have a collection of packs whose sizes
>    add up to the batch size. Create a new pack containing all objects that
>    the multi-pack-index references to those packs.

[snip]

Thanks - as you further explain in the snipped part, this is very useful
for users of repositories that use MIDX.

I only have minor comments (that I have written in individual replies)
and the series overall looks good to me.

I personally would have squashed patches 3 and 7 (the "prepare for"
patches) into the patches that implement the respective commands,
because I'd rather not have points where the commands don't work. Having
said that, rebase is probably not going to be affected, so I don't feel
strongly about this.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 5/9] midx: refactor permutation logic and pack sorting
  2019-01-23 21:00       ` Jonathan Tan
@ 2019-01-24 17:34         ` Derrick Stolee
  2019-01-24 19:17           ` Derrick Stolee
  0 siblings, 1 reply; 89+ messages in thread
From: Derrick Stolee @ 2019-01-24 17:34 UTC (permalink / raw)
  To: Jonathan Tan, gitgitgadget
  Cc: git, sbeller, peff, jrnieder, avarab, gitster, dstolee

On 1/23/2019 4:00 PM, Jonathan Tan wrote:
> Indeed, the sorting of pack_info is moved to after get_sorted_entries().
> Also, pack_perm[old number] = new number, as expected.

Thanks for chiming in with all the detail on the use of 'perm'. This is 
the most confusing part of this code path.

> I think a comment explaining why the perm is needed would be helpful -
> something explaining that the entries were generated using the old pack
> numbers, so we need this mapping to be able to write them using the new
> numbers.

I can put this comment in the struct definition. Is that the right place 
for it?


> As part of this review, I did attempt to make everything use the sorted
> pack_info order, but failed. I got as far as making get_sorted_entries()
> always use start_pack=0 and skip the pack_infos that didn't have the p
> pointer (to match the existing behavior of get_sorted_entries() that
> only operates on new pack_infos being added by add_pack_to_midx), but it
> still didn't work, and I didn't investigate further.

In the previous version, I tried to use 'perm' and the arrays as they 
existed previously, but that led to the bug in that version. Hopefully 
the new code is easier to read and understand. The delta may actually be 
hard to clearly see that it does the same work, but reading the code 
after it is applied looks clearer (to me).

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 6/9] multi-pack-index: implement 'expire' verb
  2019-01-23 22:13       ` Jonathan Tan
@ 2019-01-24 17:36         ` Derrick Stolee
  0 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-01-24 17:36 UTC (permalink / raw)
  To: Jonathan Tan, gitgitgadget
  Cc: git, sbeller, peff, jrnieder, avarab, gitster, dstolee

On 1/23/2019 5:13 PM, Jonathan Tan wrote:
> Maybe add a fsck at the end for sanity's sake. Also, I think that
> preservation of .keep packfiles is an important feature, and maybe worth
> a test.
Good points! I forgot to test the .keep stuff directly here because I 
have an equivalent test in VFS for Git, so my wires got crossed.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 5/9] midx: refactor permutation logic and pack sorting
  2019-01-24 17:34         ` Derrick Stolee
@ 2019-01-24 19:17           ` Derrick Stolee
  0 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-01-24 19:17 UTC (permalink / raw)
  To: Jonathan Tan, gitgitgadget
  Cc: git, sbeller, peff, jrnieder, avarab, gitster, dstolee

On 1/24/2019 12:34 PM, Derrick Stolee wrote:
> On 1/23/2019 4:00 PM, Jonathan Tan wrote:
>> Indeed, the sorting of pack_info is moved to after get_sorted_entries().
>> Also, pack_perm[old number] = new number, as expected.
>
> Thanks for chiming in with all the detail on the use of 'perm'. This 
> is the most confusing part of this code path.
>
>> I think a comment explaining why the perm is needed would be helpful -
>> something explaining that the entries were generated using the old pack
>> numbers, so we need this mapping to be able to write them using the new
>> numbers.
>
> I can put this comment in the struct definition. Is that the right 
> place for it?

I mistakenly thought the pack_perm array was placed into the pack_list 
struct. I'll put the comment right before we populate the contents of 
the array.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 7/9] multi-pack-index: prepare 'repack' subcommand
  2019-01-23 22:38       ` Jonathan Tan
@ 2019-01-24 19:36         ` Derrick Stolee
  2019-01-24 21:38           ` Jonathan Tan
  0 siblings, 1 reply; 89+ messages in thread
From: Derrick Stolee @ 2019-01-24 19:36 UTC (permalink / raw)
  To: Jonathan Tan, gitgitgadget
  Cc: git, sbeller, peff, jrnieder, avarab, gitster, dstolee

On 1/23/2019 5:38 PM, Jonathan Tan wrote:
>> diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
>> index 6186c4c936..cc63531cc0 100644
>> --- a/Documentation/git-multi-pack-index.txt
>> +++ b/Documentation/git-multi-pack-index.txt
>> @@ -36,6 +36,17 @@ expire::
>>   	have no objects referenced by the MIDX. Rewrite the MIDX file
>>   	afterward to remove all references to these pack-files.
>>   
>> +repack::
>> +	Collect a batch of pack-files whose size are all at most the
>> +	size given by --batch-size, but whose sizes sum to larger
>> +	than --batch-size. The batch is selected by greedily adding
>> +	small pack-files starting with the oldest pack-files that fit
>> +	the size. Create a new pack-file containing the objects the
>> +	multi-pack-index indexes into those pack-files, and rewrite
>> +	the multi-pack-index to contain that pack-file. A later run
>> +	of 'git multi-pack-index expire' will delete the pack-files
>> +	that were part of this batch.
> I see in the subsequent patch that you stop once the batch size is
> matched or exceeded - I see that you mention "whose sizes sum to larger
> than --batch-size", but this leads me to think that if the total so
> happens to not exceed the batch size, don't do anything, but otherwise
> repack *all* the small packs together.
>
> I would write this as:
>
>    Create a new packfile containing the objects in the N least-sized
>    packfiles referenced by the multi-pack-index, where N is the smallest
>    number such that the total size of the packfiles equals or exceeds the
>    given batch size. Rewrite the multi-pack-index to reference the new
>    packfile instead of the N packfiles. A later run of ...

Thanks for the suggestion.

It is slightly wrong, in that we don't sort by size. Instead we sort by 
modified time. That makes is a little complicated, but I'll give it 
another shot using your framing:

         Create a new pack-file containing objects in small pack-files
         referenced by the multi-pack-index. Select the pack-files by
         examining packs from oldest-to-newest, adding a pack if its
         size is below the batch size. Stop adding packs when the sum
         of sizes of the added packs is above the batch size. If the
         total size does not reach the batch size, then do nothing.
         Rewrite the multi-pack-index to reference the new pack-file.
         A later run of 'git multi-pack-index expire' will delete the
         pack-files that were part of this batch.

-Stolee

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 7/9] multi-pack-index: prepare 'repack' subcommand
  2019-01-24 19:36         ` Derrick Stolee
@ 2019-01-24 21:38           ` Jonathan Tan
  0 siblings, 0 replies; 89+ messages in thread
From: Jonathan Tan @ 2019-01-24 21:38 UTC (permalink / raw)
  To: stolee
  Cc: jonathantanmy, gitgitgadget, git, sbeller, peff, jrnieder,
	avarab, gitster, dstolee

> Thanks for the suggestion.
> 
> It is slightly wrong, in that we don't sort by size. Instead we sort by 
> modified time. That makes is a little complicated, but I'll give it 
> another shot using your framing:
> 
>          Create a new pack-file containing objects in small pack-files
>          referenced by the multi-pack-index. Select the pack-files by
>          examining packs from oldest-to-newest, adding a pack if its
>          size is below the batch size. Stop adding packs when the sum
>          of sizes of the added packs is above the batch size. If the
>          total size does not reach the batch size, then do nothing.
>          Rewrite the multi-pack-index to reference the new pack-file.
>          A later run of 'git multi-pack-index expire' will delete the
>          pack-files that were part of this batch.

Thanks, this looks good.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v4 01/10] repack: refactor pack deletion for future use
  2019-01-24 21:51     ` [PATCH v4 00/10] " Derrick Stolee via GitGitGadget
@ 2019-01-24 21:51       ` Derrick Stolee via GitGitGadget
  2019-01-24 21:51       ` [PATCH v4 02/10] Docs: rearrange subcommands for multi-pack-index Derrick Stolee via GitGitGadget
                         ` (11 subsequent siblings)
  12 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-24 21:51 UTC (permalink / raw)
  To: git
  Cc: sbeller, peff, jrnieder, avarab, jonathantanmy, Junio C Hamano,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The repack builtin deletes redundant pack-files and their
associated .idx, .promisor, .bitmap, and .keep files. We will want
to re-use this logic in the future for other types of repack, so
pull the logic into 'unlink_pack_path()' in packfile.c.

The 'ignore_keep' parameter is enabled for the use in repack, but
will be important for a future caller.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/repack.c | 14 ++------------
 packfile.c       | 28 ++++++++++++++++++++++++++++
 packfile.h       |  7 +++++++
 3 files changed, 37 insertions(+), 12 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 45583683ee..3d445b34b4 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -129,19 +129,9 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list,
 
 static void remove_redundant_pack(const char *dir_name, const char *base_name)
 {
-	const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", ".promisor"};
-	int i;
 	struct strbuf buf = STRBUF_INIT;
-	size_t plen;
-
-	strbuf_addf(&buf, "%s/%s", dir_name, base_name);
-	plen = buf.len;
-
-	for (i = 0; i < ARRAY_SIZE(exts); i++) {
-		strbuf_setlen(&buf, plen);
-		strbuf_addstr(&buf, exts[i]);
-		unlink(buf.buf);
-	}
+	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
+	unlink_pack_path(buf.buf, 1);
 	strbuf_release(&buf);
 }
 
diff --git a/packfile.c b/packfile.c
index d1e6683ffe..bacecb4d0d 100644
--- a/packfile.c
+++ b/packfile.c
@@ -352,6 +352,34 @@ void close_all_packs(struct raw_object_store *o)
 	}
 }
 
+void unlink_pack_path(const char *pack_name, int force_delete)
+{
+	static const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", ".promisor"};
+	int i;
+	struct strbuf buf = STRBUF_INIT;
+	size_t plen;
+
+	strbuf_addstr(&buf, pack_name);
+	strip_suffix_mem(buf.buf, &buf.len, ".pack");
+	plen = buf.len;
+
+	if (!force_delete) {
+		strbuf_addstr(&buf, ".keep");
+		if (!access(buf.buf, F_OK)) {
+			strbuf_release(&buf);
+			return;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(exts); i++) {
+		strbuf_setlen(&buf, plen);
+		strbuf_addstr(&buf, exts[i]);
+		unlink(buf.buf);
+	}
+
+	strbuf_release(&buf);
+}
+
 /*
  * The LRU pack is the one with the oldest MRU window, preferring packs
  * with no used windows, or the oldest mtime if it has no windows allocated.
diff --git a/packfile.h b/packfile.h
index 6c4037605d..5b7bcdb1dd 100644
--- a/packfile.h
+++ b/packfile.h
@@ -86,6 +86,13 @@ extern void unuse_pack(struct pack_window **);
 extern void clear_delta_base_cache(void);
 extern struct packed_git *add_packed_git(const char *path, size_t path_len, int local);
 
+/*
+ * Unlink the .pack and associated extension files.
+ * Does not unlink if 'force_delete' is false and the pack-file is
+ * marked as ".keep".
+ */
+extern void unlink_pack_path(const char *pack_name, int force_delete);
+
 /*
  * Make sure that a pointer access into an mmap'd index file is within bounds,
  * and can provide at least 8 bytes of data.
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v4 00/10] Create 'expire' and 'repack' verbs for git-multi-pack-index
  2019-01-09 15:21   ` [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
                       ` (10 preceding siblings ...)
  2019-01-23 22:44     ` Jonathan Tan
@ 2019-01-24 21:51     ` " Derrick Stolee via GitGitGadget
  2019-01-24 21:51       ` [PATCH v4 01/10] repack: refactor pack deletion for future use Derrick Stolee via GitGitGadget
                         ` (12 more replies)
  11 siblings, 13 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-24 21:51 UTC (permalink / raw)
  To: git; +Cc: sbeller, peff, jrnieder, avarab, jonathantanmy, Junio C Hamano

The multi-pack-index provides a fast way to find an object among a large
list of pack-files. It stores a single pack-reference for each object id, so
duplicate objects are ignored. Among a list of pack-files storing the same
object, the most-recently modified one is used.

Create new subcommands for the multi-pack-index builtin.

 * 'git multi-pack-index expire': If we have a pack-file indexed by the
   multi-pack-index, but all objects in that pack are duplicated in
   more-recently modified packs, then delete that pack (and any others like
   it). Delete the reference to that pack in the multi-pack-index.
   
   
 * 'git multi-pack-index repack --batch-size=': Starting from the oldest
   pack-files covered by the multi-pack-index, find those whose on-disk size
   is below the batch size until we have a collection of packs whose sizes
   add up to the batch size. Create a new pack containing all objects that
   the multi-pack-index references to those packs.
   
   

This allows us to create a new pattern for repacking objects: run 'repack'.
After enough time has passed that all Git commands that started before the
last 'repack' are finished, run 'expire' again. This approach has some
advantages over the existing "repack everything" model:

 1. Incremental. We can repack a small batch of objects at a time, instead
    of repacking all reachable objects. We can also limit ourselves to the
    objects that do not appear in newer pack-files.
    
    
 2. Highly Available. By adding a new pack-file (and not deleting the old
    pack-files) we do not interrupt concurrent Git commands, and do not
    suffer performance degradation. By expiring only pack-files that have no
    referenced objects, we know that Git commands that are doing normal
    object lookups* will not be interrupted.
    
    
 3. Note: if someone concurrently runs a Git command that uses
    get_all_packs(), then that command could try to read the pack-files and
    pack-indexes that we are deleting during an expire command. Such
    commands are usually related to object maintenance (i.e. fsck, gc,
    pack-objects) or are related to less-often-used features (i.e.
    fast-import, http-backend, server-info).
    
    

We plan to use this approach in VFS for Git to do background maintenance of
the "shared object cache" which is a Git alternate directory filled with
packfiles containing commits and trees. We currently download pack-files on
an hourly basis to keep up-to-date with the central server. The cache
servers supply packs on an hourly and daily basis, so most of the hourly
packs become useless after a new daily pack is downloaded. The 'expire'
command would clear out most of those packs, but many will still remain with
fewer than 100 objects remaining. The 'repack' command (with a batch size of
1-3gb, probably) can condense the remaining packs in commands that run for
1-3 min at a time. Since the daily packs range from 100-250mb, we will also
combine and condense those packs.

Updates in V2:

 * Added a method, unlink_pack_path() to remove packfiles, but with the
   additional check for a .keep file. This borrows logic from 
   builtin/repack.c.
   
   
 * Modified documentation and commit messages to replace 'verb' with
   'subcommand'. Simplified the documentation. (I left 'verbs' in the title
   of the cover letter for consistency.)
   
   

Updates in V3:

 * There was a bug in the expire logic when simultaneously removing packs
   and adding uncovered packs, specifically around the pack permutation.
   This was hard to see during review because I was using the 'pack_perm'
   array for multiple purposes. First, I was reducing its length, and then I
   was adding to it and resorting. In V3, I significantly overhauled the
   logic here, which required some extra commits before implementing
   'expire'. The final commit includes a test that would cover this case.

Updates in V4:

 * More 'verb' and 'command' instances replaced with 'subcommand'. I grepped
   the patch to check these should be fixed everywhere.
   
   
 * Update the tests to check .keep files (in last patch).
   
   
 * Modify the tests to show the terminating condition of --batch-size when
   there are three packs that fit under the size, but the first two are
   large enough to stop adding packs. This required rearranging the packs
   slightly to get different sizes than we had before. Also, I added 'touch
   -t' to set the modified times so we can fix the order in which the packs
   are selected.
   
   
 * Added a comment about the purpose of pack_perm.
   
   

Thanks, -Stolee

Derrick Stolee (10):
  repack: refactor pack deletion for future use
  Docs: rearrange subcommands for multi-pack-index
  multi-pack-index: prepare for 'expire' subcommand
  midx: simplify computation of pack name lengths
  midx: refactor permutation logic and pack sorting
  multi-pack-index: implement 'expire' subcommand
  multi-pack-index: prepare 'repack' subcommand
  midx: implement midx_repack()
  multi-pack-index: test expire while adding packs
  midx: add test that 'expire' respects .keep files

 Documentation/git-multi-pack-index.txt |  26 +-
 builtin/multi-pack-index.c             |  14 +-
 builtin/repack.c                       |  14 +-
 midx.c                                 | 399 ++++++++++++++++++-------
 midx.h                                 |   2 +
 packfile.c                             |  28 ++
 packfile.h                             |   7 +
 t/t5319-multi-pack-index.sh            | 165 ++++++++++
 8 files changed, 536 insertions(+), 119 deletions(-)


base-commit: 26aa9fc81d4c7f6c3b456a29da0b7ec72e5c6595
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-92%2Fderrickstolee%2Fmidx-expire%2Fupstream-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-92/derrickstolee/midx-expire/upstream-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/92

Range-diff vs v3:

  1:  62b393b816 =  1:  62b393b816 repack: refactor pack deletion for future use
  2:  7886785904 =  2:  7886785904 Docs: rearrange subcommands for multi-pack-index
  3:  f06382b4ae !  3:  628ca46036 multi-pack-index: prepare for 'expire' subcommand
     @@ -16,7 +16,9 @@
          Add a test that verifies the 'expire' subcommand is correctly wired,
          but will still be valid when the verb is implemented. Specifically,
          create a set of packs that should all have referenced objects and
     -    should not be removed during an 'expire' operation.
     +    should not be removed during an 'expire' operation. The packs are
     +    created carefully to ensure they have a specific order when sorted
     +    by size. This will be important in a later test.
      
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
     @@ -95,6 +97,8 @@
      +	(
      +		cd dup &&
      +		git init &&
     ++		test-tool genrandom "data" 4096 >large_file.txt &&
     ++		git update-index --add large_file.txt &&
      +		for i in $(test_seq 1 20)
      +		do
      +			test_commit $i
     @@ -104,24 +108,24 @@
      +		git branch C HEAD~13 &&
      +		git branch D HEAD~16 &&
      +		git branch E HEAD~18 &&
     -+		git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
     -+		refs/heads/E
     ++		git pack-objects --revs .git/objects/pack/pack-A <<-EOF &&
     ++		refs/heads/A
     ++		^refs/heads/B
      +		EOF
     -+		git pack-objects --revs .git/objects/pack/pack-D <<-EOF &&
     -+		refs/heads/D
     -+		^refs/heads/E
     ++		git pack-objects --revs .git/objects/pack/pack-B <<-EOF &&
     ++		refs/heads/B
     ++		^refs/heads/C
      +		EOF
      +		git pack-objects --revs .git/objects/pack/pack-C <<-EOF &&
      +		refs/heads/C
      +		^refs/heads/D
      +		EOF
     -+		git pack-objects --revs .git/objects/pack/pack-B <<-EOF &&
     -+		refs/heads/B
     -+		^refs/heads/C
     ++		git pack-objects --revs .git/objects/pack/pack-D <<-EOF &&
     ++		refs/heads/D
     ++		^refs/heads/E
      +		EOF
     -+		git pack-objects --revs .git/objects/pack/pack-A <<-EOF &&
     -+		refs/heads/A
     -+		^refs/heads/B
     ++		git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
     ++		refs/heads/E
      +		EOF
      +		git multi-pack-index write
      +	)
  4:  2a763990ae !  4:  d55c1d7ee7 midx: simplify computation of pack name lengths
     @@ -12,7 +12,7 @@
          dir not already covered by the multi-pack-index.
      
          In anticipation of this becoming more complicated with the 'expire'
     -    command, simplify the computation by centralizing it to a single
     +    subcommand, simplify the computation by centralizing it to a single
          loop before writing the file.
      
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
  5:  a0d4cc6cb3 !  5:  3950743b96 midx: refactor permutation logic and pack sorting
     @@ -282,6 +282,12 @@
       
      +	QSORT(packs.info, packs.nr, pack_info_compare);
      +
     ++	/*
     ++	 * pack_perm stores a permutation between pack-int-ids from the
     ++	 * previous multi-pack-index to the new one we are writing:
     ++	 *
     ++	 * pack_perm[old_id] = new_id
     ++	 */
      +	ALLOC_ARRAY(pack_perm, packs.nr);
      +	for (i = 0; i < packs.nr; i++) {
      +		pack_perm[packs.info[i].orig_pack_int_id] = i;
  6:  4dbff40e7a !  6:  6691d97902 multi-pack-index: implement 'expire' verb
     @@ -1,8 +1,8 @@
      Author: Derrick Stolee <dstolee@microsoft.com>
      
     -    multi-pack-index: implement 'expire' verb
     +    multi-pack-index: implement 'expire' subcommand
      
     -    The 'git multi-pack-index expire' command looks at the existing
     +    The 'git multi-pack-index expire' subcommand looks at the existing
          mult-pack-index, counts the number of objects referenced in each
          pack-file, deletes the pack-fils with no referenced objects, and
          rewrites the multi-pack-index to no longer reference those packs.
     @@ -18,7 +18,7 @@
      
          Test that a new pack-file that covers the contents of two other
          pack-files leads to those pack-files being deleted during the
     -    expire command. Be sure to read the multi-pack-index to ensure
     +    expire subcommand. Be sure to read the multi-pack-index to ensure
          it no longer references those packs.
      
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
     @@ -161,6 +161,11 @@
      +		}
      +	}
      +
     + 	/*
     + 	 * pack_perm stores a permutation between pack-int-ids from the
     + 	 * previous multi-pack-index to the new one we are writing:
     +@@
     + 	 */
       	ALLOC_ARRAY(pack_perm, packs.nr);
       	for (i = 0; i < packs.nr; i++) {
      -		pack_perm[packs.info[i].orig_pack_int_id] = i;
     @@ -273,7 +278,9 @@
      +		test_cmp expect actual &&
      +		ls .git/objects/pack/ | grep idx >expect-idx &&
      +		test-tool read-midx .git/objects | grep idx >actual-midx &&
     -+		test_cmp expect-idx actual-midx
     ++		test_cmp expect-idx actual-midx &&
     ++		git multi-pack-index verify &&
     ++		git fsck
      +	)
      +'
      +
  7:  b39f90ad09 !  7:  f5a8ff21dd multi-pack-index: prepare 'repack' subcommand
     @@ -11,7 +11,7 @@
          operation does not interrupt concurrent git commands.
      
          Introduce a 'repack' subcommand to 'git multi-pack-index' that
     -    takes a '--batch-size' option. The verb will inspect the
     +    takes a '--batch-size' option. The subcommand will inspect the
          multi-pack-index for referenced pack-files whose size is smaller
          than the batch size, until collecting a list of pack-files whose
          sizes sum to larger than the batch size. Then, a new pack-file
     @@ -26,6 +26,11 @@
          we specify a small batch size, we will guarantee that future
          implementations do not change the list of pack-files.
      
     +    In addition, we hard-code the modified times of the packs in
     +    the pack directory to ensure the list of packs sorted by modified
     +    time matches the order if sorted by size (ascending). This will
     +    be important in a future test.
     +
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
     @@ -36,15 +41,15 @@
       	afterward to remove all references to these pack-files.
       
      +repack::
     -+	Collect a batch of pack-files whose size are all at most the
     -+	size given by --batch-size, but whose sizes sum to larger
     -+	than --batch-size. The batch is selected by greedily adding
     -+	small pack-files starting with the oldest pack-files that fit
     -+	the size. Create a new pack-file containing the objects the
     -+	multi-pack-index indexes into those pack-files, and rewrite
     -+	the multi-pack-index to contain that pack-file. A later run
     -+	of 'git multi-pack-index expire' will delete the pack-files
     -+	that were part of this batch.
     ++	Create a new pack-file containing objects in small pack-files
     ++	referenced by the multi-pack-index. Select the pack-files by
     ++	examining packs from oldest-to-newest, adding a pack if its
     ++	size is below the batch size. Stop adding packs when the sum
     ++	of sizes of the added packs is above the batch size. If the
     ++	total size does not reach the batch size, then do nothing.
     ++	Rewrite the multi-pack-index to reference the new pack-file.
     ++	A later run of 'git multi-pack-index expire' will delete the
     ++	pack-files that were part of this batch.
      +
       
       EXAMPLES
     @@ -84,11 +89,18 @@
      +	if (!strcmp(argv[0], "repack"))
      +		return midx_repack(opts.object_dir, (size_t)opts.batch_size);
      +	if (opts.batch_size)
     -+		die(_("--batch-size option is only for 'repack' verb"));
     ++		die(_("--batch-size option is only for 'repack' subcommand"));
      +
       	if (!strcmp(argv[0], "write"))
       		return write_midx_file(opts.object_dir);
       	if (!strcmp(argv[0], "verify"))
     +@@
     + 	if (!strcmp(argv[0], "expire"))
     + 		return expire_midx_packs(opts.object_dir);
     + 
     +-	die(_("unrecognized verb: %s"), argv[0]);
     ++	die(_("unrecognized subcommand: %s"), argv[0]);
     + }
      
       diff --git a/midx.c b/midx.c
       --- a/midx.c
     @@ -125,6 +137,12 @@
      +test_expect_success 'repack with minimum size does not alter existing packs' '
      +	(
      +		cd dup &&
     ++		rm -rf .git/objects/pack &&
     ++		mv .git/objects/pack-backup .git/objects/pack &&
     ++		touch -m -t 201901010000 .git/objects/pack/pack-D* &&
     ++		touch -m -t 201901010001 .git/objects/pack/pack-C* &&
     ++		touch -m -t 201901010002 .git/objects/pack/pack-B* &&
     ++		touch -m -t 201901010003 .git/objects/pack/pack-A* &&
      +		ls .git/objects/pack >expect &&
      +		MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
      +		git multi-pack-index repack --batch-size=$MINSIZE &&
  8:  a4c2d5a8e1 !  8:  ba1a1c7bbb midx: implement midx_repack()
     @@ -149,6 +149,16 @@
       diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
       --- a/t/t5319-multi-pack-index.sh
       +++ b/t/t5319-multi-pack-index.sh
     +@@
     + 		git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
     + 		refs/heads/E
     + 		EOF
     +-		git multi-pack-index write
     ++		git multi-pack-index write &&
     ++		cp -r .git/objects/pack .git/objects/pack-backup
     + 	)
     + '
     + 
      @@
       	)
       '
     @@ -156,25 +166,28 @@
      +test_expect_success 'repack creates a new pack' '
      +	(
      +		cd dup &&
     -+		SECOND_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 2 | tail -n 1) &&
     -+		BATCH_SIZE=$(($SECOND_SMALLEST_SIZE + 1)) &&
     -+		git multi-pack-index repack --batch-size=$BATCH_SIZE &&
      +		ls .git/objects/pack/*idx >idx-list &&
      +		test_line_count = 5 idx-list &&
     ++		THIRD_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 3 | tail -n 1) &&
     ++		BATCH_SIZE=$(($THIRD_SMALLEST_SIZE + 1)) &&
     ++		git multi-pack-index repack --batch-size=$BATCH_SIZE &&
     ++		ls .git/objects/pack/*idx >idx-list &&
     ++		test_line_count = 6 idx-list &&
      +		test-tool read-midx .git/objects | grep idx >midx-list &&
     -+		test_line_count = 5 midx-list
     ++		test_line_count = 6 midx-list
      +	)
      +'
      +
      +test_expect_success 'expire removes repacked packs' '
      +	(
      +		cd dup &&
     -+		ls -S .git/objects/pack/*pack | head -n 3 >expect &&
     ++		ls -al .git/objects/pack/*pack &&
     ++		ls -S .git/objects/pack/*pack | head -n 4 >expect &&
      +		git multi-pack-index expire &&
      +		ls -S .git/objects/pack/*pack >actual &&
      +		test_cmp expect actual &&
      +		test-tool read-midx .git/objects | grep idx >midx-list &&
     -+		test_line_count = 3 midx-list
     ++		test_line_count = 4 midx-list
      +	)
      +'
      +
  9:  b97fb35ba9 =  9:  b1c6892417 multi-pack-index: test expire while adding packs
  -:  ---------- > 10:  481b08890f midx: add test that 'expire' respects .keep files

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v4 02/10] Docs: rearrange subcommands for multi-pack-index
  2019-01-24 21:51     ` [PATCH v4 00/10] " Derrick Stolee via GitGitGadget
  2019-01-24 21:51       ` [PATCH v4 01/10] repack: refactor pack deletion for future use Derrick Stolee via GitGitGadget
@ 2019-01-24 21:51       ` Derrick Stolee via GitGitGadget
  2019-01-24 21:51       ` [PATCH v4 03/10] multi-pack-index: prepare for 'expire' subcommand Derrick Stolee via GitGitGadget
                         ` (10 subsequent siblings)
  12 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-24 21:51 UTC (permalink / raw)
  To: git
  Cc: sbeller, peff, jrnieder, avarab, jonathantanmy, Junio C Hamano,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will add new subcommands to the multi-pack-index, and that will
make the documentation a bit messier. Clean up the 'verb'
descriptions by renaming the concept to 'subcommand' and removing
the reference to the object directory.

Helped-by: Stefan Beller <sbeller@google.com>
Helped-by: Szeder Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index f7778a2c85..1af406aca2 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -9,7 +9,7 @@ git-multi-pack-index - Write and verify multi-pack-indexes
 SYNOPSIS
 --------
 [verse]
-'git multi-pack-index' [--object-dir=<dir>] <verb>
+'git multi-pack-index' [--object-dir=<dir>] <subcommand>
 
 DESCRIPTION
 -----------
@@ -23,13 +23,13 @@ OPTIONS
 	`<dir>/packs/multi-pack-index` for the current MIDX file, and
 	`<dir>/packs` for the pack-files to index.
 
+The following subcommands are available:
+
 write::
-	When given as the verb, write a new MIDX file to
-	`<dir>/packs/multi-pack-index`.
+	Write a new MIDX file.
 
 verify::
-	When given as the verb, verify the contents of the MIDX file
-	at `<dir>/packs/multi-pack-index`.
+	Verify the contents of the MIDX file.
 
 
 EXAMPLES
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v4 03/10] multi-pack-index: prepare for 'expire' subcommand
  2019-01-24 21:51     ` [PATCH v4 00/10] " Derrick Stolee via GitGitGadget
  2019-01-24 21:51       ` [PATCH v4 01/10] repack: refactor pack deletion for future use Derrick Stolee via GitGitGadget
  2019-01-24 21:51       ` [PATCH v4 02/10] Docs: rearrange subcommands for multi-pack-index Derrick Stolee via GitGitGadget
@ 2019-01-24 21:51       ` Derrick Stolee via GitGitGadget
  2019-01-24 21:51       ` [PATCH v4 04/10] midx: simplify computation of pack name lengths Derrick Stolee via GitGitGadget
                         ` (9 subsequent siblings)
  12 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-24 21:51 UTC (permalink / raw)
  To: git
  Cc: sbeller, peff, jrnieder, avarab, jonathantanmy, Junio C Hamano,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The multi-pack-index tracks objects in a collection of pack-files.
Only one copy of each object is indexed, using the modified time
of the pack-files to determine tie-breakers. It is possible to
have a pack-file with no referenced objects because all objects
have a duplicate in a newer pack-file.

Introduce a new 'expire' subcommand to the multi-pack-index builtin.
This subcommand will delete these unused pack-files and rewrite the
multi-pack-index to no longer refer to those files. More details
about the specifics will follow as the method is implemented.

Add a test that verifies the 'expire' subcommand is correctly wired,
but will still be valid when the verb is implemented. Specifically,
create a set of packs that should all have referenced objects and
should not be removed during an 'expire' operation. The packs are
created carefully to ensure they have a specific order when sorted
by size. This will be important in a later test.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt |  5 +++
 builtin/multi-pack-index.c             |  4 ++-
 midx.c                                 |  5 +++
 midx.h                                 |  1 +
 t/t5319-multi-pack-index.sh            | 49 ++++++++++++++++++++++++++
 5 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index 1af406aca2..6186c4c936 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -31,6 +31,11 @@ write::
 verify::
 	Verify the contents of the MIDX file.
 
+expire::
+	Delete the pack-files that are tracked 	by the MIDX file, but
+	have no objects referenced by the MIDX. Rewrite the MIDX file
+	afterward to remove all references to these pack-files.
+
 
 EXAMPLES
 --------
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index fca70f8e4f..145de3a46c 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,7 +5,7 @@
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-	N_("git multi-pack-index [--object-dir=<dir>] (write|verify)"),
+	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire)"),
 	NULL
 };
 
@@ -44,6 +44,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
 		return write_midx_file(opts.object_dir);
 	if (!strcmp(argv[0], "verify"))
 		return verify_midx_file(opts.object_dir);
+	if (!strcmp(argv[0], "expire"))
+		return expire_midx_packs(opts.object_dir);
 
 	die(_("unrecognized verb: %s"), argv[0]);
 }
diff --git a/midx.c b/midx.c
index 730ff84dff..bb825ef816 100644
--- a/midx.c
+++ b/midx.c
@@ -1025,3 +1025,8 @@ int verify_midx_file(const char *object_dir)
 
 	return verify_midx_error;
 }
+
+int expire_midx_packs(const char *object_dir)
+{
+	return 0;
+}
diff --git a/midx.h b/midx.h
index 774f652530..e3a2b740b5 100644
--- a/midx.h
+++ b/midx.h
@@ -49,6 +49,7 @@ int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, i
 int write_midx_file(const char *object_dir);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(const char *object_dir);
+int expire_midx_packs(const char *object_dir);
 
 void close_midx(struct multi_pack_index *m);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 70926b5bc0..a8528f7da0 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -348,4 +348,53 @@ test_expect_success 'verify incorrect 64-bit offset' '
 		"incorrect object offset"
 '
 
+test_expect_success 'setup expire tests' '
+	mkdir dup &&
+	(
+		cd dup &&
+		git init &&
+		test-tool genrandom "data" 4096 >large_file.txt &&
+		git update-index --add large_file.txt &&
+		for i in $(test_seq 1 20)
+		do
+			test_commit $i
+		done &&
+		git branch A HEAD &&
+		git branch B HEAD~8 &&
+		git branch C HEAD~13 &&
+		git branch D HEAD~16 &&
+		git branch E HEAD~18 &&
+		git pack-objects --revs .git/objects/pack/pack-A <<-EOF &&
+		refs/heads/A
+		^refs/heads/B
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-B <<-EOF &&
+		refs/heads/B
+		^refs/heads/C
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-C <<-EOF &&
+		refs/heads/C
+		^refs/heads/D
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-D <<-EOF &&
+		refs/heads/D
+		^refs/heads/E
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
+		refs/heads/E
+		EOF
+		git multi-pack-index write
+	)
+'
+
+test_expect_success 'expire does not remove any packs' '
+	(
+		cd dup &&
+		ls .git/objects/pack >expect &&
+		git multi-pack-index expire &&
+		ls .git/objects/pack >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v4 05/10] midx: refactor permutation logic and pack sorting
  2019-01-24 21:51     ` [PATCH v4 00/10] " Derrick Stolee via GitGitGadget
                         ` (3 preceding siblings ...)
  2019-01-24 21:51       ` [PATCH v4 04/10] midx: simplify computation of pack name lengths Derrick Stolee via GitGitGadget
@ 2019-01-24 21:51       ` Derrick Stolee via GitGitGadget
  2019-01-24 21:51       ` [PATCH v4 07/10] multi-pack-index: prepare 'repack' subcommand Derrick Stolee via GitGitGadget
                         ` (7 subsequent siblings)
  12 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-24 21:51 UTC (permalink / raw)
  To: git
  Cc: sbeller, peff, jrnieder, avarab, jonathantanmy, Junio C Hamano,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In anticipation of the expire subcommand, refactor the way we sort
the packfiles by name. This will greatly simplify our approach to
dropping expired packs from the list.

First, create 'struct pack_info' to replace 'struct pack_pair'.
This struct contains the necessary information about a pack,
including its name, a pointer to its packfile struct (if not
already in the multi-pack-index), and the original pack-int-id.

Second, track the pack information using an array of pack_info
structs in the pack_list struct. This simplifies the logic around
the multiple arrays we were tracking in that struct.

Finally, update get_sorted_entries() to not permute the pack-int-id
and instead supply the permutation to write_midx_object_offsets().
This requires sorting the packs after get_sorted_entries().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 156 +++++++++++++++++++++++++--------------------------------
 1 file changed, 69 insertions(+), 87 deletions(-)

diff --git a/midx.c b/midx.c
index f087bbbe82..95c39106b2 100644
--- a/midx.c
+++ b/midx.c
@@ -377,12 +377,23 @@ static size_t write_midx_header(struct hashfile *f,
 	return MIDX_HEADER_SIZE;
 }
 
+struct pack_info {
+	uint32_t orig_pack_int_id;
+	char *pack_name;
+	struct packed_git *p;
+};
+
+static int pack_info_compare(const void *_a, const void *_b)
+{
+	struct pack_info *a = (struct pack_info *)_a;
+	struct pack_info *b = (struct pack_info *)_b;
+	return strcmp(a->pack_name, b->pack_name);
+}
+
 struct pack_list {
-	struct packed_git **list;
-	char **names;
+	struct pack_info *info;
 	uint32_t nr;
-	uint32_t alloc_list;
-	uint32_t alloc_names;
+	uint32_t alloc;
 	struct multi_pack_index *m;
 };
 
@@ -395,66 +406,32 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 		if (packs->m && midx_contains_pack(packs->m, file_name))
 			return;
 
-		ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc_list);
-		ALLOC_GROW(packs->names, packs->nr + 1, packs->alloc_names);
+		ALLOC_GROW(packs->info, packs->nr + 1, packs->alloc);
 
-		packs->list[packs->nr] = add_packed_git(full_path,
-							full_path_len,
-							0);
+		packs->info[packs->nr].p = add_packed_git(full_path,
+							  full_path_len,
+							  0);
 
-		if (!packs->list[packs->nr]) {
+		if (!packs->info[packs->nr].p) {
 			warning(_("failed to add packfile '%s'"),
 				full_path);
 			return;
 		}
 
-		if (open_pack_index(packs->list[packs->nr])) {
+		if (open_pack_index(packs->info[packs->nr].p)) {
 			warning(_("failed to open pack-index '%s'"),
 				full_path);
-			close_pack(packs->list[packs->nr]);
-			FREE_AND_NULL(packs->list[packs->nr]);
+			close_pack(packs->info[packs->nr].p);
+			FREE_AND_NULL(packs->info[packs->nr].p);
 			return;
 		}
 
-		packs->names[packs->nr] = xstrdup(file_name);
+		packs->info[packs->nr].pack_name = xstrdup(file_name);
+		packs->info[packs->nr].orig_pack_int_id = packs->nr;
 		packs->nr++;
 	}
 }
 
-struct pack_pair {
-	uint32_t pack_int_id;
-	char *pack_name;
-};
-
-static int pack_pair_compare(const void *_a, const void *_b)
-{
-	struct pack_pair *a = (struct pack_pair *)_a;
-	struct pack_pair *b = (struct pack_pair *)_b;
-	return strcmp(a->pack_name, b->pack_name);
-}
-
-static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *perm)
-{
-	uint32_t i;
-	struct pack_pair *pairs;
-
-	ALLOC_ARRAY(pairs, nr_packs);
-
-	for (i = 0; i < nr_packs; i++) {
-		pairs[i].pack_int_id = i;
-		pairs[i].pack_name = pack_names[i];
-	}
-
-	QSORT(pairs, nr_packs, pack_pair_compare);
-
-	for (i = 0; i < nr_packs; i++) {
-		pack_names[i] = pairs[i].pack_name;
-		perm[pairs[i].pack_int_id] = i;
-	}
-
-	free(pairs);
-}
-
 struct pack_midx_entry {
 	struct object_id oid;
 	uint32_t pack_int_id;
@@ -480,7 +457,6 @@ static int midx_oid_compare(const void *_a, const void *_b)
 }
 
 static int nth_midxed_pack_midx_entry(struct multi_pack_index *m,
-				      uint32_t *pack_perm,
 				      struct pack_midx_entry *e,
 				      uint32_t pos)
 {
@@ -488,7 +464,7 @@ static int nth_midxed_pack_midx_entry(struct multi_pack_index *m,
 		return 1;
 
 	nth_midxed_object_oid(&e->oid, m, pos);
-	e->pack_int_id = pack_perm[nth_midxed_pack_int_id(m, pos)];
+	e->pack_int_id = nth_midxed_pack_int_id(m, pos);
 	e->offset = nth_midxed_offset(m, pos);
 
 	/* consider objects in midx to be from "old" packs */
@@ -522,8 +498,7 @@ static void fill_pack_entry(uint32_t pack_int_id,
  * of a packfile containing the object).
  */
 static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
-						  struct packed_git **p,
-						  uint32_t *perm,
+						  struct pack_info *info,
 						  uint32_t nr_packs,
 						  uint32_t *nr_objects)
 {
@@ -534,7 +509,7 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 	uint32_t start_pack = m ? m->num_packs : 0;
 
 	for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++)
-		total_objects += p[cur_pack]->num_objects;
+		total_objects += info[cur_pack].p->num_objects;
 
 	/*
 	 * As we de-duplicate by fanout value, we expect the fanout
@@ -559,7 +534,7 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 
 			for (cur_object = start; cur_object < end; cur_object++) {
 				ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout);
-				nth_midxed_pack_midx_entry(m, perm,
+				nth_midxed_pack_midx_entry(m,
 							   &entries_by_fanout[nr_fanout],
 							   cur_object);
 				nr_fanout++;
@@ -570,12 +545,12 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 			uint32_t start = 0, end;
 
 			if (cur_fanout)
-				start = get_pack_fanout(p[cur_pack], cur_fanout - 1);
-			end = get_pack_fanout(p[cur_pack], cur_fanout);
+				start = get_pack_fanout(info[cur_pack].p, cur_fanout - 1);
+			end = get_pack_fanout(info[cur_pack].p, cur_fanout);
 
 			for (cur_object = start; cur_object < end; cur_object++) {
 				ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout);
-				fill_pack_entry(perm[cur_pack], p[cur_pack], cur_object, &entries_by_fanout[nr_fanout]);
+				fill_pack_entry(cur_pack, info[cur_pack].p, cur_object, &entries_by_fanout[nr_fanout]);
 				nr_fanout++;
 			}
 		}
@@ -604,7 +579,7 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 }
 
 static size_t write_midx_pack_names(struct hashfile *f,
-				    char **pack_names,
+				    struct pack_info *info,
 				    uint32_t num_packs)
 {
 	uint32_t i;
@@ -612,14 +587,14 @@ static size_t write_midx_pack_names(struct hashfile *f,
 	size_t written = 0;
 
 	for (i = 0; i < num_packs; i++) {
-		size_t writelen = strlen(pack_names[i]) + 1;
+		size_t writelen = strlen(info[i].pack_name) + 1;
 
-		if (i && strcmp(pack_names[i], pack_names[i - 1]) <= 0)
+		if (i && strcmp(info[i].pack_name, info[i - 1].pack_name) <= 0)
 			BUG("incorrect pack-file order: %s before %s",
-			    pack_names[i - 1],
-			    pack_names[i]);
+			    info[i - 1].pack_name,
+			    info[i].pack_name);
 
-		hashwrite(f, pack_names[i], writelen);
+		hashwrite(f, info[i].pack_name, writelen);
 		written += writelen;
 	}
 
@@ -690,6 +665,7 @@ static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
 }
 
 static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_needed,
+					uint32_t *perm,
 					struct pack_midx_entry *objects, uint32_t nr_objects)
 {
 	struct pack_midx_entry *list = objects;
@@ -699,7 +675,7 @@ static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_nee
 	for (i = 0; i < nr_objects; i++) {
 		struct pack_midx_entry *obj = list++;
 
-		hashwrite_be32(f, obj->pack_int_id);
+		hashwrite_be32(f, perm[obj->pack_int_id]);
 
 		if (large_offset_needed && obj->offset >> 31)
 			hashwrite_be32(f, MIDX_LARGE_OFFSET_NEEDED | nr_large_offset++);
@@ -772,20 +748,17 @@ int write_midx_file(const char *object_dir)
 	packs.m = load_multi_pack_index(object_dir, 1);
 
 	packs.nr = 0;
-	packs.alloc_list = packs.m ? packs.m->num_packs : 16;
-	packs.alloc_names = packs.alloc_list;
-	packs.list = NULL;
-	packs.names = NULL;
-	ALLOC_ARRAY(packs.list, packs.alloc_list);
-	ALLOC_ARRAY(packs.names, packs.alloc_names);
+	packs.alloc = packs.m ? packs.m->num_packs : 16;
+	packs.info = NULL;
+	ALLOC_ARRAY(packs.info, packs.alloc);
 
 	if (packs.m) {
 		for (i = 0; i < packs.m->num_packs; i++) {
-			ALLOC_GROW(packs.list, packs.nr + 1, packs.alloc_list);
-			ALLOC_GROW(packs.names, packs.nr + 1, packs.alloc_names);
+			ALLOC_GROW(packs.info, packs.nr + 1, packs.alloc);
 
-			packs.list[packs.nr] = NULL;
-			packs.names[packs.nr] = xstrdup(packs.m->pack_names[i]);
+			packs.info[packs.nr].orig_pack_int_id = i;
+			packs.info[packs.nr].pack_name = xstrdup(packs.m->pack_names[i]);
+			packs.info[packs.nr].p = NULL;
 			packs.nr++;
 		}
 	}
@@ -795,10 +768,7 @@ int write_midx_file(const char *object_dir)
 	if (packs.m && packs.nr == packs.m->num_packs)
 		goto cleanup;
 
-	ALLOC_ARRAY(pack_perm, packs.nr);
-	sort_packs_by_name(packs.names, packs.nr, pack_perm);
-
-	entries = get_sorted_entries(packs.m, packs.list, pack_perm, packs.nr, &nr_entries);
+	entries = get_sorted_entries(packs.m, packs.info, packs.nr, &nr_entries);
 
 	for (i = 0; i < nr_entries; i++) {
 		if (entries[i].offset > 0x7fffffff)
@@ -807,8 +777,21 @@ int write_midx_file(const char *object_dir)
 			large_offsets_needed = 1;
 	}
 
+	QSORT(packs.info, packs.nr, pack_info_compare);
+
+	/*
+	 * pack_perm stores a permutation between pack-int-ids from the
+	 * previous multi-pack-index to the new one we are writing:
+	 *
+	 * pack_perm[old_id] = new_id
+	 */
+	ALLOC_ARRAY(pack_perm, packs.nr);
+	for (i = 0; i < packs.nr; i++) {
+		pack_perm[packs.info[i].orig_pack_int_id] = i;
+	}
+
 	for (i = 0; i < packs.nr; i++)
-		pack_name_concat_len += strlen(packs.names[i]) + 1;
+		pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
 
 	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
 		pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
@@ -879,7 +862,7 @@ int write_midx_file(const char *object_dir)
 
 		switch (chunk_ids[i]) {
 			case MIDX_CHUNKID_PACKNAMES:
-				written += write_midx_pack_names(f, packs.names, packs.nr);
+				written += write_midx_pack_names(f, packs.info, packs.nr);
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
@@ -891,7 +874,7 @@ int write_midx_file(const char *object_dir)
 				break;
 
 			case MIDX_CHUNKID_OBJECTOFFSETS:
-				written += write_midx_object_offsets(f, large_offsets_needed, entries, nr_entries);
+				written += write_midx_object_offsets(f, large_offsets_needed, pack_perm, entries, nr_entries);
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
@@ -914,15 +897,14 @@ int write_midx_file(const char *object_dir)
 
 cleanup:
 	for (i = 0; i < packs.nr; i++) {
-		if (packs.list[i]) {
-			close_pack(packs.list[i]);
-			free(packs.list[i]);
+		if (packs.info[i].p) {
+			close_pack(packs.info[i].p);
+			free(packs.info[i].p);
 		}
-		free(packs.names[i]);
+		free(packs.info[i].pack_name);
 	}
 
-	free(packs.list);
-	free(packs.names);
+	free(packs.info);
 	free(entries);
 	free(pack_perm);
 	free(midx_name);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v4 04/10] midx: simplify computation of pack name lengths
  2019-01-24 21:51     ` [PATCH v4 00/10] " Derrick Stolee via GitGitGadget
                         ` (2 preceding siblings ...)
  2019-01-24 21:51       ` [PATCH v4 03/10] multi-pack-index: prepare for 'expire' subcommand Derrick Stolee via GitGitGadget
@ 2019-01-24 21:51       ` Derrick Stolee via GitGitGadget
  2019-01-24 21:51       ` [PATCH v4 05/10] midx: refactor permutation logic and pack sorting Derrick Stolee via GitGitGadget
                         ` (8 subsequent siblings)
  12 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-24 21:51 UTC (permalink / raw)
  To: git
  Cc: sbeller, peff, jrnieder, avarab, jonathantanmy, Junio C Hamano,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Before writing the multi-pack-index, we compute the length of the
pack-index names concatenated together. This forms the data in the
pack name chunk, and we precompute it to compute chunk offsets.
The value is also modified to fit alignment needs.

Previously, this computation was coupled with adding packs from
the existing multi-pack-index and the remaining packs in the object
dir not already covered by the multi-pack-index.

In anticipation of this becoming more complicated with the 'expire'
subcommand, simplify the computation by centralizing it to a single
loop before writing the file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/midx.c b/midx.c
index bb825ef816..f087bbbe82 100644
--- a/midx.c
+++ b/midx.c
@@ -383,7 +383,6 @@ struct pack_list {
 	uint32_t nr;
 	uint32_t alloc_list;
 	uint32_t alloc_names;
-	size_t pack_name_concat_len;
 	struct multi_pack_index *m;
 };
 
@@ -418,7 +417,6 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 		}
 
 		packs->names[packs->nr] = xstrdup(file_name);
-		packs->pack_name_concat_len += strlen(file_name) + 1;
 		packs->nr++;
 	}
 }
@@ -762,6 +760,7 @@ int write_midx_file(const char *object_dir)
 	uint32_t nr_entries, num_large_offsets = 0;
 	struct pack_midx_entry *entries = NULL;
 	int large_offsets_needed = 0;
+	int pack_name_concat_len = 0;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -777,7 +776,6 @@ int write_midx_file(const char *object_dir)
 	packs.alloc_names = packs.alloc_list;
 	packs.list = NULL;
 	packs.names = NULL;
-	packs.pack_name_concat_len = 0;
 	ALLOC_ARRAY(packs.list, packs.alloc_list);
 	ALLOC_ARRAY(packs.names, packs.alloc_names);
 
@@ -788,7 +786,6 @@ int write_midx_file(const char *object_dir)
 
 			packs.list[packs.nr] = NULL;
 			packs.names[packs.nr] = xstrdup(packs.m->pack_names[i]);
-			packs.pack_name_concat_len += strlen(packs.names[packs.nr]) + 1;
 			packs.nr++;
 		}
 	}
@@ -798,10 +795,6 @@ int write_midx_file(const char *object_dir)
 	if (packs.m && packs.nr == packs.m->num_packs)
 		goto cleanup;
 
-	if (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
-		packs.pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
-					      (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
-
 	ALLOC_ARRAY(pack_perm, packs.nr);
 	sort_packs_by_name(packs.names, packs.nr, pack_perm);
 
@@ -814,6 +807,13 @@ int write_midx_file(const char *object_dir)
 			large_offsets_needed = 1;
 	}
 
+	for (i = 0; i < packs.nr; i++)
+		pack_name_concat_len += strlen(packs.names[i]) + 1;
+
+	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
+		pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
+					(pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
+
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
@@ -831,7 +831,7 @@ int write_midx_file(const char *object_dir)
 
 	cur_chunk++;
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + packs.pack_name_concat_len;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
 
 	cur_chunk++;
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v4 06/10] multi-pack-index: implement 'expire' subcommand
  2019-01-24 21:51     ` [PATCH v4 00/10] " Derrick Stolee via GitGitGadget
                         ` (5 preceding siblings ...)
  2019-01-24 21:51       ` [PATCH v4 07/10] multi-pack-index: prepare 'repack' subcommand Derrick Stolee via GitGitGadget
@ 2019-01-24 21:51       ` Derrick Stolee via GitGitGadget
  2019-01-24 21:52       ` [PATCH v4 08/10] midx: implement midx_repack() Derrick Stolee via GitGitGadget
                         ` (5 subsequent siblings)
  12 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-24 21:51 UTC (permalink / raw)
  To: git
  Cc: sbeller, peff, jrnieder, avarab, jonathantanmy, Junio C Hamano,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The 'git multi-pack-index expire' subcommand looks at the existing
mult-pack-index, counts the number of objects referenced in each
pack-file, deletes the pack-fils with no referenced objects, and
rewrites the multi-pack-index to no longer reference those packs.

Refactor the write_midx_file() method to call write_midx_internal()
which now takes an existing 'struct multi_pack_index' and a list
of pack-files to drop (as specified by the names of their pack-
indexes). As we write the new multi-pack-index, we drop those
file names from the list of known pack-files.

The expire_midx_packs() method removes the unreferenced pack-files
after carefully closing the packs to avoid open handles.

Test that a new pack-file that covers the contents of two other
pack-files leads to those pack-files being deleted during the
expire subcommand. Be sure to read the multi-pack-index to ensure
it no longer references those packs.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 120 +++++++++++++++++++++++++++++++++---
 t/t5319-multi-pack-index.sh |  20 ++++++
 2 files changed, 130 insertions(+), 10 deletions(-)

diff --git a/midx.c b/midx.c
index 95c39106b2..299e9b2e8f 100644
--- a/midx.c
+++ b/midx.c
@@ -33,6 +33,8 @@
 #define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t))
 #define MIDX_LARGE_OFFSET_NEEDED 0x80000000
 
+#define PACK_EXPIRED UINT_MAX
+
 static char *get_midx_filename(const char *object_dir)
 {
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
@@ -381,6 +383,7 @@ struct pack_info {
 	uint32_t orig_pack_int_id;
 	char *pack_name;
 	struct packed_git *p;
+	unsigned expired : 1;
 };
 
 static int pack_info_compare(const void *_a, const void *_b)
@@ -428,6 +431,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 
 		packs->info[packs->nr].pack_name = xstrdup(file_name);
 		packs->info[packs->nr].orig_pack_int_id = packs->nr;
+		packs->info[packs->nr].expired = 0;
 		packs->nr++;
 	}
 }
@@ -587,13 +591,17 @@ static size_t write_midx_pack_names(struct hashfile *f,
 	size_t written = 0;
 
 	for (i = 0; i < num_packs; i++) {
-		size_t writelen = strlen(info[i].pack_name) + 1;
+		size_t writelen;
+
+		if (info[i].expired)
+			continue;
 
 		if (i && strcmp(info[i].pack_name, info[i - 1].pack_name) <= 0)
 			BUG("incorrect pack-file order: %s before %s",
 			    info[i - 1].pack_name,
 			    info[i].pack_name);
 
+		writelen = strlen(info[i].pack_name) + 1;
 		hashwrite(f, info[i].pack_name, writelen);
 		written += writelen;
 	}
@@ -675,6 +683,11 @@ static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_nee
 	for (i = 0; i < nr_objects; i++) {
 		struct pack_midx_entry *obj = list++;
 
+		if (perm[obj->pack_int_id] == PACK_EXPIRED)
+			BUG("object %s is in an expired pack with int-id %d",
+			    oid_to_hex(&obj->oid),
+			    obj->pack_int_id);
+
 		hashwrite_be32(f, perm[obj->pack_int_id]);
 
 		if (large_offset_needed && obj->offset >> 31)
@@ -721,7 +734,8 @@ static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_off
 	return written;
 }
 
-int write_midx_file(const char *object_dir)
+static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
+			       struct string_list *packs_to_drop)
 {
 	unsigned char cur_chunk, num_chunks = 0;
 	char *midx_name;
@@ -737,6 +751,8 @@ int write_midx_file(const char *object_dir)
 	struct pack_midx_entry *entries = NULL;
 	int large_offsets_needed = 0;
 	int pack_name_concat_len = 0;
+	int dropped_packs = 0;
+	int result = 0;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -745,7 +761,10 @@ int write_midx_file(const char *object_dir)
 			  midx_name);
 	}
 
-	packs.m = load_multi_pack_index(object_dir, 1);
+	if (m)
+		packs.m = m;
+	else
+		packs.m = load_multi_pack_index(object_dir, 1);
 
 	packs.nr = 0;
 	packs.alloc = packs.m ? packs.m->num_packs : 16;
@@ -759,13 +778,14 @@ int write_midx_file(const char *object_dir)
 			packs.info[packs.nr].orig_pack_int_id = i;
 			packs.info[packs.nr].pack_name = xstrdup(packs.m->pack_names[i]);
 			packs.info[packs.nr].p = NULL;
+			packs.info[packs.nr].expired = 0;
 			packs.nr++;
 		}
 	}
 
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
 
-	if (packs.m && packs.nr == packs.m->num_packs)
+	if (packs.m && packs.nr == packs.m->num_packs && !packs_to_drop)
 		goto cleanup;
 
 	entries = get_sorted_entries(packs.m, packs.info, packs.nr, &nr_entries);
@@ -779,6 +799,34 @@ int write_midx_file(const char *object_dir)
 
 	QSORT(packs.info, packs.nr, pack_info_compare);
 
+	if (packs_to_drop && packs_to_drop->nr) {
+		int drop_index = 0;
+		int missing_drops = 0;
+
+		for (i = 0; i < packs.nr && drop_index < packs_to_drop->nr; i++) {
+			int cmp = strcmp(packs.info[i].pack_name,
+					 packs_to_drop->items[drop_index].string);
+
+			if (!cmp) {
+				drop_index++;
+				packs.info[i].expired = 1;
+			} else if (cmp > 0) {
+				error(_("did not see pack-file %s to drop"),
+				      packs_to_drop->items[drop_index].string);
+				drop_index++;
+				missing_drops++;
+				i--;
+			} else {
+				packs.info[i].expired = 0;
+			}
+		}
+
+		if (missing_drops) {
+			result = 1;
+			goto cleanup;
+		}
+	}
+
 	/*
 	 * pack_perm stores a permutation between pack-int-ids from the
 	 * previous multi-pack-index to the new one we are writing:
@@ -787,11 +835,18 @@ int write_midx_file(const char *object_dir)
 	 */
 	ALLOC_ARRAY(pack_perm, packs.nr);
 	for (i = 0; i < packs.nr; i++) {
-		pack_perm[packs.info[i].orig_pack_int_id] = i;
+		if (packs.info[i].expired) {
+			dropped_packs++;
+			pack_perm[packs.info[i].orig_pack_int_id] = PACK_EXPIRED;
+		} else {
+			pack_perm[packs.info[i].orig_pack_int_id] = i - dropped_packs;
+		}
 	}
 
-	for (i = 0; i < packs.nr; i++)
-		pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
+	for (i = 0; i < packs.nr; i++) {
+		if (!packs.info[i].expired)
+			pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
+	}
 
 	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
 		pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
@@ -807,7 +862,7 @@ int write_midx_file(const char *object_dir)
 	cur_chunk = 0;
 	num_chunks = large_offsets_needed ? 5 : 4;
 
-	written = write_midx_header(f, num_chunks, packs.nr);
+	written = write_midx_header(f, num_chunks, packs.nr - dropped_packs);
 
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
 	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
@@ -908,7 +963,12 @@ int write_midx_file(const char *object_dir)
 	free(entries);
 	free(pack_perm);
 	free(midx_name);
-	return 0;
+	return result;
+}
+
+int write_midx_file(const char *object_dir)
+{
+	return write_midx_internal(object_dir, NULL, NULL);
 }
 
 void clear_midx_file(struct repository *r)
@@ -1010,5 +1070,45 @@ int verify_midx_file(const char *object_dir)
 
 int expire_midx_packs(const char *object_dir)
 {
-	return 0;
+	uint32_t i, *count, result = 0;
+	struct string_list packs_to_drop = STRING_LIST_INIT_DUP;
+	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
+
+	if (!m)
+		return 0;
+
+	count = xcalloc(m->num_packs, sizeof(uint32_t));
+	for (i = 0; i < m->num_objects; i++) {
+		int pack_int_id = nth_midxed_pack_int_id(m, i);
+		count[pack_int_id]++;
+	}
+
+	for (i = 0; i < m->num_packs; i++) {
+		char *pack_name;
+
+		if (count[i])
+			continue;
+
+		if (prepare_midx_pack(m, i))
+			continue;
+
+		if (m->packs[i]->pack_keep)
+			continue;
+
+		pack_name = xstrdup(m->packs[i]->pack_name);
+		close_pack(m->packs[i]);
+		FREE_AND_NULL(m->packs[i]);
+
+		string_list_insert(&packs_to_drop, m->pack_names[i]);
+		unlink_pack_path(pack_name, 0);
+		free(pack_name);
+	}
+
+	free(count);
+
+	if (packs_to_drop.nr)
+		result = write_midx_internal(object_dir, m, &packs_to_drop);
+
+	string_list_clear(&packs_to_drop, 0);
+	return result;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index a8528f7da0..65e85debec 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -397,4 +397,24 @@ test_expect_success 'expire does not remove any packs' '
 	)
 '
 
+test_expect_success 'expire removes unreferenced packs' '
+	(
+		cd dup &&
+		git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
+		refs/heads/A
+		^refs/heads/C
+		EOF
+		git multi-pack-index write &&
+		ls .git/objects/pack | grep -v -e pack-[AB] >expect &&
+		git multi-pack-index expire &&
+		ls .git/objects/pack >actual &&
+		test_cmp expect actual &&
+		ls .git/objects/pack/ | grep idx >expect-idx &&
+		test-tool read-midx .git/objects | grep idx >actual-midx &&
+		test_cmp expect-idx actual-midx &&
+		git multi-pack-index verify &&
+		git fsck
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v4 07/10] multi-pack-index: prepare 'repack' subcommand
  2019-01-24 21:51     ` [PATCH v4 00/10] " Derrick Stolee via GitGitGadget
                         ` (4 preceding siblings ...)
  2019-01-24 21:51       ` [PATCH v4 05/10] midx: refactor permutation logic and pack sorting Derrick Stolee via GitGitGadget
@ 2019-01-24 21:51       ` Derrick Stolee via GitGitGadget
  2019-01-25 23:24         ` Josh Steadmon
  2019-01-24 21:51       ` [PATCH v4 06/10] multi-pack-index: implement 'expire' subcommand Derrick Stolee via GitGitGadget
                         ` (6 subsequent siblings)
  12 siblings, 1 reply; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-24 21:51 UTC (permalink / raw)
  To: git
  Cc: sbeller, peff, jrnieder, avarab, jonathantanmy, Junio C Hamano,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an environment where the multi-pack-index is useful, it is due
to many pack-files and an inability to repack the object store
into a single pack-file. However, it is likely that many of these
pack-files are rather small, and could be repacked into a slightly
larger pack-file without too much effort. It may also be important
to ensure the object store is highly available and the repack
operation does not interrupt concurrent git commands.

Introduce a 'repack' subcommand to 'git multi-pack-index' that
takes a '--batch-size' option. The subcommand will inspect the
multi-pack-index for referenced pack-files whose size is smaller
than the batch size, until collecting a list of pack-files whose
sizes sum to larger than the batch size. Then, a new pack-file
will be created containing the objects from those pack-files that
are referenced by the multi-pack-index. The resulting pack is
likely to actually be smaller than the batch size due to
compression and the fact that there may be objects in the pack-
files that have duplicate copies in other pack-files.

The current change introduces the command-line arguments, and we
add a test that ensures we parse these options properly. Since
we specify a small batch size, we will guarantee that future
implementations do not change the list of pack-files.

In addition, we hard-code the modified times of the packs in
the pack directory to ensure the list of packs sorted by modified
time matches the order if sorted by size (ascending). This will
be important in a future test.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt | 11 +++++++++++
 builtin/multi-pack-index.c             | 12 ++++++++++--
 midx.c                                 |  5 +++++
 midx.h                                 |  1 +
 t/t5319-multi-pack-index.sh            | 17 +++++++++++++++++
 5 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index 6186c4c936..de345c2400 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -36,6 +36,17 @@ expire::
 	have no objects referenced by the MIDX. Rewrite the MIDX file
 	afterward to remove all references to these pack-files.
 
+repack::
+	Create a new pack-file containing objects in small pack-files
+	referenced by the multi-pack-index. Select the pack-files by
+	examining packs from oldest-to-newest, adding a pack if its
+	size is below the batch size. Stop adding packs when the sum
+	of sizes of the added packs is above the batch size. If the
+	total size does not reach the batch size, then do nothing.
+	Rewrite the multi-pack-index to reference the new pack-file.
+	A later run of 'git multi-pack-index expire' will delete the
+	pack-files that were part of this batch.
+
 
 EXAMPLES
 --------
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 145de3a46c..c66239de33 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,12 +5,13 @@
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire)"),
+	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire|repack --batch-size=<size>)"),
 	NULL
 };
 
 static struct opts_multi_pack_index {
 	const char *object_dir;
+	unsigned long batch_size;
 } opts;
 
 int cmd_multi_pack_index(int argc, const char **argv,
@@ -19,6 +20,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
 	static struct option builtin_multi_pack_index_options[] = {
 		OPT_FILENAME(0, "object-dir", &opts.object_dir,
 		  N_("object directory containing set of packfile and pack-index pairs")),
+		OPT_MAGNITUDE(0, "batch-size", &opts.batch_size,
+		  N_("during repack, collect pack-files of smaller size into a batch that is larger than this size")),
 		OPT_END(),
 	};
 
@@ -40,6 +43,11 @@ int cmd_multi_pack_index(int argc, const char **argv,
 		return 1;
 	}
 
+	if (!strcmp(argv[0], "repack"))
+		return midx_repack(opts.object_dir, (size_t)opts.batch_size);
+	if (opts.batch_size)
+		die(_("--batch-size option is only for 'repack' subcommand"));
+
 	if (!strcmp(argv[0], "write"))
 		return write_midx_file(opts.object_dir);
 	if (!strcmp(argv[0], "verify"))
@@ -47,5 +55,5 @@ int cmd_multi_pack_index(int argc, const char **argv,
 	if (!strcmp(argv[0], "expire"))
 		return expire_midx_packs(opts.object_dir);
 
-	die(_("unrecognized verb: %s"), argv[0]);
+	die(_("unrecognized subcommand: %s"), argv[0]);
 }
diff --git a/midx.c b/midx.c
index 299e9b2e8f..768a7dff73 100644
--- a/midx.c
+++ b/midx.c
@@ -1112,3 +1112,8 @@ int expire_midx_packs(const char *object_dir)
 	string_list_clear(&packs_to_drop, 0);
 	return result;
 }
+
+int midx_repack(const char *object_dir, size_t batch_size)
+{
+	return 0;
+}
diff --git a/midx.h b/midx.h
index e3a2b740b5..394a21ee96 100644
--- a/midx.h
+++ b/midx.h
@@ -50,6 +50,7 @@ int write_midx_file(const char *object_dir);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(const char *object_dir);
 int expire_midx_packs(const char *object_dir);
+int midx_repack(const char *object_dir, size_t batch_size);
 
 void close_midx(struct multi_pack_index *m);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 65e85debec..acc5e65ecc 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -417,4 +417,21 @@ test_expect_success 'expire removes unreferenced packs' '
 	)
 '
 
+test_expect_success 'repack with minimum size does not alter existing packs' '
+	(
+		cd dup &&
+		rm -rf .git/objects/pack &&
+		mv .git/objects/pack-backup .git/objects/pack &&
+		touch -m -t 201901010000 .git/objects/pack/pack-D* &&
+		touch -m -t 201901010001 .git/objects/pack/pack-C* &&
+		touch -m -t 201901010002 .git/objects/pack/pack-B* &&
+		touch -m -t 201901010003 .git/objects/pack/pack-A* &&
+		ls .git/objects/pack >expect &&
+		MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
+		git multi-pack-index repack --batch-size=$MINSIZE &&
+		ls .git/objects/pack >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v4 08/10] midx: implement midx_repack()
  2019-01-24 21:51     ` [PATCH v4 00/10] " Derrick Stolee via GitGitGadget
                         ` (6 preceding siblings ...)
  2019-01-24 21:51       ` [PATCH v4 06/10] multi-pack-index: implement 'expire' subcommand Derrick Stolee via GitGitGadget
@ 2019-01-24 21:52       ` Derrick Stolee via GitGitGadget
  2019-01-26 17:10         ` Derrick Stolee
  2019-01-24 21:52       ` [PATCH v4 09/10] multi-pack-index: test expire while adding packs Derrick Stolee via GitGitGadget
                         ` (4 subsequent siblings)
  12 siblings, 1 reply; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-24 21:52 UTC (permalink / raw)
  To: git
  Cc: sbeller, peff, jrnieder, avarab, jonathantanmy, Junio C Hamano,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

To repack using a multi-pack-index, first sort all pack-files by
their modified time. Second, walk those pack-files from oldest
to newest, adding the packs to a list if they are smaller than the
given pack-size. Finally, collect the objects from the multi-pack-
index that are in those packs and send them to 'git pack-objects'.

While first designing a 'git multi-pack-index repack' operation, I
started by collecting the batches based on the size of the objects
instead of the size of the pack-files. This allows repacking a
large pack-file that has very few referencd objects. However, this
came at a significant cost of parsing pack-files instead of simply
reading the multi-pack-index and getting the file information for
the pack-files. This object-size idea could be a direction for
future expansion in this area.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 109 +++++++++++++++++++++++++++++++++++-
 t/t5319-multi-pack-index.sh |  31 +++++++++-
 2 files changed, 138 insertions(+), 2 deletions(-)

diff --git a/midx.c b/midx.c
index 768a7dff73..7d81bf943e 100644
--- a/midx.c
+++ b/midx.c
@@ -8,6 +8,7 @@
 #include "sha1-lookup.h"
 #include "midx.h"
 #include "progress.h"
+#include "run-command.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -1113,7 +1114,113 @@ int expire_midx_packs(const char *object_dir)
 	return result;
 }
 
-int midx_repack(const char *object_dir, size_t batch_size)
+struct time_and_id {
+	timestamp_t mtime;
+	uint32_t pack_int_id;
+};
+
+static int compare_by_mtime(const void *a_, const void *b_)
 {
+	const struct time_and_id *a, *b;
+
+	a = (const struct time_and_id *)a_;
+	b = (const struct time_and_id *)b_;
+
+	if (a->mtime < b->mtime)
+		return -1;
+	if (a->mtime > b->mtime)
+		return 1;
 	return 0;
 }
+
+int midx_repack(const char *object_dir, size_t batch_size)
+{
+	int result = 0;
+	uint32_t i, packs_to_repack;
+	size_t total_size;
+	struct time_and_id *pack_ti;
+	unsigned char *include_pack;
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf base_name = STRBUF_INIT;
+	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
+
+	if (!m)
+		return 0;
+
+	include_pack = xcalloc(m->num_packs, sizeof(unsigned char));
+	pack_ti = xcalloc(m->num_packs, sizeof(struct time_and_id));
+
+	for (i = 0; i < m->num_packs; i++) {
+		pack_ti[i].pack_int_id = i;
+
+		if (prepare_midx_pack(m, i))
+			continue;
+
+		pack_ti[i].mtime = m->packs[i]->mtime;
+	}
+	QSORT(pack_ti, m->num_packs, compare_by_mtime);
+
+	total_size = 0;
+	packs_to_repack = 0;
+	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
+		int pack_int_id = pack_ti[i].pack_int_id;
+		struct packed_git *p = m->packs[pack_int_id];
+
+		if (!p)
+			continue;
+		if (p->pack_size >= batch_size)
+			continue;
+
+		packs_to_repack++;
+		total_size += p->pack_size;
+		include_pack[pack_int_id] = 1;
+	}
+
+	if (total_size < batch_size || packs_to_repack < 2)
+		goto cleanup;
+
+	argv_array_push(&cmd.args, "pack-objects");
+
+	strbuf_addstr(&base_name, object_dir);
+	strbuf_addstr(&base_name, "/pack/pack");
+	argv_array_push(&cmd.args, base_name.buf);
+	strbuf_release(&base_name);
+
+	cmd.git_cmd = 1;
+	cmd.in = cmd.out = -1;
+
+	if (start_command(&cmd)) {
+		error(_("could not start pack-objects"));
+		result = 1;
+		goto cleanup;
+	}
+
+	for (i = 0; i < m->num_objects; i++) {
+		struct object_id oid;
+		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
+
+		if (!include_pack[pack_int_id])
+			continue;
+
+		nth_midxed_object_oid(&oid, m, i);
+		xwrite(cmd.in, oid_to_hex(&oid), the_hash_algo->hexsz);
+		xwrite(cmd.in, "\n", 1);
+	}
+	close(cmd.in);
+
+	if (finish_command(&cmd)) {
+		error(_("could not finish pack-objects"));
+		result = 1;
+		goto cleanup;
+	}
+
+	result = write_midx_internal(object_dir, m, NULL);
+	m = NULL;
+
+cleanup:
+	if (m)
+		close_midx(m);
+	free(include_pack);
+	free(pack_ti);
+	return result;
+}
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index acc5e65ecc..d6c1353514 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -383,7 +383,8 @@ test_expect_success 'setup expire tests' '
 		git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
 		refs/heads/E
 		EOF
-		git multi-pack-index write
+		git multi-pack-index write &&
+		cp -r .git/objects/pack .git/objects/pack-backup
 	)
 '
 
@@ -434,4 +435,32 @@ test_expect_success 'repack with minimum size does not alter existing packs' '
 	)
 '
 
+test_expect_success 'repack creates a new pack' '
+	(
+		cd dup &&
+		ls .git/objects/pack/*idx >idx-list &&
+		test_line_count = 5 idx-list &&
+		THIRD_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 3 | tail -n 1) &&
+		BATCH_SIZE=$(($THIRD_SMALLEST_SIZE + 1)) &&
+		git multi-pack-index repack --batch-size=$BATCH_SIZE &&
+		ls .git/objects/pack/*idx >idx-list &&
+		test_line_count = 6 idx-list &&
+		test-tool read-midx .git/objects | grep idx >midx-list &&
+		test_line_count = 6 midx-list
+	)
+'
+
+test_expect_success 'expire removes repacked packs' '
+	(
+		cd dup &&
+		ls -al .git/objects/pack/*pack &&
+		ls -S .git/objects/pack/*pack | head -n 4 >expect &&
+		git multi-pack-index expire &&
+		ls -S .git/objects/pack/*pack >actual &&
+		test_cmp expect actual &&
+		test-tool read-midx .git/objects | grep idx >midx-list &&
+		test_line_count = 4 midx-list
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v4 09/10] multi-pack-index: test expire while adding packs
  2019-01-24 21:51     ` [PATCH v4 00/10] " Derrick Stolee via GitGitGadget
                         ` (7 preceding siblings ...)
  2019-01-24 21:52       ` [PATCH v4 08/10] midx: implement midx_repack() Derrick Stolee via GitGitGadget
@ 2019-01-24 21:52       ` Derrick Stolee via GitGitGadget
  2019-01-24 21:52       ` [PATCH v4 10/10] midx: add test that 'expire' respects .keep files Derrick Stolee via GitGitGadget
                         ` (3 subsequent siblings)
  12 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-24 21:52 UTC (permalink / raw)
  To: git
  Cc: sbeller, peff, jrnieder, avarab, jonathantanmy, Junio C Hamano,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

During development of the multi-pack-index expire subcommand, a
version went out that improperly computed the pack order if a new
pack was introduced while other packs were being removed. Part of
the subtlety of the bug involved the new pack being placed before
other packs that already existed in the multi-pack-index.

Add a test to t5319-multi-pack-index.sh that catches this issue.
The test adds new packs that cause another pack to be expired, and
creates new packs that are lexicographically sorted before and
after the existing packs.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t5319-multi-pack-index.sh | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index d6c1353514..19b769eea0 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -463,4 +463,36 @@ test_expect_success 'expire removes repacked packs' '
 	)
 '
 
+test_expect_success 'expire works when adding new packs' '
+	(
+		cd dup &&
+		git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
+		refs/heads/A
+		^refs/heads/B
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
+		refs/heads/B
+		^refs/heads/C
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
+		refs/heads/C
+		^refs/heads/D
+		EOF
+		git multi-pack-index write &&
+		git pack-objects --revs .git/objects/pack/a-pack <<-EOF &&
+		refs/heads/D
+		^refs/heads/E
+		EOF
+		git multi-pack-index write &&
+		git pack-objects --revs .git/objects/pack/z-pack <<-EOF &&
+		refs/heads/E
+		EOF
+		git multi-pack-index expire &&
+		ls .git/objects/pack/ | grep idx >expect &&
+		test-tool read-midx .git/objects | grep idx >actual &&
+		test_cmp expect actual &&
+		git multi-pack-index verify
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v4 10/10] midx: add test that 'expire' respects .keep files
  2019-01-24 21:51     ` [PATCH v4 00/10] " Derrick Stolee via GitGitGadget
                         ` (8 preceding siblings ...)
  2019-01-24 21:52       ` [PATCH v4 09/10] multi-pack-index: test expire while adding packs Derrick Stolee via GitGitGadget
@ 2019-01-24 21:52       ` Derrick Stolee via GitGitGadget
  2019-01-24 22:14       ` [PATCH v4 00/10] Create 'expire' and 'repack' verbs for git-multi-pack-index Jonathan Tan
                         ` (2 subsequent siblings)
  12 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-01-24 21:52 UTC (permalink / raw)
  To: git
  Cc: sbeller, peff, jrnieder, avarab, jonathantanmy, Junio C Hamano,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The 'git multi-pack-index expire' subcommand may delete packs that
are not needed from the perspective of the multi-pack-index. If
a pack has a .keep file, then we should not delete that pack. Add
a test that ensures we preserve a pack that would otherwise be
expired. First, create a new pack that contains every object in
the repo, then add it to the multi-pack-index. Then create a .keep
file for a pack starting with "a-pack" that was added in the
previous test. Finally, expire and verify that the pack remains
and the other packs were expired.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t5319-multi-pack-index.sh | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 19b769eea0..bcfa520401 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -495,4 +495,22 @@ test_expect_success 'expire works when adding new packs' '
 	)
 '
 
+test_expect_success 'expire respects .keep files' '
+	(
+		cd dup &&
+		git pack-objects --revs .git/objects/pack/pack-all <<-EOF &&
+		refs/heads/A
+		EOF
+		git multi-pack-index write &&
+		PACKA=$(ls .git/objects/pack/a-pack*\.pack | sed s/\.pack\$//) &&
+		touch $PACKA.keep &&
+		git multi-pack-index expire &&
+		ls -S .git/objects/pack/a-pack* | grep $PACKA >a-pack-files &&
+		test_line_count = 3 a-pack-files &&
+		test-tool read-midx .git/objects | grep idx >midx-list &&
+		test_line_count = 2 midx-list
+	)
+'
+
+
 test_done
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v4 00/10] Create 'expire' and 'repack' verbs for git-multi-pack-index
  2019-01-24 21:51     ` [PATCH v4 00/10] " Derrick Stolee via GitGitGadget
                         ` (9 preceding siblings ...)
  2019-01-24 21:52       ` [PATCH v4 10/10] midx: add test that 'expire' respects .keep files Derrick Stolee via GitGitGadget
@ 2019-01-24 22:14       ` Jonathan Tan
  2019-01-25 23:49       ` Josh Steadmon
  2019-04-24 15:14       ` [PATCH v5 00/11] " Derrick Stolee
  12 siblings, 0 replies; 89+ messages in thread
From: Jonathan Tan @ 2019-01-24 22:14 UTC (permalink / raw)
  To: gitgitgadget; +Cc: git, sbeller, peff, jrnieder, avarab, jonathantanmy, gitster

> Updates in V4:
> 
>  * More 'verb' and 'command' instances replaced with 'subcommand'. I grepped
>    the patch to check these should be fixed everywhere.
>    
>    
>  * Update the tests to check .keep files (in last patch).
>    
>    
>  * Modify the tests to show the terminating condition of --batch-size when
>    there are three packs that fit under the size, but the first two are
>    large enough to stop adding packs. This required rearranging the packs
>    slightly to get different sizes than we had before. Also, I added 'touch
>    -t' to set the modified times so we can fix the order in which the packs
>    are selected.
>    
>    
>  * Added a comment about the purpose of pack_perm.

Thanks, the interdiff and patch 10 look good to me (I already reviewed
V3). I also verified that in the last test, if there is no .keep file,
the test fails as expected.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v4 07/10] multi-pack-index: prepare 'repack' subcommand
  2019-01-24 21:51       ` [PATCH v4 07/10] multi-pack-index: prepare 'repack' subcommand Derrick Stolee via GitGitGadget
@ 2019-01-25 23:24         ` Josh Steadmon
  0 siblings, 0 replies; 89+ messages in thread
From: Josh Steadmon @ 2019-01-25 23:24 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, sbeller, peff, jrnieder, avarab, jonathantanmy,
	Junio C Hamano, Derrick Stolee

On 2019.01.24 13:51, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> In an environment where the multi-pack-index is useful, it is due
> to many pack-files and an inability to repack the object store
> into a single pack-file. However, it is likely that many of these
> pack-files are rather small, and could be repacked into a slightly
> larger pack-file without too much effort. It may also be important
> to ensure the object store is highly available and the repack
> operation does not interrupt concurrent git commands.
> 
> Introduce a 'repack' subcommand to 'git multi-pack-index' that
> takes a '--batch-size' option. The subcommand will inspect the
> multi-pack-index for referenced pack-files whose size is smaller
> than the batch size, until collecting a list of pack-files whose
> sizes sum to larger than the batch size. Then, a new pack-file
> will be created containing the objects from those pack-files that
> are referenced by the multi-pack-index. The resulting pack is
> likely to actually be smaller than the batch size due to
> compression and the fact that there may be objects in the pack-
> files that have duplicate copies in other pack-files.
> 
> The current change introduces the command-line arguments, and we
> add a test that ensures we parse these options properly. Since
> we specify a small batch size, we will guarantee that future
> implementations do not change the list of pack-files.
> 
> In addition, we hard-code the modified times of the packs in
> the pack directory to ensure the list of packs sorted by modified
> time matches the order if sorted by size (ascending). This will
> be important in a future test.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-multi-pack-index.txt | 11 +++++++++++
>  builtin/multi-pack-index.c             | 12 ++++++++++--
>  midx.c                                 |  5 +++++
>  midx.h                                 |  1 +
>  t/t5319-multi-pack-index.sh            | 17 +++++++++++++++++
>  5 files changed, 44 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> index 6186c4c936..de345c2400 100644
> --- a/Documentation/git-multi-pack-index.txt
> +++ b/Documentation/git-multi-pack-index.txt
> @@ -36,6 +36,17 @@ expire::
>  	have no objects referenced by the MIDX. Rewrite the MIDX file
>  	afterward to remove all references to these pack-files.
>  
> +repack::
> +	Create a new pack-file containing objects in small pack-files
> +	referenced by the multi-pack-index. Select the pack-files by
> +	examining packs from oldest-to-newest, adding a pack if its
> +	size is below the batch size. Stop adding packs when the sum
> +	of sizes of the added packs is above the batch size. If the
> +	total size does not reach the batch size, then do nothing.
> +	Rewrite the multi-pack-index to reference the new pack-file.
> +	A later run of 'git multi-pack-index expire' will delete the
> +	pack-files that were part of this batch.
> +
>  
>  EXAMPLES
>  --------
> diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> index 145de3a46c..c66239de33 100644
> --- a/builtin/multi-pack-index.c
> +++ b/builtin/multi-pack-index.c
> @@ -5,12 +5,13 @@
>  #include "midx.h"
>  
>  static char const * const builtin_multi_pack_index_usage[] = {
> -	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire)"),
> +	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire|repack --batch-size=<size>)"),
>  	NULL
>  };
>  
>  static struct opts_multi_pack_index {
>  	const char *object_dir;
> +	unsigned long batch_size;
>  } opts;
>  
>  int cmd_multi_pack_index(int argc, const char **argv,
> @@ -19,6 +20,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
>  	static struct option builtin_multi_pack_index_options[] = {
>  		OPT_FILENAME(0, "object-dir", &opts.object_dir,
>  		  N_("object directory containing set of packfile and pack-index pairs")),
> +		OPT_MAGNITUDE(0, "batch-size", &opts.batch_size,
> +		  N_("during repack, collect pack-files of smaller size into a batch that is larger than this size")),
>  		OPT_END(),
>  	};
>  
> @@ -40,6 +43,11 @@ int cmd_multi_pack_index(int argc, const char **argv,
>  		return 1;
>  	}
>  
> +	if (!strcmp(argv[0], "repack"))
> +		return midx_repack(opts.object_dir, (size_t)opts.batch_size);
> +	if (opts.batch_size)
> +		die(_("--batch-size option is only for 'repack' subcommand"));
> +
>  	if (!strcmp(argv[0], "write"))
>  		return write_midx_file(opts.object_dir);
>  	if (!strcmp(argv[0], "verify"))
> @@ -47,5 +55,5 @@ int cmd_multi_pack_index(int argc, const char **argv,
>  	if (!strcmp(argv[0], "expire"))
>  		return expire_midx_packs(opts.object_dir);
>  
> -	die(_("unrecognized verb: %s"), argv[0]);
> +	die(_("unrecognized subcommand: %s"), argv[0]);
>  }
> diff --git a/midx.c b/midx.c
> index 299e9b2e8f..768a7dff73 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -1112,3 +1112,8 @@ int expire_midx_packs(const char *object_dir)
>  	string_list_clear(&packs_to_drop, 0);
>  	return result;
>  }
> +
> +int midx_repack(const char *object_dir, size_t batch_size)
> +{
> +	return 0;
> +}
> diff --git a/midx.h b/midx.h
> index e3a2b740b5..394a21ee96 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -50,6 +50,7 @@ int write_midx_file(const char *object_dir);
>  void clear_midx_file(struct repository *r);
>  int verify_midx_file(const char *object_dir);
>  int expire_midx_packs(const char *object_dir);
> +int midx_repack(const char *object_dir, size_t batch_size);
>  
>  void close_midx(struct multi_pack_index *m);
>  
> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> index 65e85debec..acc5e65ecc 100755
> --- a/t/t5319-multi-pack-index.sh
> +++ b/t/t5319-multi-pack-index.sh
> @@ -417,4 +417,21 @@ test_expect_success 'expire removes unreferenced packs' '
>  	)
>  '
>  
> +test_expect_success 'repack with minimum size does not alter existing packs' '
> +	(
> +		cd dup &&
> +		rm -rf .git/objects/pack &&
> +		mv .git/objects/pack-backup .git/objects/pack &&
> +		touch -m -t 201901010000 .git/objects/pack/pack-D* &&
> +		touch -m -t 201901010001 .git/objects/pack/pack-C* &&
> +		touch -m -t 201901010002 .git/objects/pack/pack-B* &&
> +		touch -m -t 201901010003 .git/objects/pack/pack-A* &&
> +		ls .git/objects/pack >expect &&
> +		MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
> +		git multi-pack-index repack --batch-size=$MINSIZE &&
> +		ls .git/objects/pack >actual &&
> +		test_cmp expect actual
> +	)
> +'
> +
>  test_done

This test failes for me, with the following error:
mv: cannot stat '.git/objects/pack-backup': No such file or directory


> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v4 00/10] Create 'expire' and 'repack' verbs for git-multi-pack-index
  2019-01-24 21:51     ` [PATCH v4 00/10] " Derrick Stolee via GitGitGadget
                         ` (10 preceding siblings ...)
  2019-01-24 22:14       ` [PATCH v4 00/10] Create 'expire' and 'repack' verbs for git-multi-pack-index Jonathan Tan
@ 2019-01-25 23:49       ` Josh Steadmon
  2019-04-24 15:14       ` [PATCH v5 00/11] " Derrick Stolee
  12 siblings, 0 replies; 89+ messages in thread
From: Josh Steadmon @ 2019-01-25 23:49 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, sbeller, peff, jrnieder, avarab, jonathantanmy, Junio C Hamano

On 2019.01.24 13:51, Derrick Stolee via GitGitGadget wrote:
> The multi-pack-index provides a fast way to find an object among a large
> list of pack-files. It stores a single pack-reference for each object id, so
> duplicate objects are ignored. Among a list of pack-files storing the same
> object, the most-recently modified one is used.
> 
> Create new subcommands for the multi-pack-index builtin.
> 
>  * 'git multi-pack-index expire': If we have a pack-file indexed by the
>    multi-pack-index, but all objects in that pack are duplicated in
>    more-recently modified packs, then delete that pack (and any others like
>    it). Delete the reference to that pack in the multi-pack-index.
>    
>    
>  * 'git multi-pack-index repack --batch-size=': Starting from the oldest
>    pack-files covered by the multi-pack-index, find those whose on-disk size
>    is below the batch size until we have a collection of packs whose sizes
>    add up to the batch size. Create a new pack containing all objects that
>    the multi-pack-index references to those packs.
>    
>    
> 
> This allows us to create a new pattern for repacking objects: run 'repack'.
> After enough time has passed that all Git commands that started before the
> last 'repack' are finished, run 'expire' again. This approach has some
> advantages over the existing "repack everything" model:
> 
>  1. Incremental. We can repack a small batch of objects at a time, instead
>     of repacking all reachable objects. We can also limit ourselves to the
>     objects that do not appear in newer pack-files.
>     
>     
>  2. Highly Available. By adding a new pack-file (and not deleting the old
>     pack-files) we do not interrupt concurrent Git commands, and do not
>     suffer performance degradation. By expiring only pack-files that have no
>     referenced objects, we know that Git commands that are doing normal
>     object lookups* will not be interrupted.
>     
>     
>  3. Note: if someone concurrently runs a Git command that uses
>     get_all_packs(), then that command could try to read the pack-files and
>     pack-indexes that we are deleting during an expire command. Such
>     commands are usually related to object maintenance (i.e. fsck, gc,
>     pack-objects) or are related to less-often-used features (i.e.
>     fast-import, http-backend, server-info).
>     
>     
> 
> We plan to use this approach in VFS for Git to do background maintenance of
> the "shared object cache" which is a Git alternate directory filled with
> packfiles containing commits and trees. We currently download pack-files on
> an hourly basis to keep up-to-date with the central server. The cache
> servers supply packs on an hourly and daily basis, so most of the hourly
> packs become useless after a new daily pack is downloaded. The 'expire'
> command would clear out most of those packs, but many will still remain with
> fewer than 100 objects remaining. The 'repack' command (with a batch size of
> 1-3gb, probably) can condense the remaining packs in commands that run for
> 1-3 min at a time. Since the daily packs range from 100-250mb, we will also
> combine and condense those packs.
> 
> Updates in V2:
> 
>  * Added a method, unlink_pack_path() to remove packfiles, but with the
>    additional check for a .keep file. This borrows logic from 
>    builtin/repack.c.
>    
>    
>  * Modified documentation and commit messages to replace 'verb' with
>    'subcommand'. Simplified the documentation. (I left 'verbs' in the title
>    of the cover letter for consistency.)
>    
>    
> 
> Updates in V3:
> 
>  * There was a bug in the expire logic when simultaneously removing packs
>    and adding uncovered packs, specifically around the pack permutation.
>    This was hard to see during review because I was using the 'pack_perm'
>    array for multiple purposes. First, I was reducing its length, and then I
>    was adding to it and resorting. In V3, I significantly overhauled the
>    logic here, which required some extra commits before implementing
>    'expire'. The final commit includes a test that would cover this case.
> 
> Updates in V4:
> 
>  * More 'verb' and 'command' instances replaced with 'subcommand'. I grepped
>    the patch to check these should be fixed everywhere.
>    
>    
>  * Update the tests to check .keep files (in last patch).
>    
>    
>  * Modify the tests to show the terminating condition of --batch-size when
>    there are three packs that fit under the size, but the first two are
>    large enough to stop adding packs. This required rearranging the packs
>    slightly to get different sizes than we had before. Also, I added 'touch
>    -t' to set the modified times so we can fix the order in which the packs
>    are selected.
>    
>    
>  * Added a comment about the purpose of pack_perm.
>    
>    
> 
> Thanks, -Stolee
> 
> Derrick Stolee (10):
>   repack: refactor pack deletion for future use
>   Docs: rearrange subcommands for multi-pack-index
>   multi-pack-index: prepare for 'expire' subcommand
>   midx: simplify computation of pack name lengths
>   midx: refactor permutation logic and pack sorting
>   multi-pack-index: implement 'expire' subcommand
>   multi-pack-index: prepare 'repack' subcommand
>   midx: implement midx_repack()
>   multi-pack-index: test expire while adding packs
>   midx: add test that 'expire' respects .keep files
> 
>  Documentation/git-multi-pack-index.txt |  26 +-
>  builtin/multi-pack-index.c             |  14 +-
>  builtin/repack.c                       |  14 +-
>  midx.c                                 | 399 ++++++++++++++++++-------
>  midx.h                                 |   2 +
>  packfile.c                             |  28 ++
>  packfile.h                             |   7 +
>  t/t5319-multi-pack-index.sh            | 165 ++++++++++
>  8 files changed, 536 insertions(+), 119 deletions(-)
> 
> 
> base-commit: 26aa9fc81d4c7f6c3b456a29da0b7ec72e5c6595
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-92%2Fderrickstolee%2Fmidx-expire%2Fupstream-v4
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-92/derrickstolee/midx-expire/upstream-v4
> Pull-Request: https://github.com/gitgitgadget/git/pull/92
> 
> Range-diff vs v3:
> 
>   1:  62b393b816 =  1:  62b393b816 repack: refactor pack deletion for future use
>   2:  7886785904 =  2:  7886785904 Docs: rearrange subcommands for multi-pack-index
>   3:  f06382b4ae !  3:  628ca46036 multi-pack-index: prepare for 'expire' subcommand
>      @@ -16,7 +16,9 @@
>           Add a test that verifies the 'expire' subcommand is correctly wired,
>           but will still be valid when the verb is implemented. Specifically,
>           create a set of packs that should all have referenced objects and
>      -    should not be removed during an 'expire' operation.
>      +    should not be removed during an 'expire' operation. The packs are
>      +    created carefully to ensure they have a specific order when sorted
>      +    by size. This will be important in a later test.
>       
>           Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>       
>      @@ -95,6 +97,8 @@
>       +	(
>       +		cd dup &&
>       +		git init &&
>      ++		test-tool genrandom "data" 4096 >large_file.txt &&
>      ++		git update-index --add large_file.txt &&
>       +		for i in $(test_seq 1 20)
>       +		do
>       +			test_commit $i
>      @@ -104,24 +108,24 @@
>       +		git branch C HEAD~13 &&
>       +		git branch D HEAD~16 &&
>       +		git branch E HEAD~18 &&
>      -+		git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
>      -+		refs/heads/E
>      ++		git pack-objects --revs .git/objects/pack/pack-A <<-EOF &&
>      ++		refs/heads/A
>      ++		^refs/heads/B
>       +		EOF
>      -+		git pack-objects --revs .git/objects/pack/pack-D <<-EOF &&
>      -+		refs/heads/D
>      -+		^refs/heads/E
>      ++		git pack-objects --revs .git/objects/pack/pack-B <<-EOF &&
>      ++		refs/heads/B
>      ++		^refs/heads/C
>       +		EOF
>       +		git pack-objects --revs .git/objects/pack/pack-C <<-EOF &&
>       +		refs/heads/C
>       +		^refs/heads/D
>       +		EOF
>      -+		git pack-objects --revs .git/objects/pack/pack-B <<-EOF &&
>      -+		refs/heads/B
>      -+		^refs/heads/C
>      ++		git pack-objects --revs .git/objects/pack/pack-D <<-EOF &&
>      ++		refs/heads/D
>      ++		^refs/heads/E
>       +		EOF
>      -+		git pack-objects --revs .git/objects/pack/pack-A <<-EOF &&
>      -+		refs/heads/A
>      -+		^refs/heads/B
>      ++		git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
>      ++		refs/heads/E
>       +		EOF
>       +		git multi-pack-index write
>       +	)
>   4:  2a763990ae !  4:  d55c1d7ee7 midx: simplify computation of pack name lengths
>      @@ -12,7 +12,7 @@
>           dir not already covered by the multi-pack-index.
>       
>           In anticipation of this becoming more complicated with the 'expire'
>      -    command, simplify the computation by centralizing it to a single
>      +    subcommand, simplify the computation by centralizing it to a single
>           loop before writing the file.
>       
>           Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>   5:  a0d4cc6cb3 !  5:  3950743b96 midx: refactor permutation logic and pack sorting
>      @@ -282,6 +282,12 @@
>        
>       +	QSORT(packs.info, packs.nr, pack_info_compare);
>       +
>      ++	/*
>      ++	 * pack_perm stores a permutation between pack-int-ids from the
>      ++	 * previous multi-pack-index to the new one we are writing:
>      ++	 *
>      ++	 * pack_perm[old_id] = new_id
>      ++	 */
>       +	ALLOC_ARRAY(pack_perm, packs.nr);
>       +	for (i = 0; i < packs.nr; i++) {
>       +		pack_perm[packs.info[i].orig_pack_int_id] = i;
>   6:  4dbff40e7a !  6:  6691d97902 multi-pack-index: implement 'expire' verb
>      @@ -1,8 +1,8 @@
>       Author: Derrick Stolee <dstolee@microsoft.com>
>       
>      -    multi-pack-index: implement 'expire' verb
>      +    multi-pack-index: implement 'expire' subcommand
>       
>      -    The 'git multi-pack-index expire' command looks at the existing
>      +    The 'git multi-pack-index expire' subcommand looks at the existing
>           mult-pack-index, counts the number of objects referenced in each
>           pack-file, deletes the pack-fils with no referenced objects, and
>           rewrites the multi-pack-index to no longer reference those packs.
>      @@ -18,7 +18,7 @@
>       
>           Test that a new pack-file that covers the contents of two other
>           pack-files leads to those pack-files being deleted during the
>      -    expire command. Be sure to read the multi-pack-index to ensure
>      +    expire subcommand. Be sure to read the multi-pack-index to ensure
>           it no longer references those packs.
>       
>           Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>      @@ -161,6 +161,11 @@
>       +		}
>       +	}
>       +
>      + 	/*
>      + 	 * pack_perm stores a permutation between pack-int-ids from the
>      + 	 * previous multi-pack-index to the new one we are writing:
>      +@@
>      + 	 */
>        	ALLOC_ARRAY(pack_perm, packs.nr);
>        	for (i = 0; i < packs.nr; i++) {
>       -		pack_perm[packs.info[i].orig_pack_int_id] = i;
>      @@ -273,7 +278,9 @@
>       +		test_cmp expect actual &&
>       +		ls .git/objects/pack/ | grep idx >expect-idx &&
>       +		test-tool read-midx .git/objects | grep idx >actual-midx &&
>      -+		test_cmp expect-idx actual-midx
>      ++		test_cmp expect-idx actual-midx &&
>      ++		git multi-pack-index verify &&
>      ++		git fsck
>       +	)
>       +'
>       +
>   7:  b39f90ad09 !  7:  f5a8ff21dd multi-pack-index: prepare 'repack' subcommand
>      @@ -11,7 +11,7 @@
>           operation does not interrupt concurrent git commands.
>       
>           Introduce a 'repack' subcommand to 'git multi-pack-index' that
>      -    takes a '--batch-size' option. The verb will inspect the
>      +    takes a '--batch-size' option. The subcommand will inspect the
>           multi-pack-index for referenced pack-files whose size is smaller
>           than the batch size, until collecting a list of pack-files whose
>           sizes sum to larger than the batch size. Then, a new pack-file
>      @@ -26,6 +26,11 @@
>           we specify a small batch size, we will guarantee that future
>           implementations do not change the list of pack-files.
>       
>      +    In addition, we hard-code the modified times of the packs in
>      +    the pack directory to ensure the list of packs sorted by modified
>      +    time matches the order if sorted by size (ascending). This will
>      +    be important in a future test.
>      +
>           Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>       
>        diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
>      @@ -36,15 +41,15 @@
>        	afterward to remove all references to these pack-files.
>        
>       +repack::
>      -+	Collect a batch of pack-files whose size are all at most the
>      -+	size given by --batch-size, but whose sizes sum to larger
>      -+	than --batch-size. The batch is selected by greedily adding
>      -+	small pack-files starting with the oldest pack-files that fit
>      -+	the size. Create a new pack-file containing the objects the
>      -+	multi-pack-index indexes into those pack-files, and rewrite
>      -+	the multi-pack-index to contain that pack-file. A later run
>      -+	of 'git multi-pack-index expire' will delete the pack-files
>      -+	that were part of this batch.
>      ++	Create a new pack-file containing objects in small pack-files
>      ++	referenced by the multi-pack-index. Select the pack-files by
>      ++	examining packs from oldest-to-newest, adding a pack if its
>      ++	size is below the batch size. Stop adding packs when the sum
>      ++	of sizes of the added packs is above the batch size. If the
>      ++	total size does not reach the batch size, then do nothing.
>      ++	Rewrite the multi-pack-index to reference the new pack-file.
>      ++	A later run of 'git multi-pack-index expire' will delete the
>      ++	pack-files that were part of this batch.
>       +
>        
>        EXAMPLES
>      @@ -84,11 +89,18 @@
>       +	if (!strcmp(argv[0], "repack"))
>       +		return midx_repack(opts.object_dir, (size_t)opts.batch_size);
>       +	if (opts.batch_size)
>      -+		die(_("--batch-size option is only for 'repack' verb"));
>      ++		die(_("--batch-size option is only for 'repack' subcommand"));
>       +
>        	if (!strcmp(argv[0], "write"))
>        		return write_midx_file(opts.object_dir);
>        	if (!strcmp(argv[0], "verify"))
>      +@@
>      + 	if (!strcmp(argv[0], "expire"))
>      + 		return expire_midx_packs(opts.object_dir);
>      + 
>      +-	die(_("unrecognized verb: %s"), argv[0]);
>      ++	die(_("unrecognized subcommand: %s"), argv[0]);
>      + }
>       
>        diff --git a/midx.c b/midx.c
>        --- a/midx.c
>      @@ -125,6 +137,12 @@
>       +test_expect_success 'repack with minimum size does not alter existing packs' '
>       +	(
>       +		cd dup &&
>      ++		rm -rf .git/objects/pack &&
>      ++		mv .git/objects/pack-backup .git/objects/pack &&
>      ++		touch -m -t 201901010000 .git/objects/pack/pack-D* &&
>      ++		touch -m -t 201901010001 .git/objects/pack/pack-C* &&
>      ++		touch -m -t 201901010002 .git/objects/pack/pack-B* &&
>      ++		touch -m -t 201901010003 .git/objects/pack/pack-A* &&
>       +		ls .git/objects/pack >expect &&
>       +		MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
>       +		git multi-pack-index repack --batch-size=$MINSIZE &&
>   8:  a4c2d5a8e1 !  8:  ba1a1c7bbb midx: implement midx_repack()
>      @@ -149,6 +149,16 @@
>        diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
>        --- a/t/t5319-multi-pack-index.sh
>        +++ b/t/t5319-multi-pack-index.sh
>      +@@
>      + 		git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
>      + 		refs/heads/E
>      + 		EOF
>      +-		git multi-pack-index write
>      ++		git multi-pack-index write &&
>      ++		cp -r .git/objects/pack .git/objects/pack-backup
>      + 	)
>      + '
>      + 
>       @@
>        	)
>        '
>      @@ -156,25 +166,28 @@
>       +test_expect_success 'repack creates a new pack' '
>       +	(
>       +		cd dup &&
>      -+		SECOND_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 2 | tail -n 1) &&
>      -+		BATCH_SIZE=$(($SECOND_SMALLEST_SIZE + 1)) &&
>      -+		git multi-pack-index repack --batch-size=$BATCH_SIZE &&
>       +		ls .git/objects/pack/*idx >idx-list &&
>       +		test_line_count = 5 idx-list &&
>      ++		THIRD_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 3 | tail -n 1) &&
>      ++		BATCH_SIZE=$(($THIRD_SMALLEST_SIZE + 1)) &&
>      ++		git multi-pack-index repack --batch-size=$BATCH_SIZE &&
>      ++		ls .git/objects/pack/*idx >idx-list &&
>      ++		test_line_count = 6 idx-list &&
>       +		test-tool read-midx .git/objects | grep idx >midx-list &&
>      -+		test_line_count = 5 midx-list
>      ++		test_line_count = 6 midx-list
>       +	)
>       +'
>       +
>       +test_expect_success 'expire removes repacked packs' '
>       +	(
>       +		cd dup &&
>      -+		ls -S .git/objects/pack/*pack | head -n 3 >expect &&
>      ++		ls -al .git/objects/pack/*pack &&
>      ++		ls -S .git/objects/pack/*pack | head -n 4 >expect &&
>       +		git multi-pack-index expire &&
>       +		ls -S .git/objects/pack/*pack >actual &&
>       +		test_cmp expect actual &&
>       +		test-tool read-midx .git/objects | grep idx >midx-list &&
>      -+		test_line_count = 3 midx-list
>      ++		test_line_count = 4 midx-list
>       +	)
>       +'
>       +
>   9:  b97fb35ba9 =  9:  b1c6892417 multi-pack-index: test expire while adding packs
>   -:  ---------- > 10:  481b08890f midx: add test that 'expire' respects .keep files
> 
> -- 
> gitgitgadget

With the exception of the broken test in patch 7, and some minor style
diffs, this all looks good to me.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v4 08/10] midx: implement midx_repack()
  2019-01-24 21:52       ` [PATCH v4 08/10] midx: implement midx_repack() Derrick Stolee via GitGitGadget
@ 2019-01-26 17:10         ` Derrick Stolee
  2019-01-27 22:50           ` Junio C Hamano
  0 siblings, 1 reply; 89+ messages in thread
From: Derrick Stolee @ 2019-01-26 17:10 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: sbeller, peff, jrnieder, avarab, jonathantanmy, Junio C Hamano,
	Derrick Stolee, Josh Steadmon

On 1/24/2019 4:52 PM, Derrick Stolee via GitGitGadget wrote:
> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> index acc5e65ecc..d6c1353514 100755
> --- a/t/t5319-multi-pack-index.sh
> +++ b/t/t5319-multi-pack-index.sh
> @@ -383,7 +383,8 @@ test_expect_success 'setup expire tests' '
>   		git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
>   		refs/heads/E
>   		EOF
> -		git multi-pack-index write
> +		git multi-pack-index write &&
> +		cp -r .git/objects/pack .git/objects/pack-backup
>   	)
>   '

Josh: Thanks for catching the failure in PATCH 7. It's due to this line 
that should be part of that commit, not this one.

Thanks,

-Stolee


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v4 08/10] midx: implement midx_repack()
  2019-01-26 17:10         ` Derrick Stolee
@ 2019-01-27 22:50           ` Junio C Hamano
  0 siblings, 0 replies; 89+ messages in thread
From: Junio C Hamano @ 2019-01-27 22:50 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, sbeller, peff, jrnieder,
	avarab, jonathantanmy, Derrick Stolee, Josh Steadmon

Derrick Stolee <stolee@gmail.com> writes:

> On 1/24/2019 4:52 PM, Derrick Stolee via GitGitGadget wrote:
>> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
>> index acc5e65ecc..d6c1353514 100755
>> --- a/t/t5319-multi-pack-index.sh
>> +++ b/t/t5319-multi-pack-index.sh
>> @@ -383,7 +383,8 @@ test_expect_success 'setup expire tests' '
>>   		git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
>>   		refs/heads/E
>>   		EOF
>> -		git multi-pack-index write
>> +		git multi-pack-index write &&
>> +		cp -r .git/objects/pack .git/objects/pack-backup
>>   	)
>>   '
>
> Josh: Thanks for catching the failure in PATCH 7. It's due to this
> line that should be part of that commit, not this one.

Will move the hunk while queuing.  Thanks, both.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 00/11] Create 'expire' and 'repack' verbs for git-multi-pack-index
  2019-01-24 21:51     ` [PATCH v4 00/10] " Derrick Stolee via GitGitGadget
                         ` (11 preceding siblings ...)
  2019-01-25 23:49       ` Josh Steadmon
@ 2019-04-24 15:14       ` " Derrick Stolee
  2019-04-24 15:14         ` [PATCH v5 01/11] repack: refactor pack deletion for future use Derrick Stolee
                           ` (12 more replies)
  12 siblings, 13 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-04-24 15:14 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, avarab, gitster, Derrick Stolee

The multi-pack-index provides a fast way to find an object among a large list
of pack-files. It stores a single pack-reference for each object id, so
duplicate objects are ignored. Among a list of pack-files storing the same
object, the most-recently modified one is used.

Create new subcommands for the multi-pack-index builtin.

* 'git multi-pack-index expire': If we have a pack-file indexed by the
  multi-pack-index, but all objects in that pack are duplicated in
  more-recently modified packs, then delete that pack (and any others like it).
  Delete the reference to that pack in the multi-pack-index.

* 'git multi-pack-index repack --batch-size=<size>': Starting from the oldest
  pack-files covered by the multi-pack-index, find those whose "expected size"
  is below the batch size until we have a collection of packs whose expected
  sizes add up to the batch size. We compute the expected size by multiplying
  the number of referenced objects by the pack-size and dividing by the total
  number of objects in the pack. If the batch-size is zero, then select all
  packs. Create a new pack containing all objects that the multi-pack-index
  references to those packs.

This allows us to create a new pattern for repacking objects: run 'repack'.
After enough time has passed that all Git commands that started before the last
'repack' are finished, run 'expire' again. This approach has some advantages
over the existing "repack everything" model:

1. Incremental. We can repack a small batch of objects at a time, instead of
repacking all reachable objects. We can also limit ourselves to the objects
that do not appear in newer pack-files.

2. Highly Available. By adding a new pack-file (and not deleting the old
pack-files) we do not interrupt concurrent Git commands, and do not suffer
performance degradation. By expiring only pack-files that have no referenced
objects, we know that Git commands that are doing normal object lookups* will
not be interrupted.

* Note: if someone concurrently runs a Git command that uses get_all_packs(),
* then that command could try to read the pack-files and pack-indexes that we
* are deleting during an expire command. Such commands are usually related to
* object maintenance (i.e. fsck, gc, pack-objects) or are related to
* less-often-used features (i.e. fast-import, http-backend, server-info).

We **are using** this approach in VFS for Git to do background maintenance of
the "shared object cache" which is a Git alternate directory filled with
packfiles containing commits and trees. We currently download pack-files on an
hourly basis to keep up-to-date with the central server. The cache servers
supply packs on an hourly and daily basis, so most of the hourly packs become
useless after a new daily pack is downloaded. The 'expire' command would clear
out most of those packs, but many will still remain with fewer than 100 objects
remaining. The 'repack' command (with a batch size of 1-3gb, probably) can
condense the remaining packs in commands that run for 1-3 min at a time. Since
the daily packs range from 100-250mb, we will also combine and condense those
packs.

Updates in V5:

* Fixed the error in PATCH 7 due to a missing line that existed in PATCH 8. Thanks, Josh Steadmon!

* The 'repack' subcommand now computes the "expected size" of a pack instead of
  relying on the total size of the pack. This is actually really important to
  the way VFS for Git uses prefetch packs, and some packs are not being
  repacked because the pack size is larger than the batch size, but really
  there are only a few referenced objects.

* The 'repack' subcommand now allows a batch size of zero to mean "create one
  pack containing all objects in the multi-pack-index". A new commit adds a
  test that hits the boundary cases here, but follows the 'expire' subcommand
  so we can show that cycle of repack-then-expire to safely replace the packs.

Junio: It appears that there are some conflicts with the trace2 changes in
master. These are not new to the updates in this version. I saw how you
resolved these conflicts and replaying that resolution should work for you.

Thanks,
-Stolee

Derrick Stolee (11):
  repack: refactor pack deletion for future use
  Docs: rearrange subcommands for multi-pack-index
  multi-pack-index: prepare for 'expire' subcommand
  midx: simplify computation of pack name lengths
  midx: refactor permutation logic and pack sorting
  multi-pack-index: implement 'expire' subcommand
  multi-pack-index: prepare 'repack' subcommand
  midx: implement midx_repack()
  multi-pack-index: test expire while adding packs
  midx: add test that 'expire' respects .keep files
  t5319-multi-pack-index.sh: test batch size zero

 Documentation/git-multi-pack-index.txt |  32 +-
 builtin/multi-pack-index.c             |  14 +-
 builtin/repack.c                       |  14 +-
 midx.c                                 | 440 +++++++++++++++++++------
 midx.h                                 |   2 +
 packfile.c                             |  28 ++
 packfile.h                             |   7 +
 t/t5319-multi-pack-index.sh            | 184 +++++++++++
 8 files changed, 602 insertions(+), 119 deletions(-)


base-commit: 26aa9fc81d4c7f6c3b456a29da0b7ec72e5c6595
-- 
2.21.0.1096.g1c91fdc207

Range diff against v4:

 1:  62b393b816 =  1:  62b393b816 repack: refactor pack deletion for future use
 2:  7886785904 =  2:  7886785904 Docs: rearrange subcommands for multi-pack-index
 3:  628ca46036 =  3:  628ca46036 multi-pack-index: prepare for 'expire' subcommand
 4:  d55c1d7ee7 =  4:  d55c1d7ee7 midx: simplify computation of pack name lengths
 5:  3950743b96 =  5:  3950743b96 midx: refactor permutation logic and pack sorting
 6:  6691d97902 =  6:  6691d97902 multi-pack-index: implement 'expire' subcommand
 7:  f5a8ff21dd !  7:  e66e383231 multi-pack-index: prepare 'repack' subcommand
    @@ -42,14 +42,20 @@
      
     +repack::
     +	Create a new pack-file containing objects in small pack-files
    -+	referenced by the multi-pack-index. Select the pack-files by
    -+	examining packs from oldest-to-newest, adding a pack if its
    -+	size is below the batch size. Stop adding packs when the sum
    -+	of sizes of the added packs is above the batch size. If the
    -+	total size does not reach the batch size, then do nothing.
    -+	Rewrite the multi-pack-index to reference the new pack-file.
    -+	A later run of 'git multi-pack-index expire' will delete the
    -+	pack-files that were part of this batch.
    ++	referenced by the multi-pack-index. If the size given by the
    ++	`--batch-size=<size>` argument is zero, then create a pack
    ++	containing all objects referenced by the multi-pack-index. For
    ++	a non-zero batch size, Select the pack-files by examining packs
    ++	from oldest-to-newest, computing the "expected size" by counting
    ++	the number of objects in the pack referenced by the
    ++	multi-pack-index, then divide by the total number of objects in
    ++	the pack and multiply by the pack size. We select packs with
    ++	expected size below the batch size until the set of packs have
    ++	total expected size at least the batch size. If the total size
    ++	does not reach the batch size, then do nothing. If a new pack-
    ++	file is created, rewrite the multi-pack-index to reference the
    ++	new pack-file. A later run of 'git multi-pack-index expire' will
    ++	delete the pack-files that were part of this batch.
     +
      
      EXAMPLES
    @@ -130,6 +136,16 @@
      diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
      --- a/t/t5319-multi-pack-index.sh
      +++ b/t/t5319-multi-pack-index.sh
    +@@
    + 		git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
    + 		refs/heads/E
    + 		EOF
    +-		git multi-pack-index write
    ++		git multi-pack-index write &&
    ++		cp -r .git/objects/pack .git/objects/pack-backup
    + 	)
    + '
    + 
     @@
      	)
      '
 8:  ba1a1c7bbb !  8:  8e2a1140a5 midx: implement midx_repack()
    @@ -2,20 +2,35 @@
     
         midx: implement midx_repack()
     
    -    To repack using a multi-pack-index, first sort all pack-files by
    +    To repack with a non-zero batch-size, first sort all pack-files by
         their modified time. Second, walk those pack-files from oldest
    -    to newest, adding the packs to a list if they are smaller than the
    -    given pack-size. Finally, collect the objects from the multi-pack-
    -    index that are in those packs and send them to 'git pack-objects'.
    +    to newest, compute their expected size, and add the packs to a list
    +    if they are smaller than the given batch-size. Stop when the total
    +    expected size is at least the batch size.
    +
    +    If the batch size is zero, select all packs in the multi-pack-index.
    +
    +    Finally, collect the objects from the multi-pack-index that are in
    +    the selected packs and send them to 'git pack-objects'. Write a new
    +    multi-pack-index that includes the new pack.
    +
    +    Using a batch size of zero is very similar to a standard 'git repack'
    +    command, except that we do not delete the old packs and instead rely
    +    on the new multi-pack-index to prevent new processes from reading the
    +    old packs. This does not disrupt other Git processes that are currently
    +    reading the old packs based on the old multi-pack-index.
     
         While first designing a 'git multi-pack-index repack' operation, I
    -    started by collecting the batches based on the size of the objects
    -    instead of the size of the pack-files. This allows repacking a
    -    large pack-file that has very few referencd objects. However, this
    +    started by collecting the batches based on the actual size of the
    +    objects instead of the size of the pack-files. This allows repacking
    +    a large pack-file that has very few referencd objects. However, this
         came at a significant cost of parsing pack-files instead of simply
         reading the multi-pack-index and getting the file information for
    -    the pack-files. This object-size idea could be a direction for
    -    future expansion in this area.
    +    the pack-files. The "expected size" version provides similar
    +    behavior, but could skip a pack-file if the average object size is
    +    much larger than the actual size of the referenced objects, or
    +    can create a large pack if the actual size of the referenced objects
    +    is larger than the expected size.
     
         Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
     
    @@ -35,69 +50,111 @@
      }
      
     -int midx_repack(const char *object_dir, size_t batch_size)
    -+struct time_and_id {
    ++struct repack_info {
     +	timestamp_t mtime;
    ++	uint32_t referenced_objects;
     +	uint32_t pack_int_id;
     +};
     +
     +static int compare_by_mtime(const void *a_, const void *b_)
      {
    -+	const struct time_and_id *a, *b;
    ++	const struct repack_info *a, *b;
     +
    -+	a = (const struct time_and_id *)a_;
    -+	b = (const struct time_and_id *)b_;
    ++	a = (const struct repack_info *)a_;
    ++	b = (const struct repack_info *)b_;
     +
     +	if (a->mtime < b->mtime)
     +		return -1;
     +	if (a->mtime > b->mtime)
     +		return 1;
    - 	return 0;
    - }
    ++	return 0;
    ++}
     +
    -+int midx_repack(const char *object_dir, size_t batch_size)
    ++static int fill_included_packs_all(struct multi_pack_index *m,
    ++				   unsigned char *include_pack)
     +{
    -+	int result = 0;
    -+	uint32_t i, packs_to_repack;
    -+	size_t total_size;
    -+	struct time_and_id *pack_ti;
    -+	unsigned char *include_pack;
    -+	struct child_process cmd = CHILD_PROCESS_INIT;
    -+	struct strbuf base_name = STRBUF_INIT;
    -+	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
    ++	uint32_t i;
     +
    -+	if (!m)
    -+		return 0;
    ++	for (i = 0; i < m->num_packs; i++)
    ++		include_pack[i] = 1;
     +
    -+	include_pack = xcalloc(m->num_packs, sizeof(unsigned char));
    -+	pack_ti = xcalloc(m->num_packs, sizeof(struct time_and_id));
    ++	return m->num_packs < 2;
    ++}
    ++
    ++static int fill_included_packs_batch(struct multi_pack_index *m,
    ++				     unsigned char *include_pack,
    ++				     size_t batch_size)
    ++{
    ++	uint32_t i, packs_to_repack;
    ++	size_t total_size;
    ++	struct repack_info *pack_info = xcalloc(m->num_packs, sizeof(struct repack_info));
     +
     +	for (i = 0; i < m->num_packs; i++) {
    -+		pack_ti[i].pack_int_id = i;
    ++		pack_info[i].pack_int_id = i;
     +
     +		if (prepare_midx_pack(m, i))
     +			continue;
     +
    -+		pack_ti[i].mtime = m->packs[i]->mtime;
    ++		pack_info[i].mtime = m->packs[i]->mtime;
    ++	}
    ++
    ++	for (i = 0; batch_size && i < m->num_objects; i++) {
    ++		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
    ++		pack_info[pack_int_id].referenced_objects++;
     +	}
    -+	QSORT(pack_ti, m->num_packs, compare_by_mtime);
    ++
    ++	QSORT(pack_info, m->num_packs, compare_by_mtime);
     +
     +	total_size = 0;
     +	packs_to_repack = 0;
     +	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
    -+		int pack_int_id = pack_ti[i].pack_int_id;
    ++		int pack_int_id = pack_info[i].pack_int_id;
     +		struct packed_git *p = m->packs[pack_int_id];
    ++		size_t expected_size;
     +
     +		if (!p)
     +			continue;
    -+		if (p->pack_size >= batch_size)
    ++		if (open_pack_index(p) || !p->num_objects)
    ++			continue;
    ++
    ++		expected_size = (size_t)(p->pack_size
    ++					 * pack_info[i].referenced_objects);
    ++		expected_size /= p->num_objects;
    ++
    ++		if (expected_size >= batch_size)
     +			continue;
     +
     +		packs_to_repack++;
    -+		total_size += p->pack_size;
    ++		total_size += expected_size;
     +		include_pack[pack_int_id] = 1;
     +	}
     +
    ++	free(pack_info);
    ++
     +	if (total_size < batch_size || packs_to_repack < 2)
    ++		return 1;
    ++
    + 	return 0;
    ++}	
    ++
    ++int midx_repack(const char *object_dir, size_t batch_size)
    ++{
    ++	int result = 0;
    ++	uint32_t i;
    ++	unsigned char *include_pack;
    ++	struct child_process cmd = CHILD_PROCESS_INIT;
    ++	struct strbuf base_name = STRBUF_INIT;
    ++	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
    ++
    ++	if (!m)
    ++		return 0;
    ++
    ++	include_pack = xcalloc(m->num_packs, sizeof(unsigned char));
    ++
    ++	if (batch_size) {
    ++		if (fill_included_packs_batch(m, include_pack, batch_size))
    ++			goto cleanup;
    ++	} else if (fill_included_packs_all(m, include_pack))
     +		goto cleanup;
     +
     +	argv_array_push(&cmd.args, "pack-objects");
    @@ -142,23 +199,12 @@
     +	if (m)
     +		close_midx(m);
     +	free(include_pack);
    -+	free(pack_ti);
     +	return result;
    -+}
    + }
     
      diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
      --- a/t/t5319-multi-pack-index.sh
      +++ b/t/t5319-multi-pack-index.sh
    -@@
    - 		git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
    - 		refs/heads/E
    - 		EOF
    --		git multi-pack-index write
    -+		git multi-pack-index write &&
    -+		cp -r .git/objects/pack .git/objects/pack-backup
    - 	)
    - '
    - 
     @@
      	)
      '
 9:  b1c6892417 =  9:  f1e4669bf6 multi-pack-index: test expire while adding packs
10:  481b08890f = 10:  b1e9df9980 midx: add test that 'expire' respects .keep files
 -:  ---------- > 11:  f6217e9fae t5319-multi-pack-index.sh: test batch size zero

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 01/11] repack: refactor pack deletion for future use
  2019-04-24 15:14       ` [PATCH v5 00/11] " Derrick Stolee
@ 2019-04-24 15:14         ` Derrick Stolee
  2019-04-24 15:14         ` [PATCH v5 02/11] Docs: rearrange subcommands for multi-pack-index Derrick Stolee
                           ` (11 subsequent siblings)
  12 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-04-24 15:14 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, avarab, gitster, Derrick Stolee

The repack builtin deletes redundant pack-files and their
associated .idx, .promisor, .bitmap, and .keep files. We will want
to re-use this logic in the future for other types of repack, so
pull the logic into 'unlink_pack_path()' in packfile.c.

The 'ignore_keep' parameter is enabled for the use in repack, but
will be important for a future caller.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/repack.c | 14 ++------------
 packfile.c       | 28 ++++++++++++++++++++++++++++
 packfile.h       |  7 +++++++
 3 files changed, 37 insertions(+), 12 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 45583683ee..3d445b34b4 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -129,19 +129,9 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list,
 
 static void remove_redundant_pack(const char *dir_name, const char *base_name)
 {
-	const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", ".promisor"};
-	int i;
 	struct strbuf buf = STRBUF_INIT;
-	size_t plen;
-
-	strbuf_addf(&buf, "%s/%s", dir_name, base_name);
-	plen = buf.len;
-
-	for (i = 0; i < ARRAY_SIZE(exts); i++) {
-		strbuf_setlen(&buf, plen);
-		strbuf_addstr(&buf, exts[i]);
-		unlink(buf.buf);
-	}
+	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
+	unlink_pack_path(buf.buf, 1);
 	strbuf_release(&buf);
 }
 
diff --git a/packfile.c b/packfile.c
index d1e6683ffe..bacecb4d0d 100644
--- a/packfile.c
+++ b/packfile.c
@@ -352,6 +352,34 @@ void close_all_packs(struct raw_object_store *o)
 	}
 }
 
+void unlink_pack_path(const char *pack_name, int force_delete)
+{
+	static const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", ".promisor"};
+	int i;
+	struct strbuf buf = STRBUF_INIT;
+	size_t plen;
+
+	strbuf_addstr(&buf, pack_name);
+	strip_suffix_mem(buf.buf, &buf.len, ".pack");
+	plen = buf.len;
+
+	if (!force_delete) {
+		strbuf_addstr(&buf, ".keep");
+		if (!access(buf.buf, F_OK)) {
+			strbuf_release(&buf);
+			return;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(exts); i++) {
+		strbuf_setlen(&buf, plen);
+		strbuf_addstr(&buf, exts[i]);
+		unlink(buf.buf);
+	}
+
+	strbuf_release(&buf);
+}
+
 /*
  * The LRU pack is the one with the oldest MRU window, preferring packs
  * with no used windows, or the oldest mtime if it has no windows allocated.
diff --git a/packfile.h b/packfile.h
index 6c4037605d..5b7bcdb1dd 100644
--- a/packfile.h
+++ b/packfile.h
@@ -86,6 +86,13 @@ extern void unuse_pack(struct pack_window **);
 extern void clear_delta_base_cache(void);
 extern struct packed_git *add_packed_git(const char *path, size_t path_len, int local);
 
+/*
+ * Unlink the .pack and associated extension files.
+ * Does not unlink if 'force_delete' is false and the pack-file is
+ * marked as ".keep".
+ */
+extern void unlink_pack_path(const char *pack_name, int force_delete);
+
 /*
  * Make sure that a pointer access into an mmap'd index file is within bounds,
  * and can provide at least 8 bytes of data.
-- 
2.21.0.1096.g1c91fdc207


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 02/11] Docs: rearrange subcommands for multi-pack-index
  2019-04-24 15:14       ` [PATCH v5 00/11] " Derrick Stolee
  2019-04-24 15:14         ` [PATCH v5 01/11] repack: refactor pack deletion for future use Derrick Stolee
@ 2019-04-24 15:14         ` Derrick Stolee
  2019-04-24 15:14         ` [PATCH v5 03/11] multi-pack-index: prepare for 'expire' subcommand Derrick Stolee
                           ` (10 subsequent siblings)
  12 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-04-24 15:14 UTC (permalink / raw)
  To: git
  Cc: peff, jrnieder, avarab, gitster, Derrick Stolee, Stefan Beller,
	Szeder Gábor

We will add new subcommands to the multi-pack-index, and that will
make the documentation a bit messier. Clean up the 'verb'
descriptions by renaming the concept to 'subcommand' and removing
the reference to the object directory.

Helped-by: Stefan Beller <sbeller@google.com>
Helped-by: Szeder Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index f7778a2c85..1af406aca2 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -9,7 +9,7 @@ git-multi-pack-index - Write and verify multi-pack-indexes
 SYNOPSIS
 --------
 [verse]
-'git multi-pack-index' [--object-dir=<dir>] <verb>
+'git multi-pack-index' [--object-dir=<dir>] <subcommand>
 
 DESCRIPTION
 -----------
@@ -23,13 +23,13 @@ OPTIONS
 	`<dir>/packs/multi-pack-index` for the current MIDX file, and
 	`<dir>/packs` for the pack-files to index.
 
+The following subcommands are available:
+
 write::
-	When given as the verb, write a new MIDX file to
-	`<dir>/packs/multi-pack-index`.
+	Write a new MIDX file.
 
 verify::
-	When given as the verb, verify the contents of the MIDX file
-	at `<dir>/packs/multi-pack-index`.
+	Verify the contents of the MIDX file.
 
 
 EXAMPLES
-- 
2.21.0.1096.g1c91fdc207


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 03/11] multi-pack-index: prepare for 'expire' subcommand
  2019-04-24 15:14       ` [PATCH v5 00/11] " Derrick Stolee
  2019-04-24 15:14         ` [PATCH v5 01/11] repack: refactor pack deletion for future use Derrick Stolee
  2019-04-24 15:14         ` [PATCH v5 02/11] Docs: rearrange subcommands for multi-pack-index Derrick Stolee
@ 2019-04-24 15:14         ` Derrick Stolee
  2019-04-24 15:14         ` [PATCH v5 04/11] midx: simplify computation of pack name lengths Derrick Stolee
                           ` (9 subsequent siblings)
  12 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-04-24 15:14 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, avarab, gitster, Derrick Stolee

The multi-pack-index tracks objects in a collection of pack-files.
Only one copy of each object is indexed, using the modified time
of the pack-files to determine tie-breakers. It is possible to
have a pack-file with no referenced objects because all objects
have a duplicate in a newer pack-file.

Introduce a new 'expire' subcommand to the multi-pack-index builtin.
This subcommand will delete these unused pack-files and rewrite the
multi-pack-index to no longer refer to those files. More details
about the specifics will follow as the method is implemented.

Add a test that verifies the 'expire' subcommand is correctly wired,
but will still be valid when the verb is implemented. Specifically,
create a set of packs that should all have referenced objects and
should not be removed during an 'expire' operation. The packs are
created carefully to ensure they have a specific order when sorted
by size. This will be important in a later test.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt |  5 +++
 builtin/multi-pack-index.c             |  4 ++-
 midx.c                                 |  5 +++
 midx.h                                 |  1 +
 t/t5319-multi-pack-index.sh            | 49 ++++++++++++++++++++++++++
 5 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index 1af406aca2..6186c4c936 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -31,6 +31,11 @@ write::
 verify::
 	Verify the contents of the MIDX file.
 
+expire::
+	Delete the pack-files that are tracked 	by the MIDX file, but
+	have no objects referenced by the MIDX. Rewrite the MIDX file
+	afterward to remove all references to these pack-files.
+
 
 EXAMPLES
 --------
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index fca70f8e4f..145de3a46c 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,7 +5,7 @@
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-	N_("git multi-pack-index [--object-dir=<dir>] (write|verify)"),
+	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire)"),
 	NULL
 };
 
@@ -44,6 +44,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
 		return write_midx_file(opts.object_dir);
 	if (!strcmp(argv[0], "verify"))
 		return verify_midx_file(opts.object_dir);
+	if (!strcmp(argv[0], "expire"))
+		return expire_midx_packs(opts.object_dir);
 
 	die(_("unrecognized verb: %s"), argv[0]);
 }
diff --git a/midx.c b/midx.c
index 730ff84dff..bb825ef816 100644
--- a/midx.c
+++ b/midx.c
@@ -1025,3 +1025,8 @@ int verify_midx_file(const char *object_dir)
 
 	return verify_midx_error;
 }
+
+int expire_midx_packs(const char *object_dir)
+{
+	return 0;
+}
diff --git a/midx.h b/midx.h
index 774f652530..e3a2b740b5 100644
--- a/midx.h
+++ b/midx.h
@@ -49,6 +49,7 @@ int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, i
 int write_midx_file(const char *object_dir);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(const char *object_dir);
+int expire_midx_packs(const char *object_dir);
 
 void close_midx(struct multi_pack_index *m);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 70926b5bc0..a8528f7da0 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -348,4 +348,53 @@ test_expect_success 'verify incorrect 64-bit offset' '
 		"incorrect object offset"
 '
 
+test_expect_success 'setup expire tests' '
+	mkdir dup &&
+	(
+		cd dup &&
+		git init &&
+		test-tool genrandom "data" 4096 >large_file.txt &&
+		git update-index --add large_file.txt &&
+		for i in $(test_seq 1 20)
+		do
+			test_commit $i
+		done &&
+		git branch A HEAD &&
+		git branch B HEAD~8 &&
+		git branch C HEAD~13 &&
+		git branch D HEAD~16 &&
+		git branch E HEAD~18 &&
+		git pack-objects --revs .git/objects/pack/pack-A <<-EOF &&
+		refs/heads/A
+		^refs/heads/B
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-B <<-EOF &&
+		refs/heads/B
+		^refs/heads/C
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-C <<-EOF &&
+		refs/heads/C
+		^refs/heads/D
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-D <<-EOF &&
+		refs/heads/D
+		^refs/heads/E
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
+		refs/heads/E
+		EOF
+		git multi-pack-index write
+	)
+'
+
+test_expect_success 'expire does not remove any packs' '
+	(
+		cd dup &&
+		ls .git/objects/pack >expect &&
+		git multi-pack-index expire &&
+		ls .git/objects/pack >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
2.21.0.1096.g1c91fdc207


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 04/11] midx: simplify computation of pack name lengths
  2019-04-24 15:14       ` [PATCH v5 00/11] " Derrick Stolee
                           ` (2 preceding siblings ...)
  2019-04-24 15:14         ` [PATCH v5 03/11] multi-pack-index: prepare for 'expire' subcommand Derrick Stolee
@ 2019-04-24 15:14         ` Derrick Stolee
  2019-04-24 15:14         ` [PATCH v5 05/11] midx: refactor permutation logic and pack sorting Derrick Stolee
                           ` (8 subsequent siblings)
  12 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-04-24 15:14 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, avarab, gitster, Derrick Stolee

Before writing the multi-pack-index, we compute the length of the
pack-index names concatenated together. This forms the data in the
pack name chunk, and we precompute it to compute chunk offsets.
The value is also modified to fit alignment needs.

Previously, this computation was coupled with adding packs from
the existing multi-pack-index and the remaining packs in the object
dir not already covered by the multi-pack-index.

In anticipation of this becoming more complicated with the 'expire'
subcommand, simplify the computation by centralizing it to a single
loop before writing the file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/midx.c b/midx.c
index bb825ef816..f087bbbe82 100644
--- a/midx.c
+++ b/midx.c
@@ -383,7 +383,6 @@ struct pack_list {
 	uint32_t nr;
 	uint32_t alloc_list;
 	uint32_t alloc_names;
-	size_t pack_name_concat_len;
 	struct multi_pack_index *m;
 };
 
@@ -418,7 +417,6 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 		}
 
 		packs->names[packs->nr] = xstrdup(file_name);
-		packs->pack_name_concat_len += strlen(file_name) + 1;
 		packs->nr++;
 	}
 }
@@ -762,6 +760,7 @@ int write_midx_file(const char *object_dir)
 	uint32_t nr_entries, num_large_offsets = 0;
 	struct pack_midx_entry *entries = NULL;
 	int large_offsets_needed = 0;
+	int pack_name_concat_len = 0;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -777,7 +776,6 @@ int write_midx_file(const char *object_dir)
 	packs.alloc_names = packs.alloc_list;
 	packs.list = NULL;
 	packs.names = NULL;
-	packs.pack_name_concat_len = 0;
 	ALLOC_ARRAY(packs.list, packs.alloc_list);
 	ALLOC_ARRAY(packs.names, packs.alloc_names);
 
@@ -788,7 +786,6 @@ int write_midx_file(const char *object_dir)
 
 			packs.list[packs.nr] = NULL;
 			packs.names[packs.nr] = xstrdup(packs.m->pack_names[i]);
-			packs.pack_name_concat_len += strlen(packs.names[packs.nr]) + 1;
 			packs.nr++;
 		}
 	}
@@ -798,10 +795,6 @@ int write_midx_file(const char *object_dir)
 	if (packs.m && packs.nr == packs.m->num_packs)
 		goto cleanup;
 
-	if (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
-		packs.pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
-					      (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
-
 	ALLOC_ARRAY(pack_perm, packs.nr);
 	sort_packs_by_name(packs.names, packs.nr, pack_perm);
 
@@ -814,6 +807,13 @@ int write_midx_file(const char *object_dir)
 			large_offsets_needed = 1;
 	}
 
+	for (i = 0; i < packs.nr; i++)
+		pack_name_concat_len += strlen(packs.names[i]) + 1;
+
+	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
+		pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
+					(pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
+
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
@@ -831,7 +831,7 @@ int write_midx_file(const char *object_dir)
 
 	cur_chunk++;
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + packs.pack_name_concat_len;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
 
 	cur_chunk++;
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
-- 
2.21.0.1096.g1c91fdc207


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 05/11] midx: refactor permutation logic and pack sorting
  2019-04-24 15:14       ` [PATCH v5 00/11] " Derrick Stolee
                           ` (3 preceding siblings ...)
  2019-04-24 15:14         ` [PATCH v5 04/11] midx: simplify computation of pack name lengths Derrick Stolee
@ 2019-04-24 15:14         ` Derrick Stolee
  2019-04-24 15:14         ` [PATCH v5 06/11] multi-pack-index: implement 'expire' subcommand Derrick Stolee
                           ` (7 subsequent siblings)
  12 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-04-24 15:14 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, avarab, gitster, Derrick Stolee

In anticipation of the expire subcommand, refactor the way we sort
the packfiles by name. This will greatly simplify our approach to
dropping expired packs from the list.

First, create 'struct pack_info' to replace 'struct pack_pair'.
This struct contains the necessary information about a pack,
including its name, a pointer to its packfile struct (if not
already in the multi-pack-index), and the original pack-int-id.

Second, track the pack information using an array of pack_info
structs in the pack_list struct. This simplifies the logic around
the multiple arrays we were tracking in that struct.

Finally, update get_sorted_entries() to not permute the pack-int-id
and instead supply the permutation to write_midx_object_offsets().
This requires sorting the packs after get_sorted_entries().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 156 +++++++++++++++++++++++++--------------------------------
 1 file changed, 69 insertions(+), 87 deletions(-)

diff --git a/midx.c b/midx.c
index f087bbbe82..95c39106b2 100644
--- a/midx.c
+++ b/midx.c
@@ -377,12 +377,23 @@ static size_t write_midx_header(struct hashfile *f,
 	return MIDX_HEADER_SIZE;
 }
 
+struct pack_info {
+	uint32_t orig_pack_int_id;
+	char *pack_name;
+	struct packed_git *p;
+};
+
+static int pack_info_compare(const void *_a, const void *_b)
+{
+	struct pack_info *a = (struct pack_info *)_a;
+	struct pack_info *b = (struct pack_info *)_b;
+	return strcmp(a->pack_name, b->pack_name);
+}
+
 struct pack_list {
-	struct packed_git **list;
-	char **names;
+	struct pack_info *info;
 	uint32_t nr;
-	uint32_t alloc_list;
-	uint32_t alloc_names;
+	uint32_t alloc;
 	struct multi_pack_index *m;
 };
 
@@ -395,66 +406,32 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 		if (packs->m && midx_contains_pack(packs->m, file_name))
 			return;
 
-		ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc_list);
-		ALLOC_GROW(packs->names, packs->nr + 1, packs->alloc_names);
+		ALLOC_GROW(packs->info, packs->nr + 1, packs->alloc);
 
-		packs->list[packs->nr] = add_packed_git(full_path,
-							full_path_len,
-							0);
+		packs->info[packs->nr].p = add_packed_git(full_path,
+							  full_path_len,
+							  0);
 
-		if (!packs->list[packs->nr]) {
+		if (!packs->info[packs->nr].p) {
 			warning(_("failed to add packfile '%s'"),
 				full_path);
 			return;
 		}
 
-		if (open_pack_index(packs->list[packs->nr])) {
+		if (open_pack_index(packs->info[packs->nr].p)) {
 			warning(_("failed to open pack-index '%s'"),
 				full_path);
-			close_pack(packs->list[packs->nr]);
-			FREE_AND_NULL(packs->list[packs->nr]);
+			close_pack(packs->info[packs->nr].p);
+			FREE_AND_NULL(packs->info[packs->nr].p);
 			return;
 		}
 
-		packs->names[packs->nr] = xstrdup(file_name);
+		packs->info[packs->nr].pack_name = xstrdup(file_name);
+		packs->info[packs->nr].orig_pack_int_id = packs->nr;
 		packs->nr++;
 	}
 }
 
-struct pack_pair {
-	uint32_t pack_int_id;
-	char *pack_name;
-};
-
-static int pack_pair_compare(const void *_a, const void *_b)
-{
-	struct pack_pair *a = (struct pack_pair *)_a;
-	struct pack_pair *b = (struct pack_pair *)_b;
-	return strcmp(a->pack_name, b->pack_name);
-}
-
-static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *perm)
-{
-	uint32_t i;
-	struct pack_pair *pairs;
-
-	ALLOC_ARRAY(pairs, nr_packs);
-
-	for (i = 0; i < nr_packs; i++) {
-		pairs[i].pack_int_id = i;
-		pairs[i].pack_name = pack_names[i];
-	}
-
-	QSORT(pairs, nr_packs, pack_pair_compare);
-
-	for (i = 0; i < nr_packs; i++) {
-		pack_names[i] = pairs[i].pack_name;
-		perm[pairs[i].pack_int_id] = i;
-	}
-
-	free(pairs);
-}
-
 struct pack_midx_entry {
 	struct object_id oid;
 	uint32_t pack_int_id;
@@ -480,7 +457,6 @@ static int midx_oid_compare(const void *_a, const void *_b)
 }
 
 static int nth_midxed_pack_midx_entry(struct multi_pack_index *m,
-				      uint32_t *pack_perm,
 				      struct pack_midx_entry *e,
 				      uint32_t pos)
 {
@@ -488,7 +464,7 @@ static int nth_midxed_pack_midx_entry(struct multi_pack_index *m,
 		return 1;
 
 	nth_midxed_object_oid(&e->oid, m, pos);
-	e->pack_int_id = pack_perm[nth_midxed_pack_int_id(m, pos)];
+	e->pack_int_id = nth_midxed_pack_int_id(m, pos);
 	e->offset = nth_midxed_offset(m, pos);
 
 	/* consider objects in midx to be from "old" packs */
@@ -522,8 +498,7 @@ static void fill_pack_entry(uint32_t pack_int_id,
  * of a packfile containing the object).
  */
 static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
-						  struct packed_git **p,
-						  uint32_t *perm,
+						  struct pack_info *info,
 						  uint32_t nr_packs,
 						  uint32_t *nr_objects)
 {
@@ -534,7 +509,7 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 	uint32_t start_pack = m ? m->num_packs : 0;
 
 	for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++)
-		total_objects += p[cur_pack]->num_objects;
+		total_objects += info[cur_pack].p->num_objects;
 
 	/*
 	 * As we de-duplicate by fanout value, we expect the fanout
@@ -559,7 +534,7 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 
 			for (cur_object = start; cur_object < end; cur_object++) {
 				ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout);
-				nth_midxed_pack_midx_entry(m, perm,
+				nth_midxed_pack_midx_entry(m,
 							   &entries_by_fanout[nr_fanout],
 							   cur_object);
 				nr_fanout++;
@@ -570,12 +545,12 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 			uint32_t start = 0, end;
 
 			if (cur_fanout)
-				start = get_pack_fanout(p[cur_pack], cur_fanout - 1);
-			end = get_pack_fanout(p[cur_pack], cur_fanout);
+				start = get_pack_fanout(info[cur_pack].p, cur_fanout - 1);
+			end = get_pack_fanout(info[cur_pack].p, cur_fanout);
 
 			for (cur_object = start; cur_object < end; cur_object++) {
 				ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout);
-				fill_pack_entry(perm[cur_pack], p[cur_pack], cur_object, &entries_by_fanout[nr_fanout]);
+				fill_pack_entry(cur_pack, info[cur_pack].p, cur_object, &entries_by_fanout[nr_fanout]);
 				nr_fanout++;
 			}
 		}
@@ -604,7 +579,7 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 }
 
 static size_t write_midx_pack_names(struct hashfile *f,
-				    char **pack_names,
+				    struct pack_info *info,
 				    uint32_t num_packs)
 {
 	uint32_t i;
@@ -612,14 +587,14 @@ static size_t write_midx_pack_names(struct hashfile *f,
 	size_t written = 0;
 
 	for (i = 0; i < num_packs; i++) {
-		size_t writelen = strlen(pack_names[i]) + 1;
+		size_t writelen = strlen(info[i].pack_name) + 1;
 
-		if (i && strcmp(pack_names[i], pack_names[i - 1]) <= 0)
+		if (i && strcmp(info[i].pack_name, info[i - 1].pack_name) <= 0)
 			BUG("incorrect pack-file order: %s before %s",
-			    pack_names[i - 1],
-			    pack_names[i]);
+			    info[i - 1].pack_name,
+			    info[i].pack_name);
 
-		hashwrite(f, pack_names[i], writelen);
+		hashwrite(f, info[i].pack_name, writelen);
 		written += writelen;
 	}
 
@@ -690,6 +665,7 @@ static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
 }
 
 static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_needed,
+					uint32_t *perm,
 					struct pack_midx_entry *objects, uint32_t nr_objects)
 {
 	struct pack_midx_entry *list = objects;
@@ -699,7 +675,7 @@ static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_nee
 	for (i = 0; i < nr_objects; i++) {
 		struct pack_midx_entry *obj = list++;
 
-		hashwrite_be32(f, obj->pack_int_id);
+		hashwrite_be32(f, perm[obj->pack_int_id]);
 
 		if (large_offset_needed && obj->offset >> 31)
 			hashwrite_be32(f, MIDX_LARGE_OFFSET_NEEDED | nr_large_offset++);
@@ -772,20 +748,17 @@ int write_midx_file(const char *object_dir)
 	packs.m = load_multi_pack_index(object_dir, 1);
 
 	packs.nr = 0;
-	packs.alloc_list = packs.m ? packs.m->num_packs : 16;
-	packs.alloc_names = packs.alloc_list;
-	packs.list = NULL;
-	packs.names = NULL;
-	ALLOC_ARRAY(packs.list, packs.alloc_list);
-	ALLOC_ARRAY(packs.names, packs.alloc_names);
+	packs.alloc = packs.m ? packs.m->num_packs : 16;
+	packs.info = NULL;
+	ALLOC_ARRAY(packs.info, packs.alloc);
 
 	if (packs.m) {
 		for (i = 0; i < packs.m->num_packs; i++) {
-			ALLOC_GROW(packs.list, packs.nr + 1, packs.alloc_list);
-			ALLOC_GROW(packs.names, packs.nr + 1, packs.alloc_names);
+			ALLOC_GROW(packs.info, packs.nr + 1, packs.alloc);
 
-			packs.list[packs.nr] = NULL;
-			packs.names[packs.nr] = xstrdup(packs.m->pack_names[i]);
+			packs.info[packs.nr].orig_pack_int_id = i;
+			packs.info[packs.nr].pack_name = xstrdup(packs.m->pack_names[i]);
+			packs.info[packs.nr].p = NULL;
 			packs.nr++;
 		}
 	}
@@ -795,10 +768,7 @@ int write_midx_file(const char *object_dir)
 	if (packs.m && packs.nr == packs.m->num_packs)
 		goto cleanup;
 
-	ALLOC_ARRAY(pack_perm, packs.nr);
-	sort_packs_by_name(packs.names, packs.nr, pack_perm);
-
-	entries = get_sorted_entries(packs.m, packs.list, pack_perm, packs.nr, &nr_entries);
+	entries = get_sorted_entries(packs.m, packs.info, packs.nr, &nr_entries);
 
 	for (i = 0; i < nr_entries; i++) {
 		if (entries[i].offset > 0x7fffffff)
@@ -807,8 +777,21 @@ int write_midx_file(const char *object_dir)
 			large_offsets_needed = 1;
 	}
 
+	QSORT(packs.info, packs.nr, pack_info_compare);
+
+	/*
+	 * pack_perm stores a permutation between pack-int-ids from the
+	 * previous multi-pack-index to the new one we are writing:
+	 *
+	 * pack_perm[old_id] = new_id
+	 */
+	ALLOC_ARRAY(pack_perm, packs.nr);
+	for (i = 0; i < packs.nr; i++) {
+		pack_perm[packs.info[i].orig_pack_int_id] = i;
+	}
+
 	for (i = 0; i < packs.nr; i++)
-		pack_name_concat_len += strlen(packs.names[i]) + 1;
+		pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
 
 	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
 		pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
@@ -879,7 +862,7 @@ int write_midx_file(const char *object_dir)
 
 		switch (chunk_ids[i]) {
 			case MIDX_CHUNKID_PACKNAMES:
-				written += write_midx_pack_names(f, packs.names, packs.nr);
+				written += write_midx_pack_names(f, packs.info, packs.nr);
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
@@ -891,7 +874,7 @@ int write_midx_file(const char *object_dir)
 				break;
 
 			case MIDX_CHUNKID_OBJECTOFFSETS:
-				written += write_midx_object_offsets(f, large_offsets_needed, entries, nr_entries);
+				written += write_midx_object_offsets(f, large_offsets_needed, pack_perm, entries, nr_entries);
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
@@ -914,15 +897,14 @@ int write_midx_file(const char *object_dir)
 
 cleanup:
 	for (i = 0; i < packs.nr; i++) {
-		if (packs.list[i]) {
-			close_pack(packs.list[i]);
-			free(packs.list[i]);
+		if (packs.info[i].p) {
+			close_pack(packs.info[i].p);
+			free(packs.info[i].p);
 		}
-		free(packs.names[i]);
+		free(packs.info[i].pack_name);
 	}
 
-	free(packs.list);
-	free(packs.names);
+	free(packs.info);
 	free(entries);
 	free(pack_perm);
 	free(midx_name);
-- 
2.21.0.1096.g1c91fdc207


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 06/11] multi-pack-index: implement 'expire' subcommand
  2019-04-24 15:14       ` [PATCH v5 00/11] " Derrick Stolee
                           ` (4 preceding siblings ...)
  2019-04-24 15:14         ` [PATCH v5 05/11] midx: refactor permutation logic and pack sorting Derrick Stolee
@ 2019-04-24 15:14         ` Derrick Stolee
  2019-04-24 15:14         ` [PATCH v5 07/11] multi-pack-index: prepare 'repack' subcommand Derrick Stolee
                           ` (6 subsequent siblings)
  12 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-04-24 15:14 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, avarab, gitster, Derrick Stolee

The 'git multi-pack-index expire' subcommand looks at the existing
mult-pack-index, counts the number of objects referenced in each
pack-file, deletes the pack-fils with no referenced objects, and
rewrites the multi-pack-index to no longer reference those packs.

Refactor the write_midx_file() method to call write_midx_internal()
which now takes an existing 'struct multi_pack_index' and a list
of pack-files to drop (as specified by the names of their pack-
indexes). As we write the new multi-pack-index, we drop those
file names from the list of known pack-files.

The expire_midx_packs() method removes the unreferenced pack-files
after carefully closing the packs to avoid open handles.

Test that a new pack-file that covers the contents of two other
pack-files leads to those pack-files being deleted during the
expire subcommand. Be sure to read the multi-pack-index to ensure
it no longer references those packs.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 120 +++++++++++++++++++++++++++++++++---
 t/t5319-multi-pack-index.sh |  20 ++++++
 2 files changed, 130 insertions(+), 10 deletions(-)

diff --git a/midx.c b/midx.c
index 95c39106b2..299e9b2e8f 100644
--- a/midx.c
+++ b/midx.c
@@ -33,6 +33,8 @@
 #define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t))
 #define MIDX_LARGE_OFFSET_NEEDED 0x80000000
 
+#define PACK_EXPIRED UINT_MAX
+
 static char *get_midx_filename(const char *object_dir)
 {
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
@@ -381,6 +383,7 @@ struct pack_info {
 	uint32_t orig_pack_int_id;
 	char *pack_name;
 	struct packed_git *p;
+	unsigned expired : 1;
 };
 
 static int pack_info_compare(const void *_a, const void *_b)
@@ -428,6 +431,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 
 		packs->info[packs->nr].pack_name = xstrdup(file_name);
 		packs->info[packs->nr].orig_pack_int_id = packs->nr;
+		packs->info[packs->nr].expired = 0;
 		packs->nr++;
 	}
 }
@@ -587,13 +591,17 @@ static size_t write_midx_pack_names(struct hashfile *f,
 	size_t written = 0;
 
 	for (i = 0; i < num_packs; i++) {
-		size_t writelen = strlen(info[i].pack_name) + 1;
+		size_t writelen;
+
+		if (info[i].expired)
+			continue;
 
 		if (i && strcmp(info[i].pack_name, info[i - 1].pack_name) <= 0)
 			BUG("incorrect pack-file order: %s before %s",
 			    info[i - 1].pack_name,
 			    info[i].pack_name);
 
+		writelen = strlen(info[i].pack_name) + 1;
 		hashwrite(f, info[i].pack_name, writelen);
 		written += writelen;
 	}
@@ -675,6 +683,11 @@ static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_nee
 	for (i = 0; i < nr_objects; i++) {
 		struct pack_midx_entry *obj = list++;
 
+		if (perm[obj->pack_int_id] == PACK_EXPIRED)
+			BUG("object %s is in an expired pack with int-id %d",
+			    oid_to_hex(&obj->oid),
+			    obj->pack_int_id);
+
 		hashwrite_be32(f, perm[obj->pack_int_id]);
 
 		if (large_offset_needed && obj->offset >> 31)
@@ -721,7 +734,8 @@ static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_off
 	return written;
 }
 
-int write_midx_file(const char *object_dir)
+static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
+			       struct string_list *packs_to_drop)
 {
 	unsigned char cur_chunk, num_chunks = 0;
 	char *midx_name;
@@ -737,6 +751,8 @@ int write_midx_file(const char *object_dir)
 	struct pack_midx_entry *entries = NULL;
 	int large_offsets_needed = 0;
 	int pack_name_concat_len = 0;
+	int dropped_packs = 0;
+	int result = 0;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -745,7 +761,10 @@ int write_midx_file(const char *object_dir)
 			  midx_name);
 	}
 
-	packs.m = load_multi_pack_index(object_dir, 1);
+	if (m)
+		packs.m = m;
+	else
+		packs.m = load_multi_pack_index(object_dir, 1);
 
 	packs.nr = 0;
 	packs.alloc = packs.m ? packs.m->num_packs : 16;
@@ -759,13 +778,14 @@ int write_midx_file(const char *object_dir)
 			packs.info[packs.nr].orig_pack_int_id = i;
 			packs.info[packs.nr].pack_name = xstrdup(packs.m->pack_names[i]);
 			packs.info[packs.nr].p = NULL;
+			packs.info[packs.nr].expired = 0;
 			packs.nr++;
 		}
 	}
 
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
 
-	if (packs.m && packs.nr == packs.m->num_packs)
+	if (packs.m && packs.nr == packs.m->num_packs && !packs_to_drop)
 		goto cleanup;
 
 	entries = get_sorted_entries(packs.m, packs.info, packs.nr, &nr_entries);
@@ -779,6 +799,34 @@ int write_midx_file(const char *object_dir)
 
 	QSORT(packs.info, packs.nr, pack_info_compare);
 
+	if (packs_to_drop && packs_to_drop->nr) {
+		int drop_index = 0;
+		int missing_drops = 0;
+
+		for (i = 0; i < packs.nr && drop_index < packs_to_drop->nr; i++) {
+			int cmp = strcmp(packs.info[i].pack_name,
+					 packs_to_drop->items[drop_index].string);
+
+			if (!cmp) {
+				drop_index++;
+				packs.info[i].expired = 1;
+			} else if (cmp > 0) {
+				error(_("did not see pack-file %s to drop"),
+				      packs_to_drop->items[drop_index].string);
+				drop_index++;
+				missing_drops++;
+				i--;
+			} else {
+				packs.info[i].expired = 0;
+			}
+		}
+
+		if (missing_drops) {
+			result = 1;
+			goto cleanup;
+		}
+	}
+
 	/*
 	 * pack_perm stores a permutation between pack-int-ids from the
 	 * previous multi-pack-index to the new one we are writing:
@@ -787,11 +835,18 @@ int write_midx_file(const char *object_dir)
 	 */
 	ALLOC_ARRAY(pack_perm, packs.nr);
 	for (i = 0; i < packs.nr; i++) {
-		pack_perm[packs.info[i].orig_pack_int_id] = i;
+		if (packs.info[i].expired) {
+			dropped_packs++;
+			pack_perm[packs.info[i].orig_pack_int_id] = PACK_EXPIRED;
+		} else {
+			pack_perm[packs.info[i].orig_pack_int_id] = i - dropped_packs;
+		}
 	}
 
-	for (i = 0; i < packs.nr; i++)
-		pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
+	for (i = 0; i < packs.nr; i++) {
+		if (!packs.info[i].expired)
+			pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
+	}
 
 	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
 		pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
@@ -807,7 +862,7 @@ int write_midx_file(const char *object_dir)
 	cur_chunk = 0;
 	num_chunks = large_offsets_needed ? 5 : 4;
 
-	written = write_midx_header(f, num_chunks, packs.nr);
+	written = write_midx_header(f, num_chunks, packs.nr - dropped_packs);
 
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
 	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
@@ -908,7 +963,12 @@ int write_midx_file(const char *object_dir)
 	free(entries);
 	free(pack_perm);
 	free(midx_name);
-	return 0;
+	return result;
+}
+
+int write_midx_file(const char *object_dir)
+{
+	return write_midx_internal(object_dir, NULL, NULL);
 }
 
 void clear_midx_file(struct repository *r)
@@ -1010,5 +1070,45 @@ int verify_midx_file(const char *object_dir)
 
 int expire_midx_packs(const char *object_dir)
 {
-	return 0;
+	uint32_t i, *count, result = 0;
+	struct string_list packs_to_drop = STRING_LIST_INIT_DUP;
+	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
+
+	if (!m)
+		return 0;
+
+	count = xcalloc(m->num_packs, sizeof(uint32_t));
+	for (i = 0; i < m->num_objects; i++) {
+		int pack_int_id = nth_midxed_pack_int_id(m, i);
+		count[pack_int_id]++;
+	}
+
+	for (i = 0; i < m->num_packs; i++) {
+		char *pack_name;
+
+		if (count[i])
+			continue;
+
+		if (prepare_midx_pack(m, i))
+			continue;
+
+		if (m->packs[i]->pack_keep)
+			continue;
+
+		pack_name = xstrdup(m->packs[i]->pack_name);
+		close_pack(m->packs[i]);
+		FREE_AND_NULL(m->packs[i]);
+
+		string_list_insert(&packs_to_drop, m->pack_names[i]);
+		unlink_pack_path(pack_name, 0);
+		free(pack_name);
+	}
+
+	free(count);
+
+	if (packs_to_drop.nr)
+		result = write_midx_internal(object_dir, m, &packs_to_drop);
+
+	string_list_clear(&packs_to_drop, 0);
+	return result;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index a8528f7da0..65e85debec 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -397,4 +397,24 @@ test_expect_success 'expire does not remove any packs' '
 	)
 '
 
+test_expect_success 'expire removes unreferenced packs' '
+	(
+		cd dup &&
+		git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
+		refs/heads/A
+		^refs/heads/C
+		EOF
+		git multi-pack-index write &&
+		ls .git/objects/pack | grep -v -e pack-[AB] >expect &&
+		git multi-pack-index expire &&
+		ls .git/objects/pack >actual &&
+		test_cmp expect actual &&
+		ls .git/objects/pack/ | grep idx >expect-idx &&
+		test-tool read-midx .git/objects | grep idx >actual-midx &&
+		test_cmp expect-idx actual-midx &&
+		git multi-pack-index verify &&
+		git fsck
+	)
+'
+
 test_done
-- 
2.21.0.1096.g1c91fdc207


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 07/11] multi-pack-index: prepare 'repack' subcommand
  2019-04-24 15:14       ` [PATCH v5 00/11] " Derrick Stolee
                           ` (5 preceding siblings ...)
  2019-04-24 15:14         ` [PATCH v5 06/11] multi-pack-index: implement 'expire' subcommand Derrick Stolee
@ 2019-04-24 15:14         ` Derrick Stolee
  2019-04-24 15:14         ` [PATCH v5 08/11] midx: implement midx_repack() Derrick Stolee
                           ` (5 subsequent siblings)
  12 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-04-24 15:14 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, avarab, gitster, Derrick Stolee

In an environment where the multi-pack-index is useful, it is due
to many pack-files and an inability to repack the object store
into a single pack-file. However, it is likely that many of these
pack-files are rather small, and could be repacked into a slightly
larger pack-file without too much effort. It may also be important
to ensure the object store is highly available and the repack
operation does not interrupt concurrent git commands.

Introduce a 'repack' subcommand to 'git multi-pack-index' that
takes a '--batch-size' option. The subcommand will inspect the
multi-pack-index for referenced pack-files whose size is smaller
than the batch size, until collecting a list of pack-files whose
sizes sum to larger than the batch size. Then, a new pack-file
will be created containing the objects from those pack-files that
are referenced by the multi-pack-index. The resulting pack is
likely to actually be smaller than the batch size due to
compression and the fact that there may be objects in the pack-
files that have duplicate copies in other pack-files.

The current change introduces the command-line arguments, and we
add a test that ensures we parse these options properly. Since
we specify a small batch size, we will guarantee that future
implementations do not change the list of pack-files.

In addition, we hard-code the modified times of the packs in
the pack directory to ensure the list of packs sorted by modified
time matches the order if sorted by size (ascending). This will
be important in a future test.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt | 17 +++++++++++++++++
 builtin/multi-pack-index.c             | 12 ++++++++++--
 midx.c                                 |  5 +++++
 midx.h                                 |  1 +
 t/t5319-multi-pack-index.sh            | 20 +++++++++++++++++++-
 5 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index 6186c4c936..233b2b7862 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -36,6 +36,23 @@ expire::
 	have no objects referenced by the MIDX. Rewrite the MIDX file
 	afterward to remove all references to these pack-files.
 
+repack::
+	Create a new pack-file containing objects in small pack-files
+	referenced by the multi-pack-index. If the size given by the
+	`--batch-size=<size>` argument is zero, then create a pack
+	containing all objects referenced by the multi-pack-index. For
+	a non-zero batch size, Select the pack-files by examining packs
+	from oldest-to-newest, computing the "expected size" by counting
+	the number of objects in the pack referenced by the
+	multi-pack-index, then divide by the total number of objects in
+	the pack and multiply by the pack size. We select packs with
+	expected size below the batch size until the set of packs have
+	total expected size at least the batch size. If the total size
+	does not reach the batch size, then do nothing. If a new pack-
+	file is created, rewrite the multi-pack-index to reference the
+	new pack-file. A later run of 'git multi-pack-index expire' will
+	delete the pack-files that were part of this batch.
+
 
 EXAMPLES
 --------
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 145de3a46c..c66239de33 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,12 +5,13 @@
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire)"),
+	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire|repack --batch-size=<size>)"),
 	NULL
 };
 
 static struct opts_multi_pack_index {
 	const char *object_dir;
+	unsigned long batch_size;
 } opts;
 
 int cmd_multi_pack_index(int argc, const char **argv,
@@ -19,6 +20,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
 	static struct option builtin_multi_pack_index_options[] = {
 		OPT_FILENAME(0, "object-dir", &opts.object_dir,
 		  N_("object directory containing set of packfile and pack-index pairs")),
+		OPT_MAGNITUDE(0, "batch-size", &opts.batch_size,
+		  N_("during repack, collect pack-files of smaller size into a batch that is larger than this size")),
 		OPT_END(),
 	};
 
@@ -40,6 +43,11 @@ int cmd_multi_pack_index(int argc, const char **argv,
 		return 1;
 	}
 
+	if (!strcmp(argv[0], "repack"))
+		return midx_repack(opts.object_dir, (size_t)opts.batch_size);
+	if (opts.batch_size)
+		die(_("--batch-size option is only for 'repack' subcommand"));
+
 	if (!strcmp(argv[0], "write"))
 		return write_midx_file(opts.object_dir);
 	if (!strcmp(argv[0], "verify"))
@@ -47,5 +55,5 @@ int cmd_multi_pack_index(int argc, const char **argv,
 	if (!strcmp(argv[0], "expire"))
 		return expire_midx_packs(opts.object_dir);
 
-	die(_("unrecognized verb: %s"), argv[0]);
+	die(_("unrecognized subcommand: %s"), argv[0]);
 }
diff --git a/midx.c b/midx.c
index 299e9b2e8f..768a7dff73 100644
--- a/midx.c
+++ b/midx.c
@@ -1112,3 +1112,8 @@ int expire_midx_packs(const char *object_dir)
 	string_list_clear(&packs_to_drop, 0);
 	return result;
 }
+
+int midx_repack(const char *object_dir, size_t batch_size)
+{
+	return 0;
+}
diff --git a/midx.h b/midx.h
index e3a2b740b5..394a21ee96 100644
--- a/midx.h
+++ b/midx.h
@@ -50,6 +50,7 @@ int write_midx_file(const char *object_dir);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(const char *object_dir);
 int expire_midx_packs(const char *object_dir);
+int midx_repack(const char *object_dir, size_t batch_size);
 
 void close_midx(struct multi_pack_index *m);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 65e85debec..26ae8b3f62 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -383,7 +383,8 @@ test_expect_success 'setup expire tests' '
 		git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
 		refs/heads/E
 		EOF
-		git multi-pack-index write
+		git multi-pack-index write &&
+		cp -r .git/objects/pack .git/objects/pack-backup
 	)
 '
 
@@ -417,4 +418,21 @@ test_expect_success 'expire removes unreferenced packs' '
 	)
 '
 
+test_expect_success 'repack with minimum size does not alter existing packs' '
+	(
+		cd dup &&
+		rm -rf .git/objects/pack &&
+		mv .git/objects/pack-backup .git/objects/pack &&
+		touch -m -t 201901010000 .git/objects/pack/pack-D* &&
+		touch -m -t 201901010001 .git/objects/pack/pack-C* &&
+		touch -m -t 201901010002 .git/objects/pack/pack-B* &&
+		touch -m -t 201901010003 .git/objects/pack/pack-A* &&
+		ls .git/objects/pack >expect &&
+		MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
+		git multi-pack-index repack --batch-size=$MINSIZE &&
+		ls .git/objects/pack >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
2.21.0.1096.g1c91fdc207


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 08/11] midx: implement midx_repack()
  2019-04-24 15:14       ` [PATCH v5 00/11] " Derrick Stolee
                           ` (6 preceding siblings ...)
  2019-04-24 15:14         ` [PATCH v5 07/11] multi-pack-index: prepare 'repack' subcommand Derrick Stolee
@ 2019-04-24 15:14         ` Derrick Stolee
  2019-04-24 15:14         ` [PATCH v5 09/11] multi-pack-index: test expire while adding packs Derrick Stolee
                           ` (4 subsequent siblings)
  12 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-04-24 15:14 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, avarab, gitster, Derrick Stolee

To repack with a non-zero batch-size, first sort all pack-files by
their modified time. Second, walk those pack-files from oldest
to newest, compute their expected size, and add the packs to a list
if they are smaller than the given batch-size. Stop when the total
expected size is at least the batch size.

If the batch size is zero, select all packs in the multi-pack-index.

Finally, collect the objects from the multi-pack-index that are in
the selected packs and send them to 'git pack-objects'. Write a new
multi-pack-index that includes the new pack.

Using a batch size of zero is very similar to a standard 'git repack'
command, except that we do not delete the old packs and instead rely
on the new multi-pack-index to prevent new processes from reading the
old packs. This does not disrupt other Git processes that are currently
reading the old packs based on the old multi-pack-index.

While first designing a 'git multi-pack-index repack' operation, I
started by collecting the batches based on the actual size of the
objects instead of the size of the pack-files. This allows repacking
a large pack-file that has very few referencd objects. However, this
came at a significant cost of parsing pack-files instead of simply
reading the multi-pack-index and getting the file information for
the pack-files. The "expected size" version provides similar
behavior, but could skip a pack-file if the average object size is
much larger than the actual size of the referenced objects, or
can create a large pack if the actual size of the referenced objects
is larger than the expected size.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 150 +++++++++++++++++++++++++++++++++++-
 t/t5319-multi-pack-index.sh |  28 +++++++
 2 files changed, 177 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 768a7dff73..01c6a05732 100644
--- a/midx.c
+++ b/midx.c
@@ -8,6 +8,7 @@
 #include "sha1-lookup.h"
 #include "midx.h"
 #include "progress.h"
+#include "run-command.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -1113,7 +1114,154 @@ int expire_midx_packs(const char *object_dir)
 	return result;
 }
 
-int midx_repack(const char *object_dir, size_t batch_size)
+struct repack_info {
+	timestamp_t mtime;
+	uint32_t referenced_objects;
+	uint32_t pack_int_id;
+};
+
+static int compare_by_mtime(const void *a_, const void *b_)
 {
+	const struct repack_info *a, *b;
+
+	a = (const struct repack_info *)a_;
+	b = (const struct repack_info *)b_;
+
+	if (a->mtime < b->mtime)
+		return -1;
+	if (a->mtime > b->mtime)
+		return 1;
+	return 0;
+}
+
+static int fill_included_packs_all(struct multi_pack_index *m,
+				   unsigned char *include_pack)
+{
+	uint32_t i;
+
+	for (i = 0; i < m->num_packs; i++)
+		include_pack[i] = 1;
+
+	return m->num_packs < 2;
+}
+
+static int fill_included_packs_batch(struct multi_pack_index *m,
+				     unsigned char *include_pack,
+				     size_t batch_size)
+{
+	uint32_t i, packs_to_repack;
+	size_t total_size;
+	struct repack_info *pack_info = xcalloc(m->num_packs, sizeof(struct repack_info));
+
+	for (i = 0; i < m->num_packs; i++) {
+		pack_info[i].pack_int_id = i;
+
+		if (prepare_midx_pack(m, i))
+			continue;
+
+		pack_info[i].mtime = m->packs[i]->mtime;
+	}
+
+	for (i = 0; batch_size && i < m->num_objects; i++) {
+		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
+		pack_info[pack_int_id].referenced_objects++;
+	}
+
+	QSORT(pack_info, m->num_packs, compare_by_mtime);
+
+	total_size = 0;
+	packs_to_repack = 0;
+	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
+		int pack_int_id = pack_info[i].pack_int_id;
+		struct packed_git *p = m->packs[pack_int_id];
+		size_t expected_size;
+
+		if (!p)
+			continue;
+		if (open_pack_index(p) || !p->num_objects)
+			continue;
+
+		expected_size = (size_t)(p->pack_size
+					 * pack_info[i].referenced_objects);
+		expected_size /= p->num_objects;
+
+		if (expected_size >= batch_size)
+			continue;
+
+		packs_to_repack++;
+		total_size += expected_size;
+		include_pack[pack_int_id] = 1;
+	}
+
+	free(pack_info);
+
+	if (total_size < batch_size || packs_to_repack < 2)
+		return 1;
+
 	return 0;
+}	
+
+int midx_repack(const char *object_dir, size_t batch_size)
+{
+	int result = 0;
+	uint32_t i;
+	unsigned char *include_pack;
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf base_name = STRBUF_INIT;
+	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
+
+	if (!m)
+		return 0;
+
+	include_pack = xcalloc(m->num_packs, sizeof(unsigned char));
+
+	if (batch_size) {
+		if (fill_included_packs_batch(m, include_pack, batch_size))
+			goto cleanup;
+	} else if (fill_included_packs_all(m, include_pack))
+		goto cleanup;
+
+	argv_array_push(&cmd.args, "pack-objects");
+
+	strbuf_addstr(&base_name, object_dir);
+	strbuf_addstr(&base_name, "/pack/pack");
+	argv_array_push(&cmd.args, base_name.buf);
+	strbuf_release(&base_name);
+
+	cmd.git_cmd = 1;
+	cmd.in = cmd.out = -1;
+
+	if (start_command(&cmd)) {
+		error(_("could not start pack-objects"));
+		result = 1;
+		goto cleanup;
+	}
+
+	for (i = 0; i < m->num_objects; i++) {
+		struct object_id oid;
+		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
+
+		if (!include_pack[pack_int_id])
+			continue;
+
+		nth_midxed_object_oid(&oid, m, i);
+		xwrite(cmd.in, oid_to_hex(&oid), the_hash_algo->hexsz);
+		xwrite(cmd.in, "\n", 1);
+	}
+	close(cmd.in);
+
+	if (finish_command(&cmd)) {
+		error(_("could not finish pack-objects"));
+		result = 1;
+		goto cleanup;
+	}
+
+	result = write_midx_internal(object_dir, m, NULL);
+	m = NULL;
+
+cleanup:
+	if (m)
+		close_midx(m);
+	free(include_pack);
+	return result;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 26ae8b3f62..d6c1353514 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -435,4 +435,32 @@ test_expect_success 'repack with minimum size does not alter existing packs' '
 	)
 '
 
+test_expect_success 'repack creates a new pack' '
+	(
+		cd dup &&
+		ls .git/objects/pack/*idx >idx-list &&
+		test_line_count = 5 idx-list &&
+		THIRD_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 3 | tail -n 1) &&
+		BATCH_SIZE=$(($THIRD_SMALLEST_SIZE + 1)) &&
+		git multi-pack-index repack --batch-size=$BATCH_SIZE &&
+		ls .git/objects/pack/*idx >idx-list &&
+		test_line_count = 6 idx-list &&
+		test-tool read-midx .git/objects | grep idx >midx-list &&
+		test_line_count = 6 midx-list
+	)
+'
+
+test_expect_success 'expire removes repacked packs' '
+	(
+		cd dup &&
+		ls -al .git/objects/pack/*pack &&
+		ls -S .git/objects/pack/*pack | head -n 4 >expect &&
+		git multi-pack-index expire &&
+		ls -S .git/objects/pack/*pack >actual &&
+		test_cmp expect actual &&
+		test-tool read-midx .git/objects | grep idx >midx-list &&
+		test_line_count = 4 midx-list
+	)
+'
+
 test_done
-- 
2.21.0.1096.g1c91fdc207


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 09/11] multi-pack-index: test expire while adding packs
  2019-04-24 15:14       ` [PATCH v5 00/11] " Derrick Stolee
                           ` (7 preceding siblings ...)
  2019-04-24 15:14         ` [PATCH v5 08/11] midx: implement midx_repack() Derrick Stolee
@ 2019-04-24 15:14         ` Derrick Stolee
  2019-04-24 15:14         ` [PATCH v5 10/11] midx: add test that 'expire' respects .keep files Derrick Stolee
                           ` (3 subsequent siblings)
  12 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-04-24 15:14 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, avarab, gitster, Derrick Stolee

During development of the multi-pack-index expire subcommand, a
version went out that improperly computed the pack order if a new
pack was introduced while other packs were being removed. Part of
the subtlety of the bug involved the new pack being placed before
other packs that already existed in the multi-pack-index.

Add a test to t5319-multi-pack-index.sh that catches this issue.
The test adds new packs that cause another pack to be expired, and
creates new packs that are lexicographically sorted before and
after the existing packs.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t5319-multi-pack-index.sh | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index d6c1353514..19b769eea0 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -463,4 +463,36 @@ test_expect_success 'expire removes repacked packs' '
 	)
 '
 
+test_expect_success 'expire works when adding new packs' '
+	(
+		cd dup &&
+		git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
+		refs/heads/A
+		^refs/heads/B
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
+		refs/heads/B
+		^refs/heads/C
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
+		refs/heads/C
+		^refs/heads/D
+		EOF
+		git multi-pack-index write &&
+		git pack-objects --revs .git/objects/pack/a-pack <<-EOF &&
+		refs/heads/D
+		^refs/heads/E
+		EOF
+		git multi-pack-index write &&
+		git pack-objects --revs .git/objects/pack/z-pack <<-EOF &&
+		refs/heads/E
+		EOF
+		git multi-pack-index expire &&
+		ls .git/objects/pack/ | grep idx >expect &&
+		test-tool read-midx .git/objects | grep idx >actual &&
+		test_cmp expect actual &&
+		git multi-pack-index verify
+	)
+'
+
 test_done
-- 
2.21.0.1096.g1c91fdc207


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 10/11] midx: add test that 'expire' respects .keep files
  2019-04-24 15:14       ` [PATCH v5 00/11] " Derrick Stolee
                           ` (8 preceding siblings ...)
  2019-04-24 15:14         ` [PATCH v5 09/11] multi-pack-index: test expire while adding packs Derrick Stolee
@ 2019-04-24 15:14         ` Derrick Stolee
  2019-04-24 15:14         ` [PATCH v5 11/11] t5319-multi-pack-index.sh: test batch size zero Derrick Stolee
                           ` (2 subsequent siblings)
  12 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-04-24 15:14 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, avarab, gitster, Derrick Stolee

The 'git multi-pack-index expire' subcommand may delete packs that
are not needed from the perspective of the multi-pack-index. If
a pack has a .keep file, then we should not delete that pack. Add
a test that ensures we preserve a pack that would otherwise be
expired. First, create a new pack that contains every object in
the repo, then add it to the multi-pack-index. Then create a .keep
file for a pack starting with "a-pack" that was added in the
previous test. Finally, expire and verify that the pack remains
and the other packs were expired.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t5319-multi-pack-index.sh | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 19b769eea0..bcfa520401 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -495,4 +495,22 @@ test_expect_success 'expire works when adding new packs' '
 	)
 '
 
+test_expect_success 'expire respects .keep files' '
+	(
+		cd dup &&
+		git pack-objects --revs .git/objects/pack/pack-all <<-EOF &&
+		refs/heads/A
+		EOF
+		git multi-pack-index write &&
+		PACKA=$(ls .git/objects/pack/a-pack*\.pack | sed s/\.pack\$//) &&
+		touch $PACKA.keep &&
+		git multi-pack-index expire &&
+		ls -S .git/objects/pack/a-pack* | grep $PACKA >a-pack-files &&
+		test_line_count = 3 a-pack-files &&
+		test-tool read-midx .git/objects | grep idx >midx-list &&
+		test_line_count = 2 midx-list
+	)
+'
+
+
 test_done
-- 
2.21.0.1096.g1c91fdc207


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 11/11] t5319-multi-pack-index.sh: test batch size zero
  2019-04-24 15:14       ` [PATCH v5 00/11] " Derrick Stolee
                           ` (9 preceding siblings ...)
  2019-04-24 15:14         ` [PATCH v5 10/11] midx: add test that 'expire' respects .keep files Derrick Stolee
@ 2019-04-24 15:14         ` Derrick Stolee
  2019-04-25  5:38         ` [PATCH v5 00/11] Create 'expire' and 'repack' verbs for git-multi-pack-index Junio C Hamano
  2019-05-14 18:47         ` [PATCH v6 " Derrick Stolee
  12 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-04-24 15:14 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, avarab, gitster, Derrick Stolee

The 'git multi-pack-index repack' command can take a batch size of
zero, which creates a new pack-file containing all objects in the
multi-pack-index. The first 'repack' command will create one new
pack-file, and an 'expire' command after that will delete the old
pack-files, as they no longer contain any referenced objects in the
multi-pack-index.

We must remove the .keep file that was added in the previous test
in order to expire that pack-file.

Also test that a 'repack' will do nothing if there is only one
pack-file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t5319-multi-pack-index.sh | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index bcfa520401..0f116b4b92 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -512,5 +512,24 @@ test_expect_success 'expire respects .keep files' '
 	)
 '
 
+test_expect_success 'repack --batch-size=0 repacks everything' '
+	(
+		cd dup &&
+		rm .git/objects/pack/*.keep &&
+		ls .git/objects/pack/*idx >idx-list &&
+		test_line_count = 2 idx-list &&
+		git multi-pack-index repack --batch-size=0 &&
+		ls .git/objects/pack/*idx >idx-list &&
+		test_line_count = 3 idx-list &&
+		test-tool read-midx .git/objects | grep idx >midx-list &&
+		test_line_count = 3 midx-list &&
+		git multi-pack-index expire &&
+		ls -al .git/objects/pack/*idx >idx-list &&
+		test_line_count = 1 idx-list &&
+		git multi-pack-index repack --batch-size=0 &&
+		ls -al .git/objects/pack/*idx >new-idx-list &&
+		test_cmp idx-list new-idx-list
+	)
+'
 
 test_done
-- 
2.21.0.1096.g1c91fdc207


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v5 00/11] Create 'expire' and 'repack' verbs for git-multi-pack-index
  2019-04-24 15:14       ` [PATCH v5 00/11] " Derrick Stolee
                           ` (10 preceding siblings ...)
  2019-04-24 15:14         ` [PATCH v5 11/11] t5319-multi-pack-index.sh: test batch size zero Derrick Stolee
@ 2019-04-25  5:38         ` Junio C Hamano
  2019-04-25 11:06           ` Derrick Stolee
  2019-05-14 18:47         ` [PATCH v6 " Derrick Stolee
  12 siblings, 1 reply; 89+ messages in thread
From: Junio C Hamano @ 2019-04-25  5:38 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, peff, jrnieder, avarab, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> Updates in V5:
>
> * Fixed the error in PATCH 7 due to a missing line that existed in PATCH 8. Thanks, Josh Steadmon!
>
> * The 'repack' subcommand now computes the "expected size" of a pack instead of
>   relying on the total size of the pack. This is actually really important to
>   the way VFS for Git uses prefetch packs, and some packs are not being
>   repacked because the pack size is larger than the batch size, but really
>   there are only a few referenced objects.
>
> * The 'repack' subcommand now allows a batch size of zero to mean "create one
>   pack containing all objects in the multi-pack-index". A new commit adds a
>   test that hits the boundary cases here, but follows the 'expire' subcommand
>   so we can show that cycle of repack-then-expire to safely replace the packs.

I guess all of them need to tweak the authorship from the gmail
address to the work address on the Signed-off-by: trailer, which I
can do (as I noticed it before applying).


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v5 00/11] Create 'expire' and 'repack' verbs for git-multi-pack-index
  2019-04-25  5:38         ` [PATCH v5 00/11] Create 'expire' and 'repack' verbs for git-multi-pack-index Junio C Hamano
@ 2019-04-25 11:06           ` Derrick Stolee
  0 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-04-25 11:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, jrnieder, avarab, Derrick Stolee

On 4/25/2019 1:38 AM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> Updates in V5:
>>
>> * Fixed the error in PATCH 7 due to a missing line that existed in PATCH 8. Thanks, Josh Steadmon!
>>
>> * The 'repack' subcommand now computes the "expected size" of a pack instead of
>>   relying on the total size of the pack. This is actually really important to
>>   the way VFS for Git uses prefetch packs, and some packs are not being
>>   repacked because the pack size is larger than the batch size, but really
>>   there are only a few referenced objects.
>>
>> * The 'repack' subcommand now allows a batch size of zero to mean "create one
>>   pack containing all objects in the multi-pack-index". A new commit adds a
>>   test that hits the boundary cases here, but follows the 'expire' subcommand
>>   so we can show that cycle of repack-then-expire to safely replace the packs.
> 
> I guess all of them need to tweak the authorship from the gmail
> address to the work address on the Signed-off-by: trailer, which I
> can do (as I noticed it before applying).

Sorry. Due to the conflicts, GitGitGadget prevented me from submitting in
my normal way, so I pulled out format-patch and send-email for the first
time in a very long time. I manually added new "From: " lines in the bodies
of the patch files, but they got suppressed, I guess.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 00/11] Create 'expire' and 'repack' verbs for git-multi-pack-index
  2019-04-24 15:14       ` [PATCH v5 00/11] " Derrick Stolee
                           ` (11 preceding siblings ...)
  2019-04-25  5:38         ` [PATCH v5 00/11] Create 'expire' and 'repack' verbs for git-multi-pack-index Junio C Hamano
@ 2019-05-14 18:47         ` " Derrick Stolee
  2019-05-14 18:47           ` [PATCH v6 01/11] repack: refactor pack deletion for future use Derrick Stolee
                             ` (10 more replies)
  12 siblings, 11 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-05-14 18:47 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, gitster, avarab, dstolee

The multi-pack-index provides a fast way to find an object among a large list
of pack-files. It stores a single pack-reference for each object id, so
duplicate objects are ignored. Among a list of pack-files storing the same
object, the most-recently modified one is used.

Create new subcommands for the multi-pack-index builtin.

* 'git multi-pack-index expire': If we have a pack-file indexed by the
  multi-pack-index, but all objects in that pack are duplicated in
  more-recently modified packs, then delete that pack (and any others like it).
  Delete the reference to that pack in the multi-pack-index.

* 'git multi-pack-index repack --batch-size=<size>': Starting from the oldest
  pack-files covered by the multi-pack-index, find those whose "expected size"
  is below the batch size until we have a collection of packs whose expected
  sizes add up to the batch size. We compute the expected size by multiplying
  the number of referenced objects by the pack-size and dividing by the total
  number of objects in the pack. If the batch-size is zero, then select all
  packs. Create a new pack containing all objects that the multi-pack-index
  references to those packs.

This allows us to create a new pattern for repacking objects: run 'repack'.
After enough time has passed that all Git commands that started before the last
'repack' are finished, run 'expire' again. This approach has some advantages
over the existing "repack everything" model:

1. Incremental. We can repack a small batch of objects at a time, instead of
repacking all reachable objects. We can also limit ourselves to the objects
that do not appear in newer pack-files.

2. Highly Available. By adding a new pack-file (and not deleting the old
pack-files) we do not interrupt concurrent Git commands, and do not suffer
performance degradation. By expiring only pack-files that have no referenced
objects, we know that Git commands that are doing normal object lookups* will
not be interrupted.

* Note: if someone concurrently runs a Git command that uses get_all_packs(),
* then that command could try to read the pack-files and pack-indexes that we
* are deleting during an expire command. Such commands are usually related to
* object maintenance (i.e. fsck, gc, pack-objects) or are related to
* less-often-used features (i.e. fast-import, http-backend, server-info).

We **are using** this approach in VFS for Git to do background maintenance of
the "shared object cache" which is a Git alternate directory filled with
packfiles containing commits and trees. We currently download pack-files on an
hourly basis to keep up-to-date with the central server. The cache servers
supply packs on an hourly and daily basis, so most of the hourly packs become
useless after a new daily pack is downloaded. The 'expire' command would clear
out most of those packs, but many will still remain with fewer than 100 objects
remaining. The 'repack' command (with a batch size of 1-3gb, probably) can
condense the remaining packs in commands that run for 1-3 min at a time. Since
the daily packs range from 100-250mb, we will also combine and condense those
packs.

Updates in V6:

I rebased onto ds/midx-too-many-packs. Thanks, Junio for taking that
change first. There were several subtle things that needed to change to
put this change on top:

* We need a repository struct everywhere since we add pack-files to the
  packed_git list now.

* A FREE_AND_NULL() was dropped after closing a pack because the pack
  is still in the packed_git list after opening.

* I noticed some whitespace problems.

I also expect GMail to munge my added "From:" tags, so it will look
like the author is "stolee@gmail.com" instead of
"dstolee@microsoft.com". Sorry for the continued inconvenience here.

Thanks,
-Stolee

Derrick Stolee (11):
  repack: refactor pack deletion for future use
  Docs: rearrange subcommands for multi-pack-index
  multi-pack-index: prepare for 'expire' subcommand
  midx: simplify computation of pack name lengths
  midx: refactor permutation logic and pack sorting
  multi-pack-index: implement 'expire' subcommand
  multi-pack-index: prepare 'repack' subcommand
  midx: implement midx_repack()
  multi-pack-index: test expire while adding packs
  midx: add test that 'expire' respects .keep files
  t5319-multi-pack-index.sh: test batch size zero

 Documentation/git-multi-pack-index.txt |  32 +-
 builtin/multi-pack-index.c             |  14 +-
 builtin/repack.c                       |  14 +-
 midx.c                                 | 440 +++++++++++++++++++------
 midx.h                                 |   2 +
 packfile.c                             |  28 ++
 packfile.h                             |   7 +
 t/t5319-multi-pack-index.sh            | 184 +++++++++++
 8 files changed, 602 insertions(+), 119 deletions(-)

-- 
2.22.0.rc0

 1:  d8d7629fc0 !  1:  3b424f7c2a repack: refactor pack deletion for future use
    @@ -81,8 +81,8 @@
      --- a/packfile.h
      +++ b/packfile.h
     @@
    - extern void clear_delta_base_cache(void);
    - extern struct packed_git *add_packed_git(const char *path, size_t path_len, int local);
    + void clear_delta_base_cache(void);
    + struct packed_git *add_packed_git(const char *path, size_t path_len, int local);
      
     +/*
     + * Unlink the .pack and associated extension files.
 2:  d8ed299705 =  2:  fe047db570 Docs: rearrange subcommands for multi-pack-index
 3:  166e03dd77 !  3:  b78967a052 multi-pack-index: prepare for 'expire' subcommand
    @@ -42,7 +42,7 @@
      --- a/builtin/multi-pack-index.c
      +++ b/builtin/multi-pack-index.c
     @@
    - #include "midx.h"
    + #include "trace2.h"
      
      static char const * const builtin_multi_pack_index_usage[] = {
     -	N_("git multi-pack-index [--object-dir=<dir>] (write|verify)"),
    @@ -53,9 +53,9 @@
     @@
      		return write_midx_file(opts.object_dir);
      	if (!strcmp(argv[0], "verify"))
    - 		return verify_midx_file(opts.object_dir);
    + 		return verify_midx_file(the_repository, opts.object_dir);
     +	if (!strcmp(argv[0], "expire"))
    -+		return expire_midx_packs(opts.object_dir);
    ++		return expire_midx_packs(the_repository, opts.object_dir);
      
      	die(_("unrecognized verb: %s"), argv[0]);
      }
    @@ -68,7 +68,7 @@
      	return verify_midx_error;
      }
     +
    -+int expire_midx_packs(const char *object_dir)
    ++int expire_midx_packs(struct repository *r, const char *object_dir)
     +{
     +	return 0;
     +}
    @@ -79,8 +79,8 @@
     @@
      int write_midx_file(const char *object_dir);
      void clear_midx_file(struct repository *r);
    - int verify_midx_file(const char *object_dir);
    -+int expire_midx_packs(const char *object_dir);
    + int verify_midx_file(struct repository *r, const char *object_dir);
    ++int expire_midx_packs(struct repository *r, const char *object_dir);
      
      void close_midx(struct multi_pack_index *m);
      
 4:  f82ccd0e16 =  4:  dec7f384ee midx: simplify computation of pack name lengths
 5:  a4ea2a0fe0 =  5:  989d49d0b2 midx: refactor permutation logic and pack sorting
 6:  28b99a74da !  6:  8213541052 multi-pack-index: implement 'expire' subcommand
    @@ -211,7 +211,7 @@
      void clear_midx_file(struct repository *r)
     @@
      
    - int expire_midx_packs(const char *object_dir)
    + int expire_midx_packs(struct repository *r, const char *object_dir)
      {
     -	return 0;
     +	uint32_t i, *count, result = 0;
    @@ -233,7 +233,7 @@
     +		if (count[i])
     +			continue;
     +
    -+		if (prepare_midx_pack(m, i))
    ++		if (prepare_midx_pack(r, m, i))
     +			continue;
     +
     +		if (m->packs[i]->pack_keep)
    @@ -241,7 +241,6 @@
     +
     +		pack_name = xstrdup(m->packs[i]->pack_name);
     +		close_pack(m->packs[i]);
    -+		FREE_AND_NULL(m->packs[i]);
     +
     +		string_list_insert(&packs_to_drop, m->pack_names[i]);
     +		unlink_pack_path(pack_name, 0);
 7:  b1f7b66948 !  7:  1776e36f19 multi-pack-index: prepare 'repack' subcommand
    @@ -65,7 +65,7 @@
      --- a/builtin/multi-pack-index.c
      +++ b/builtin/multi-pack-index.c
     @@
    - #include "midx.h"
    + #include "trace2.h"
      
      static char const * const builtin_multi_pack_index_usage[] = {
     -	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire)"),
    @@ -89,11 +89,11 @@
      	};
      
     @@
    - 		return 1;
    - 	}
    + 
    + 	trace2_cmd_mode(argv[0]);
      
     +	if (!strcmp(argv[0], "repack"))
    -+		return midx_repack(opts.object_dir, (size_t)opts.batch_size);
    ++		return midx_repack(the_repository, opts.object_dir, (size_t)opts.batch_size);
     +	if (opts.batch_size)
     +		die(_("--batch-size option is only for 'repack' subcommand"));
     +
    @@ -102,7 +102,7 @@
      	if (!strcmp(argv[0], "verify"))
     @@
      	if (!strcmp(argv[0], "expire"))
    - 		return expire_midx_packs(opts.object_dir);
    + 		return expire_midx_packs(the_repository, opts.object_dir);
      
     -	die(_("unrecognized verb: %s"), argv[0]);
     +	die(_("unrecognized subcommand: %s"), argv[0]);
    @@ -116,7 +116,7 @@
      	return result;
      }
     +
    -+int midx_repack(const char *object_dir, size_t batch_size)
    ++int midx_repack(struct repository *r, const char *object_dir, size_t batch_size)
     +{
     +	return 0;
     +}
    @@ -126,9 +126,9 @@
      +++ b/midx.h
     @@
      void clear_midx_file(struct repository *r);
    - int verify_midx_file(const char *object_dir);
    - int expire_midx_packs(const char *object_dir);
    -+int midx_repack(const char *object_dir, size_t batch_size);
    + int verify_midx_file(struct repository *r, const char *object_dir);
    + int expire_midx_packs(struct repository *r, const char *object_dir);
    ++int midx_repack(struct repository *r, const char *object_dir, size_t batch_size);
      
      void close_midx(struct multi_pack_index *m);
      
 8:  6e962fd947 !  8:  ab77ce8afe midx: implement midx_repack()
    @@ -38,9 +38,9 @@
      --- a/midx.c
      +++ b/midx.c
     @@
    - #include "sha1-lookup.h"
      #include "midx.h"
      #include "progress.h"
    + #include "trace2.h"
     +#include "run-command.h"
      
      #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
    @@ -49,7 +49,7 @@
      	return result;
      }
      
    --int midx_repack(const char *object_dir, size_t batch_size)
    +-int midx_repack(struct repository *r, const char *object_dir, size_t batch_size)
     +struct repack_info {
     +	timestamp_t mtime;
     +	uint32_t referenced_objects;
    @@ -81,7 +81,8 @@
     +	return m->num_packs < 2;
     +}
     +
    -+static int fill_included_packs_batch(struct multi_pack_index *m,
    ++static int fill_included_packs_batch(struct repository *r,
    ++				     struct multi_pack_index *m,
     +				     unsigned char *include_pack,
     +				     size_t batch_size)
     +{
    @@ -92,7 +93,7 @@
     +	for (i = 0; i < m->num_packs; i++) {
     +		pack_info[i].pack_int_id = i;
     +
    -+		if (prepare_midx_pack(m, i))
    ++		if (prepare_midx_pack(r, m, i))
     +			continue;
     +
     +		pack_info[i].mtime = m->packs[i]->mtime;
    @@ -135,9 +136,9 @@
     +		return 1;
     +
      	return 0;
    -+}	
    + }
     +
    -+int midx_repack(const char *object_dir, size_t batch_size)
    ++int midx_repack(struct repository *r, const char *object_dir, size_t batch_size)
     +{
     +	int result = 0;
     +	uint32_t i;
    @@ -152,7 +153,7 @@
     +	include_pack = xcalloc(m->num_packs, sizeof(unsigned char));
     +
     +	if (batch_size) {
    -+		if (fill_included_packs_batch(m, include_pack, batch_size))
    ++		if (fill_included_packs_batch(r, m, include_pack, batch_size))
     +			goto cleanup;
     +	} else if (fill_included_packs_all(m, include_pack))
     +		goto cleanup;
    @@ -200,7 +201,7 @@
     +		close_midx(m);
     +	free(include_pack);
     +	return result;
    - }
    ++}
     
      diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
      --- a/t/t5319-multi-pack-index.sh
 9:  321c8990f8 =  9:  80c7d2e581 multi-pack-index: test expire while adding packs
10:  f7e4e76bfe = 10:  8e243939ef midx: add test that 'expire' respects .keep files
11:  5da2603ed5 = 11:  3ed388f0a8 t5319-multi-pack-index.sh: test batch size zero
     

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 01/11] repack: refactor pack deletion for future use
  2019-05-14 18:47         ` [PATCH v6 " Derrick Stolee
@ 2019-05-14 18:47           ` Derrick Stolee
  2019-05-14 18:47           ` [PATCH v6 02/11] Docs: rearrange subcommands for multi-pack-index Derrick Stolee
                             ` (9 subsequent siblings)
  10 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-05-14 18:47 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, gitster, avarab, dstolee

The repack builtin deletes redundant pack-files and their
associated .idx, .promisor, .bitmap, and .keep files. We will want
to re-use this logic in the future for other types of repack, so
pull the logic into 'unlink_pack_path()' in packfile.c.

The 'ignore_keep' parameter is enabled for the use in repack, but
will be important for a future caller.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/repack.c | 14 ++------------
 packfile.c       | 28 ++++++++++++++++++++++++++++
 packfile.h       |  7 +++++++
 3 files changed, 37 insertions(+), 12 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 67f8978043..1f9e6fad1b 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -129,19 +129,9 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list,
 
 static void remove_redundant_pack(const char *dir_name, const char *base_name)
 {
-	const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", ".promisor"};
-	int i;
 	struct strbuf buf = STRBUF_INIT;
-	size_t plen;
-
-	strbuf_addf(&buf, "%s/%s", dir_name, base_name);
-	plen = buf.len;
-
-	for (i = 0; i < ARRAY_SIZE(exts); i++) {
-		strbuf_setlen(&buf, plen);
-		strbuf_addstr(&buf, exts[i]);
-		unlink(buf.buf);
-	}
+	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
+	unlink_pack_path(buf.buf, 1);
 	strbuf_release(&buf);
 }
 
diff --git a/packfile.c b/packfile.c
index 060de420d1..683ce5674c 100644
--- a/packfile.c
+++ b/packfile.c
@@ -352,6 +352,34 @@ void close_all_packs(struct raw_object_store *o)
 	}
 }
 
+void unlink_pack_path(const char *pack_name, int force_delete)
+{
+	static const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", ".promisor"};
+	int i;
+	struct strbuf buf = STRBUF_INIT;
+	size_t plen;
+
+	strbuf_addstr(&buf, pack_name);
+	strip_suffix_mem(buf.buf, &buf.len, ".pack");
+	plen = buf.len;
+
+	if (!force_delete) {
+		strbuf_addstr(&buf, ".keep");
+		if (!access(buf.buf, F_OK)) {
+			strbuf_release(&buf);
+			return;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(exts); i++) {
+		strbuf_setlen(&buf, plen);
+		strbuf_addstr(&buf, exts[i]);
+		unlink(buf.buf);
+	}
+
+	strbuf_release(&buf);
+}
+
 /*
  * The LRU pack is the one with the oldest MRU window, preferring packs
  * with no used windows, or the oldest mtime if it has no windows allocated.
diff --git a/packfile.h b/packfile.h
index 12baa6118a..09f5222113 100644
--- a/packfile.h
+++ b/packfile.h
@@ -94,6 +94,13 @@ void unuse_pack(struct pack_window **);
 void clear_delta_base_cache(void);
 struct packed_git *add_packed_git(const char *path, size_t path_len, int local);
 
+/*
+ * Unlink the .pack and associated extension files.
+ * Does not unlink if 'force_delete' is false and the pack-file is
+ * marked as ".keep".
+ */
+extern void unlink_pack_path(const char *pack_name, int force_delete);
+
 /*
  * Make sure that a pointer access into an mmap'd index file is within bounds,
  * and can provide at least 8 bytes of data.
-- 
2.22.0.rc0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 02/11] Docs: rearrange subcommands for multi-pack-index
  2019-05-14 18:47         ` [PATCH v6 " Derrick Stolee
  2019-05-14 18:47           ` [PATCH v6 01/11] repack: refactor pack deletion for future use Derrick Stolee
@ 2019-05-14 18:47           ` Derrick Stolee
  2019-05-14 18:47           ` [PATCH v6 03/11] multi-pack-index: prepare for 'expire' subcommand Derrick Stolee
                             ` (8 subsequent siblings)
  10 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-05-14 18:47 UTC (permalink / raw)
  To: git
  Cc: peff, jrnieder, gitster, avarab, dstolee, Stefan Beller,
	Szeder Gábor

We will add new subcommands to the multi-pack-index, and that will
make the documentation a bit messier. Clean up the 'verb'
descriptions by renaming the concept to 'subcommand' and removing
the reference to the object directory.

Helped-by: Stefan Beller <sbeller@google.com>
Helped-by: Szeder Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index f7778a2c85..1af406aca2 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -9,7 +9,7 @@ git-multi-pack-index - Write and verify multi-pack-indexes
 SYNOPSIS
 --------
 [verse]
-'git multi-pack-index' [--object-dir=<dir>] <verb>
+'git multi-pack-index' [--object-dir=<dir>] <subcommand>
 
 DESCRIPTION
 -----------
@@ -23,13 +23,13 @@ OPTIONS
 	`<dir>/packs/multi-pack-index` for the current MIDX file, and
 	`<dir>/packs` for the pack-files to index.
 
+The following subcommands are available:
+
 write::
-	When given as the verb, write a new MIDX file to
-	`<dir>/packs/multi-pack-index`.
+	Write a new MIDX file.
 
 verify::
-	When given as the verb, verify the contents of the MIDX file
-	at `<dir>/packs/multi-pack-index`.
+	Verify the contents of the MIDX file.
 
 
 EXAMPLES
-- 
2.22.0.rc0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 03/11] multi-pack-index: prepare for 'expire' subcommand
  2019-05-14 18:47         ` [PATCH v6 " Derrick Stolee
  2019-05-14 18:47           ` [PATCH v6 01/11] repack: refactor pack deletion for future use Derrick Stolee
  2019-05-14 18:47           ` [PATCH v6 02/11] Docs: rearrange subcommands for multi-pack-index Derrick Stolee
@ 2019-05-14 18:47           ` Derrick Stolee
  2019-05-14 18:47           ` [PATCH v6 04/11] midx: simplify computation of pack name lengths Derrick Stolee
                             ` (7 subsequent siblings)
  10 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-05-14 18:47 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, gitster, avarab, dstolee

The multi-pack-index tracks objects in a collection of pack-files.
Only one copy of each object is indexed, using the modified time
of the pack-files to determine tie-breakers. It is possible to
have a pack-file with no referenced objects because all objects
have a duplicate in a newer pack-file.

Introduce a new 'expire' subcommand to the multi-pack-index builtin.
This subcommand will delete these unused pack-files and rewrite the
multi-pack-index to no longer refer to those files. More details
about the specifics will follow as the method is implemented.

Add a test that verifies the 'expire' subcommand is correctly wired,
but will still be valid when the verb is implemented. Specifically,
create a set of packs that should all have referenced objects and
should not be removed during an 'expire' operation. The packs are
created carefully to ensure they have a specific order when sorted
by size. This will be important in a later test.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt |  5 +++
 builtin/multi-pack-index.c             |  4 ++-
 midx.c                                 |  5 +++
 midx.h                                 |  1 +
 t/t5319-multi-pack-index.sh            | 49 ++++++++++++++++++++++++++
 5 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index 1af406aca2..6186c4c936 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -31,6 +31,11 @@ write::
 verify::
 	Verify the contents of the MIDX file.
 
+expire::
+	Delete the pack-files that are tracked 	by the MIDX file, but
+	have no objects referenced by the MIDX. Rewrite the MIDX file
+	afterward to remove all references to these pack-files.
+
 
 EXAMPLES
 --------
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 72dfd3dadc..ad10d40512 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -6,7 +6,7 @@
 #include "trace2.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-	N_("git multi-pack-index [--object-dir=<dir>] (write|verify)"),
+	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire)"),
 	NULL
 };
 
@@ -47,6 +47,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
 		return write_midx_file(opts.object_dir);
 	if (!strcmp(argv[0], "verify"))
 		return verify_midx_file(the_repository, opts.object_dir);
+	if (!strcmp(argv[0], "expire"))
+		return expire_midx_packs(the_repository, opts.object_dir);
 
 	die(_("unrecognized verb: %s"), argv[0]);
 }
diff --git a/midx.c b/midx.c
index e7e1fe4d65..3b7da1a360 100644
--- a/midx.c
+++ b/midx.c
@@ -1140,3 +1140,8 @@ int verify_midx_file(struct repository *r, const char *object_dir)
 
 	return verify_midx_error;
 }
+
+int expire_midx_packs(struct repository *r, const char *object_dir)
+{
+	return 0;
+}
diff --git a/midx.h b/midx.h
index 3eb29731f2..505f1431b7 100644
--- a/midx.h
+++ b/midx.h
@@ -50,6 +50,7 @@ int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, i
 int write_midx_file(const char *object_dir);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(struct repository *r, const char *object_dir);
+int expire_midx_packs(struct repository *r, const char *object_dir);
 
 void close_midx(struct multi_pack_index *m);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 1ebf19ec3c..1b2d32f475 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -363,4 +363,53 @@ test_expect_success 'verify incorrect 64-bit offset' '
 		"incorrect object offset"
 '
 
+test_expect_success 'setup expire tests' '
+	mkdir dup &&
+	(
+		cd dup &&
+		git init &&
+		test-tool genrandom "data" 4096 >large_file.txt &&
+		git update-index --add large_file.txt &&
+		for i in $(test_seq 1 20)
+		do
+			test_commit $i
+		done &&
+		git branch A HEAD &&
+		git branch B HEAD~8 &&
+		git branch C HEAD~13 &&
+		git branch D HEAD~16 &&
+		git branch E HEAD~18 &&
+		git pack-objects --revs .git/objects/pack/pack-A <<-EOF &&
+		refs/heads/A
+		^refs/heads/B
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-B <<-EOF &&
+		refs/heads/B
+		^refs/heads/C
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-C <<-EOF &&
+		refs/heads/C
+		^refs/heads/D
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-D <<-EOF &&
+		refs/heads/D
+		^refs/heads/E
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
+		refs/heads/E
+		EOF
+		git multi-pack-index write
+	)
+'
+
+test_expect_success 'expire does not remove any packs' '
+	(
+		cd dup &&
+		ls .git/objects/pack >expect &&
+		git multi-pack-index expire &&
+		ls .git/objects/pack >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
2.22.0.rc0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 04/11] midx: simplify computation of pack name lengths
  2019-05-14 18:47         ` [PATCH v6 " Derrick Stolee
                             ` (2 preceding siblings ...)
  2019-05-14 18:47           ` [PATCH v6 03/11] multi-pack-index: prepare for 'expire' subcommand Derrick Stolee
@ 2019-05-14 18:47           ` Derrick Stolee
  2019-05-14 18:47           ` [PATCH v6 05/11] midx: refactor permutation logic and pack sorting Derrick Stolee
                             ` (6 subsequent siblings)
  10 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-05-14 18:47 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, gitster, avarab, dstolee

Before writing the multi-pack-index, we compute the length of the
pack-index names concatenated together. This forms the data in the
pack name chunk, and we precompute it to compute chunk offsets.
The value is also modified to fit alignment needs.

Previously, this computation was coupled with adding packs from
the existing multi-pack-index and the remaining packs in the object
dir not already covered by the multi-pack-index.

In anticipation of this becoming more complicated with the 'expire'
subcommand, simplify the computation by centralizing it to a single
loop before writing the file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/midx.c b/midx.c
index 3b7da1a360..62404620ad 100644
--- a/midx.c
+++ b/midx.c
@@ -433,7 +433,6 @@ struct pack_list {
 	uint32_t nr;
 	uint32_t alloc_list;
 	uint32_t alloc_names;
-	size_t pack_name_concat_len;
 	struct multi_pack_index *m;
 };
 
@@ -468,7 +467,6 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 		}
 
 		packs->names[packs->nr] = xstrdup(file_name);
-		packs->pack_name_concat_len += strlen(file_name) + 1;
 		packs->nr++;
 	}
 }
@@ -812,6 +810,7 @@ int write_midx_file(const char *object_dir)
 	uint32_t nr_entries, num_large_offsets = 0;
 	struct pack_midx_entry *entries = NULL;
 	int large_offsets_needed = 0;
+	int pack_name_concat_len = 0;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -827,7 +826,6 @@ int write_midx_file(const char *object_dir)
 	packs.alloc_names = packs.alloc_list;
 	packs.list = NULL;
 	packs.names = NULL;
-	packs.pack_name_concat_len = 0;
 	ALLOC_ARRAY(packs.list, packs.alloc_list);
 	ALLOC_ARRAY(packs.names, packs.alloc_names);
 
@@ -838,7 +836,6 @@ int write_midx_file(const char *object_dir)
 
 			packs.list[packs.nr] = NULL;
 			packs.names[packs.nr] = xstrdup(packs.m->pack_names[i]);
-			packs.pack_name_concat_len += strlen(packs.names[packs.nr]) + 1;
 			packs.nr++;
 		}
 	}
@@ -848,10 +845,6 @@ int write_midx_file(const char *object_dir)
 	if (packs.m && packs.nr == packs.m->num_packs)
 		goto cleanup;
 
-	if (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
-		packs.pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
-					      (packs.pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
-
 	ALLOC_ARRAY(pack_perm, packs.nr);
 	sort_packs_by_name(packs.names, packs.nr, pack_perm);
 
@@ -864,6 +857,13 @@ int write_midx_file(const char *object_dir)
 			large_offsets_needed = 1;
 	}
 
+	for (i = 0; i < packs.nr; i++)
+		pack_name_concat_len += strlen(packs.names[i]) + 1;
+
+	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
+		pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
+					(pack_name_concat_len % MIDX_CHUNK_ALIGNMENT);
+
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
@@ -881,7 +881,7 @@ int write_midx_file(const char *object_dir)
 
 	cur_chunk++;
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + packs.pack_name_concat_len;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
 
 	cur_chunk++;
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
-- 
2.22.0.rc0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 05/11] midx: refactor permutation logic and pack sorting
  2019-05-14 18:47         ` [PATCH v6 " Derrick Stolee
                             ` (3 preceding siblings ...)
  2019-05-14 18:47           ` [PATCH v6 04/11] midx: simplify computation of pack name lengths Derrick Stolee
@ 2019-05-14 18:47           ` Derrick Stolee
  2019-05-14 18:47           ` [PATCH v6 06/11] multi-pack-index: implement 'expire' subcommand Derrick Stolee
                             ` (5 subsequent siblings)
  10 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-05-14 18:47 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, gitster, avarab, dstolee

In anticipation of the expire subcommand, refactor the way we sort
the packfiles by name. This will greatly simplify our approach to
dropping expired packs from the list.

First, create 'struct pack_info' to replace 'struct pack_pair'.
This struct contains the necessary information about a pack,
including its name, a pointer to its packfile struct (if not
already in the multi-pack-index), and the original pack-int-id.

Second, track the pack information using an array of pack_info
structs in the pack_list struct. This simplifies the logic around
the multiple arrays we were tracking in that struct.

Finally, update get_sorted_entries() to not permute the pack-int-id
and instead supply the permutation to write_midx_object_offsets().
This requires sorting the packs after get_sorted_entries().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 156 +++++++++++++++++++++++++--------------------------------
 1 file changed, 69 insertions(+), 87 deletions(-)

diff --git a/midx.c b/midx.c
index 62404620ad..6d4b84e243 100644
--- a/midx.c
+++ b/midx.c
@@ -427,12 +427,23 @@ static size_t write_midx_header(struct hashfile *f,
 	return MIDX_HEADER_SIZE;
 }
 
+struct pack_info {
+	uint32_t orig_pack_int_id;
+	char *pack_name;
+	struct packed_git *p;
+};
+
+static int pack_info_compare(const void *_a, const void *_b)
+{
+	struct pack_info *a = (struct pack_info *)_a;
+	struct pack_info *b = (struct pack_info *)_b;
+	return strcmp(a->pack_name, b->pack_name);
+}
+
 struct pack_list {
-	struct packed_git **list;
-	char **names;
+	struct pack_info *info;
 	uint32_t nr;
-	uint32_t alloc_list;
-	uint32_t alloc_names;
+	uint32_t alloc;
 	struct multi_pack_index *m;
 };
 
@@ -445,66 +456,32 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 		if (packs->m && midx_contains_pack(packs->m, file_name))
 			return;
 
-		ALLOC_GROW(packs->list, packs->nr + 1, packs->alloc_list);
-		ALLOC_GROW(packs->names, packs->nr + 1, packs->alloc_names);
+		ALLOC_GROW(packs->info, packs->nr + 1, packs->alloc);
 
-		packs->list[packs->nr] = add_packed_git(full_path,
-							full_path_len,
-							0);
+		packs->info[packs->nr].p = add_packed_git(full_path,
+							  full_path_len,
+							  0);
 
-		if (!packs->list[packs->nr]) {
+		if (!packs->info[packs->nr].p) {
 			warning(_("failed to add packfile '%s'"),
 				full_path);
 			return;
 		}
 
-		if (open_pack_index(packs->list[packs->nr])) {
+		if (open_pack_index(packs->info[packs->nr].p)) {
 			warning(_("failed to open pack-index '%s'"),
 				full_path);
-			close_pack(packs->list[packs->nr]);
-			FREE_AND_NULL(packs->list[packs->nr]);
+			close_pack(packs->info[packs->nr].p);
+			FREE_AND_NULL(packs->info[packs->nr].p);
 			return;
 		}
 
-		packs->names[packs->nr] = xstrdup(file_name);
+		packs->info[packs->nr].pack_name = xstrdup(file_name);
+		packs->info[packs->nr].orig_pack_int_id = packs->nr;
 		packs->nr++;
 	}
 }
 
-struct pack_pair {
-	uint32_t pack_int_id;
-	char *pack_name;
-};
-
-static int pack_pair_compare(const void *_a, const void *_b)
-{
-	struct pack_pair *a = (struct pack_pair *)_a;
-	struct pack_pair *b = (struct pack_pair *)_b;
-	return strcmp(a->pack_name, b->pack_name);
-}
-
-static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *perm)
-{
-	uint32_t i;
-	struct pack_pair *pairs;
-
-	ALLOC_ARRAY(pairs, nr_packs);
-
-	for (i = 0; i < nr_packs; i++) {
-		pairs[i].pack_int_id = i;
-		pairs[i].pack_name = pack_names[i];
-	}
-
-	QSORT(pairs, nr_packs, pack_pair_compare);
-
-	for (i = 0; i < nr_packs; i++) {
-		pack_names[i] = pairs[i].pack_name;
-		perm[pairs[i].pack_int_id] = i;
-	}
-
-	free(pairs);
-}
-
 struct pack_midx_entry {
 	struct object_id oid;
 	uint32_t pack_int_id;
@@ -530,7 +507,6 @@ static int midx_oid_compare(const void *_a, const void *_b)
 }
 
 static int nth_midxed_pack_midx_entry(struct multi_pack_index *m,
-				      uint32_t *pack_perm,
 				      struct pack_midx_entry *e,
 				      uint32_t pos)
 {
@@ -538,7 +514,7 @@ static int nth_midxed_pack_midx_entry(struct multi_pack_index *m,
 		return 1;
 
 	nth_midxed_object_oid(&e->oid, m, pos);
-	e->pack_int_id = pack_perm[nth_midxed_pack_int_id(m, pos)];
+	e->pack_int_id = nth_midxed_pack_int_id(m, pos);
 	e->offset = nth_midxed_offset(m, pos);
 
 	/* consider objects in midx to be from "old" packs */
@@ -572,8 +548,7 @@ static void fill_pack_entry(uint32_t pack_int_id,
  * of a packfile containing the object).
  */
 static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
-						  struct packed_git **p,
-						  uint32_t *perm,
+						  struct pack_info *info,
 						  uint32_t nr_packs,
 						  uint32_t *nr_objects)
 {
@@ -584,7 +559,7 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 	uint32_t start_pack = m ? m->num_packs : 0;
 
 	for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++)
-		total_objects += p[cur_pack]->num_objects;
+		total_objects += info[cur_pack].p->num_objects;
 
 	/*
 	 * As we de-duplicate by fanout value, we expect the fanout
@@ -609,7 +584,7 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 
 			for (cur_object = start; cur_object < end; cur_object++) {
 				ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout);
-				nth_midxed_pack_midx_entry(m, perm,
+				nth_midxed_pack_midx_entry(m,
 							   &entries_by_fanout[nr_fanout],
 							   cur_object);
 				nr_fanout++;
@@ -620,12 +595,12 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 			uint32_t start = 0, end;
 
 			if (cur_fanout)
-				start = get_pack_fanout(p[cur_pack], cur_fanout - 1);
-			end = get_pack_fanout(p[cur_pack], cur_fanout);
+				start = get_pack_fanout(info[cur_pack].p, cur_fanout - 1);
+			end = get_pack_fanout(info[cur_pack].p, cur_fanout);
 
 			for (cur_object = start; cur_object < end; cur_object++) {
 				ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout);
-				fill_pack_entry(perm[cur_pack], p[cur_pack], cur_object, &entries_by_fanout[nr_fanout]);
+				fill_pack_entry(cur_pack, info[cur_pack].p, cur_object, &entries_by_fanout[nr_fanout]);
 				nr_fanout++;
 			}
 		}
@@ -654,7 +629,7 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 }
 
 static size_t write_midx_pack_names(struct hashfile *f,
-				    char **pack_names,
+				    struct pack_info *info,
 				    uint32_t num_packs)
 {
 	uint32_t i;
@@ -662,14 +637,14 @@ static size_t write_midx_pack_names(struct hashfile *f,
 	size_t written = 0;
 
 	for (i = 0; i < num_packs; i++) {
-		size_t writelen = strlen(pack_names[i]) + 1;
+		size_t writelen = strlen(info[i].pack_name) + 1;
 
-		if (i && strcmp(pack_names[i], pack_names[i - 1]) <= 0)
+		if (i && strcmp(info[i].pack_name, info[i - 1].pack_name) <= 0)
 			BUG("incorrect pack-file order: %s before %s",
-			    pack_names[i - 1],
-			    pack_names[i]);
+			    info[i - 1].pack_name,
+			    info[i].pack_name);
 
-		hashwrite(f, pack_names[i], writelen);
+		hashwrite(f, info[i].pack_name, writelen);
 		written += writelen;
 	}
 
@@ -740,6 +715,7 @@ static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
 }
 
 static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_needed,
+					uint32_t *perm,
 					struct pack_midx_entry *objects, uint32_t nr_objects)
 {
 	struct pack_midx_entry *list = objects;
@@ -749,7 +725,7 @@ static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_nee
 	for (i = 0; i < nr_objects; i++) {
 		struct pack_midx_entry *obj = list++;
 
-		hashwrite_be32(f, obj->pack_int_id);
+		hashwrite_be32(f, perm[obj->pack_int_id]);
 
 		if (large_offset_needed && obj->offset >> 31)
 			hashwrite_be32(f, MIDX_LARGE_OFFSET_NEEDED | nr_large_offset++);
@@ -822,20 +798,17 @@ int write_midx_file(const char *object_dir)
 	packs.m = load_multi_pack_index(object_dir, 1);
 
 	packs.nr = 0;
-	packs.alloc_list = packs.m ? packs.m->num_packs : 16;
-	packs.alloc_names = packs.alloc_list;
-	packs.list = NULL;
-	packs.names = NULL;
-	ALLOC_ARRAY(packs.list, packs.alloc_list);
-	ALLOC_ARRAY(packs.names, packs.alloc_names);
+	packs.alloc = packs.m ? packs.m->num_packs : 16;
+	packs.info = NULL;
+	ALLOC_ARRAY(packs.info, packs.alloc);
 
 	if (packs.m) {
 		for (i = 0; i < packs.m->num_packs; i++) {
-			ALLOC_GROW(packs.list, packs.nr + 1, packs.alloc_list);
-			ALLOC_GROW(packs.names, packs.nr + 1, packs.alloc_names);
+			ALLOC_GROW(packs.info, packs.nr + 1, packs.alloc);
 
-			packs.list[packs.nr] = NULL;
-			packs.names[packs.nr] = xstrdup(packs.m->pack_names[i]);
+			packs.info[packs.nr].orig_pack_int_id = i;
+			packs.info[packs.nr].pack_name = xstrdup(packs.m->pack_names[i]);
+			packs.info[packs.nr].p = NULL;
 			packs.nr++;
 		}
 	}
@@ -845,10 +818,7 @@ int write_midx_file(const char *object_dir)
 	if (packs.m && packs.nr == packs.m->num_packs)
 		goto cleanup;
 
-	ALLOC_ARRAY(pack_perm, packs.nr);
-	sort_packs_by_name(packs.names, packs.nr, pack_perm);
-
-	entries = get_sorted_entries(packs.m, packs.list, pack_perm, packs.nr, &nr_entries);
+	entries = get_sorted_entries(packs.m, packs.info, packs.nr, &nr_entries);
 
 	for (i = 0; i < nr_entries; i++) {
 		if (entries[i].offset > 0x7fffffff)
@@ -857,8 +827,21 @@ int write_midx_file(const char *object_dir)
 			large_offsets_needed = 1;
 	}
 
+	QSORT(packs.info, packs.nr, pack_info_compare);
+
+	/*
+	 * pack_perm stores a permutation between pack-int-ids from the
+	 * previous multi-pack-index to the new one we are writing:
+	 *
+	 * pack_perm[old_id] = new_id
+	 */
+	ALLOC_ARRAY(pack_perm, packs.nr);
+	for (i = 0; i < packs.nr; i++) {
+		pack_perm[packs.info[i].orig_pack_int_id] = i;
+	}
+
 	for (i = 0; i < packs.nr; i++)
-		pack_name_concat_len += strlen(packs.names[i]) + 1;
+		pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
 
 	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
 		pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
@@ -929,7 +912,7 @@ int write_midx_file(const char *object_dir)
 
 		switch (chunk_ids[i]) {
 			case MIDX_CHUNKID_PACKNAMES:
-				written += write_midx_pack_names(f, packs.names, packs.nr);
+				written += write_midx_pack_names(f, packs.info, packs.nr);
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
@@ -941,7 +924,7 @@ int write_midx_file(const char *object_dir)
 				break;
 
 			case MIDX_CHUNKID_OBJECTOFFSETS:
-				written += write_midx_object_offsets(f, large_offsets_needed, entries, nr_entries);
+				written += write_midx_object_offsets(f, large_offsets_needed, pack_perm, entries, nr_entries);
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
@@ -964,15 +947,14 @@ int write_midx_file(const char *object_dir)
 
 cleanup:
 	for (i = 0; i < packs.nr; i++) {
-		if (packs.list[i]) {
-			close_pack(packs.list[i]);
-			free(packs.list[i]);
+		if (packs.info[i].p) {
+			close_pack(packs.info[i].p);
+			free(packs.info[i].p);
 		}
-		free(packs.names[i]);
+		free(packs.info[i].pack_name);
 	}
 
-	free(packs.list);
-	free(packs.names);
+	free(packs.info);
 	free(entries);
 	free(pack_perm);
 	free(midx_name);
-- 
2.22.0.rc0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 06/11] multi-pack-index: implement 'expire' subcommand
  2019-05-14 18:47         ` [PATCH v6 " Derrick Stolee
                             ` (4 preceding siblings ...)
  2019-05-14 18:47           ` [PATCH v6 05/11] midx: refactor permutation logic and pack sorting Derrick Stolee
@ 2019-05-14 18:47           ` Derrick Stolee
  2019-05-14 18:47           ` [PATCH v6 07/11] multi-pack-index: prepare 'repack' subcommand Derrick Stolee
                             ` (4 subsequent siblings)
  10 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-05-14 18:47 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, gitster, avarab, dstolee

The 'git multi-pack-index expire' subcommand looks at the existing
mult-pack-index, counts the number of objects referenced in each
pack-file, deletes the pack-fils with no referenced objects, and
rewrites the multi-pack-index to no longer reference those packs.

Refactor the write_midx_file() method to call write_midx_internal()
which now takes an existing 'struct multi_pack_index' and a list
of pack-files to drop (as specified by the names of their pack-
indexes). As we write the new multi-pack-index, we drop those
file names from the list of known pack-files.

The expire_midx_packs() method removes the unreferenced pack-files
after carefully closing the packs to avoid open handles.

Test that a new pack-file that covers the contents of two other
pack-files leads to those pack-files being deleted during the
expire subcommand. Be sure to read the multi-pack-index to ensure
it no longer references those packs.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 119 +++++++++++++++++++++++++++++++++---
 t/t5319-multi-pack-index.sh |  20 ++++++
 2 files changed, 129 insertions(+), 10 deletions(-)

diff --git a/midx.c b/midx.c
index 6d4b84e243..9b0b4c1520 100644
--- a/midx.c
+++ b/midx.c
@@ -34,6 +34,8 @@
 #define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t))
 #define MIDX_LARGE_OFFSET_NEEDED 0x80000000
 
+#define PACK_EXPIRED UINT_MAX
+
 static char *get_midx_filename(const char *object_dir)
 {
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
@@ -431,6 +433,7 @@ struct pack_info {
 	uint32_t orig_pack_int_id;
 	char *pack_name;
 	struct packed_git *p;
+	unsigned expired : 1;
 };
 
 static int pack_info_compare(const void *_a, const void *_b)
@@ -478,6 +481,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 
 		packs->info[packs->nr].pack_name = xstrdup(file_name);
 		packs->info[packs->nr].orig_pack_int_id = packs->nr;
+		packs->info[packs->nr].expired = 0;
 		packs->nr++;
 	}
 }
@@ -637,13 +641,17 @@ static size_t write_midx_pack_names(struct hashfile *f,
 	size_t written = 0;
 
 	for (i = 0; i < num_packs; i++) {
-		size_t writelen = strlen(info[i].pack_name) + 1;
+		size_t writelen;
+
+		if (info[i].expired)
+			continue;
 
 		if (i && strcmp(info[i].pack_name, info[i - 1].pack_name) <= 0)
 			BUG("incorrect pack-file order: %s before %s",
 			    info[i - 1].pack_name,
 			    info[i].pack_name);
 
+		writelen = strlen(info[i].pack_name) + 1;
 		hashwrite(f, info[i].pack_name, writelen);
 		written += writelen;
 	}
@@ -725,6 +733,11 @@ static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_nee
 	for (i = 0; i < nr_objects; i++) {
 		struct pack_midx_entry *obj = list++;
 
+		if (perm[obj->pack_int_id] == PACK_EXPIRED)
+			BUG("object %s is in an expired pack with int-id %d",
+			    oid_to_hex(&obj->oid),
+			    obj->pack_int_id);
+
 		hashwrite_be32(f, perm[obj->pack_int_id]);
 
 		if (large_offset_needed && obj->offset >> 31)
@@ -771,7 +784,8 @@ static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_off
 	return written;
 }
 
-int write_midx_file(const char *object_dir)
+static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
+			       struct string_list *packs_to_drop)
 {
 	unsigned char cur_chunk, num_chunks = 0;
 	char *midx_name;
@@ -787,6 +801,8 @@ int write_midx_file(const char *object_dir)
 	struct pack_midx_entry *entries = NULL;
 	int large_offsets_needed = 0;
 	int pack_name_concat_len = 0;
+	int dropped_packs = 0;
+	int result = 0;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name)) {
@@ -795,7 +811,10 @@ int write_midx_file(const char *object_dir)
 			  midx_name);
 	}
 
-	packs.m = load_multi_pack_index(object_dir, 1);
+	if (m)
+		packs.m = m;
+	else
+		packs.m = load_multi_pack_index(object_dir, 1);
 
 	packs.nr = 0;
 	packs.alloc = packs.m ? packs.m->num_packs : 16;
@@ -809,13 +828,14 @@ int write_midx_file(const char *object_dir)
 			packs.info[packs.nr].orig_pack_int_id = i;
 			packs.info[packs.nr].pack_name = xstrdup(packs.m->pack_names[i]);
 			packs.info[packs.nr].p = NULL;
+			packs.info[packs.nr].expired = 0;
 			packs.nr++;
 		}
 	}
 
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
 
-	if (packs.m && packs.nr == packs.m->num_packs)
+	if (packs.m && packs.nr == packs.m->num_packs && !packs_to_drop)
 		goto cleanup;
 
 	entries = get_sorted_entries(packs.m, packs.info, packs.nr, &nr_entries);
@@ -829,6 +849,34 @@ int write_midx_file(const char *object_dir)
 
 	QSORT(packs.info, packs.nr, pack_info_compare);
 
+	if (packs_to_drop && packs_to_drop->nr) {
+		int drop_index = 0;
+		int missing_drops = 0;
+
+		for (i = 0; i < packs.nr && drop_index < packs_to_drop->nr; i++) {
+			int cmp = strcmp(packs.info[i].pack_name,
+					 packs_to_drop->items[drop_index].string);
+
+			if (!cmp) {
+				drop_index++;
+				packs.info[i].expired = 1;
+			} else if (cmp > 0) {
+				error(_("did not see pack-file %s to drop"),
+				      packs_to_drop->items[drop_index].string);
+				drop_index++;
+				missing_drops++;
+				i--;
+			} else {
+				packs.info[i].expired = 0;
+			}
+		}
+
+		if (missing_drops) {
+			result = 1;
+			goto cleanup;
+		}
+	}
+
 	/*
 	 * pack_perm stores a permutation between pack-int-ids from the
 	 * previous multi-pack-index to the new one we are writing:
@@ -837,11 +885,18 @@ int write_midx_file(const char *object_dir)
 	 */
 	ALLOC_ARRAY(pack_perm, packs.nr);
 	for (i = 0; i < packs.nr; i++) {
-		pack_perm[packs.info[i].orig_pack_int_id] = i;
+		if (packs.info[i].expired) {
+			dropped_packs++;
+			pack_perm[packs.info[i].orig_pack_int_id] = PACK_EXPIRED;
+		} else {
+			pack_perm[packs.info[i].orig_pack_int_id] = i - dropped_packs;
+		}
 	}
 
-	for (i = 0; i < packs.nr; i++)
-		pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
+	for (i = 0; i < packs.nr; i++) {
+		if (!packs.info[i].expired)
+			pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
+	}
 
 	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
 		pack_name_concat_len += MIDX_CHUNK_ALIGNMENT -
@@ -857,7 +912,7 @@ int write_midx_file(const char *object_dir)
 	cur_chunk = 0;
 	num_chunks = large_offsets_needed ? 5 : 4;
 
-	written = write_midx_header(f, num_chunks, packs.nr);
+	written = write_midx_header(f, num_chunks, packs.nr - dropped_packs);
 
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
 	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
@@ -958,7 +1013,12 @@ int write_midx_file(const char *object_dir)
 	free(entries);
 	free(pack_perm);
 	free(midx_name);
-	return 0;
+	return result;
+}
+
+int write_midx_file(const char *object_dir)
+{
+	return write_midx_internal(object_dir, NULL, NULL);
 }
 
 void clear_midx_file(struct repository *r)
@@ -1125,5 +1185,44 @@ int verify_midx_file(struct repository *r, const char *object_dir)
 
 int expire_midx_packs(struct repository *r, const char *object_dir)
 {
-	return 0;
+	uint32_t i, *count, result = 0;
+	struct string_list packs_to_drop = STRING_LIST_INIT_DUP;
+	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
+
+	if (!m)
+		return 0;
+
+	count = xcalloc(m->num_packs, sizeof(uint32_t));
+	for (i = 0; i < m->num_objects; i++) {
+		int pack_int_id = nth_midxed_pack_int_id(m, i);
+		count[pack_int_id]++;
+	}
+
+	for (i = 0; i < m->num_packs; i++) {
+		char *pack_name;
+
+		if (count[i])
+			continue;
+
+		if (prepare_midx_pack(r, m, i))
+			continue;
+
+		if (m->packs[i]->pack_keep)
+			continue;
+
+		pack_name = xstrdup(m->packs[i]->pack_name);
+		close_pack(m->packs[i]);
+
+		string_list_insert(&packs_to_drop, m->pack_names[i]);
+		unlink_pack_path(pack_name, 0);
+		free(pack_name);
+	}
+
+	free(count);
+
+	if (packs_to_drop.nr)
+		result = write_midx_internal(object_dir, m, &packs_to_drop);
+
+	string_list_clear(&packs_to_drop, 0);
+	return result;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 1b2d32f475..12570fe7ac 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -412,4 +412,24 @@ test_expect_success 'expire does not remove any packs' '
 	)
 '
 
+test_expect_success 'expire removes unreferenced packs' '
+	(
+		cd dup &&
+		git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
+		refs/heads/A
+		^refs/heads/C
+		EOF
+		git multi-pack-index write &&
+		ls .git/objects/pack | grep -v -e pack-[AB] >expect &&
+		git multi-pack-index expire &&
+		ls .git/objects/pack >actual &&
+		test_cmp expect actual &&
+		ls .git/objects/pack/ | grep idx >expect-idx &&
+		test-tool read-midx .git/objects | grep idx >actual-midx &&
+		test_cmp expect-idx actual-midx &&
+		git multi-pack-index verify &&
+		git fsck
+	)
+'
+
 test_done
-- 
2.22.0.rc0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 07/11] multi-pack-index: prepare 'repack' subcommand
  2019-05-14 18:47         ` [PATCH v6 " Derrick Stolee
                             ` (5 preceding siblings ...)
  2019-05-14 18:47           ` [PATCH v6 06/11] multi-pack-index: implement 'expire' subcommand Derrick Stolee
@ 2019-05-14 18:47           ` Derrick Stolee
  2019-05-14 18:47           ` [PATCH v6 08/11] midx: implement midx_repack() Derrick Stolee
                             ` (3 subsequent siblings)
  10 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-05-14 18:47 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, gitster, avarab, dstolee

In an environment where the multi-pack-index is useful, it is due
to many pack-files and an inability to repack the object store
into a single pack-file. However, it is likely that many of these
pack-files are rather small, and could be repacked into a slightly
larger pack-file without too much effort. It may also be important
to ensure the object store is highly available and the repack
operation does not interrupt concurrent git commands.

Introduce a 'repack' subcommand to 'git multi-pack-index' that
takes a '--batch-size' option. The subcommand will inspect the
multi-pack-index for referenced pack-files whose size is smaller
than the batch size, until collecting a list of pack-files whose
sizes sum to larger than the batch size. Then, a new pack-file
will be created containing the objects from those pack-files that
are referenced by the multi-pack-index. The resulting pack is
likely to actually be smaller than the batch size due to
compression and the fact that there may be objects in the pack-
files that have duplicate copies in other pack-files.

The current change introduces the command-line arguments, and we
add a test that ensures we parse these options properly. Since
we specify a small batch size, we will guarantee that future
implementations do not change the list of pack-files.

In addition, we hard-code the modified times of the packs in
the pack directory to ensure the list of packs sorted by modified
time matches the order if sorted by size (ascending). This will
be important in a future test.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-multi-pack-index.txt | 17 +++++++++++++++++
 builtin/multi-pack-index.c             | 12 ++++++++++--
 midx.c                                 |  5 +++++
 midx.h                                 |  1 +
 t/t5319-multi-pack-index.sh            | 20 +++++++++++++++++++-
 5 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index 6186c4c936..233b2b7862 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -36,6 +36,23 @@ expire::
 	have no objects referenced by the MIDX. Rewrite the MIDX file
 	afterward to remove all references to these pack-files.
 
+repack::
+	Create a new pack-file containing objects in small pack-files
+	referenced by the multi-pack-index. If the size given by the
+	`--batch-size=<size>` argument is zero, then create a pack
+	containing all objects referenced by the multi-pack-index. For
+	a non-zero batch size, Select the pack-files by examining packs
+	from oldest-to-newest, computing the "expected size" by counting
+	the number of objects in the pack referenced by the
+	multi-pack-index, then divide by the total number of objects in
+	the pack and multiply by the pack size. We select packs with
+	expected size below the batch size until the set of packs have
+	total expected size at least the batch size. If the total size
+	does not reach the batch size, then do nothing. If a new pack-
+	file is created, rewrite the multi-pack-index to reference the
+	new pack-file. A later run of 'git multi-pack-index expire' will
+	delete the pack-files that were part of this batch.
+
 
 EXAMPLES
 --------
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index ad10d40512..b1ea1a6aa1 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -6,12 +6,13 @@
 #include "trace2.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire)"),
+	N_("git multi-pack-index [--object-dir=<dir>] (write|verify|expire|repack --batch-size=<size>)"),
 	NULL
 };
 
 static struct opts_multi_pack_index {
 	const char *object_dir;
+	unsigned long batch_size;
 } opts;
 
 int cmd_multi_pack_index(int argc, const char **argv,
@@ -20,6 +21,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
 	static struct option builtin_multi_pack_index_options[] = {
 		OPT_FILENAME(0, "object-dir", &opts.object_dir,
 		  N_("object directory containing set of packfile and pack-index pairs")),
+		OPT_MAGNITUDE(0, "batch-size", &opts.batch_size,
+		  N_("during repack, collect pack-files of smaller size into a batch that is larger than this size")),
 		OPT_END(),
 	};
 
@@ -43,6 +46,11 @@ int cmd_multi_pack_index(int argc, const char **argv,
 
 	trace2_cmd_mode(argv[0]);
 
+	if (!strcmp(argv[0], "repack"))
+		return midx_repack(the_repository, opts.object_dir, (size_t)opts.batch_size);
+	if (opts.batch_size)
+		die(_("--batch-size option is only for 'repack' subcommand"));
+
 	if (!strcmp(argv[0], "write"))
 		return write_midx_file(opts.object_dir);
 	if (!strcmp(argv[0], "verify"))
@@ -50,5 +58,5 @@ int cmd_multi_pack_index(int argc, const char **argv,
 	if (!strcmp(argv[0], "expire"))
 		return expire_midx_packs(the_repository, opts.object_dir);
 
-	die(_("unrecognized verb: %s"), argv[0]);
+	die(_("unrecognized subcommand: %s"), argv[0]);
 }
diff --git a/midx.c b/midx.c
index 9b0b4c1520..fbed8a8adb 100644
--- a/midx.c
+++ b/midx.c
@@ -1226,3 +1226,8 @@ int expire_midx_packs(struct repository *r, const char *object_dir)
 	string_list_clear(&packs_to_drop, 0);
 	return result;
 }
+
+int midx_repack(struct repository *r, const char *object_dir, size_t batch_size)
+{
+	return 0;
+}
diff --git a/midx.h b/midx.h
index 505f1431b7..f0ae656b5d 100644
--- a/midx.h
+++ b/midx.h
@@ -51,6 +51,7 @@ int write_midx_file(const char *object_dir);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(struct repository *r, const char *object_dir);
 int expire_midx_packs(struct repository *r, const char *object_dir);
+int midx_repack(struct repository *r, const char *object_dir, size_t batch_size);
 
 void close_midx(struct multi_pack_index *m);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 12570fe7ac..133d5b7068 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -398,7 +398,8 @@ test_expect_success 'setup expire tests' '
 		git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
 		refs/heads/E
 		EOF
-		git multi-pack-index write
+		git multi-pack-index write &&
+		cp -r .git/objects/pack .git/objects/pack-backup
 	)
 '
 
@@ -432,4 +433,21 @@ test_expect_success 'expire removes unreferenced packs' '
 	)
 '
 
+test_expect_success 'repack with minimum size does not alter existing packs' '
+	(
+		cd dup &&
+		rm -rf .git/objects/pack &&
+		mv .git/objects/pack-backup .git/objects/pack &&
+		touch -m -t 201901010000 .git/objects/pack/pack-D* &&
+		touch -m -t 201901010001 .git/objects/pack/pack-C* &&
+		touch -m -t 201901010002 .git/objects/pack/pack-B* &&
+		touch -m -t 201901010003 .git/objects/pack/pack-A* &&
+		ls .git/objects/pack >expect &&
+		MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
+		git multi-pack-index repack --batch-size=$MINSIZE &&
+		ls .git/objects/pack >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
2.22.0.rc0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 08/11] midx: implement midx_repack()
  2019-05-14 18:47         ` [PATCH v6 " Derrick Stolee
                             ` (6 preceding siblings ...)
  2019-05-14 18:47           ` [PATCH v6 07/11] multi-pack-index: prepare 'repack' subcommand Derrick Stolee
@ 2019-05-14 18:47           ` Derrick Stolee
  2019-05-14 18:47           ` [PATCH v6 09/11] multi-pack-index: test expire while adding packs Derrick Stolee
                             ` (2 subsequent siblings)
  10 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-05-14 18:47 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, gitster, avarab, dstolee

To repack with a non-zero batch-size, first sort all pack-files by
their modified time. Second, walk those pack-files from oldest
to newest, compute their expected size, and add the packs to a list
if they are smaller than the given batch-size. Stop when the total
expected size is at least the batch size.

If the batch size is zero, select all packs in the multi-pack-index.

Finally, collect the objects from the multi-pack-index that are in
the selected packs and send them to 'git pack-objects'. Write a new
multi-pack-index that includes the new pack.

Using a batch size of zero is very similar to a standard 'git repack'
command, except that we do not delete the old packs and instead rely
on the new multi-pack-index to prevent new processes from reading the
old packs. This does not disrupt other Git processes that are currently
reading the old packs based on the old multi-pack-index.

While first designing a 'git multi-pack-index repack' operation, I
started by collecting the batches based on the actual size of the
objects instead of the size of the pack-files. This allows repacking
a large pack-file that has very few referencd objects. However, this
came at a significant cost of parsing pack-files instead of simply
reading the multi-pack-index and getting the file information for
the pack-files. The "expected size" version provides similar
behavior, but could skip a pack-file if the average object size is
much larger than the actual size of the referenced objects, or
can create a large pack if the actual size of the referenced objects
is larger than the expected size.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 151 +++++++++++++++++++++++++++++++++++-
 t/t5319-multi-pack-index.sh |  28 +++++++
 2 files changed, 178 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index fbed8a8adb..d649644420 100644
--- a/midx.c
+++ b/midx.c
@@ -9,6 +9,7 @@
 #include "midx.h"
 #include "progress.h"
 #include "trace2.h"
+#include "run-command.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -1227,7 +1228,155 @@ int expire_midx_packs(struct repository *r, const char *object_dir)
 	return result;
 }
 
-int midx_repack(struct repository *r, const char *object_dir, size_t batch_size)
+struct repack_info {
+	timestamp_t mtime;
+	uint32_t referenced_objects;
+	uint32_t pack_int_id;
+};
+
+static int compare_by_mtime(const void *a_, const void *b_)
 {
+	const struct repack_info *a, *b;
+
+	a = (const struct repack_info *)a_;
+	b = (const struct repack_info *)b_;
+
+	if (a->mtime < b->mtime)
+		return -1;
+	if (a->mtime > b->mtime)
+		return 1;
+	return 0;
+}
+
+static int fill_included_packs_all(struct multi_pack_index *m,
+				   unsigned char *include_pack)
+{
+	uint32_t i;
+
+	for (i = 0; i < m->num_packs; i++)
+		include_pack[i] = 1;
+
+	return m->num_packs < 2;
+}
+
+static int fill_included_packs_batch(struct repository *r,
+				     struct multi_pack_index *m,
+				     unsigned char *include_pack,
+				     size_t batch_size)
+{
+	uint32_t i, packs_to_repack;
+	size_t total_size;
+	struct repack_info *pack_info = xcalloc(m->num_packs, sizeof(struct repack_info));
+
+	for (i = 0; i < m->num_packs; i++) {
+		pack_info[i].pack_int_id = i;
+
+		if (prepare_midx_pack(r, m, i))
+			continue;
+
+		pack_info[i].mtime = m->packs[i]->mtime;
+	}
+
+	for (i = 0; batch_size && i < m->num_objects; i++) {
+		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
+		pack_info[pack_int_id].referenced_objects++;
+	}
+
+	QSORT(pack_info, m->num_packs, compare_by_mtime);
+
+	total_size = 0;
+	packs_to_repack = 0;
+	for (i = 0; total_size < batch_size && i < m->num_packs; i++) {
+		int pack_int_id = pack_info[i].pack_int_id;
+		struct packed_git *p = m->packs[pack_int_id];
+		size_t expected_size;
+
+		if (!p)
+			continue;
+		if (open_pack_index(p) || !p->num_objects)
+			continue;
+
+		expected_size = (size_t)(p->pack_size
+					 * pack_info[i].referenced_objects);
+		expected_size /= p->num_objects;
+
+		if (expected_size >= batch_size)
+			continue;
+
+		packs_to_repack++;
+		total_size += expected_size;
+		include_pack[pack_int_id] = 1;
+	}
+
+	free(pack_info);
+
+	if (total_size < batch_size || packs_to_repack < 2)
+		return 1;
+
 	return 0;
 }
+
+int midx_repack(struct repository *r, const char *object_dir, size_t batch_size)
+{
+	int result = 0;
+	uint32_t i;
+	unsigned char *include_pack;
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf base_name = STRBUF_INIT;
+	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
+
+	if (!m)
+		return 0;
+
+	include_pack = xcalloc(m->num_packs, sizeof(unsigned char));
+
+	if (batch_size) {
+		if (fill_included_packs_batch(r, m, include_pack, batch_size))
+			goto cleanup;
+	} else if (fill_included_packs_all(m, include_pack))
+		goto cleanup;
+
+	argv_array_push(&cmd.args, "pack-objects");
+
+	strbuf_addstr(&base_name, object_dir);
+	strbuf_addstr(&base_name, "/pack/pack");
+	argv_array_push(&cmd.args, base_name.buf);
+	strbuf_release(&base_name);
+
+	cmd.git_cmd = 1;
+	cmd.in = cmd.out = -1;
+
+	if (start_command(&cmd)) {
+		error(_("could not start pack-objects"));
+		result = 1;
+		goto cleanup;
+	}
+
+	for (i = 0; i < m->num_objects; i++) {
+		struct object_id oid;
+		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
+
+		if (!include_pack[pack_int_id])
+			continue;
+
+		nth_midxed_object_oid(&oid, m, i);
+		xwrite(cmd.in, oid_to_hex(&oid), the_hash_algo->hexsz);
+		xwrite(cmd.in, "\n", 1);
+	}
+	close(cmd.in);
+
+	if (finish_command(&cmd)) {
+		error(_("could not finish pack-objects"));
+		result = 1;
+		goto cleanup;
+	}
+
+	result = write_midx_internal(object_dir, m, NULL);
+	m = NULL;
+
+cleanup:
+	if (m)
+		close_midx(m);
+	free(include_pack);
+	return result;
+}
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 133d5b7068..6e47e5d0b2 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -450,4 +450,32 @@ test_expect_success 'repack with minimum size does not alter existing packs' '
 	)
 '
 
+test_expect_success 'repack creates a new pack' '
+	(
+		cd dup &&
+		ls .git/objects/pack/*idx >idx-list &&
+		test_line_count = 5 idx-list &&
+		THIRD_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 3 | tail -n 1) &&
+		BATCH_SIZE=$(($THIRD_SMALLEST_SIZE + 1)) &&
+		git multi-pack-index repack --batch-size=$BATCH_SIZE &&
+		ls .git/objects/pack/*idx >idx-list &&
+		test_line_count = 6 idx-list &&
+		test-tool read-midx .git/objects | grep idx >midx-list &&
+		test_line_count = 6 midx-list
+	)
+'
+
+test_expect_success 'expire removes repacked packs' '
+	(
+		cd dup &&
+		ls -al .git/objects/pack/*pack &&
+		ls -S .git/objects/pack/*pack | head -n 4 >expect &&
+		git multi-pack-index expire &&
+		ls -S .git/objects/pack/*pack >actual &&
+		test_cmp expect actual &&
+		test-tool read-midx .git/objects | grep idx >midx-list &&
+		test_line_count = 4 midx-list
+	)
+'
+
 test_done
-- 
2.22.0.rc0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 09/11] multi-pack-index: test expire while adding packs
  2019-05-14 18:47         ` [PATCH v6 " Derrick Stolee
                             ` (7 preceding siblings ...)
  2019-05-14 18:47           ` [PATCH v6 08/11] midx: implement midx_repack() Derrick Stolee
@ 2019-05-14 18:47           ` Derrick Stolee
  2019-05-14 18:47           ` [PATCH v6 10/11] midx: add test that 'expire' respects .keep files Derrick Stolee
  2019-05-14 18:47           ` [PATCH v6 11/11] t5319-multi-pack-index.sh: test batch size zero Derrick Stolee
  10 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-05-14 18:47 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, gitster, avarab, dstolee

During development of the multi-pack-index expire subcommand, a
version went out that improperly computed the pack order if a new
pack was introduced while other packs were being removed. Part of
the subtlety of the bug involved the new pack being placed before
other packs that already existed in the multi-pack-index.

Add a test to t5319-multi-pack-index.sh that catches this issue.
The test adds new packs that cause another pack to be expired, and
creates new packs that are lexicographically sorted before and
after the existing packs.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t5319-multi-pack-index.sh | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 6e47e5d0b2..8e04ce2821 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -478,4 +478,36 @@ test_expect_success 'expire removes repacked packs' '
 	)
 '
 
+test_expect_success 'expire works when adding new packs' '
+	(
+		cd dup &&
+		git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
+		refs/heads/A
+		^refs/heads/B
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
+		refs/heads/B
+		^refs/heads/C
+		EOF
+		git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
+		refs/heads/C
+		^refs/heads/D
+		EOF
+		git multi-pack-index write &&
+		git pack-objects --revs .git/objects/pack/a-pack <<-EOF &&
+		refs/heads/D
+		^refs/heads/E
+		EOF
+		git multi-pack-index write &&
+		git pack-objects --revs .git/objects/pack/z-pack <<-EOF &&
+		refs/heads/E
+		EOF
+		git multi-pack-index expire &&
+		ls .git/objects/pack/ | grep idx >expect &&
+		test-tool read-midx .git/objects | grep idx >actual &&
+		test_cmp expect actual &&
+		git multi-pack-index verify
+	)
+'
+
 test_done
-- 
2.22.0.rc0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 10/11] midx: add test that 'expire' respects .keep files
  2019-05-14 18:47         ` [PATCH v6 " Derrick Stolee
                             ` (8 preceding siblings ...)
  2019-05-14 18:47           ` [PATCH v6 09/11] multi-pack-index: test expire while adding packs Derrick Stolee
@ 2019-05-14 18:47           ` Derrick Stolee
  2019-05-14 18:47           ` [PATCH v6 11/11] t5319-multi-pack-index.sh: test batch size zero Derrick Stolee
  10 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-05-14 18:47 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, gitster, avarab, dstolee

The 'git multi-pack-index expire' subcommand may delete packs that
are not needed from the perspective of the multi-pack-index. If
a pack has a .keep file, then we should not delete that pack. Add
a test that ensures we preserve a pack that would otherwise be
expired. First, create a new pack that contains every object in
the repo, then add it to the multi-pack-index. Then create a .keep
file for a pack starting with "a-pack" that was added in the
previous test. Finally, expire and verify that the pack remains
and the other packs were expired.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t5319-multi-pack-index.sh | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 8e04ce2821..c288901401 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -510,4 +510,22 @@ test_expect_success 'expire works when adding new packs' '
 	)
 '
 
+test_expect_success 'expire respects .keep files' '
+	(
+		cd dup &&
+		git pack-objects --revs .git/objects/pack/pack-all <<-EOF &&
+		refs/heads/A
+		EOF
+		git multi-pack-index write &&
+		PACKA=$(ls .git/objects/pack/a-pack*\.pack | sed s/\.pack\$//) &&
+		touch $PACKA.keep &&
+		git multi-pack-index expire &&
+		ls -S .git/objects/pack/a-pack* | grep $PACKA >a-pack-files &&
+		test_line_count = 3 a-pack-files &&
+		test-tool read-midx .git/objects | grep idx >midx-list &&
+		test_line_count = 2 midx-list
+	)
+'
+
+
 test_done
-- 
2.22.0.rc0


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 11/11] t5319-multi-pack-index.sh: test batch size zero
  2019-05-14 18:47         ` [PATCH v6 " Derrick Stolee
                             ` (9 preceding siblings ...)
  2019-05-14 18:47           ` [PATCH v6 10/11] midx: add test that 'expire' respects .keep files Derrick Stolee
@ 2019-05-14 18:47           ` Derrick Stolee
  10 siblings, 0 replies; 89+ messages in thread
From: Derrick Stolee @ 2019-05-14 18:47 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, gitster, avarab, dstolee

The 'git multi-pack-index repack' command can take a batch size of
zero, which creates a new pack-file containing all objects in the
multi-pack-index. The first 'repack' command will create one new
pack-file, and an 'expire' command after that will delete the old
pack-files, as they no longer contain any referenced objects in the
multi-pack-index.

We must remove the .keep file that was added in the previous test
in order to expire that pack-file.

Also test that a 'repack' will do nothing if there is only one
pack-file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t5319-multi-pack-index.sh | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index c288901401..79bfaeafa9 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -527,5 +527,24 @@ test_expect_success 'expire respects .keep files' '
 	)
 '
 
+test_expect_success 'repack --batch-size=0 repacks everything' '
+	(
+		cd dup &&
+		rm .git/objects/pack/*.keep &&
+		ls .git/objects/pack/*idx >idx-list &&
+		test_line_count = 2 idx-list &&
+		git multi-pack-index repack --batch-size=0 &&
+		ls .git/objects/pack/*idx >idx-list &&
+		test_line_count = 3 idx-list &&
+		test-tool read-midx .git/objects | grep idx >midx-list &&
+		test_line_count = 3 midx-list &&
+		git multi-pack-index expire &&
+		ls -al .git/objects/pack/*idx >idx-list &&
+		test_line_count = 1 idx-list &&
+		git multi-pack-index repack --batch-size=0 &&
+		ls -al .git/objects/pack/*idx >new-idx-list &&
+		test_cmp idx-list new-idx-list
+	)
+'
 
 test_done
-- 
2.22.0.rc0


^ permalink raw reply	[flat|nested] 89+ messages in thread

end of thread, back to index

Thread overview: 89+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-10 18:06 [PATCH 0/5] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
2018-12-10 18:06 ` [PATCH 1/5] multi-pack-index: prepare for 'expire' verb Derrick Stolee via GitGitGadget
2018-12-11  1:35   ` Stefan Beller
2018-12-11  1:59     ` SZEDER Gábor
2018-12-11 12:32       ` Derrick Stolee
2018-12-10 18:06 ` [PATCH 2/5] midx: refactor permutation logic Derrick Stolee via GitGitGadget
2018-12-10 18:06 ` [PATCH 3/5] multi-pack-index: implement 'expire' verb Derrick Stolee via GitGitGadget
2018-12-10 18:06 ` [PATCH 4/5] multi-pack-index: prepare 'repack' verb Derrick Stolee via GitGitGadget
2018-12-11  1:54   ` Stefan Beller
2018-12-11 12:45     ` Derrick Stolee
2018-12-10 18:06 ` [PATCH 5/5] midx: implement midx_repack() Derrick Stolee via GitGitGadget
2018-12-11  2:32   ` Stefan Beller
2018-12-11 13:00     ` Derrick Stolee
2018-12-12  7:40   ` Junio C Hamano
2018-12-13  4:23     ` Junio C Hamano
2018-12-21 16:28 ` [PATCH v2 0/7] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
2018-12-21 16:28   ` [PATCH v2 1/7] repack: refactor pack deletion for future use Derrick Stolee via GitGitGadget
2018-12-21 16:28   ` [PATCH v2 2/7] Docs: rearrange subcommands for multi-pack-index Derrick Stolee via GitGitGadget
2018-12-21 16:28   ` [PATCH v2 3/7] multi-pack-index: prepare for 'expire' subcommand Derrick Stolee via GitGitGadget
2018-12-21 16:28   ` [PATCH v2 4/7] midx: refactor permutation logic Derrick Stolee via GitGitGadget
2018-12-21 16:28   ` [PATCH v2 5/7] multi-pack-index: implement 'expire' verb Derrick Stolee via GitGitGadget
2018-12-21 16:28   ` [PATCH v2 6/7] multi-pack-index: prepare 'repack' subcommand Derrick Stolee via GitGitGadget
2018-12-21 16:28   ` [PATCH v2 7/7] midx: implement midx_repack() Derrick Stolee via GitGitGadget
2019-01-09 15:21   ` [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
2019-01-09 15:21     ` [PATCH v3 1/9] repack: refactor pack deletion for future use Derrick Stolee via GitGitGadget
2019-01-09 15:21     ` [PATCH v3 2/9] Docs: rearrange subcommands for multi-pack-index Derrick Stolee via GitGitGadget
2019-01-09 15:21     ` [PATCH v3 3/9] multi-pack-index: prepare for 'expire' subcommand Derrick Stolee via GitGitGadget
2019-01-09 15:21     ` [PATCH v3 4/9] midx: simplify computation of pack name lengths Derrick Stolee via GitGitGadget
2019-01-09 15:21     ` [PATCH v3 5/9] midx: refactor permutation logic and pack sorting Derrick Stolee via GitGitGadget
2019-01-23 21:00       ` Jonathan Tan
2019-01-24 17:34         ` Derrick Stolee
2019-01-24 19:17           ` Derrick Stolee
2019-01-09 15:21     ` [PATCH v3 6/9] multi-pack-index: implement 'expire' verb Derrick Stolee via GitGitGadget
2019-01-09 15:54       ` SZEDER Gábor
2019-01-10 18:05         ` Junio C Hamano
2019-01-23 22:13       ` Jonathan Tan
2019-01-24 17:36         ` Derrick Stolee
2019-01-09 15:21     ` [PATCH v3 7/9] multi-pack-index: prepare 'repack' subcommand Derrick Stolee via GitGitGadget
2019-01-09 15:56       ` SZEDER Gábor
2019-01-23 22:38       ` Jonathan Tan
2019-01-24 19:36         ` Derrick Stolee
2019-01-24 21:38           ` Jonathan Tan
2019-01-09 15:21     ` [PATCH v3 8/9] midx: implement midx_repack() Derrick Stolee via GitGitGadget
2019-01-23 22:33       ` Jonathan Tan
2019-01-09 15:21     ` [PATCH v3 9/9] multi-pack-index: test expire while adding packs Derrick Stolee via GitGitGadget
2019-01-17 15:27     ` [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee
2019-01-23 22:44     ` Jonathan Tan
2019-01-24 21:51     ` [PATCH v4 00/10] " Derrick Stolee via GitGitGadget
2019-01-24 21:51       ` [PATCH v4 01/10] repack: refactor pack deletion for future use Derrick Stolee via GitGitGadget
2019-01-24 21:51       ` [PATCH v4 02/10] Docs: rearrange subcommands for multi-pack-index Derrick Stolee via GitGitGadget
2019-01-24 21:51       ` [PATCH v4 03/10] multi-pack-index: prepare for 'expire' subcommand Derrick Stolee via GitGitGadget
2019-01-24 21:51       ` [PATCH v4 04/10] midx: simplify computation of pack name lengths Derrick Stolee via GitGitGadget
2019-01-24 21:51       ` [PATCH v4 05/10] midx: refactor permutation logic and pack sorting Derrick Stolee via GitGitGadget
2019-01-24 21:51       ` [PATCH v4 07/10] multi-pack-index: prepare 'repack' subcommand Derrick Stolee via GitGitGadget
2019-01-25 23:24         ` Josh Steadmon
2019-01-24 21:51       ` [PATCH v4 06/10] multi-pack-index: implement 'expire' subcommand Derrick Stolee via GitGitGadget
2019-01-24 21:52       ` [PATCH v4 08/10] midx: implement midx_repack() Derrick Stolee via GitGitGadget
2019-01-26 17:10         ` Derrick Stolee
2019-01-27 22:50           ` Junio C Hamano
2019-01-24 21:52       ` [PATCH v4 09/10] multi-pack-index: test expire while adding packs Derrick Stolee via GitGitGadget
2019-01-24 21:52       ` [PATCH v4 10/10] midx: add test that 'expire' respects .keep files Derrick Stolee via GitGitGadget
2019-01-24 22:14       ` [PATCH v4 00/10] Create 'expire' and 'repack' verbs for git-multi-pack-index Jonathan Tan
2019-01-25 23:49       ` Josh Steadmon
2019-04-24 15:14       ` [PATCH v5 00/11] " Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 01/11] repack: refactor pack deletion for future use Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 02/11] Docs: rearrange subcommands for multi-pack-index Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 03/11] multi-pack-index: prepare for 'expire' subcommand Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 04/11] midx: simplify computation of pack name lengths Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 05/11] midx: refactor permutation logic and pack sorting Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 06/11] multi-pack-index: implement 'expire' subcommand Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 07/11] multi-pack-index: prepare 'repack' subcommand Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 08/11] midx: implement midx_repack() Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 09/11] multi-pack-index: test expire while adding packs Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 10/11] midx: add test that 'expire' respects .keep files Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 11/11] t5319-multi-pack-index.sh: test batch size zero Derrick Stolee
2019-04-25  5:38         ` [PATCH v5 00/11] Create 'expire' and 'repack' verbs for git-multi-pack-index Junio C Hamano
2019-04-25 11:06           ` Derrick Stolee
2019-05-14 18:47         ` [PATCH v6 " Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 01/11] repack: refactor pack deletion for future use Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 02/11] Docs: rearrange subcommands for multi-pack-index Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 03/11] multi-pack-index: prepare for 'expire' subcommand Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 04/11] midx: simplify computation of pack name lengths Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 05/11] midx: refactor permutation logic and pack sorting Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 06/11] multi-pack-index: implement 'expire' subcommand Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 07/11] multi-pack-index: prepare 'repack' subcommand Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 08/11] midx: implement midx_repack() Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 09/11] multi-pack-index: test expire while adding packs Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 10/11] midx: add test that 'expire' respects .keep files Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 11/11] t5319-multi-pack-index.sh: test batch size zero Derrick Stolee

git@vger.kernel.org list mirror (unofficial, one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox