git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH 00/17] cruft packs
@ 2021-11-29 22:25 Taylor Blau
  2021-11-29 22:25 ` [PATCH 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
                   ` (21 more replies)
  0 siblings, 22 replies; 200+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

This series implements "cruft packs", a pack which stores accumulated
unreachable objects, along with a new ".mtimes" file which tracks each
object's last known modification time.

This idea was discussed recently-ish in [1], but the most thorough
discussion I could find is in [2]. The approach settled on in this
series is laid out in detail by the first patch.

For the uninitiated, cruft packs enable repositories to safely run
`git repack -Ad` by storing unreachable objects which have not yet
"aged out" in a separate pack. This prevents repositories from storing
a potentially large number of these such objects as loose.

This series is structured as follows:

  - The first patch describes the technical details of cruft packs.
  - The next five patches implement reading and writing the new
    `.mtimes` format.
  - The next six patches implement `git pack-objects --cruft`. The
    first five implement this mode when no grace period is specified,
    and the six patch adds support for the grace period.
  - The next five patches integrate cruft packs with `git repack`,
    including the new-ish `--geometric` mode.
  - The final patch handles object freshening for objects stored in a
    cruft pack.

Thanks in advance for your review.

[1]: https://lore.kernel.org/git/20170610080626.sjujpmgkli4muh7h@sigill.intra.peff.net/
[2]: https://lore.kernel.org/git/E1SdhJ9-0006B1-6p@tytso-glaptop.cam.corp.google.com/

Taylor Blau (17):
  Documentation/technical: add cruft-packs.txt
  pack-mtimes: support reading .mtimes files
  pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  chunk-format.h: extract oid_version()
  pack-mtimes: support writing pack .mtimes files
  t/helper: add 'pack-mtimes' test-tool
  builtin/pack-objects.c: return from create_object_entry()
  builtin/pack-objects.c: --cruft without expiration
  reachable: add options to add_unseen_recent_objects_to_traversal
  reachable: report precise timestamps from objects in cruft packs
  builtin/pack-objects.c: --cruft with expiration
  builtin/repack.c: support generating a cruft pack
  builtin/repack.c: allow configuring cruft pack generation
  builtin/repack.c: use named flags for existing_packs
  builtin/repack.c: add cruft packs to MIDX during geometric repack
  builtin/gc.c: conditionally avoid pruning objects via loose
  sha1-file.c: don't freshen cruft packs

 Documentation/Makefile                  |   1 +
 Documentation/config/gc.txt             |  21 +-
 Documentation/config/repack.txt         |   9 +
 Documentation/git-gc.txt                |   5 +
 Documentation/git-pack-objects.txt      |  23 +
 Documentation/git-repack.txt            |  11 +
 Documentation/technical/cruft-packs.txt |  95 ++++
 Documentation/technical/pack-format.txt |  22 +
 Makefile                                |   2 +
 builtin/gc.c                            |  10 +-
 builtin/pack-objects.c                  | 306 ++++++++++-
 builtin/repack.c                        | 189 ++++++-
 bulk-checkin.c                          |   2 +-
 chunk-format.c                          |  12 +
 chunk-format.h                          |   3 +
 commit-graph.c                          |  18 +-
 midx.c                                  |  18 +-
 object-file.c                           |   4 +-
 object-store.h                          |   7 +-
 pack-mtimes.c                           | 139 +++++
 pack-mtimes.h                           |  16 +
 pack-objects.c                          |   6 +
 pack-objects.h                          |  20 +
 pack-write.c                            |  90 +++-
 pack.h                                  |   4 +
 packfile.c                              |  18 +-
 packfile.h                              |   1 +
 reachable.c                             |  58 +-
 reachable.h                             |   9 +-
 t/helper/test-pack-mtimes.c             |  53 ++
 t/helper/test-tool.c                    |   1 +
 t/helper/test-tool.h                    |   1 +
 t/t5327-pack-objects-cruft.sh           | 685 ++++++++++++++++++++++++
 33 files changed, 1757 insertions(+), 102 deletions(-)
 create mode 100644 Documentation/technical/cruft-packs.txt
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h
 create mode 100644 t/helper/test-pack-mtimes.c
 create mode 100755 t/t5327-pack-objects-cruft.sh

-- 
2.34.1.25.gb3157a20e6

^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH 01/17] Documentation/technical: add cruft-packs.txt
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-12-02 14:33   ` Derrick Stolee
  2021-12-04 22:20   ` Elijah Newren
  2021-11-29 22:25 ` [PATCH 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
                   ` (20 subsequent siblings)
  21 siblings, 2 replies; 200+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

Create a technical document to explain cruft packs. It contains a brief
overview of the problem, some background, details on the implementation,
and a couple of alternative approaches not considered here.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/Makefile                  |  1 +
 Documentation/technical/cruft-packs.txt | 95 +++++++++++++++++++++++++
 2 files changed, 96 insertions(+)
 create mode 100644 Documentation/technical/cruft-packs.txt

diff --git a/Documentation/Makefile b/Documentation/Makefile
index ed656db2ae..0b01c9408e 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -91,6 +91,7 @@ TECH_DOCS += MyFirstContribution
 TECH_DOCS += MyFirstObjectWalk
 TECH_DOCS += SubmittingPatches
 TECH_DOCS += technical/bundle-format
+TECH_DOCS += technical/cruft-packs
 TECH_DOCS += technical/hash-function-transition
 TECH_DOCS += technical/http-protocol
 TECH_DOCS += technical/index-format
diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
new file mode 100644
index 0000000000..bb54cce1b1
--- /dev/null
+++ b/Documentation/technical/cruft-packs.txt
@@ -0,0 +1,95 @@
+= Cruft packs
+
+Cruft packs offer an alternative to Git's traditional mechanism of removing
+unreachable objects. This document provides an overview of Git's pruning
+mechanism, and how cruft packs can be used instead to accomplish the same.
+
+== Background
+
+To remove unreachable objects from your repository, Git offers `git repack -Ad`
+(see linkgit:git-repack[1]). Quoting from the documentation:
+
+[quote]
+[...] unreachable objects in a previous pack become loose, unpacked objects,
+instead of being left in the old pack. [...] loose unreachable objects will be
+pruned according to normal expiry rules with the next 'git gc' invocation.
+
+Unreachable objects aren't removed immediately, since doing so could race with
+an incoming push which may reference an object which is about to be deleted.
+Instead, those unreachable objects are stored as loose object and stay that way
+until they are older than the expiration window, at which point they are removed
+by linkgit:git-prune[1].
+
+Git must store these unreachable objects loose in order to keep track of their
+per-object mtimes. If these unreachable objects were written into one big pack,
+then either freshening that pack (because an object contained within it was
+re-written) or creating a new pack of unreachable objects would cause the pack's
+mtime to get updated, and the objects within it would never leave the expiration
+window. Instead, objects are stored loose in order to keep track of the
+individual object mtimes and avoid a situation where all cruft objects are
+freshened at once.
+
+This can lead to undesirable situations when a repository contains many
+unreachable objects which have not yet left the grace period. Having large
+directories in the shards of `.git/objects` can lead to decreased performance in
+the repository. But given enough unreachable objects, this can lead to inode
+starvation and degrade the performance of the whole system. Since we
+can never pack those objects, these repositories often take up a large amount of
+disk space, since we can only zlib compress them, but not store them in delta
+chains.
+
+== Cruft packs
+
+Cruft packs are designed to eliminate the need for storing unreachable objects
+in a loose state by including the per-object mtimes in a separate file alongside
+a single pack containing all loose objects.
+
+A cruft pack is written by `git repack --cruft` when generating a new pack.
+linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft`
+is a classic all-into-one repack, meaning that everything in the resulting pack is
+reachable, and everything else is unreachable. Once written, the `--cruft`
+option instructs `git repack` to generate another pack containing only objects
+not packed in the previous step (which equates to packing all unreachable
+objects together). This progresses as follows:
+
+  1. Enumerate every object, marking any object which is (a) not contained in a
+     kept-pack, and (b) whose mtime is within the grace period as a traversal
+     tip.
+
+  2. Perform a reachability traversal based on the tips gathered in the previous
+     step, adding every object along the way to the pack.
+
+  3. Write the pack out, along with a `.mtimes` file that records the per-object
+     timestamps.
+
+This mode is invoked internally by linkgit:git-repack[1] when instructed to
+write a cruft pack. Crucially, the set of in-core kept packs is exactly the set
+of packs which will not be deleted by the repack; in other words, they contain
+all of the repository's reachable objects.
+
+When a repository already has a cruft pack, `git repack --cruft` typically only
+adds objects to it. An exception to this is when `git repack` is given the
+`--cruft-expiration` option, which allows the generated cruft pack to omit
+expired objects instead of waiting for linkgit:git-gc[1] to expire those objects
+later on.
+
+It is linkgit:git-gc[1] that is typically responsible for removing expired
+unreachable objects.
+
+== Alternatives
+
+Notable alternatives to this design include:
+
+  - The location of the per-object mtime data, and
+  - Whether cruft packs should be incremental or not.
+
+On the location of mtime data, a new auxiliary file tied to the pack was chosen
+to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
+support for optional chunks of data, it may make sense to consolidate the
+`.mtimes` format into the `.idx` itself.
+
+Incremental cruft packs (i.e., where each time a repository is repacked a new
+cruft pack is generated containing only the unreachable objects introduced since
+the last time a cruft pack was written) are significantly more complicated to
+construct, and so aren't pursued here. The obvious drawback to the current
+implementation is that the entire cruft pack must be re-written from scratch.
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH 02/17] pack-mtimes: support reading .mtimes files
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
  2021-11-29 22:25 ` [PATCH 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-12-02 15:06   ` Derrick Stolee
  2021-11-29 22:25 ` [PATCH 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

To store the individual mtimes of objects in a cruft pack, introduce a
new `.mtimes` format that can optionally accompany a single pack in the
repository.

The format is defined in Documentation/technical/pack-format.txt, and
stores a 4-byte network order timestamp for each object in name (index)
order.

This patch prepares for cruft packs by defining the `.mtimes` format,
and introducing a basic API that callers can use to read out individual
mtimes.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/pack-format.txt |  22 ++++
 Makefile                                |   1 +
 builtin/repack.c                        |   1 +
 object-store.h                          |   5 +-
 pack-mtimes.c                           | 139 ++++++++++++++++++++++++
 pack-mtimes.h                           |  16 +++
 packfile.c                              |  18 ++-
 packfile.h                              |   1 +
 8 files changed, 200 insertions(+), 3 deletions(-)
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 8d2f42f29e..61d8d960e7 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -294,6 +294,28 @@ Pack file entry: <+
 
 All 4-byte numbers are in network order.
 
+== pack-*.mtimes files have the format:
+
+  - A 4-byte magic number '0x4d544d45' ('MTME').
+
+  - A 4-byte version identifier (= 1).
+
+  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
+
+  - A table of mtimes (one per packed object, num_objects in total, each
+    a 4-byte unsigned integer in network order), in the same order as
+    objects appear in the index file (e.g., the first entry in the mtime
+    table corresponds to the object with the lowest lexically-sorted
+    oid). The mtimes count standard epoch seconds.
+
+  - A trailer, containing a:
+
+    checksum of the corresponding packfile, and
+
+    a checksum of all of the above.
+
+All 4-byte numbers are in network order.
+
 == multi-pack-index (MIDX) files have the following format:
 
 The multi-pack-index files refer to multiple pack-files and loose objects.
diff --git a/Makefile b/Makefile
index 12be39ac49..efd5e00717 100644
--- a/Makefile
+++ b/Makefile
@@ -949,6 +949,7 @@ LIB_OBJS += oidtree.o
 LIB_OBJS += pack-bitmap-write.o
 LIB_OBJS += pack-bitmap.o
 LIB_OBJS += pack-check.o
+LIB_OBJS += pack-mtimes.o
 LIB_OBJS += pack-objects.o
 LIB_OBJS += pack-revindex.o
 LIB_OBJS += pack-write.o
diff --git a/builtin/repack.c b/builtin/repack.c
index 0b2d1e5d82..acbb7b8c3b 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -212,6 +212,7 @@ static struct {
 } exts[] = {
 	{".pack"},
 	{".rev", 1},
+	{".mtimes", 1},
 	{".bitmap", 1},
 	{".promisor", 1},
 	{".idx"},
diff --git a/object-store.h b/object-store.h
index 952efb6a4b..d87481f101 100644
--- a/object-store.h
+++ b/object-store.h
@@ -89,12 +89,15 @@ struct packed_git {
 		 freshened:1,
 		 do_not_close:1,
 		 pack_promisor:1,
-		 multi_pack_index:1;
+		 multi_pack_index:1,
+		 is_cruft:1;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	struct revindex_entry *revindex;
 	const uint32_t *revindex_data;
 	const uint32_t *revindex_map;
 	size_t revindex_size;
+	const uint32_t *mtimes_map;
+	size_t mtimes_size;
 	/* something like ".git/objects/pack/xxxxx.pack" */
 	char pack_name[FLEX_ARRAY]; /* more */
 };
diff --git a/pack-mtimes.c b/pack-mtimes.c
new file mode 100644
index 0000000000..4c7c00fa67
--- /dev/null
+++ b/pack-mtimes.c
@@ -0,0 +1,139 @@
+#include "pack-mtimes.h"
+#include "object-store.h"
+#include "packfile.h"
+
+static char *pack_mtimes_filename(struct packed_git *p)
+{
+	size_t len;
+	if (!strip_suffix(p->pack_name, ".pack", &len))
+		BUG("pack_name does not end in .pack");
+	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
+	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
+}
+
+int pack_has_mtimes(struct packed_git *p)
+{
+	struct stat st;
+	char *fname = pack_mtimes_filename(p);
+
+	if (stat(fname, &st) < 0) {
+		if (errno == ENOENT)
+			return 0;
+		die_errno(_("could not stat %s"), fname);
+	}
+
+	free(fname);
+	return 1;
+}
+
+#define MTIMES_HEADER_SIZE (12)
+#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))
+
+struct mtimes_header {
+	uint32_t signature;
+	uint32_t version;
+	uint32_t hash_id;
+};
+
+static int load_pack_mtimes_file(char *mtimes_file,
+				 uint32_t num_objects,
+				 const uint32_t **data_p, size_t *len_p)
+{
+	int fd, ret = 0;
+	struct stat st;
+	void *data = NULL;
+	size_t mtimes_size;
+	uint32_t *hdr;
+
+	fd = git_open(mtimes_file);
+
+	if (fd < 0) {
+		ret = -1;
+		goto cleanup;
+	}
+	if (fstat(fd, &st)) {
+		ret = error_errno(_("failed to read %s"), mtimes_file);
+		goto cleanup;
+	}
+
+	mtimes_size = xsize_t(st.st_size);
+
+	if (mtimes_size < MTIMES_MIN_SIZE) {
+		ret = error(_("mtimes file %s is too small"), mtimes_file);
+		goto cleanup;
+	}
+
+	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
+		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
+		goto cleanup;
+	}
+
+	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
+
+	if (ntohl(*hdr) != MTIMES_SIGNATURE) {
+		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
+		goto cleanup;
+	}
+
+	if (ntohl(*++hdr) != 1) {
+		ret = error(_("mtimes file %s has unsupported version %"PRIu32),
+			    mtimes_file, ntohl(*hdr));
+		goto cleanup;
+	}
+	hdr++;
+	if (!(ntohl(*hdr) == 1 || ntohl(*hdr) == 2)) {
+		ret = error(_("mtimes file %s has unsupported hash id %"PRIu32),
+			    mtimes_file, ntohl(*hdr));
+		goto cleanup;
+	}
+
+cleanup:
+	if (ret) {
+		if (data)
+			munmap(data, mtimes_size);
+	} else {
+		*len_p = mtimes_size;
+		*data_p = (const uint32_t *)data;
+	}
+
+	close(fd);
+	return ret;
+}
+
+int load_pack_mtimes(struct packed_git *p)
+{
+	char *mtimes_name = NULL;
+	int ret = 0;
+
+	if (!p->is_cruft)
+		return ret; /* not a cruft pack */
+	if (p->mtimes_map)
+		return ret; /* already loaded */
+
+	ret = open_pack_index(p);
+	if (ret < 0)
+		goto cleanup;
+
+	mtimes_name = pack_mtimes_filename(p);
+	ret = load_pack_mtimes_file(mtimes_name,
+				    p->num_objects,
+				    &p->mtimes_map,
+				    &p->mtimes_size);
+	if (ret)
+		goto cleanup;
+
+cleanup:
+	free(mtimes_name);
+	return ret;
+}
+
+uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)
+{
+	if (!p->mtimes_map)
+		BUG("pack .mtimes file not loaded for %s", p->pack_name);
+	if (p->num_objects <= pos)
+		BUG("pack .mtimes out-of-bounds (%"PRIu32" vs %"PRIu32")",
+		    pos, p->num_objects);
+
+	return get_be32(p->mtimes_map + pos + 3);
+}
diff --git a/pack-mtimes.h b/pack-mtimes.h
new file mode 100644
index 0000000000..ac4247bb5e
--- /dev/null
+++ b/pack-mtimes.h
@@ -0,0 +1,16 @@
+#ifndef PACK_MTIMES_H
+#define PACK_MTIMES_H
+
+#include "git-compat-util.h"
+
+#define MTIMES_SIGNATURE 0x4d544d45 /* "MTME" */
+#define MTIMES_VERSION 1
+
+struct packed_git;
+
+int pack_has_mtimes(struct packed_git *p);
+int load_pack_mtimes(struct packed_git *p);
+
+uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos);
+
+#endif
diff --git a/packfile.c b/packfile.c
index 89402cfc69..ae79ac644e 100644
--- a/packfile.c
+++ b/packfile.c
@@ -333,12 +333,21 @@ void close_pack_revindex(struct packed_git *p) {
 	p->revindex_data = NULL;
 }
 
+void close_pack_mtimes(struct packed_git *p) {
+	if (!p->mtimes_map)
+		return;
+
+	munmap((void *)p->mtimes_map, p->mtimes_size);
+	p->mtimes_map = NULL;
+}
+
 void close_pack(struct packed_git *p)
 {
 	close_pack_windows(p);
 	close_pack_fd(p);
 	close_pack_index(p);
 	close_pack_revindex(p);
+	close_pack_mtimes(p);
 	oidset_clear(&p->bad_objects);
 }
 
@@ -362,7 +371,7 @@ void close_object_store(struct raw_object_store *o)
 
 void unlink_pack_path(const char *pack_name, int force_delete)
 {
-	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
+	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".mtimes"};
 	int i;
 	struct strbuf buf = STRBUF_INIT;
 	size_t plen;
@@ -717,6 +726,10 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
 	if (!access(p->pack_name, F_OK))
 		p->pack_promisor = 1;
 
+	xsnprintf(p->pack_name + path_len, alloc - path_len, ".mtimes");
+	if (!access(p->pack_name, F_OK))
+		p->is_cruft = 1;
+
 	xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
 	if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) {
 		free(p);
@@ -868,7 +881,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 	    ends_with(file_name, ".pack") ||
 	    ends_with(file_name, ".bitmap") ||
 	    ends_with(file_name, ".keep") ||
-	    ends_with(file_name, ".promisor"))
+	    ends_with(file_name, ".promisor") ||
+	    ends_with(file_name, ".mtimes"))
 		string_list_append(data->garbage, full_name);
 	else
 		report_garbage(PACKDIR_FILE_GARBAGE, full_name);
diff --git a/packfile.h b/packfile.h
index 186146779d..32201d8af7 100644
--- a/packfile.h
+++ b/packfile.h
@@ -91,6 +91,7 @@ uint32_t get_pack_fanout(struct packed_git *p, uint32_t value);
 unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *);
 void close_pack_windows(struct packed_git *);
 void close_pack_revindex(struct packed_git *);
+void close_pack_mtimes(struct packed_git *p);
 void close_pack(struct packed_git *);
 void close_object_store(struct raw_object_store *o);
 void unuse_pack(struct pack_window **);
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
  2021-11-29 22:25 ` [PATCH 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
  2021-11-29 22:25 ` [PATCH 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-11-29 22:25 ` [PATCH 04/17] chunk-format.h: extract oid_version() Taylor Blau
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

This structure will be used to communicate the per-object mtimes when
writing a cruft pack. Here, we need the full packing_data structure
because the mtime information is stored in an array there, not on the
individual object_entry's themselves (to avoid paying the overhead in
structure width for operations which do not generate a cruft pack).

We haven't passed this information down before because one of the two
callers (in bulk-checkin.c) does not have a packing_data structure at
all. In that case (where no cruft pack will be generated), NULL is
passed instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 3 ++-
 bulk-checkin.c         | 2 +-
 pack-write.c           | 1 +
 pack.h                 | 3 +++
 4 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 1a3dd445f8..bf45ffbc57 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1254,7 +1254,8 @@ static void write_pack_file(void)
 
 			stage_tmp_packfiles(&tmpname, pack_tmp_name,
 					    written_list, nr_written,
-					    &pack_idx_opts, hash, &idx_tmp_name);
+					    &to_pack, &pack_idx_opts, hash,
+					    &idx_tmp_name);
 
 			if (write_bitmap_index) {
 				size_t tmpname_len = tmpname.len;
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 8785b2ac80..99f7596c4e 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -33,7 +33,7 @@ static void finish_tmp_packfile(struct strbuf *basename,
 	char *idx_tmp_name = NULL;
 
 	stage_tmp_packfiles(basename, pack_tmp_name, written_list, nr_written,
-			    pack_idx_opts, hash, &idx_tmp_name);
+			    NULL, pack_idx_opts, hash, &idx_tmp_name);
 	rename_tmp_packfile_idx(basename, &idx_tmp_name);
 
 	free(idx_tmp_name);
diff --git a/pack-write.c b/pack-write.c
index a5846f3a34..d594e3008e 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -483,6 +483,7 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 const char *pack_tmp_name,
 			 struct pack_idx_entry **written_list,
 			 uint32_t nr_written,
+			 struct packing_data *to_pack,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[],
 			 char **idx_tmp_name)
diff --git a/pack.h b/pack.h
index b22bfc4a18..fd27cfdfd7 100644
--- a/pack.h
+++ b/pack.h
@@ -109,11 +109,14 @@ int encode_in_pack_object_header(unsigned char *hdr, int hdr_len,
 #define PH_ERROR_PROTOCOL	(-3)
 int read_pack_header(int fd, struct pack_header *);
 
+struct packing_data;
+
 struct hashfile *create_tmp_packfile(char **pack_tmp_name);
 void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 const char *pack_tmp_name,
 			 struct pack_idx_entry **written_list,
 			 uint32_t nr_written,
+			 struct packing_data *to_pack,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[],
 			 char **idx_tmp_name);
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH 04/17] chunk-format.h: extract oid_version()
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (2 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-12-02 15:22   ` Derrick Stolee
  2021-11-29 22:25 ` [PATCH 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

There are three definitions of an identical function which converts
`the_hash_algo` into either 1 (for SHA-1) or 2 (for SHA-256). There is a
copy of this function for writing both the commit-graph and
multi-pack-index file, and another inline definition used to write the
.rev header.

Consolidate these into a single definition in chunk-format.h. It's not
clear that this is the best header to define this function in, but it
should do for now.

(Worth noting, the .rev caller expects a 4-byte unsigned, but the other
two callers work with a single unsigned byte. The consolidated version
uses the latter type, and lets the compiler widen it when required).

Another caller will be added in a subsequent patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 chunk-format.c | 12 ++++++++++++
 chunk-format.h |  3 +++
 commit-graph.c | 18 +++---------------
 midx.c         | 18 +++---------------
 pack-write.c   | 15 ++-------------
 5 files changed, 23 insertions(+), 43 deletions(-)

diff --git a/chunk-format.c b/chunk-format.c
index 1c3dca62e2..0275b74a89 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -181,3 +181,15 @@ int read_chunk(struct chunkfile *cf,
 
 	return CHUNK_NOT_FOUND;
 }
+
+uint8_t oid_version(const struct git_hash_algo *algop)
+{
+	switch (hash_algo_by_ptr(algop)) {
+	case GIT_HASH_SHA1:
+		return 1;
+	case GIT_HASH_SHA256:
+		return 2;
+	default:
+		die(_("invalid hash version"));
+	}
+}
diff --git a/chunk-format.h b/chunk-format.h
index 9ccbe00377..7885aa0848 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -2,6 +2,7 @@
 #define CHUNK_FORMAT_H
 
 #include "git-compat-util.h"
+#include "hash.h"
 
 struct hashfile;
 struct chunkfile;
@@ -65,4 +66,6 @@ int read_chunk(struct chunkfile *cf,
 	       chunk_read_fn fn,
 	       void *data);
 
+uint8_t oid_version(const struct git_hash_algo *algop);
+
 #endif
diff --git a/commit-graph.c b/commit-graph.c
index 2706683acf..1f08152a35 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -193,18 +193,6 @@ char *get_commit_graph_chain_filename(struct object_directory *odb)
 	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
 }
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 static struct commit_graph *alloc_commit_graph(void)
 {
 	struct commit_graph *g = xcalloc(1, sizeof(*g));
@@ -365,9 +353,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 	}
 
 	hash_version = *(unsigned char*)(data + 5);
-	if (hash_version != oid_version()) {
+	if (hash_version != oid_version(the_hash_algo)) {
 		error(_("commit-graph hash version %X does not match version %X"),
-		      hash_version, oid_version());
+		      hash_version, oid_version(the_hash_algo));
 		return NULL;
 	}
 
@@ -1908,7 +1896,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
 	hashwrite_u8(f, GRAPH_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, oid_version(the_hash_algo));
 	hashwrite_u8(f, get_num_chunks(cf));
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
diff --git a/midx.c b/midx.c
index 8433086ac1..756ae6a206 100644
--- a/midx.c
+++ b/midx.c
@@ -40,18 +40,6 @@
 
 #define PACK_EXPIRED UINT_MAX
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 const unsigned char *get_midx_checksum(struct multi_pack_index *m)
 {
 	return m->data + m->data_len - the_hash_algo->rawsz;
@@ -131,9 +119,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 		      m->version);
 
 	hash_version = m->data[MIDX_BYTE_HASH_VERSION];
-	if (hash_version != oid_version()) {
+	if (hash_version != oid_version(the_hash_algo)) {
 		error(_("multi-pack-index hash version %u does not match version %u"),
-		      hash_version, oid_version());
+		      hash_version, oid_version(the_hash_algo));
 		goto cleanup_fail;
 	}
 	m->hash_len = the_hash_algo->rawsz;
@@ -413,7 +401,7 @@ static size_t write_midx_header(struct hashfile *f,
 {
 	hashwrite_be32(f, MIDX_SIGNATURE);
 	hashwrite_u8(f, MIDX_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, oid_version(the_hash_algo));
 	hashwrite_u8(f, num_chunks);
 	hashwrite_u8(f, 0); /* unused */
 	hashwrite_be32(f, num_packs);
diff --git a/pack-write.c b/pack-write.c
index d594e3008e..ff305b404c 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -2,6 +2,7 @@
 #include "pack.h"
 #include "csum-file.h"
 #include "remote.h"
+#include "chunk-format.h"
 
 void reset_pack_idx_option(struct pack_idx_option *opts)
 {
@@ -181,21 +182,9 @@ static int pack_order_cmp(const void *va, const void *vb, void *ctx)
 
 static void write_rev_header(struct hashfile *f)
 {
-	uint32_t oid_version;
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		oid_version = 1;
-		break;
-	case GIT_HASH_SHA256:
-		oid_version = 2;
-		break;
-	default:
-		die("write_rev_header: unknown hash version");
-	}
-
 	hashwrite_be32(f, RIDX_SIGNATURE);
 	hashwrite_be32(f, RIDX_VERSION);
-	hashwrite_be32(f, oid_version);
+	hashwrite_be32(f, oid_version(the_hash_algo));
 }
 
 static void write_rev_index_positions(struct hashfile *f,
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH 05/17] pack-mtimes: support writing pack .mtimes files
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (3 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 04/17] chunk-format.h: extract oid_version() Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-12-02 15:36   ` Derrick Stolee
  2021-11-29 22:25 ` [PATCH 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

Now that the `.mtimes` format is defined, supplement the pack-write API
to be able to conditionally write an `.mtimes` file along with a pack by
setting an additional flag and passing an oidmap that contains the
timestamps corresponding to each object in the pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-objects.c |  6 ++++
 pack-objects.h | 20 ++++++++++++++
 pack-write.c   | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++
 pack.h         |  1 +
 4 files changed, 101 insertions(+)

diff --git a/pack-objects.c b/pack-objects.c
index fe2a4eace9..272e8d4517 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -170,6 +170,9 @@ struct object_entry *packlist_alloc(struct packing_data *pdata,
 
 		if (pdata->layer)
 			REALLOC_ARRAY(pdata->layer, pdata->nr_alloc);
+
+		if (pdata->cruft_mtime)
+			REALLOC_ARRAY(pdata->cruft_mtime, pdata->nr_alloc);
 	}
 
 	new_entry = pdata->objects + pdata->nr_objects++;
@@ -198,6 +201,9 @@ struct object_entry *packlist_alloc(struct packing_data *pdata,
 	if (pdata->layer)
 		pdata->layer[pdata->nr_objects - 1] = 0;
 
+	if (pdata->cruft_mtime)
+		pdata->cruft_mtime[pdata->nr_objects - 1] = 0;
+
 	return new_entry;
 }
 
diff --git a/pack-objects.h b/pack-objects.h
index dca2351ef9..f17119de26 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -168,6 +168,9 @@ struct packing_data {
 	/* delta islands */
 	unsigned int *tree_depth;
 	unsigned char *layer;
+
+	/* cruft packs */
+	uint32_t *cruft_mtime;
 };
 
 void prepare_packing_data(struct repository *r, struct packing_data *pdata);
@@ -289,4 +292,21 @@ static inline void oe_set_layer(struct packing_data *pack,
 	pack->layer[e - pack->objects] = layer;
 }
 
+static inline uint32_t oe_cruft_mtime(struct packing_data *pack,
+				      struct object_entry *e)
+{
+	if (!pack->cruft_mtime)
+		return 0;
+	return pack->cruft_mtime[e - pack->objects];
+}
+
+static inline void oe_set_cruft_mtime(struct packing_data *pack,
+				      struct object_entry *e,
+				      uint32_t mtime)
+{
+	if (!pack->cruft_mtime)
+		CALLOC_ARRAY(pack->cruft_mtime, pack->nr_alloc);
+	pack->cruft_mtime[e - pack->objects] = mtime;
+}
+
 #endif
diff --git a/pack-write.c b/pack-write.c
index ff305b404c..8c3efda2c3 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -3,6 +3,10 @@
 #include "csum-file.h"
 #include "remote.h"
 #include "chunk-format.h"
+#include "pack-mtimes.h"
+#include "oidmap.h"
+#include "chunk-format.h"
+#include "pack-objects.h"
 
 void reset_pack_idx_option(struct pack_idx_option *opts)
 {
@@ -276,6 +280,65 @@ const char *write_rev_file_order(const char *rev_name,
 	return rev_name;
 }
 
+static void write_mtimes_header(struct hashfile *f)
+{
+	hashwrite_be32(f, MTIMES_SIGNATURE);
+	hashwrite_be32(f, MTIMES_VERSION);
+	hashwrite_be32(f, oid_version(the_hash_algo));
+}
+
+static void write_mtimes_objects(struct hashfile *f,
+				 struct packing_data *to_pack,
+				 struct pack_idx_entry **objects,
+				 uint32_t nr_objects)
+{
+	uint32_t i;
+	for (i = 0; i < nr_objects; i++) {
+		struct object_entry *e = (struct object_entry*)objects[i];
+		hashwrite_be32(f, oe_cruft_mtime(to_pack, e));
+	}
+}
+
+static void write_mtimes_trailer(struct hashfile *f, const unsigned char *hash)
+{
+	hashwrite(f, hash, the_hash_algo->rawsz);
+}
+
+static const char *write_mtimes_file(const char *mtimes_name,
+				     struct packing_data *to_pack,
+				     struct pack_idx_entry **objects,
+				     uint32_t nr_objects,
+				     const unsigned char *hash)
+{
+	struct hashfile *f;
+	int fd;
+
+	if (!to_pack)
+		BUG("cannot call write_mtimes_file with NULL packing_data");
+
+	if (!mtimes_name) {
+		struct strbuf tmp_file = STRBUF_INIT;
+		fd = odb_mkstemp(&tmp_file, "pack/tmp_mtimes_XXXXXX");
+		mtimes_name = strbuf_detach(&tmp_file, NULL);
+	} else {
+		unlink(mtimes_name);
+		fd = xopen(mtimes_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
+	}
+	f = hashfd(fd, mtimes_name);
+
+	write_mtimes_header(f);
+	write_mtimes_objects(f, to_pack, objects, nr_objects);
+	write_mtimes_trailer(f, hash);
+
+	if (mtimes_name && adjust_shared_perm(mtimes_name) < 0)
+		die(_("failed to make %s readable"), mtimes_name);
+
+	finalize_hashfile(f, NULL,
+			  CSUM_HASH_IN_STREAM | CSUM_CLOSE | CSUM_FSYNC);
+
+	return mtimes_name;
+}
+
 off_t write_pack_header(struct hashfile *f, uint32_t nr_entries)
 {
 	struct pack_header hdr;
@@ -478,6 +541,7 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 char **idx_tmp_name)
 {
 	const char *rev_tmp_name = NULL;
+	const char *mtimes_tmp_name = NULL;
 
 	if (adjust_shared_perm(pack_tmp_name))
 		die_errno("unable to make temporary pack file readable");
@@ -490,9 +554,19 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 	rev_tmp_name = write_rev_file(NULL, written_list, nr_written, hash,
 				      pack_idx_opts->flags);
 
+	if (pack_idx_opts->flags & WRITE_MTIMES) {
+		mtimes_tmp_name = write_mtimes_file(NULL, to_pack, written_list,
+						    nr_written,
+						    hash);
+		if (adjust_shared_perm(mtimes_tmp_name))
+			die_errno("unable to make temporary mtimes file readable");
+	}
+
 	rename_tmp_packfile(name_buffer, pack_tmp_name, "pack");
 	if (rev_tmp_name)
 		rename_tmp_packfile(name_buffer, rev_tmp_name, "rev");
+	if (mtimes_tmp_name)
+		rename_tmp_packfile(name_buffer, mtimes_tmp_name, "mtimes");
 }
 
 void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought)
diff --git a/pack.h b/pack.h
index fd27cfdfd7..01d385903a 100644
--- a/pack.h
+++ b/pack.h
@@ -44,6 +44,7 @@ struct pack_idx_option {
 #define WRITE_IDX_STRICT 02
 #define WRITE_REV 04
 #define WRITE_REV_VERIFY 010
+#define WRITE_MTIMES 020
 
 	uint32_t version;
 	uint32_t off32_limit;
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH 06/17] t/helper: add 'pack-mtimes' test-tool
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (4 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-12-06 21:16   ` Derrick Stolee
  2021-11-29 22:25 ` [PATCH 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

In the next patch, we will implement and test support for writing a
cruft pack via a special mode of `git pack-objects`. To make sure that
objects are written with the correct timestamps, and a new test-tool
that can dump the object names and corresponding timestamps from a given
`.mtimes` file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Makefile                    |  1 +
 t/helper/test-pack-mtimes.c | 53 +++++++++++++++++++++++++++++++++++++
 t/helper/test-tool.c        |  1 +
 t/helper/test-tool.h        |  1 +
 4 files changed, 56 insertions(+)
 create mode 100644 t/helper/test-pack-mtimes.c

diff --git a/Makefile b/Makefile
index efd5e00717..a7382cbfc1 100644
--- a/Makefile
+++ b/Makefile
@@ -721,6 +721,7 @@ TEST_BUILTINS_OBJS += test-oid-array.o
 TEST_BUILTINS_OBJS += test-oidmap.o
 TEST_BUILTINS_OBJS += test-oidtree.o
 TEST_BUILTINS_OBJS += test-online-cpus.o
+TEST_BUILTINS_OBJS += test-pack-mtimes.o
 TEST_BUILTINS_OBJS += test-parse-options.o
 TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
 TEST_BUILTINS_OBJS += test-partial-clone.o
diff --git a/t/helper/test-pack-mtimes.c b/t/helper/test-pack-mtimes.c
new file mode 100644
index 0000000000..b143f62520
--- /dev/null
+++ b/t/helper/test-pack-mtimes.c
@@ -0,0 +1,53 @@
+#include "git-compat-util.h"
+#include "test-tool.h"
+#include "strbuf.h"
+#include "object-store.h"
+#include "packfile.h"
+#include "pack-mtimes.h"
+
+static int dump_mtimes(struct packed_git *p)
+{
+	uint32_t i;
+	if (load_pack_mtimes(p) < 0)
+		die("could not load pack .mtimes");
+
+	for (i = 0; i < p->num_objects; i++) {
+		struct object_id oid;
+		if (nth_packed_object_id(&oid, p, i) < 0)
+			die("could not load object id at position %"PRIu32, i);
+
+		printf("%s %"PRIu32"\n",
+		       oid_to_hex(&oid), nth_packed_mtime(p, i));
+	}
+
+	return 0;
+}
+
+static const char *pack_mtimes_usage = "\n"
+"  test-tool pack-mtimes <pack-name.mtimes>";
+
+int cmd__pack_mtimes(int argc, const char **argv)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct packed_git *p;
+
+	setup_git_directory();
+
+	if (argc != 2)
+		usage(pack_mtimes_usage);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		strbuf_addstr(&buf, basename(p->pack_name));
+		strbuf_strip_suffix(&buf, ".pack");
+		strbuf_addstr(&buf, ".mtimes");
+
+		if (!strcmp(buf.buf, argv[1]))
+			break;
+
+		strbuf_reset(&buf);
+	}
+
+	strbuf_release(&buf);
+
+	return p ? dump_mtimes(p) : 1;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 3ce5585e53..1bb1c4b562 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -46,6 +46,7 @@ static struct test_cmd cmds[] = {
 	{ "oidmap", cmd__oidmap },
 	{ "oidtree", cmd__oidtree },
 	{ "online-cpus", cmd__online_cpus },
+	{ "pack-mtimes", cmd__pack_mtimes },
 	{ "parse-options", cmd__parse_options },
 	{ "parse-pathspec-file", cmd__parse_pathspec_file },
 	{ "partial-clone", cmd__partial_clone },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 9f0f522850..07a2d3f94e 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -35,6 +35,7 @@ int cmd__mktemp(int argc, const char **argv);
 int cmd__oidmap(int argc, const char **argv);
 int cmd__oidtree(int argc, const char **argv);
 int cmd__online_cpus(int argc, const char **argv);
+int cmd__pack_mtimes(int argc, const char **argv);
 int cmd__parse_options(int argc, const char **argv);
 int cmd__parse_pathspec_file(int argc, const char** argv);
 int cmd__partial_clone(int argc, const char **argv);
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH 07/17] builtin/pack-objects.c: return from create_object_entry()
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (5 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-11-29 22:25 ` [PATCH 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

A new caller in the next commit will want to immediately modify the
object_entry structure created by create_object_entry(). Instead of
forcing that caller to wastefully look-up the entry we just created,
return it from create_object_entry() instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index bf45ffbc57..3fb10529ba 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1508,13 +1508,13 @@ static int want_object_in_pack(const struct object_id *oid,
 	return 1;
 }
 
-static void create_object_entry(const struct object_id *oid,
-				enum object_type type,
-				uint32_t hash,
-				int exclude,
-				int no_try_delta,
-				struct packed_git *found_pack,
-				off_t found_offset)
+static struct object_entry *create_object_entry(const struct object_id *oid,
+						enum object_type type,
+						uint32_t hash,
+						int exclude,
+						int no_try_delta,
+						struct packed_git *found_pack,
+						off_t found_offset)
 {
 	struct object_entry *entry;
 
@@ -1531,6 +1531,8 @@ static void create_object_entry(const struct object_id *oid,
 	}
 
 	entry->no_try_delta = no_try_delta;
+
+	return entry;
 }
 
 static const char no_closure_warning[] = N_(
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH 08/17] builtin/pack-objects.c: --cruft without expiration
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (6 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-12-06 21:44   ` Derrick Stolee
  2021-12-07 15:17   ` Derrick Stolee
  2021-11-29 22:25 ` [PATCH 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
                   ` (13 subsequent siblings)
  21 siblings, 2 replies; 200+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

Teach `pack-objects` how to generate a cruft pack when no objects are
dropped (i.e., `--cruft-expiration=never`). Later patches will teach
`pack-objects` how to generate a cruft pack that prunes objects.

When generating a cruft pack which does not prune objects, we want to
collect all unreachable objects into a single pack (noting and updating
their mtimes as we accumulate them). Ordinary use will pass the result
of a `git repack -A` as a kept pack, so when this patch says "kept
pack", readers should think "reachable objects".

Generating a non-expiring cruft packs works as follows:

  - Callers provide a list of every pack they know about, and indicate
    which packs are about to be removed.

  - All packs which are going to be removed (we'll call these the
    redundant ones) are marked as kept in-core, as well as any packs
    that `pack-objects` found but the caller did not specify.

    These packs are presumed to have entered the repository between
    the caller collecting packs and invoking `pack-objects`. Since we
    do not want to include objects in these packs (because we don't know
    which of their objects are or aren't reachable), these are also
    marked as kept in-core.

  - Then, we enumerate all objects in the repository, and add them to
    our packing list if they do not appear in an in-core kept pack.

This results in a new cruft pack which contains all known objects that
aren't included in the kept packs. When the kept pack is the result of
`git repack -A`, the resulting pack contains all unreachable objects.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-pack-objects.txt |  23 +++
 builtin/pack-objects.c             | 203 ++++++++++++++++++++++++++-
 object-file.c                      |   2 +-
 object-store.h                     |   2 +
 t/t5327-pack-objects-cruft.sh      | 218 +++++++++++++++++++++++++++++
 5 files changed, 442 insertions(+), 6 deletions(-)
 create mode 100755 t/t5327-pack-objects-cruft.sh

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index dbfd1f9017..573c18afcd 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -13,6 +13,7 @@ SYNOPSIS
 	[--no-reuse-delta] [--delta-base-offset] [--non-empty]
 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
 	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
+	[--cruft] [--cruft-expiration=<time>]
 	[--stdout [--filter=<filter-spec>] | base-name]
 	[--shallow] [--keep-true-parents] [--[no-]sparse] < object-list
 
@@ -95,6 +96,28 @@ base-name::
 Incompatible with `--revs`, or options that imply `--revs` (such as
 `--all`), with the exception of `--unpacked`, which is compatible.
 
+--cruft::
+	Packs unreachable objects into a separate "cruft" pack, denoted
+	by the existence of a `.mtimes` file. Pack names provided over
+	stdin indicate which packs will remain after a `git repack`.
+	Pack names prefixed with a `-` indicate those which will be
+	removed. The contents of the cruft pack are all objects not
+	contained in the surviving packs specified by `--keep-pack`)
+	which have not exceeded the grace period (see
+	`--cruft-expiration` below), or which have exceeded the grace
+	period, but are reachable from an other object which hasn't.
++
+Incompatible with `--unpack-unreachable`, `--keep-unreachable`,
+`--pack-loose-unreachable`, `--stdin-packs`, as well as any other
+options which imply `--revs`. Also incompatible with `--max-pack-size`;
+when this option is set, the maximum pack size is not inferred from
+`pack.packSizeLimit`.
+
+--cruft-expiration=<approxidate>::
+	If specified, objects are eliminated from the cruft pack if they
+	have an mtime older than `<approxidate>`. If unspecified (and
+	given `--cruft`), then no objects are eliminated.
+
 --window=<n>::
 --depth=<n>::
 	These two options affect how the objects contained in
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 3fb10529ba..b12e79e4b1 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -36,6 +36,7 @@
 #include "trace2.h"
 #include "shallow.h"
 #include "promisor-remote.h"
+#include "pack-mtimes.h"
 
 /*
  * Objects we are going to pack are collected in the `to_pack` structure.
@@ -194,6 +195,8 @@ static int reuse_delta = 1, reuse_object = 1;
 static int keep_unreachable, unpack_unreachable, include_tag;
 static timestamp_t unpack_unreachable_expiration;
 static int pack_loose_unreachable;
+static int cruft;
+static timestamp_t cruft_expiration;
 static int local;
 static int have_non_local_packs;
 static int incremental;
@@ -1252,6 +1255,9 @@ static void write_pack_file(void)
 					&to_pack, written_list, nr_written);
 			}
 
+			if (cruft)
+				pack_idx_opts.flags |= WRITE_MTIMES;
+
 			stage_tmp_packfiles(&tmpname, pack_tmp_name,
 					    written_list, nr_written,
 					    &to_pack, &pack_idx_opts, hash,
@@ -3389,6 +3395,135 @@ static void read_packs_list_from_stdin(void)
 	string_list_clear(&exclude_packs, 0);
 }
 
+static int add_cruft_object_entry(const struct object_id *oid, enum object_type type,
+				  struct packed_git *pack, off_t offset,
+				  const char *name, uint32_t mtime)
+{
+	struct object_entry *entry;
+
+	display_progress(progress_state, ++nr_seen);
+
+	entry = packlist_find(&to_pack, oid);
+	if (entry) {
+		if (name) {
+			entry->hash = pack_name_hash(name);
+			entry->no_try_delta = name && no_try_delta(name);
+		}
+	} else {
+		if (!want_object_in_pack(oid, 0, &pack, &offset))
+			return 0;
+		if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
+			/*
+			 * If a traversed tree has a missing blob then we want
+			 * to avoid adding that missing object to our pack.
+			 *
+			 * This only applies to missing blobs, not trees,
+			 * because the traversal needs to parse sub-trees but
+			 * not blobs.
+			 *
+			 * Note we only perform this check when we couldn't
+			 * already find the object in a pack, so we're really
+			 * limited to "ensure non-tip blobs which don't exist in
+			 * packs do exist via loose objects". Confused?
+			 */
+			return 0;
+		}
+
+		entry = create_object_entry(oid, type, pack_name_hash(name),
+					    0, name && no_try_delta(name),
+					    pack, offset);
+	}
+
+	if (mtime > oe_cruft_mtime(&to_pack, entry))
+		oe_set_cruft_mtime(&to_pack, entry, mtime);
+	return 1;
+}
+
+static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
+{
+	struct string_list_item *item = NULL;
+	for_each_string_list_item(item, packs) {
+		struct packed_git *p = item->util;
+		if (!p)
+			die(_("could not find pack '%s'"), item->string);
+		p->pack_keep_in_core = keep;
+	}
+}
+
+static void add_unreachable_loose_objects(void);
+static void add_objects_in_unpacked_packs(void);
+
+static void enumerate_cruft_objects(void)
+{
+	if (progress)
+		progress_state = start_progress(_("Enumerating cruft objects"), 0);
+
+	add_objects_in_unpacked_packs();
+	add_unreachable_loose_objects();
+
+	stop_progress(&progress_state);
+}
+
+static void read_cruft_objects(void)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct string_list discard_packs = STRING_LIST_INIT_DUP;
+	struct string_list fresh_packs = STRING_LIST_INIT_DUP;
+	struct packed_git *p;
+
+	ignore_packed_keep_in_core = 1;
+
+	while (strbuf_getline(&buf, stdin) != EOF) {
+		if (!buf.len)
+			continue;
+
+		if (*buf.buf == '-')
+			string_list_append(&discard_packs, buf.buf + 1);
+		else
+			string_list_append(&fresh_packs, buf.buf);
+		strbuf_reset(&buf);
+	}
+
+	string_list_sort(&discard_packs);
+	string_list_sort(&fresh_packs);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		const char *pack_name = pack_basename(p);
+		struct string_list_item *item;
+
+		item = string_list_lookup(&fresh_packs, pack_name);
+		if (!item)
+			item = string_list_lookup(&discard_packs, pack_name);
+
+		if (item) {
+			item->util = p;
+		} else {
+			/*
+			 * This pack wasn't mentioned in either the "fresh" or
+			 * "discard" list, so the caller didn't know about it.
+			 *
+			 * Mark it as kept so that its objects are ignored by
+			 * add_unseen_recent_objects_to_traversal(). We'll
+			 * unmark it before starting the traversal so it doesn't
+			 * halt the traversal early.
+			 */
+			p->pack_keep_in_core = 1;
+		}
+	}
+
+	mark_pack_kept_in_core(&fresh_packs, 1);
+	mark_pack_kept_in_core(&discard_packs, 0);
+
+	if (cruft_expiration)
+		die("--cruft-expiration not yet implemented");
+	else
+		enumerate_cruft_objects();
+
+	strbuf_release(&buf);
+	string_list_clear(&discard_packs, 0);
+	string_list_clear(&fresh_packs, 0);
+}
+
 static void read_object_list_from_stdin(void)
 {
 	char line[GIT_MAX_HEXSZ + 1 + PATH_MAX + 2];
@@ -3521,7 +3656,24 @@ static int add_object_in_unpacked_pack(const struct object_id *oid,
 				       uint32_t pos,
 				       void *_data)
 {
-	add_object_entry(oid, OBJ_NONE, "", 0);
+	if (cruft) {
+		off_t offset;
+		time_t mtime;
+
+		if (pack->is_cruft) {
+			if (load_pack_mtimes(pack) < 0)
+				die(_("could not load cruft pack .mtimes"));
+			mtime = nth_packed_mtime(pack, pos);
+		} else {
+			mtime = pack->mtime;
+		}
+		offset = nth_packed_object_offset(pack, pos);
+
+		add_cruft_object_entry(oid, OBJ_NONE, pack, offset,
+				       NULL, mtime);
+	} else {
+		add_object_entry(oid, OBJ_NONE, "", 0);
+	}
 	return 0;
 }
 
@@ -3545,7 +3697,19 @@ static int add_loose_object(const struct object_id *oid, const char *path,
 		return 0;
 	}
 
-	add_object_entry(oid, type, "", 0);
+	if (cruft) {
+		struct stat st;
+		if (stat(path, &st) < 0) {
+			if (errno == ENOENT)
+				return 0;
+			return error_errno("unable to stat %s", oid_to_hex(oid));
+		}
+
+		add_cruft_object_entry(oid, type, NULL, 0, NULL,
+				       st.st_mtime);
+	} else {
+		add_object_entry(oid, type, "", 0);
+	}
 	return 0;
 }
 
@@ -3864,6 +4028,20 @@ static int option_parse_unpack_unreachable(const struct option *opt,
 	return 0;
 }
 
+static int option_parse_cruft_expiration(const struct option *opt,
+					 const char *arg, int unset)
+{
+	if (unset) {
+		cruft = 0;
+		cruft_expiration = 0;
+	} else {
+		cruft = 1;
+		if (arg)
+			cruft_expiration = approxidate(arg);
+	}
+	return 0;
+}
+
 int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 {
 	int use_internal_rev_list = 0;
@@ -3936,6 +4114,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		OPT_CALLBACK_F(0, "unpack-unreachable", NULL, N_("time"),
 		  N_("unpack unreachable objects newer than <time>"),
 		  PARSE_OPT_OPTARG, option_parse_unpack_unreachable),
+		OPT_BOOL(0, "cruft", &cruft, N_("create a cruft pack")),
+		OPT_CALLBACK_F(0, "cruft-expiration", NULL, N_("time"),
+		  N_("expire cruft objects older than <time>"),
+		  PARSE_OPT_OPTARG, option_parse_cruft_expiration),
 		OPT_BOOL(0, "sparse", &sparse,
 			 N_("use the sparse reachability algorithm")),
 		OPT_BOOL(0, "thin", &thin,
@@ -4060,7 +4242,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	if (!HAVE_THREADS && delta_search_threads != 1)
 		warning(_("no threads support, ignoring --threads"));
-	if (!pack_to_stdout && !pack_size_limit)
+	if (!pack_to_stdout && !pack_size_limit && !cruft)
 		pack_size_limit = pack_size_limit_cfg;
 	if (pack_to_stdout && pack_size_limit)
 		die(_("--max-pack-size cannot be used to build a pack for transfer"));
@@ -4087,6 +4269,15 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
 
+	if (cruft) {
+		if (use_internal_rev_list)
+			die(_("cannot use internal rev list with --cruft"));
+		if (stdin_packs)
+			die(_("cannot use --stdin-packs with --cruft"));
+		if (pack_size_limit)
+			die(_("cannot use --max-pack-size with --cruft"));
+	}
+
 	/*
 	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
 	 *
@@ -4143,7 +4334,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			    the_repository);
 	prepare_packing_data(the_repository, &to_pack);
 
-	if (progress)
+	if (progress && !cruft)
 		progress_state = start_progress(_("Enumerating objects"), 0);
 	if (stdin_packs) {
 		/* avoids adding objects in excluded packs */
@@ -4151,7 +4342,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		read_packs_list_from_stdin();
 		if (rev_list_unpacked)
 			add_unreachable_loose_objects();
-	} else if (!use_internal_rev_list)
+	} else if (cruft)
+		read_cruft_objects();
+	else if (!use_internal_rev_list)
 		read_object_list_from_stdin();
 	else {
 		get_object_list(rp.nr, rp.v);
diff --git a/object-file.c b/object-file.c
index c3d866a287..7ddb38b64a 100644
--- a/object-file.c
+++ b/object-file.c
@@ -956,7 +956,7 @@ int has_loose_object_nonlocal(const struct object_id *oid)
 	return check_and_freshen_nonlocal(oid, 0);
 }
 
-static int has_loose_object(const struct object_id *oid)
+int has_loose_object(const struct object_id *oid)
 {
 	return check_and_freshen(oid, 0);
 }
diff --git a/object-store.h b/object-store.h
index d87481f101..a79c1c91ab 100644
--- a/object-store.h
+++ b/object-store.h
@@ -308,6 +308,8 @@ int repo_has_object_file_with_flags(struct repository *r,
  */
 int has_loose_object_nonlocal(const struct object_id *);
 
+int has_loose_object(const struct object_id *);
+
 void assert_oid_type(const struct object_id *oid, enum object_type expect);
 
 /*
diff --git a/t/t5327-pack-objects-cruft.sh b/t/t5327-pack-objects-cruft.sh
new file mode 100755
index 0000000000..543a80e9bf
--- /dev/null
+++ b/t/t5327-pack-objects-cruft.sh
@@ -0,0 +1,218 @@
+#!/bin/sh
+
+test_description='cruft pack related pack-objects tests'
+. ./test-lib.sh
+
+objdir=.git/objects
+packdir=$objdir/pack
+
+basic_cruft_pack_tests () {
+	expire="$1"
+
+	test_expect_success "unreachable loose objects are packed (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit base &&
+			git repack -Ad &&
+			test_commit loose &&
+
+			test-tool chmtime +2000 "$objdir/$(test_oid_to_path \
+				$(git rev-parse loose:loose.t))" &&
+			test-tool chmtime +1000 "$objdir/$(test_oid_to_path \
+				$(git rev-parse loose^{tree}))" &&
+
+			(
+				git rev-list --objects --no-object-names base..loose |
+				while read oid
+				do
+					path="$objdir/$(test_oid_to_path "$oid")" &&
+					printf "%s %d\n" "$oid" "$(test-tool chmtime --get "$path")"
+				done |
+				sort -k1
+			) >expect &&
+
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+			cruft="$(echo $keep | git pack-objects --cruft \
+				--cruft-expiration="$expire" $packdir/pack)" &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "unreachable packed objects are packed (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit packed &&
+			git repack -Ad &&
+			test_commit other &&
+
+			git rev-list --objects --no-object-names packed.. >objects &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+			other="$(git pack-objects --delta-base-offset \
+				$packdir/pack <objects)" &&
+			git prune-packed &&
+
+			test-tool chmtime --get -100 "$packdir/pack-$other.pack" >expect &&
+
+			cruft="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF
+			$keep
+			-pack-$other.pack
+			EOF
+			)" &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+			cut -d" " -f2 <actual.raw | sort -u >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "unreachable cruft objects are repacked (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit packed &&
+			git repack -Ad &&
+			test_commit other &&
+
+			git rev-list --objects --no-object-names packed.. >objects &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+
+			cruft_a="$(echo $keep | git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack)" &&
+			git prune-packed &&
+			cruft_b="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF
+			$keep
+			-pack-$cruft_a.pack
+			EOF
+			)" &&
+
+			test-tool pack-mtimes "pack-$cruft_a.mtimes" >expect.raw &&
+			test-tool pack-mtimes "pack-$cruft_b.mtimes" >actual.raw &&
+
+			sort <expect.raw >expect &&
+			sort <actual.raw >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "multiple cruft packs (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			git repack -Ad &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+
+			test_commit cruft &&
+			loose="$objdir/$(test_oid_to_path $(git rev-parse cruft))" &&
+
+			# generate three copies of the cruft object in different
+			# cruft packs, each with a unique mtime:
+			#   - one expired (1000 seconds ago)
+			#   - two non-expired (one 1000 seconds in the future,
+			#     one 1500 seconds in the future)
+			test-tool chmtime =-1000 "$loose" &&
+			git pack-objects --cruft $packdir/pack-A <<-EOF &&
+			$keep
+			EOF
+			test-tool chmtime =+1000 "$loose" &&
+			git pack-objects --cruft $packdir/pack-B <<-EOF &&
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			EOF
+			test-tool chmtime =+1500 "$loose" &&
+			git pack-objects --cruft $packdir/pack-C <<-EOF &&
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			-$(basename $(ls $packdir/pack-B-*.pack))
+			EOF
+
+			# ensure the resulting cruft pack takes the most recent
+			# mtime among all copies
+			cruft="$(git pack-objects --cruft \
+				--cruft-expiration="$expire" \
+				$packdir/pack <<-EOF
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			-$(basename $(ls $packdir/pack-B-*.pack))
+			-$(basename $(ls $packdir/pack-C-*.pack))
+			EOF
+			)" &&
+
+			test-tool pack-mtimes "$(basename $(ls $packdir/pack-C-*.mtimes))" >expect.raw &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+			sort expect.raw >expect &&
+			sort actual.raw >actual &&
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "cruft packs tolerate missing trees (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			test_commit cruft &&
+
+			tree="$(git rev-parse cruft^{tree})" &&
+
+			git reset --hard reachable &&
+			git tag -d cruft &&
+			rm -fr .git/logs &&
+
+			# remove the unreachable tree, but leave the commit
+			# which has it as its root tree in-tact
+			rm -fr "$objdir/$(test_oid_to_path "$tree")" &&
+
+			git repack -Ad &&
+			basename $(ls $packdir/pack-*.pack) >in &&
+			git pack-objects --cruft --cruft-expiration="$expire" \
+				$packdir/pack <in
+		)
+	'
+
+	test_expect_success "cruft packs tolerate missing blobs (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			test_commit cruft &&
+
+			blob="$(git rev-parse cruft:cruft.t)" &&
+
+			git reset --hard reachable &&
+			git tag -d cruft &&
+			rm -fr .git/logs &&
+
+			# remove the unreachable blob, but leave the commit (and
+			# the root tree of that commit) in-tact
+			rm -fr "$objdir/$(test_oid_to_path "$blob")" &&
+
+			git repack -Ad &&
+			basename $(ls $packdir/pack-*.pack) >in &&
+			git pack-objects --cruft --cruft-expiration="$expire" \
+				$packdir/pack <in
+		)
+	'
+}
+
+basic_cruft_pack_tests never
+
+test_done
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH 09/17] reachable: add options to add_unseen_recent_objects_to_traversal
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (7 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-11-29 22:25 ` [PATCH 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

This function behaves very similarly to what we will need in
pack-objects in order to implement cruft packs with expiration. But it
is lacking a couple of things. Namely, it needs:

  - a mechanism to communicate the timestamps of individual recent
    objects to some external caller

  - and, in the case of packed objects, our future caller will also want
    to know the originating pack, as well as the offset within that pack
    at which the object can be found

  - finally, it needs a way to skip over packs which are marked as kept
    in-core.

To address the first two, add a callback interface in this patch which
reports the time of each recent object, as well as a (packed_git,
off_t) pair for packed objects.

Likewise, add a new option to the packed object iterators to skip over
packs which are marked as kept in core. This option will become
implicitly tested in a future patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c |  2 +-
 reachable.c            | 51 +++++++++++++++++++++++++++++++++++-------
 reachable.h            |  9 +++++++-
 3 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index b12e79e4b1..2c592d369a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3951,7 +3951,7 @@ static void get_object_list(int ac, const char **av)
 	if (unpack_unreachable_expiration) {
 		revs.ignore_missing_links = 1;
 		if (add_unseen_recent_objects_to_traversal(&revs,
-				unpack_unreachable_expiration))
+				unpack_unreachable_expiration, NULL, 0))
 			die(_("unable to add recent objects"));
 		if (prepare_revision_walk(&revs))
 			die(_("revision walk setup failed"));
diff --git a/reachable.c b/reachable.c
index 84e3d0d75e..0eb9909f47 100644
--- a/reachable.c
+++ b/reachable.c
@@ -60,9 +60,13 @@ static void mark_commit(struct commit *c, void *data)
 struct recent_data {
 	struct rev_info *revs;
 	timestamp_t timestamp;
+	report_recent_object_fn *cb;
+	int ignore_in_core_kept_packs;
 };
 
 static void add_recent_object(const struct object_id *oid,
+			      struct packed_git *pack,
+			      off_t offset,
 			      timestamp_t mtime,
 			      struct recent_data *data)
 {
@@ -103,13 +107,29 @@ static void add_recent_object(const struct object_id *oid,
 		die("unable to lookup %s", oid_to_hex(oid));
 
 	add_pending_object(data->revs, obj, "");
+	if (data->cb)
+		data->cb(obj, pack, offset, mtime);
+}
+
+static int want_recent_object(struct recent_data *data,
+			      const struct object_id *oid)
+{
+	if (data->ignore_in_core_kept_packs &&
+	    has_object_kept_pack(oid, IN_CORE_KEEP_PACKS))
+		return 0;
+	return 1;
 }
 
 static int add_recent_loose(const struct object_id *oid,
 			    const char *path, void *data)
 {
 	struct stat st;
-	struct object *obj = lookup_object(the_repository, oid);
+	struct object *obj;
+
+	if (!want_recent_object(data, oid))
+		return 0;
+
+	obj = lookup_object(the_repository, oid);
 
 	if (obj && obj->flags & SEEN)
 		return 0;
@@ -126,7 +146,7 @@ static int add_recent_loose(const struct object_id *oid,
 		return error_errno("unable to stat %s", oid_to_hex(oid));
 	}
 
-	add_recent_object(oid, st.st_mtime, data);
+	add_recent_object(oid, NULL, 0, st.st_mtime, data);
 	return 0;
 }
 
@@ -134,29 +154,43 @@ static int add_recent_packed(const struct object_id *oid,
 			     struct packed_git *p, uint32_t pos,
 			     void *data)
 {
-	struct object *obj = lookup_object(the_repository, oid);
+	struct object *obj;
+
+	if (!want_recent_object(data, oid))
+		return 0;
+
+	obj = lookup_object(the_repository, oid);
 
 	if (obj && obj->flags & SEEN)
 		return 0;
-	add_recent_object(oid, p->mtime, data);
+	add_recent_object(oid, p, nth_packed_object_offset(p, pos), p->mtime, data);
 	return 0;
 }
 
 int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
-					   timestamp_t timestamp)
+					   timestamp_t timestamp,
+					   report_recent_object_fn *cb,
+					   int ignore_in_core_kept_packs)
 {
 	struct recent_data data;
+	enum for_each_object_flags flags;
 	int r;
 
 	data.revs = revs;
 	data.timestamp = timestamp;
+	data.cb = cb;
+	data.ignore_in_core_kept_packs = ignore_in_core_kept_packs;
 
 	r = for_each_loose_object(add_recent_loose, &data,
 				  FOR_EACH_OBJECT_LOCAL_ONLY);
 	if (r)
 		return r;
-	return for_each_packed_object(add_recent_packed, &data,
-				      FOR_EACH_OBJECT_LOCAL_ONLY);
+
+	flags = FOR_EACH_OBJECT_LOCAL_ONLY | FOR_EACH_OBJECT_PACK_ORDER;
+	if (ignore_in_core_kept_packs)
+		flags |= FOR_EACH_OBJECT_SKIP_IN_CORE_KEPT_PACKS;
+
+	return for_each_packed_object(add_recent_packed, &data, flags);
 }
 
 static int mark_object_seen(const struct object_id *oid,
@@ -217,7 +251,8 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 
 	if (mark_recent) {
 		revs->ignore_missing_links = 1;
-		if (add_unseen_recent_objects_to_traversal(revs, mark_recent))
+		if (add_unseen_recent_objects_to_traversal(revs, mark_recent,
+							   NULL, 0))
 			die("unable to mark recent objects");
 		if (prepare_revision_walk(revs))
 			die("revision walk setup failed");
diff --git a/reachable.h b/reachable.h
index 5df932ad8f..b776761baa 100644
--- a/reachable.h
+++ b/reachable.h
@@ -1,11 +1,18 @@
 #ifndef REACHEABLE_H
 #define REACHEABLE_H
 
+#include "object.h"
+
 struct progress;
 struct rev_info;
 
+typedef void report_recent_object_fn(const struct object *, struct packed_git *,
+				     off_t, time_t);
+
 int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
-					   timestamp_t timestamp);
+					   timestamp_t timestamp,
+					   report_recent_object_fn cb,
+					   int ignore_in_core_kept_packs);
 void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 			    timestamp_t mark_recent, struct progress *);
 
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH 10/17] reachable: report precise timestamps from objects in cruft packs
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (8 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-11-29 22:25 ` [PATCH 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

When generating a cruft pack, the caller within pack-objects will want
to know the precise timestamps of cruft objects (i.e., their
corresponding values in the .mtimes table) rather than the mtime of the
cruft pack itself.

Teach add_recent_packed() to lookup each object's precise mtime from the
.mtimes file if one exists (indicated by the is_cruft bit on the
packed_git structure).

A couple of small things worth noting here:

  - load_pack_mtimes() needs to be called before asking for
    nth_packed_mtime(), and that call is done lazily here. That function
    exits early if the .mtimes file has already been opened and parsed,
    so only the first call is slow.

  - Checking the is_cruft bit can be done without any extra work on the
    caller's behalf, since it is set up for us automatically as a
    side-effect of calling add_packed_git() (just like the 'pack_keep'
    and 'pack_promisor' bits).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 reachable.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/reachable.c b/reachable.c
index 0eb9909f47..9ec8e6bd5b 100644
--- a/reachable.c
+++ b/reachable.c
@@ -13,6 +13,7 @@
 #include "worktree.h"
 #include "object-store.h"
 #include "pack-bitmap.h"
+#include "pack-mtimes.h"
 
 struct connectivity_progress {
 	struct progress *progress;
@@ -155,6 +156,7 @@ static int add_recent_packed(const struct object_id *oid,
 			     void *data)
 {
 	struct object *obj;
+	timestamp_t mtime = p->mtime;
 
 	if (!want_recent_object(data, oid))
 		return 0;
@@ -163,7 +165,12 @@ static int add_recent_packed(const struct object_id *oid,
 
 	if (obj && obj->flags & SEEN)
 		return 0;
-	add_recent_object(oid, p, nth_packed_object_offset(p, pos), p->mtime, data);
+	if (p->is_cruft) {
+		if (load_pack_mtimes(p) < 0)
+			die(_("could not load cruft pack .mtimes"));
+		mtime = nth_packed_mtime(p, pos);
+	}
+	add_recent_object(oid, p, nth_packed_object_offset(p, pos), mtime, data);
 	return 0;
 }
 
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH 11/17] builtin/pack-objects.c: --cruft with expiration
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (9 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-12-07 15:30   ` Derrick Stolee
  2021-11-29 22:25 ` [PATCH 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

In a previous patch, pack-objects learned how to generate a cruft pack
so long as no objects are dropped.

This patch teaches pack-objects to handle the case where a non-never
`--cruft-expiration` value is passed. This case is slightly more
complicated than before, because we want pack-objects to save
unreachable objects which would have been pruned when there is another
recent (i.e., non-prunable) unreachable object which reaches the other.
We'll call these objects "unreachable but reachable-from-recent".

Here is how pack-objects handles `--cruft-expiration`:

  - Instead of adding all objects outside of the kept pack(s) into the
    packing list, only handle the ones whose mtime is within the grace
    period.

  - Construct a reachability traversal whose tips are the
    unreachable-but-recent objects.

  - Then, walk along that traversal, stopping if we reach an object in
    the kept pack. At each step along the traversal, we add the object
    we are visiting to the packing list.

In the majority of these cases, any object we visit in this traversal
will already be in our packing list. But we will sometimes encounter
reachable-from-recent cruft objects, which we want to retain even if
they aged out of the grace period.

The most subtle point of this process is that we actually don't need to
bother to update the rescued object's mtime. Even though we will write
an .mtimes file with a value that is older than the expiration window,
it will continue to survive cruft repacks so long as any objects which
reach it haven't aged out.

That is, a future repack will also exclude that object from the initial
packing list, only to discover it later on when doing the reachability
traversal.

Finally, stopping early once an object is found in a kept pack is safe
to do because the kept packs ordinarily represent which packs will
survive after repacking. Assuming that it _isn't_ safe to halt a
traversal early would mean that there is some ancestor object which is
missing, which implies repository corruption (i.e., the complete set of
reachable objects isn't present).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c        |  84 +++++++++++++++++++-
 t/t5327-pack-objects-cruft.sh | 143 ++++++++++++++++++++++++++++++++++
 2 files changed, 226 insertions(+), 1 deletion(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 2c592d369a..a38fa34479 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3439,6 +3439,44 @@ static int add_cruft_object_entry(const struct object_id *oid, enum object_type
 	return 1;
 }
 
+static void show_cruft_object(struct object *obj, const char *name, void *data)
+{
+	/*
+	 * if we did not record it earlier, it's at least as old as our
+	 * expiration value. Rather than find it exactly, just use that
+	 * value.  This may bump it forward from its real mtime, but it
+	 * will still be "too old" next time we run with the same
+	 * expiration.
+	 *
+	 * if obj does appear in the packing list, this call is a noop (or may
+	 * set the namehash).
+	 */
+	add_cruft_object_entry(&obj->oid, obj->type, NULL, 0, name, cruft_expiration);
+}
+
+static void show_cruft_commit(struct commit *commit, void *data)
+{
+	show_cruft_object((struct object*)commit, NULL, data);
+}
+
+static int cruft_include_check_obj(struct object *obj, void *data)
+{
+	return !has_object_kept_pack(&obj->oid, IN_CORE_KEEP_PACKS);
+}
+
+static int cruft_include_check(struct commit *commit, void *data)
+{
+	return cruft_include_check_obj((struct object*)commit, data);
+}
+
+static void set_cruft_mtime(const struct object *object,
+			    struct packed_git *pack,
+			    off_t offset, time_t mtime)
+{
+	add_cruft_object_entry(&object->oid, object->type, pack, offset, NULL,
+			       mtime);
+}
+
 static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
 {
 	struct string_list_item *item = NULL;
@@ -3464,6 +3502,50 @@ static void enumerate_cruft_objects(void)
 	stop_progress(&progress_state);
 }
 
+static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs)
+{
+	struct packed_git *p;
+	struct rev_info revs;
+	int ret;
+
+	repo_init_revisions(the_repository, &revs, NULL);
+
+	revs.tag_objects = 1;
+	revs.tree_objects = 1;
+	revs.blob_objects = 1;
+
+	revs.include_check = cruft_include_check;
+	revs.include_check_obj = cruft_include_check_obj;
+
+	revs.ignore_missing_links = 1;
+
+	if (progress)
+		progress_state = start_progress(_("Enumerating cruft objects"), 0);
+	ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
+						     set_cruft_mtime, 1);
+	stop_progress(&progress_state);
+
+	if (ret)
+		die(_("unable to add cruft objects"));
+
+	/*
+	 * Re-mark only the fresh packs as kept so that objects in
+	 * unknown packs do not halt the reachability traversal early.
+	 */
+	for (p = get_all_packs(the_repository); p; p = p->next)
+		p->pack_keep_in_core = 0;
+	mark_pack_kept_in_core(fresh_packs, 1);
+
+	if (prepare_revision_walk(&revs))
+		die(_("revision walk setup failed"));
+	if (progress)
+		progress_state = start_progress(_("Traversing cruft objects"), 0);
+	nr_seen = 0;
+	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
+
+	stop_progress(&progress_state);
+}
+
 static void read_cruft_objects(void)
 {
 	struct strbuf buf = STRBUF_INIT;
@@ -3515,7 +3597,7 @@ static void read_cruft_objects(void)
 	mark_pack_kept_in_core(&discard_packs, 0);
 
 	if (cruft_expiration)
-		die("--cruft-expiration not yet implemented");
+		enumerate_and_traverse_cruft_objects(&fresh_packs);
 	else
 		enumerate_cruft_objects();
 
diff --git a/t/t5327-pack-objects-cruft.sh b/t/t5327-pack-objects-cruft.sh
index 543a80e9bf..31d4a561fe 100755
--- a/t/t5327-pack-objects-cruft.sh
+++ b/t/t5327-pack-objects-cruft.sh
@@ -214,5 +214,148 @@ basic_cruft_pack_tests () {
 }
 
 basic_cruft_pack_tests never
+basic_cruft_pack_tests 2.weeks.ago
+
+test_expect_success 'cruft tags rescue tagged objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit tagged &&
+		git tag -a annotated -m tag &&
+
+		git rev-list --objects --no-object-names packed.. >objects &&
+		while read oid
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $oid)"
+		done <objects &&
+
+		test-tool chmtime -500 \
+			"$objdir/$(test_oid_to_path $(git rev-parse annotated))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+		cut -f1 -d" " <actual.raw | sort >actual &&
+
+		(
+			cat objects &&
+			git rev-parse annotated
+		) >expect.raw &&
+		sort <expect.raw >expect &&
+
+		test_cmp expect actual &&
+		cat actual
+	)
+'
+
+test_expect_success 'cruft commits rescue parents, trees' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit old &&
+		test_commit new &&
+
+		git rev-list --objects --no-object-names packed..new >objects &&
+		while read object
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $object)"
+		done <objects &&
+		test-tool chmtime +500 "$objdir/$(test_oid_to_path \
+			$(git rev-parse HEAD))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+		cut -d" " -f1 <actual.raw | sort >actual &&
+		sort <objects >expect &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'cruft trees rescue sub-trees, blobs' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		mkdir -p dir/sub &&
+		echo foo >foo &&
+		echo bar >dir/bar &&
+		echo baz >dir/sub/baz &&
+
+		test_tick &&
+		git add . &&
+		git commit -m "pruned" &&
+
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD^{tree}))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:foo))" &&
+		test-tool chmtime  -500 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/bar))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/sub))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/sub/baz))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+		cut -f1 -d" " <actual.raw | sort >actual &&
+
+		git rev-parse HEAD:dir HEAD:dir/bar HEAD:dir/sub HEAD:dir/sub/baz >expect.raw &&
+		sort <expect.raw >expect &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'expired objects are pruned' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit pruned &&
+
+		git rev-list --objects --no-object-names packed..pruned >objects &&
+		while read object
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $object)"
+		done <objects &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual &&
+		test_must_be_empty actual
+	)
+'
 
 test_done
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH 12/17] builtin/repack.c: support generating a cruft pack
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (10 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-12-05 20:46   ` Junio C Hamano
  2021-12-07 15:38   ` Derrick Stolee
  2021-11-29 22:25 ` [PATCH 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
                   ` (9 subsequent siblings)
  21 siblings, 2 replies; 200+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

Expose a way to split the contents of a repository into a main and cruft
pack when doing an all-into-one repack with `git repack --cruft -d`, and
a complementary configuration variable.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt            |  11 ++
 Documentation/technical/cruft-packs.txt |   2 +-
 builtin/repack.c                        | 112 ++++++++++++++++-
 t/t5327-pack-objects-cruft.sh           | 153 ++++++++++++++++++++++++
 4 files changed, 272 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 7183fb498f..4f8f4b5a1f 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -63,6 +63,17 @@ to the new separate pack will be written.
 	Also run  'git prune-packed' to remove redundant
 	loose object files.
 
+--cruft::
+	Same as `-a`, unless `-d` is used. Then any unreachable objects
+	are packed into a separate cruft pack. Unreachable objects can
+	be pruned using the normal expiry rules with the next `git gc`
+	invocation (see linkgit:git-gc[1]). Incompatible with `-k`.
+
+--cruft-expiration=<approxidate>::
+	Expire unreachable objects older than `<approxidate>`
+	immediately instead of waiting for the next `git gc` invocation.
+	Only useful with `--cruft -d`.
+
 -l::
 	Pass the `--local` option to 'git pack-objects'. See
 	linkgit:git-pack-objects[1].
diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
index bb54cce1b1..b7daad2e3e 100644
--- a/Documentation/technical/cruft-packs.txt
+++ b/Documentation/technical/cruft-packs.txt
@@ -16,7 +16,7 @@ pruned according to normal expiry rules with the next 'git gc' invocation.
 
 Unreachable objects aren't removed immediately, since doing so could race with
 an incoming push which may reference an object which is about to be deleted.
-Instead, those unreachable objects are stored as loose object and stay that way
+Instead, those unreachable objects are stored as loose objects and stay that way
 until they are older than the expiration window, at which point they are removed
 by linkgit:git-prune[1].
 
diff --git a/builtin/repack.c b/builtin/repack.c
index acbb7b8c3b..68b4bdf06f 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -18,11 +18,17 @@
 #include "pack-bitmap.h"
 #include "refs.h"
 
+#define ALL_INTO_ONE 1
+#define LOOSEN_UNREACHABLE 2
+#define PACK_CRUFT 4
+
+static int pack_everything;
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
 static int write_bitmaps = -1;
 static int use_delta_islands;
 static char *packdir, *packtmp_name, *packtmp;
+static char *cruft_expiration;
 
 static const char *const git_repack_usage[] = {
 	N_("git repack [<options>]"),
@@ -54,6 +60,7 @@ static int repack_config(const char *var, const char *value, void *cb)
 		use_delta_islands = git_config_bool(var, value);
 		return 0;
 	}
+
 	return git_default_config(var, value, cb);
 }
 
@@ -298,9 +305,6 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 		die(_("could not finish pack-objects to repack promisor objects"));
 }
 
-#define ALL_INTO_ONE 1
-#define LOOSEN_UNREACHABLE 2
-
 struct pack_geometry {
 	struct packed_git **pack;
 	uint32_t pack_nr, pack_alloc;
@@ -337,6 +341,8 @@ static void init_pack_geometry(struct pack_geometry **geometry_p)
 	for (p = get_all_packs(the_repository); p; p = p->next) {
 		if (!pack_kept_objects && p->pack_keep)
 			continue;
+		if (p->is_cruft)
+			continue;
 
 		ALLOC_GROW(geometry->pack,
 			   geometry->pack_nr + 1,
@@ -598,6 +604,67 @@ static int write_midx_included_packs(struct string_list *include,
 	return finish_command(&cmd);
 }
 
+static int write_cruft_pack(const struct pack_objects_args *args,
+			    const char *pack_prefix,
+			    struct string_list *names,
+			    struct string_list *existing_packs,
+			    struct string_list *existing_kept_packs)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf line = STRBUF_INIT;
+	struct string_list_item *item;
+	FILE *in, *out;
+	int ret;
+
+	prepare_pack_objects(&cmd, args);
+
+	strvec_push(&cmd.args, "--cruft");
+	if (cruft_expiration)
+		strvec_pushf(&cmd.args, "--cruft-expiration=%s",
+			     cruft_expiration);
+
+	strvec_push(&cmd.args, "--honor-pack-keep");
+	strvec_push(&cmd.args, "--non-empty");
+	strvec_push(&cmd.args, "--max-pack-size=0");
+
+	cmd.in = -1;
+
+	ret = start_command(&cmd);
+	if (ret)
+		return ret;
+
+	/*
+	 * names has a confusing double use: it both provides the list
+	 * of just-written new packs, and accepts the name of the cruft
+	 * pack we are writing.
+	 *
+	 * By the time it is read here, it contains only the pack(s)
+	 * that were just written, which is exactly the set of packs we
+	 * want to consider kept.
+	 */
+	in = xfdopen(cmd.in, "w");
+	for_each_string_list_item(item, names)
+		fprintf(in, "%s-%s.pack\n", pack_prefix, item->string);
+	for_each_string_list_item(item, existing_packs)
+		fprintf(in, "-%s.pack\n", item->string);
+	for_each_string_list_item(item, existing_kept_packs)
+		fprintf(in, "%s.pack\n", item->string);
+	fclose(in);
+
+	out = xfdopen(cmd.out, "r");
+	while (strbuf_getline_lf(&line, out) != EOF) {
+		if (line.len != the_hash_algo->hexsz)
+			die(_("repack: Expecting full hex object ID lines only "
+			      "from pack-objects."));
+		string_list_append(names, line.buf);
+	}
+	fclose(out);
+
+	strbuf_release(&line);
+
+	return finish_command(&cmd);
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -614,7 +681,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int show_progress = isatty(2);
 
 	/* variables to be filled by option parsing */
-	int pack_everything = 0;
 	int delete_redundant = 0;
 	const char *unpack_unreachable = NULL;
 	int keep_unreachable = 0;
@@ -630,6 +696,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_BIT('A', NULL, &pack_everything,
 				N_("same as -a, and turn unreachable objects loose"),
 				   LOOSEN_UNREACHABLE | ALL_INTO_ONE),
+		OPT_BIT(0, "cruft", &pack_everything,
+				N_("same as -a, pack unreachable cruft objects separately"),
+				   PACK_CRUFT | ALL_INTO_ONE),
+		OPT_STRING(0, "cruft-expiration", &cruft_expiration, N_("approxidate"),
+				N_("with -C, expire objects older than this")),
 		OPT_BOOL('d', NULL, &delete_redundant,
 				N_("remove redundant packs, and run git-prune-packed")),
 		OPT_BOOL('f', NULL, &po_args.no_reuse_delta,
@@ -681,6 +752,14 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (keep_unreachable &&
 	    (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
 		die(_("--keep-unreachable and -A are incompatible"));
+	if (pack_everything & PACK_CRUFT && delete_redundant) {
+		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
+			die(_("--cruft and -A are incompatible"));
+		if (keep_unreachable)
+			die(_("--cruft and -k are incompatible"));
+		if (!(pack_everything & ALL_INTO_ONE))
+			die(_("--cruft must be combined with all-into-one"));
+	}
 
 	if (write_bitmaps < 0) {
 		if (!write_midx &&
@@ -763,7 +842,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (pack_everything & ALL_INTO_ONE) {
 		repack_promisor_objects(&po_args, &names);
 
-		if (existing_nonkept_packs.nr && delete_redundant) {
+		if (existing_nonkept_packs.nr && delete_redundant &&
+		    !(pack_everything & PACK_CRUFT)) {
 			for_each_string_list_item(item, &names) {
 				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
 					     packtmp_name, item->string);
@@ -798,6 +878,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		return ret;
 
 	if (geometry) {
+		struct packed_git *p;
 		FILE *in = xfdopen(cmd.in, "w");
 		/*
 		 * The resulting pack should contain all objects in packs that
@@ -808,6 +889,12 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			fprintf(in, "%s\n", pack_basename(geometry->pack[i]));
 		for (i = geometry->split; i < geometry->pack_nr; i++)
 			fprintf(in, "^%s\n", pack_basename(geometry->pack[i]));
+
+		for (p = get_all_packs(the_repository); p; p = p->next) {
+			if (!p->is_cruft)
+				continue;
+			fprintf(in, "^%s\n", pack_basename(p));
+		}
 		fclose(in);
 	}
 
@@ -825,6 +912,21 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (!names.nr && !po_args.quiet)
 		printf_ln(_("Nothing new to pack."));
 
+	if (pack_everything & PACK_CRUFT) {
+		const char *pack_prefix;
+		if (!skip_prefix(packtmp, packdir, &pack_prefix))
+			die(_("pack prefix %s does not begin with objdir %s"),
+			    packtmp, packdir);
+		if (*pack_prefix == '/')
+			pack_prefix++;
+
+		ret = write_cruft_pack(&po_args, pack_prefix, &names,
+				       &existing_nonkept_packs,
+				       &existing_kept_packs);
+		if (ret)
+			return ret;
+	}
+
 	for_each_string_list_item(item, &names) {
 		item->util = (void *)(uintptr_t)populate_pack_exts(item->string);
 	}
diff --git a/t/t5327-pack-objects-cruft.sh b/t/t5327-pack-objects-cruft.sh
index 31d4a561fe..ed1a113ab6 100755
--- a/t/t5327-pack-objects-cruft.sh
+++ b/t/t5327-pack-objects-cruft.sh
@@ -358,4 +358,157 @@ test_expect_success 'expired objects are pruned' '
 	)
 '
 
+test_expect_success 'repack --cruft generates a cruft pack' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git branch -M main &&
+		git checkout --orphan other &&
+		test_commit unreachable &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d unreachable &&
+		# objects are not cruft if they are contained in the reflogs
+		rm -fr .git/logs &&
+
+		git rev-list --objects --all --no-object-names >reachable.raw &&
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+		sort <reachable.raw >reachable &&
+		comm -13 reachable objects >unreachable &&
+
+		git repack --cruft -d &&
+
+		cruft=$(basename $(ls $packdir/pack-*.mtimes) .mtimes) &&
+		pack=$(basename $(ls $packdir/pack-*.pack | grep -v $cruft) .pack) &&
+
+		git show-index <$packdir/$pack.idx >actual.raw &&
+		cut -f2 -d" " actual.raw | sort >actual &&
+		test_cmp reachable actual &&
+
+		git show-index <$packdir/$cruft.idx >actual.raw &&
+		cut -f2 -d" " actual.raw | sort >actual &&
+		test_cmp unreachable actual
+	)
+'
+
+test_expect_success 'loose objects mtimes upsert others' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		test_commit cruft &&
+		# incremental repack, leaving existing objects loose (so
+		# they can be "freshened")
+		git repack &&
+
+		tip="$(git rev-parse cruft)" &&
+		path="$objdir/$(test_oid_to_path "$(git rev-parse cruft)")" &&
+		test-tool chmtime --get +1000 "$path" >expect &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d cruft &&
+		rm -fr .git/logs &&
+
+		git repack --cruft -d &&
+
+		mtimes="$(basename $(ls $packdir/pack-*.mtimes))" &&
+		test-tool pack-mtimes "$mtimes" >actual.raw &&
+		grep "$tip" actual.raw | cut -d" " -f2 >actual &&
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'cruft packs are not included in geometric repack' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		test_commit cruft &&
+		git repack -d &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d cruft &&
+		rm -fr .git/logs &&
+
+		git repack --cruft &&
+
+		find $packdir -type f | sort >before &&
+		git repack --geometric=2 -d &&
+		find $packdir -type f | sort >after &&
+
+		test_cmp before after
+	)
+'
+test_expect_success 'cruft repack with no reachable objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		git repack -ad &&
+
+		base="$(git rev-parse base)" &&
+
+		git for-each-ref --format="delete %(refname)" >in &&
+		git update-ref --stdin <in &&
+		rm -fr .git/logs &&
+		rm -fr .git/index &&
+
+		git repack --cruft -d &&
+
+		git cat-file -t $base
+	)
+'
+
+test_expect_success 'cruft repack ignores --max-pack-size' '
+	git init max-pack-size &&
+	(
+		cd max-pack-size &&
+		test_commit base &&
+		# two cruft objects which exceed the maximum pack size
+		test-tool genrandom foo 1048576 | git hash-object --stdin -w &&
+		test-tool genrandom bar 1048576 | git hash-object --stdin -w &&
+		git repack --cruft --max-pack-size=1M &&
+		find $packdir -name "*.mtimes" >cruft &&
+		test_line_count = 1 cruft &&
+		test-tool pack-mtimes "$(basename "$(cat cruft)")" >objects &&
+		test_line_count = 2 objects
+	)
+'
+
+test_expect_success 'cruft repack ignores pack.packSizeLimit' '
+	(
+		cd max-pack-size &&
+		# repack everything back together to remove the existing cruft
+		# pack (but to keep its objects)
+		git repack -adk &&
+		git -c pack.packSizeLimit=1M repack --cruft &&
+		# ensure the same post condition is met when --max-pack-size
+		# would otherwise be inferred from the configuration
+		find $packdir -name "*.mtimes" >cruft &&
+		test_line_count = 1 cruft &&
+		test-tool pack-mtimes "$(basename "$(cat cruft)")" >objects &&
+		test_line_count = 2 objects
+	)
+'
+
 test_done
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH 13/17] builtin/repack.c: allow configuring cruft pack generation
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (11 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-11-29 22:25 ` [PATCH 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

In servers which set the pack.window configuration to a large value, we
can wind up spending quite a lot of time finding new bases when breaking
delta chains between reachable and unreachable objects while generating
a cruft pack.

Introduce a handful of `repack.cruft*` configuration variables to
control the parameters used by pack-objects when generating a cruft
pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/repack.txt |  9 ++++
 builtin/repack.c                | 50 ++++++++++++++------
 t/t5327-pack-objects-cruft.sh   | 83 +++++++++++++++++++++++++++++++++
 3 files changed, 128 insertions(+), 14 deletions(-)

diff --git a/Documentation/config/repack.txt b/Documentation/config/repack.txt
index 9c413e177e..fd18d1fb89 100644
--- a/Documentation/config/repack.txt
+++ b/Documentation/config/repack.txt
@@ -25,3 +25,12 @@ repack.writeBitmaps::
 	space and extra time spent on the initial repack.  This has
 	no effect if multiple packfiles are created.
 	Defaults to true on bare repos, false otherwise.
+
+repack.cruftWindow::
+repack.cruftWindowMemory::
+repack.cruftDepth::
+repack.cruftThreads::
+	Parameters used by linkgit:git-pack-objects[1] when generating
+	a cruft pack and the respective parameters are not given over
+	the command line. See similarly named `pack.*` configuration
+	variables for defaults and meaning.
diff --git a/builtin/repack.c b/builtin/repack.c
index 68b4bdf06f..cefa906344 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -40,9 +40,21 @@ static const char incremental_bitmap_conflict_error[] = N_(
 "--no-write-bitmap-index or disable the pack.writebitmaps configuration."
 );
 
+struct pack_objects_args {
+	const char *window;
+	const char *window_memory;
+	const char *depth;
+	const char *threads;
+	const char *max_pack_size;
+	int no_reuse_delta;
+	int no_reuse_object;
+	int quiet;
+	int local;
+};
 
 static int repack_config(const char *var, const char *value, void *cb)
 {
+	struct pack_objects_args *cruft_po_args = cb;
 	if (!strcmp(var, "repack.usedeltabaseoffset")) {
 		delta_base_offset = git_config_bool(var, value);
 		return 0;
@@ -61,6 +73,15 @@ static int repack_config(const char *var, const char *value, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(var, "repack.cruftwindow"))
+		return git_config_string(&cruft_po_args->window, var, value);
+	if (!strcmp(var, "repack.cruftwindowmemory"))
+		return git_config_string(&cruft_po_args->window_memory, var, value);
+	if (!strcmp(var, "repack.cruftdepth"))
+		return git_config_string(&cruft_po_args->depth, var, value);
+	if (!strcmp(var, "repack.cruftthreads"))
+		return git_config_string(&cruft_po_args->threads, var, value);
+
 	return git_default_config(var, value, cb);
 }
 
@@ -153,18 +174,6 @@ static void remove_redundant_pack(const char *dir_name, const char *base_name)
 	strbuf_release(&buf);
 }
 
-struct pack_objects_args {
-	const char *window;
-	const char *window_memory;
-	const char *depth;
-	const char *threads;
-	const char *max_pack_size;
-	int no_reuse_delta;
-	int no_reuse_object;
-	int quiet;
-	int local;
-};
-
 static void prepare_pack_objects(struct child_process *cmd,
 				 const struct pack_objects_args *args)
 {
@@ -687,6 +696,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	int no_update_server_info = 0;
 	struct pack_objects_args po_args = {NULL};
+	struct pack_objects_args cruft_po_args = {NULL};
 	int geometric_factor = 0;
 	int write_midx = 0;
 
@@ -741,7 +751,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_END()
 	};
 
-	git_config(repack_config, NULL);
+	git_config(repack_config, &cruft_po_args);
 
 	argc = parse_options(argc, argv, prefix, builtin_repack_options,
 				git_repack_usage, 0);
@@ -920,7 +930,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		if (*pack_prefix == '/')
 			pack_prefix++;
 
-		ret = write_cruft_pack(&po_args, pack_prefix, &names,
+		if (!cruft_po_args.window)
+			cruft_po_args.window = po_args.window;
+		if (!cruft_po_args.window_memory)
+			cruft_po_args.window_memory = po_args.window_memory;
+		if (!cruft_po_args.depth)
+			cruft_po_args.depth = po_args.depth;
+		if (!cruft_po_args.threads)
+			cruft_po_args.threads = po_args.threads;
+
+		cruft_po_args.local = po_args.local;
+		cruft_po_args.quiet = po_args.quiet;
+
+		ret = write_cruft_pack(&cruft_po_args, pack_prefix, &names,
 				       &existing_nonkept_packs,
 				       &existing_kept_packs);
 		if (ret)
diff --git a/t/t5327-pack-objects-cruft.sh b/t/t5327-pack-objects-cruft.sh
index ed1a113ab6..750e9d6d6f 100755
--- a/t/t5327-pack-objects-cruft.sh
+++ b/t/t5327-pack-objects-cruft.sh
@@ -511,4 +511,87 @@ test_expect_success 'cruft repack ignores pack.packSizeLimit' '
 	)
 '
 
+test_expect_success 'cruft repack respects repack.cruftWindow' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		GIT_TRACE2_EVENT=$(pwd)/event.trace \
+		git -c pack.window=1 -c repack.cruftWindow=2 repack \
+		       --cruft --window=3 &&
+
+		grep "pack-objects.*--window=2.*--cruft" event.trace
+	)
+'
+
+test_expect_success 'cruft repack respects --window by default' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		GIT_TRACE2_EVENT=$(pwd)/event.trace \
+		git -c pack.window=2 repack --cruft --window=3 &&
+
+		grep "pack-objects.*--window=3.*--cruft" event.trace
+	)
+'
+
+test_expect_success 'cruft repack respects --quiet' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		GIT_PROGRESS_DELAY=0 git repack --cruft --quiet 2>err &&
+		test_must_be_empty err
+	)
+'
+
+test_expect_success 'cruft --local drops unreachable objects' '
+	git init alternate &&
+	git init repo &&
+	test_when_finished "rm -fr alternate repo" &&
+
+	test_commit -C alternate base &&
+	# Pack all objects in alterate so that the cruft repack in "repo" sees
+	# the object it dropped due to `--local` as packed. Otherwise this
+	# object would not appear packed anywhere (since it is not packed in
+	# alternate and likewise not part of the cruft pack in the other repo
+	# because of `--local`).
+	git -C alternate repack -ad &&
+
+	(
+		cd repo &&
+
+		object="$(git -C ../alternate rev-parse HEAD:base.t)" &&
+		git -C ../alternate cat-file -p $object >contents &&
+
+		# Write some reachable objects and two unreachable ones: one
+		# that the alternate has and another that is unique.
+		test_commit other &&
+		git hash-object -w -t blob contents &&
+		cruft="$(echo cruft | git hash-object -w -t blob --stdin)" &&
+
+		( cd ../alternate/.git/objects && pwd ) \
+		       >.git/objects/info/alternates &&
+
+		test_path_is_file $objdir/$(test_oid_to_path $cruft) &&
+		test_path_is_file $objdir/$(test_oid_to_path $object) &&
+
+		git repack -d --cruft --local &&
+
+		test-tool pack-mtimes "$(basename $(ls $packdir/pack-*.mtimes))" \
+		       >objects &&
+		! grep $object objects &&
+		grep $cruft objects
+	)
+'
+
 test_done
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH 14/17] builtin/repack.c: use named flags for existing_packs
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (12 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-11-29 22:25 ` [PATCH 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

We use the `util` pointer for items in the `existing_packs` string list
to indicate which packs are going to be deleted. Since that has so far
been the only use of that `util` pointer, we just set it to 0 or 1.

But we're going to add an additional state to this field in the next
patch, so prepare for that by adding a #define for the first bit so we
can more expressively inspect the flags state.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index cefa906344..cd4d789d27 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -22,6 +22,8 @@
 #define LOOSEN_UNREACHABLE 2
 #define PACK_CRUFT 4
 
+#define DELETE_PACK 1
+
 static int pack_everything;
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
@@ -559,7 +561,7 @@ static void midx_included_packs(struct string_list *include,
 		}
 	} else {
 		for_each_string_list_item(item, existing_nonkept_packs) {
-			if (item->util)
+			if ((uintptr_t)item->util & DELETE_PACK)
 				continue;
 			string_list_insert(include, xstrfmt("%s.idx", item->string));
 		}
@@ -1002,7 +1004,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			 * was given) and that we will actually delete this pack
 			 * (if `-d` was given).
 			 */
-			item->util = (void*)(intptr_t)!string_list_has_string(&names, sha1);
+			if (!string_list_has_string(&names, sha1))
+				item->util = (void*)(uintptr_t)((size_t)item->util | DELETE_PACK);
 		}
 	}
 
@@ -1026,7 +1029,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (delete_redundant) {
 		int opts = 0;
 		for_each_string_list_item(item, &existing_nonkept_packs) {
-			if (!item->util)
+			if (!((uintptr_t)item->util & DELETE_PACK))
 				continue;
 			remove_redundant_pack(packdir, item->string);
 		}
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (13 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-11-29 22:25 ` [PATCH 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

When using cruft packs, the following race can occur when a geometric
repack that writes a MIDX bitmap takes place afterwords:

  - First, create an unreachable object and do an all-into-one cruft
    repack which stores that object in the repository's cruft pack.
  - Then make that object reachable.
  - Finally, do a geometric repack and write a MIDX bitmap.

Assuming that we are sufficiently unlucky as to select a commit from the
MIDX which reaches that object for bitmapping, then the `git
multi-pack-index` process will complain that that object is missing.

The reason is because we don't include cruft packs in the MIDX when
doing a geometric repack. Since the "make that object reachable" doesn't
necessarily mean that we'll create a new copy of that object in one of
the packs that will get rolled up as part of a geometric repack, it's
possible that the MIDX won't see any copies of that now-reachable
object.

Of course, it's desirable to avoid including cruft packs in the MIDX
because it causes the MIDX to store a bunch of objects which are likely
to get thrown away. But excluding that pack does open us up to the above
race.

This patch demonstrates the bug, and resolves it by including cruft
packs in the MIDX even when doing a geometric repack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c              | 19 +++++++++++++++++--
 t/t5327-pack-objects-cruft.sh | 26 ++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index cd4d789d27..5a201063e7 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -23,6 +23,7 @@
 #define PACK_CRUFT 4
 
 #define DELETE_PACK 1
+#define CRUFT_PACK 2
 
 static int pack_everything;
 static int delta_base_offset = 1;
@@ -158,8 +159,11 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 		if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
 		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
 			string_list_append_nodup(fname_kept_list, fname);
-		else
-			string_list_append_nodup(fname_nonkept_list, fname);
+		else {
+			struct string_list_item *item = string_list_append_nodup(fname_nonkept_list, fname);
+			if (file_exists(mkpath("%s/%s.mtimes", packdir, fname)))
+				item->util = (void*)(uintptr_t)CRUFT_PACK;
+		}
 	}
 	closedir(dir);
 }
@@ -559,6 +563,17 @@ static void midx_included_packs(struct string_list *include,
 
 			string_list_insert(include, strbuf_detach(&buf, NULL));
 		}
+
+		for_each_string_list_item(item, existing_nonkept_packs) {
+			if (!((uintptr_t)item->util & CRUFT_PACK)) {
+				/*
+				 * no need to check DELETE_PACK, since we're not
+				 * doing an ALL_INTO_ONE repack
+				 */
+				continue;
+			}
+			string_list_insert(include, xstrfmt("%s.idx", item->string));
+		}
 	} else {
 		for_each_string_list_item(item, existing_nonkept_packs) {
 			if ((uintptr_t)item->util & DELETE_PACK)
diff --git a/t/t5327-pack-objects-cruft.sh b/t/t5327-pack-objects-cruft.sh
index 750e9d6d6f..857f9e8855 100755
--- a/t/t5327-pack-objects-cruft.sh
+++ b/t/t5327-pack-objects-cruft.sh
@@ -594,4 +594,30 @@ test_expect_success 'cruft --local drops unreachable objects' '
 	)
 '
 
+test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		test_commit cruft &&
+		unreachable="$(git rev-parse cruft)" &&
+
+		git reset --hard $unreachable^ &&
+		git tag -d cruft &&
+		rm -fr .git/logs &&
+
+		git repack --cruft -d &&
+
+		# resurrect the unreachable object via a new commit. the
+		# new commit will get selected for a bitmap, but be
+		# missing one of its parents from the selected packs.
+		git reset --hard $unreachable &&
+		test_commit resurrect &&
+
+		git repack --write-midx --write-bitmap-index --geometric=2 -d
+	)
+'
+
 test_done
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH 16/17] builtin/gc.c: conditionally avoid pruning objects via loose
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (14 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-11-29 22:25 ` [PATCH 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

Expose the new `git repack --cruft` mode from `git gc` via a new opt-in
flag. When invoked like `git gc --cruft`, `git gc` will avoid exploding
unreachable objects as loose ones, and instead create a cruft pack and
`.mtimes` file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/gc.txt   | 21 +++++++++++++-------
 Documentation/git-gc.txt      |  5 +++++
 builtin/gc.c                  | 10 +++++++++-
 t/t5327-pack-objects-cruft.sh | 37 +++++++++++++++++++++++++++++++++++
 4 files changed, 65 insertions(+), 8 deletions(-)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index c834e07991..38fea076a2 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -81,14 +81,21 @@ gc.packRefs::
 	to enable it within all non-bare repos or it can be set to a
 	boolean value.  The default is `true`.
 
+gc.cruftPacks::
+	Store unreachable objects in a cruft pack (see
+	linkgit:git-repack[1]) instead of as loose objects. The default
+	is `false`.
+
 gc.pruneExpire::
-	When 'git gc' is run, it will call 'prune --expire 2.weeks.ago'.
-	Override the grace period with this config variable.  The value
-	"now" may be used to disable this grace period and always prune
-	unreachable objects immediately, or "never" may be used to
-	suppress pruning.  This feature helps prevent corruption when
-	'git gc' runs concurrently with another process writing to the
-	repository; see the "NOTES" section of linkgit:git-gc[1].
+	When 'git gc' is run, it will call 'prune --expire 2.weeks.ago'
+	(and 'repack --cruft --cruft-expiration 2.weeks.ago' if using
+	cruft packs via `gc.cruftPacks` or `--cruft`).  Override the
+	grace period with this config variable.  The value "now" may be
+	used to disable this grace period and always prune unreachable
+	objects immediately, or "never" may be used to suppress pruning.
+	This feature helps prevent corruption when 'git gc' runs
+	concurrently with another process writing to the repository; see
+	the "NOTES" section of linkgit:git-gc[1].
 
 gc.worktreePruneExpire::
 	When 'git gc' is run, it calls
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 853967dea0..ba4e67700e 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -54,6 +54,11 @@ other housekeeping tasks (e.g. rerere, working trees, reflog...) will
 be performed as well.
 
 
+--cruft::
+	When expiring unreachable objects, pack them separately into a
+	cruft pack instead of storing the loose objects as loose
+	objects.
+
 --prune=<date>::
 	Prune loose objects older than date (default is 2 weeks ago,
 	overridable by the config variable `gc.pruneExpire`).
diff --git a/builtin/gc.c b/builtin/gc.c
index bcef6a4c8d..c16cef0285 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -42,6 +42,7 @@ static const char * const builtin_gc_usage[] = {
 
 static int pack_refs = 1;
 static int prune_reflogs = 1;
+static int cruft_packs = 0;
 static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
@@ -152,6 +153,7 @@ static void gc_config(void)
 	git_config_get_int("gc.auto", &gc_auto_threshold);
 	git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
 	git_config_get_bool("gc.autodetach", &detach_auto);
+	git_config_get_bool("gc.cruftpacks", &cruft_packs);
 	git_config_get_expiry("gc.pruneexpire", &prune_expire);
 	git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire);
 	git_config_get_expiry("gc.logexpiry", &gc_log_expire);
@@ -331,7 +333,11 @@ static void add_repack_all_option(struct string_list *keep_pack)
 {
 	if (prune_expire && !strcmp(prune_expire, "now"))
 		strvec_push(&repack, "-a");
-	else {
+	else if (cruft_packs) {
+		strvec_push(&repack, "--cruft");
+		if (prune_expire)
+			strvec_pushf(&repack, "--cruft-expiration=%s", prune_expire);
+	} else {
 		strvec_push(&repack, "-A");
 		if (prune_expire)
 			strvec_pushf(&repack, "--unpack-unreachable=%s", prune_expire);
@@ -550,6 +556,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 		{ OPTION_STRING, 0, "prune", &prune_expire, N_("date"),
 			N_("prune unreferenced objects"),
 			PARSE_OPT_OPTARG, NULL, (intptr_t)prune_expire },
+		OPT_BOOL(0, "cruft", &cruft_packs, N_("pack unreferenced objects separately")),
 		OPT_BOOL(0, "aggressive", &aggressive, N_("be more thorough (increased runtime)")),
 		OPT_BOOL_F(0, "auto", &auto_gc, N_("enable auto-gc mode"),
 			   PARSE_OPT_NOCOMPLETE),
@@ -668,6 +675,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 			die(FAILED_RUN, repack.v[0]);
 
 		if (prune_expire) {
+			/* run `git prune` even if using cruft packs */
 			strvec_push(&prune, prune_expire);
 			if (quiet)
 				strvec_push(&prune, "--no-progress");
diff --git a/t/t5327-pack-objects-cruft.sh b/t/t5327-pack-objects-cruft.sh
index 857f9e8855..4cd0f0cf57 100755
--- a/t/t5327-pack-objects-cruft.sh
+++ b/t/t5327-pack-objects-cruft.sh
@@ -429,6 +429,43 @@ test_expect_success 'loose objects mtimes upsert others' '
 	)
 '
 
+test_expect_success 'expiring cruft objects with git gc' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git branch -M main &&
+		git checkout --orphan other &&
+		test_commit unreachable &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d unreachable &&
+		# objects are not cruft if they are contained in the reflogs
+		rm -fr .git/logs &&
+
+		git rev-list --objects --all --no-object-names >reachable.raw &&
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+		sort <reachable.raw >reachable &&
+		comm -13 reachable objects >unreachable &&
+
+		git repack --cruft -d &&
+
+		mtimes=$(ls .git/objects/pack/pack-*.mtimes) &&
+		test_path_is_file $mtimes &&
+
+		git gc --cruft --prune=now &&
+
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+
+		comm -23 unreachable objects >removed &&
+		test_cmp unreachable removed &&
+		test_path_is_missing $mtimes
+	)
+'
+
 test_expect_success 'cruft packs are not included in geometric repack' '
 	git init repo &&
 	test_when_finished "rm -fr repo" &&
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH 17/17] sha1-file.c: don't freshen cruft packs
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (15 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-12-03 19:51 ` [PATCH 00/17] " Junio C Hamano
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

We don't bother to freshen objects stored in a cruft pack individually
by updating the `.mtimes` file. This is because we can't portably `mmap`
and write into the middle of a file (i.e., to update the mtime of just
one object). Instead, we would have to rewrite the entire `.mtimes` file
which may incur some wasted effort especially if there a lot of cruft
objects and they are freshened infrequently.

Instead, force the freshening code to avoid an optimizing write by
writing out the object loose and letting it pick up a current mtime.

This works because we prefer the mtime of the loose copy of an object
when both a loose and packed one exist (whether or not the packed copy
comes from a cruft pack or not).

This could certainly do with a test and/or be included earlier in this
series/PR, but I want to wait until after I have a chance to clean up
the overly-repetitive nature of the cruft pack tests in general.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 object-file.c                 |  2 ++
 t/t5327-pack-objects-cruft.sh | 25 +++++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/object-file.c b/object-file.c
index 7ddb38b64a..dddc1bdd2c 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1946,6 +1946,8 @@ static int freshen_packed_object(const struct object_id *oid)
 	struct pack_entry e;
 	if (!find_pack_entry(the_repository, oid, &e))
 		return 0;
+	if (e.p->is_cruft)
+		return 0;
 	if (e.p->freshened)
 		return 1;
 	if (!freshen_file(e.p->pack_name))
diff --git a/t/t5327-pack-objects-cruft.sh b/t/t5327-pack-objects-cruft.sh
index 4cd0f0cf57..ff87701bbf 100755
--- a/t/t5327-pack-objects-cruft.sh
+++ b/t/t5327-pack-objects-cruft.sh
@@ -657,4 +657,29 @@ test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
 	)
 '
 
+test_expect_success 'cruft objects are freshend via loose' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		echo "cruft" >contents &&
+		blob="$(git hash-object -w -t blob contents)" &&
+		loose="$objdir/$(test_oid_to_path $blob)" &&
+
+		test_commit base &&
+
+		git repack --cruft -d &&
+
+		test_path_is_missing "$loose" &&
+		test-tool pack-mtimes "$(basename "$(ls $packdir/pack-*.mtimes)")" >cruft &&
+		grep "$blob" cruft &&
+
+		# write the same object again
+		git hash-object -w -t blob contents &&
+
+		test_path_is_file "$loose"
+	)
+'
+
 test_done
-- 
2.34.1.25.gb3157a20e6

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 01/17] Documentation/technical: add cruft-packs.txt
  2021-11-29 22:25 ` [PATCH 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
@ 2021-12-02 14:33   ` Derrick Stolee
  2021-12-03 21:53     ` Taylor Blau
  2021-12-04 22:20   ` Elijah Newren
  1 sibling, 1 reply; 200+ messages in thread
From: Derrick Stolee @ 2021-12-02 14:33 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: gitster, larsxschneider, peff, tytso

On 11/29/2021 5:25 PM, Taylor Blau wrote:
> +Notable alternatives to this design include:
> +
> +  - The location of the per-object mtime data, and
> +  - Whether cruft packs should be incremental or not.

It was not obvious from this sentence that "incremental" meant that
we could store a number of cruft packs and use the mtime of each pack
as the time for all contained objects.

> +On the location of mtime data, a new auxiliary file tied to the pack was chosen
> +to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
> +support for optional chunks of data, it may make sense to consolidate the
> +`.mtimes` format into the `.idx` itself.
> +
> +Incremental cruft packs (i.e., where each time a repository is repacked a new
> +cruft pack is generated containing only the unreachable objects introduced since
> +the last time a cruft pack was written) are significantly more complicated to
> +construct, and so aren't pursued here. The obvious drawback to the current
> +implementation is that the entire cruft pack must be re-written from scratch.

But you seem to be pointing that direction here. The difference being
that you don't discuss how a list of cruft packs could avoid the .mtimes
file.

I think what is hidden underneath "significantly more complicated to
construct" are situations such as "this object was in an old cruft
pack, but then became reachable, but now is unreachable again". I'll
try to remember to come back to this after seeing the situations you
cover in your tests.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 02/17] pack-mtimes: support reading .mtimes files
  2021-11-29 22:25 ` [PATCH 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
@ 2021-12-02 15:06   ` Derrick Stolee
  2021-12-02 22:32     ` brian m. carlson
  2021-12-03 22:24     ` Taylor Blau
  0 siblings, 2 replies; 200+ messages in thread
From: Derrick Stolee @ 2021-12-02 15:06 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: gitster, larsxschneider, peff, tytso, brian m. carlson

On 11/29/2021 5:25 PM, Taylor Blau wrote:

> +== pack-*.mtimes files have the format:
> +
> +  - A 4-byte magic number '0x4d544d45' ('MTME').
> +
> +  - A 4-byte version identifier (= 1).
> +
> +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).

I vaguely remember complaints about using a 1-byte identifier in
the commit-graph and multi-pack-index formats because the "standard"
way to refer to these hash functions was a magic number that had a
meaning in ASCII that helped human readers a bit. I cannot find an
example of such 4-byte identifiers, but perhaps brian (CC'd) could
remind us.

You are using a 4-byte identifier, but using the same values as
those 1-byte identifiers.

> +  - A table of mtimes (one per packed object, num_objects in total, each
> +    a 4-byte unsigned integer in network order), in the same order as
> +    objects appear in the index file (e.g., the first entry in the mtime
> +    table corresponds to the object with the lowest lexically-sorted
> +    oid). The mtimes count standard epoch seconds.

This paragraph seemed awkward. Here is a rephrasing that might be
less awkward:

 - A table of 4-byte unsigned integers in network order. The ith value
   is the modified time (mtime) of the ith object of the corresponding
   pack in lexicographic order. The mtime represents standard epoch
   seconds.

Storing these mtimes in 32-bits means we will hit the 2038 problem.
The commit-graph stores commit times with an extra two bits to extend
the lifetime by another hundred years or so.

Could we extend the lifetime of cruft packs by decreasing the granularity
here? Should 'mtime' store a number of _minutes_ instead of seconds? That
should be enough granularity for these purposes.

> +  - A trailer, containing a:
> +
> +    checksum of the corresponding packfile, and
> +
> +    a checksum of all of the above.

Could you specify the checksum as having length according to the
specified hash function?

> +All 4-byte numbers are in network order.
> +

Maybe this could be at the start of the format, since the file
version and hash function are both 4-byte numbers here and we
could remove the mention of network order from the mtime values.

> +static char *pack_mtimes_filename(struct packed_git *p)
> +{
> +	size_t len;
> +	if (!strip_suffix(p->pack_name, ".pack", &len))
> +		BUG("pack_name does not end in .pack");
> +	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
> +	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
> +}

I see your NEEDSWORK here and you are probably referring to this:

static char *pack_revindex_filename(struct packed_git *p)
{
	size_t len;
	if (!strip_suffix(p->pack_name, ".pack", &len))
		BUG("pack_name does not end in .pack");
	return xstrfmt("%.*s.rev", (int)len, p->pack_name);
}

and the implementation is identical except for the new trailer
(which exist in the exts[] array in builtin/repack.c, but could
also be pulled out into a header somewhere.

I'm happy to delay any cleanup of these code clones until later,
if at all, because doing it right might mean moving more code
than we like. Such refactorings aren't worth it most of the time.

> +static int load_pack_mtimes_file(char *mtimes_file,
> +				 uint32_t num_objects,
> +				 const uint32_t **data_p, size_t *len_p)
> +{

> +	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
> +		ret = error(_("mtimes file %s is corrupt"), mtimes_file);

This message could be more informative: "mtimes file %s has the wrong size"?

> +	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
> +
> +	if (ntohl(*hdr) != MTIMES_SIGNATURE) {
> +		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
> +		goto cleanup;
> +	}

Interesting that you defined 'struct mtimes_header' before this
method, but don't use it here (in favor of moving a uint32_t
pointer). Perhaps you are avoiding pointing the struct at the
memory map, but you could also do this:

	struct mtimes_header header;

	header.signature = ntohl(hdr[0]);
	header.version = ntohl(hdr[1]);
	header.hash_id = ntohl(hdr[2]);

And then operate on the struct for your validation.

At the very least, 'struct mtimes_header' is defined but not
used in this patch. If you decide to not use it this way, then
maybe delay its definition.

> +
> +	if (ntohl(*++hdr) != 1) {
> +		ret = error(_("mtimes file %s has unsupported version %"PRIu32),
> +			    mtimes_file, ntohl(*hdr));

Unlike the commit-graph, if we don't understand the version we
cannot simply ignore the data. error() is appropriate here.

> +int load_pack_mtimes(struct packed_git *p)
> +{
> +	char *mtimes_name = NULL;
> +	int ret = 0;
> +
> +	if (!p->is_cruft)
> +		return ret; /* not a cruft pack */

Interesting that this indicator is essentially "we have an mtimes
file for this pack", but it makes sense to include that check next
to the .keep and .promisor checks.

> +uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)
> +{
> +	if (!p->mtimes_map)
> +		BUG("pack .mtimes file not loaded for %s", p->pack_name);
> +	if (p->num_objects <= pos)
> +		BUG("pack .mtimes out-of-bounds (%"PRIu32" vs %"PRIu32")",
> +		    pos, p->num_objects);
> +
> +	return get_be32(p->mtimes_map + pos + 3);
> +}

A nice safe access method. Good.

> -	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
> +	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".mtimes"};

(Speaking of that refactoring earlier, here is a second definition of
exts[] that would be valuable to unify.)

The hunks I did not comment on look good. Nice standard file format
stuff.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 04/17] chunk-format.h: extract oid_version()
  2021-11-29 22:25 ` [PATCH 04/17] chunk-format.h: extract oid_version() Taylor Blau
@ 2021-12-02 15:22   ` Derrick Stolee
  2021-12-03 22:40     ` Taylor Blau
  0 siblings, 1 reply; 200+ messages in thread
From: Derrick Stolee @ 2021-12-02 15:22 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: gitster, larsxschneider, peff, tytso

On 11/29/2021 5:25 PM, Taylor Blau wrote:
> There are three definitions of an identical function which converts
> `the_hash_algo` into either 1 (for SHA-1) or 2 (for SHA-256). There is a
> copy of this function for writing both the commit-graph and
> multi-pack-index file, and another inline definition used to write the
> .rev header.
> 
> Consolidate these into a single definition in chunk-format.h. It's not
> clear that this is the best header to define this function in, but it
> should do for now.

Thanks for consolidating these!
 
> (Worth noting, the .rev caller expects a 4-byte unsigned, but the other
> two callers work with a single unsigned byte. The consolidated version
> uses the latter type, and lets the compiler widen it when required).
> 
> Another caller will be added in a subsequent patch.

>  chunk-format.c | 12 ++++++++++++
>  chunk-format.h |  3 +++
>  commit-graph.c | 18 +++---------------
>  midx.c         | 18 +++---------------
>  pack-write.c   | 15 ++-------------

I notice that you don't use this in load_pack_mtimes_file(),
in pack-mtimes.c but you could at this point.

The code you do touch looks good.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 05/17] pack-mtimes: support writing pack .mtimes files
  2021-11-29 22:25 ` [PATCH 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
@ 2021-12-02 15:36   ` Derrick Stolee
  2021-12-03 23:04     ` Taylor Blau
  0 siblings, 1 reply; 200+ messages in thread
From: Derrick Stolee @ 2021-12-02 15:36 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: gitster, larsxschneider, peff, tytso

On 11/29/2021 5:25 PM, Taylor Blau wrote:> @@ -168,6 +168,9 @@ struct packing_data {
>  	/* delta islands */
>  	unsigned int *tree_depth;
>  	unsigned char *layer;
> +
> +	/* cruft packs */
> +	uint32_t *cruft_mtime;

This comment is a bit terse. Perhaps...

	/* Used when writing cruft packs. */

> +static inline uint32_t oe_cruft_mtime(struct packing_data *pack,
> +				      struct object_entry *e)
> +{
> +	if (!pack->cruft_mtime)
> +		return 0;
> +	return pack->cruft_mtime[e - pack->objects];
> +}

When writing a pack, it appears that the cruft_mtime array
maps to objects in pack-order, not idx-order, correct? That
might be worth mentioning in the struct definition because
it differs from the .mtimes file.

> +static void write_mtimes_objects(struct hashfile *f,
> +				 struct packing_data *to_pack,
> +				 struct pack_idx_entry **objects,
> +				 uint32_t nr_objects)
> +{
> +	uint32_t i;
> +	for (i = 0; i < nr_objects; i++) {
> +		struct object_entry *e = (struct object_entry*)objects[i];
> +		hashwrite_be32(f, oe_cruft_mtime(to_pack, e));
> +	}

The name "objects" here confused me at first, thinking it
corresponded to the objects member of 'struct packing_data', but
that is being handled by the fact that 'objects' is actually a
lex-sorted list of pack_idx_entry pointers (and they happen to
also point to 'struct object_entry' values because the 'struct
pack_idx_entry' is the first member.

So this is (very densely) handling the translation from pack-order
to lex-order through the double pointer 'objects'. I'm not sure if
there is a way to make it more clear or if every reader will need
to do the same mental gymnastics I had to do.

> +}
> +
> +static void write_mtimes_trailer(struct hashfile *f, const unsigned char *hash)
> +{
> +	hashwrite(f, hash, the_hash_algo->rawsz);
> +}
> +
> +static const char *write_mtimes_file(const char *mtimes_name,
> +				     struct packing_data *to_pack,
> +				     struct pack_idx_entry **objects,
> +				     uint32_t nr_objects,
> +				     const unsigned char *hash)
> +{
> +	struct hashfile *f;
> +	int fd;
> +
> +	if (!to_pack)
> +		BUG("cannot call write_mtimes_file with NULL packing_data");
> +
> +	if (!mtimes_name) {
> +		struct strbuf tmp_file = STRBUF_INIT;
> +		fd = odb_mkstemp(&tmp_file, "pack/tmp_mtimes_XXXXXX");
> +		mtimes_name = strbuf_detach(&tmp_file, NULL);
> +	} else {
> +		unlink(mtimes_name);
> +		fd = xopen(mtimes_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
> +	}
> +	f = hashfd(fd, mtimes_name);
> +
> +	write_mtimes_header(f);
> +	write_mtimes_objects(f, to_pack, objects, nr_objects);
> +	write_mtimes_trailer(f, hash);
> +
> +	if (mtimes_name && adjust_shared_perm(mtimes_name) < 0)
> +		die(_("failed to make %s readable"), mtimes_name);

What could cause 'mtimes_name' to be NULL here? It seems that it would
be initialized in the "if (!mtimes_name)" block above.

> +
> +	finalize_hashfile(f, NULL,
> +			  CSUM_HASH_IN_STREAM | CSUM_CLOSE | CSUM_FSYNC);
> +
> +	return mtimes_name;

Note that you return the name here...

> +	if (pack_idx_opts->flags & WRITE_MTIMES) {
> +		mtimes_tmp_name = write_mtimes_file(NULL, to_pack, written_list,
> +						    nr_written,
> +						    hash);
> +		if (adjust_shared_perm(mtimes_tmp_name))
> +			die_errno("unable to make temporary mtimes file readable");

...and then adjust the perms again. I think that this adjustment is
redundant, because it already happened within the write_mtimes_file()
method.

> +	}
> +
>  	rename_tmp_packfile(name_buffer, pack_tmp_name, "pack");
>  	if (rev_tmp_name)
>  		rename_tmp_packfile(name_buffer, rev_tmp_name, "rev");
> +	if (mtimes_tmp_name)
> +		rename_tmp_packfile(name_buffer, mtimes_tmp_name, "mtimes");

And then it is finally renamed here, if it had a temporary name to
start.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 02/17] pack-mtimes: support reading .mtimes files
  2021-12-02 15:06   ` Derrick Stolee
@ 2021-12-02 22:32     ` brian m. carlson
  2021-12-03 22:24     ` Taylor Blau
  1 sibling, 0 replies; 200+ messages in thread
From: brian m. carlson @ 2021-12-02 22:32 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, gitster, larsxschneider, peff, tytso

[-- Attachment #1: Type: text/plain, Size: 1265 bytes --]

On 2021-12-02 at 15:06:07, Derrick Stolee wrote:
> On 11/29/2021 5:25 PM, Taylor Blau wrote:
> 
> > +== pack-*.mtimes files have the format:
> > +
> > +  - A 4-byte magic number '0x4d544d45' ('MTME').
> > +
> > +  - A 4-byte version identifier (= 1).
> > +
> > +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
> 
> I vaguely remember complaints about using a 1-byte identifier in
> the commit-graph and multi-pack-index formats because the "standard"
> way to refer to these hash functions was a magic number that had a
> meaning in ASCII that helped human readers a bit. I cannot find an
> example of such 4-byte identifiers, but perhaps brian (CC'd) could
> remind us.
> 
> You are using a 4-byte identifier, but using the same values as
> those 1-byte identifiers.

The preferred value is the_hash_algo->format_id.  For SHA-1, that's
"sha1", big-endian (0x73686131) and for SHA-256 it's "s256", big-endian
(0x73323536).

There's also hash_algo_by_id to turn the format ID into an index into
the hash_algos array, but you need to check for GIT_HASH_UNKNOWN (0)
first.

These will be used in index v3, which I haven't sent out patches for
yet.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 00/17] cruft packs
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (16 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
@ 2021-12-03 19:51 ` Junio C Hamano
  2021-12-03 20:08   ` Taylor Blau
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                   ` (3 subsequent siblings)
  21 siblings, 1 reply; 200+ messages in thread
From: Junio C Hamano @ 2021-12-03 19:51 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, larsxschneider, peff, tytso

Taylor Blau <me@ttaylorr.com> writes:

> This series implements "cruft packs", a pack which stores accumulated
> unreachable objects, along with a new ".mtimes" file which tracks each
> object's last known modification time.

Let me rephrase the above to test my understanding, since I need to
write a summary for the  "What's cooking" report.

 Instead of leaving unreachable objects in loose form when packing,
 or ejecting them into loose form when repacking, gather them in a
 packfile with an auxiliary file that records the last-use time of
 these objects.

That way, we do not have to waste so many inodes for loose objects
that is not likely to be used, which feels like a win.

>   - The final patch handles object freshening for objects stored in a
>     cruft pack.

I am not going to read it today, but I think this is the most
interesting part of the series.  Instead of using mtime of an
individual loose object file, we'd need to record the time of
last use for each object in a pack.

Stepping back a bit, I do not see how we can get away without doing
the same .mtimes file for non-cruft packs.  An object that is in a
non-cruft pack may be referenced immediately after the repack that
created the pack, but the ref that was referencing the object may
have gone away and now the pack is a month old.  If we were to
repack the object, we do not know when was the last time the object
was reachable from any of the refs and index entries (collectively
known as anchor points).  Of course, recording all mtimes for all
packed objects all the time would involve quite a lot of overhead.
I am guessing (I will not spend time today to figure it out myself)
that .mtimes update at runtime will happen in-place (i.e. via
seek(2)+write(2), or pwrite()), and I wonder what the safety concern
would be (which is the primary reason why we tend not to do in-place
updates but recreate-and-rename updates).

Thanks for working on such an interesting topic.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 00/17] cruft packs
  2021-12-03 19:51 ` [PATCH 00/17] " Junio C Hamano
@ 2021-12-03 20:08   ` Taylor Blau
  2021-12-03 20:47     ` Taylor Blau
  0 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2021-12-03 20:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, larsxschneider, peff, tytso

On Fri, Dec 03, 2021 at 11:51:51AM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > This series implements "cruft packs", a pack which stores accumulated
> > unreachable objects, along with a new ".mtimes" file which tracks each
> > object's last known modification time.
>
> Let me rephrase the above to test my understanding, since I need to
> write a summary for the  "What's cooking" report.
>
>  Instead of leaving unreachable objects in loose form when packing,
>  or ejecting them into loose form when repacking, gather them in a
>  packfile with an auxiliary file that records the last-use time of
>  these objects.

Exactly. Thanks for such a concise and accurate description of the
topic.

> That way, we do not have to waste so many inodes for loose objects
> that is not likely to be used, which feels like a win.

Yes. This had historically been a problem for GitHub. We don't
automatically prune unreachable objects during repacking, but sometimes
customers will ask us to do it on their behalf (if, for example, they
accidentally pushed sensitive information to us, and then force-pushed
over it).

But occasionally we'd get bitten by exploding many years of loose
objects (because we used to freshen packfiles too aggressively when
moving them around).

We've been running this series in production for the past few months,
and it's been a huge relief on the folks who typically run these pruning
GCs.

> >   - The final patch handles object freshening for objects stored in a
> >     cruft pack.
>
> I am not going to read it today, but I think this is the most
> interesting part of the series.  Instead of using mtime of an
> individual loose object file, we'd need to record the time of
> last use for each object in a pack.
>
> Stepping back a bit, I do not see how we can get away without doing
> the same .mtimes file for non-cruft packs.  An object that is in a
> non-cruft pack may be referenced immediately after the repack that
> created the pack, but the ref that was referencing the object may
> have gone away and now the pack is a month old.  If we were to
> repack the object, we do not know when was the last time the object
> was reachable from any of the refs and index entries (collectively
> known as anchor points).

In that situation, we would use the mtime of the pack which contains
that object itself as a proxy (or the mtime of a loose copy of the
object, if it is more recent).

That isn't perfect, as you note, since if the pack isn't otherwise
freshened, we'd consider that object to be a month old, even if the
reference pointing at it was deleted a mere second ago.

I can't recall if Peff and I talked about this off-list, but I have a
vague sense we probably did (and I forgot the details).

> Of course, recording all mtimes for all
> packed objects all the time would involve quite a lot of overhead.
> I am guessing (I will not spend time today to figure it out myself)
> that .mtimes update at runtime will happen in-place (i.e. via
> seek(2)+write(2), or pwrite()), and I wonder what the safety concern
> would be (which is the primary reason why we tend not to do in-place
> updates but recreate-and-rename updates).

Yeah, this series avoids doing an in-place update, and similarly avoids
recreating the entire .mtimes file before moving into place. Instead,
freshening an object stored in a cruft pack takes place by rewriting a
copy of the object loose, since we consider an object's mtime to be the
most recent of (a) what's in the .mtimes file, (b) the mtime of the
containing pack, and (c) the mtime of a loose copy (if one exists).

It can be wasteful, but in practice "resurrecting" an object in a cruft
pack is pretty rare, so on balance it ends up costing less work to do.

> Thanks for working on such an interesting topic.

I'm glad to have piqued your interest.

Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 00/17] cruft packs
  2021-12-03 20:08   ` Taylor Blau
@ 2021-12-03 20:47     ` Taylor Blau
  0 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2021-12-03 20:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, larsxschneider, peff, tytso

On Fri, Dec 03, 2021 at 03:08:00PM -0500, Taylor Blau wrote:
> On Fri, Dec 03, 2021 at 11:51:51AM -0800, Junio C Hamano wrote:
> > Stepping back a bit, I do not see how we can get away without doing
> > the same .mtimes file for non-cruft packs.  An object that is in a
> > non-cruft pack may be referenced immediately after the repack that
> > created the pack, but the ref that was referencing the object may
> > have gone away and now the pack is a month old.  If we were to
> > repack the object, we do not know when was the last time the object
> > was reachable from any of the refs and index entries (collectively
> > known as anchor points).
>
> In that situation, we would use the mtime of the pack which contains
> that object itself as a proxy (or the mtime of a loose copy of the
> object, if it is more recent).
>
> That isn't perfect, as you note, since if the pack isn't otherwise
> freshened, we'd consider that object to be a month old, even if the
> reference pointing at it was deleted a mere second ago.
>
> I can't recall if Peff and I talked about this off-list, but I have a
> vague sense we probably did (and I forgot the details).

Maybe I can rephrase the problem as being orthogonal to what we're
addressing here. Modification time can be a useful-ish proxy for "last
referenced time", but they are ultimately different.

Forgetting cruft packs for a moment, our behavior today in that
situation would be to prune the object if our grace period did not cover
the time in which the pack was last modified. So if the pack was a month
old, the grace period was two weeks, but the reference pointing at some
object in that pack was deleted only a second before starting a pruning
GC, we'd prune that object before this series (just as we would do the
same thing with this series).

Aside from pruning, what happens to the value recorded in the .mtimes
file is more interesting. For the case you're talking about, we'll err
on the side of newer mtimes (either the original timestamp is recorded,
or some future time when the containing pack was rewritten). But the
more interesting case is when an object becomes re-referenced. Since the
ref-update doesn't cause the object to be rewritten, we wouldn't change
the timestamp.

Anyway, both of these are still independent from cruft packs, so we're
not changing the status quo there, I don't think.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 01/17] Documentation/technical: add cruft-packs.txt
  2021-12-02 14:33   ` Derrick Stolee
@ 2021-12-03 21:53     ` Taylor Blau
  0 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2021-12-03 21:53 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, larsxschneider, peff, tytso

On Thu, Dec 02, 2021 at 09:33:51AM -0500, Derrick Stolee wrote:
> On 11/29/2021 5:25 PM, Taylor Blau wrote:
> > +Notable alternatives to this design include:
> > +
> > +  - The location of the per-object mtime data, and
> > +  - Whether cruft packs should be incremental or not.
>
> It was not obvious from this sentence that "incremental" meant that
> we could store a number of cruft packs and use the mtime of each pack
> as the time for all contained objects.

Yes, I think I meant "incremental" in the sense of "incremental commit-
graphs". But it's clearer to say "storing unreachable objects in
multiple cruft packs" (and then giving an example later on). Thanks!

> I think what is hidden underneath "significantly more complicated to
> construct" are situations such as "this object was in an old cruft
> pack, but then became reachable, but now is unreachable again". I'll
> try to remember to come back to this after seeing the situations you
> cover in your tests.

Yeah, I'm being deliberately vague here, since the aim of this paragraph
is to illustrate "this is much more complicated than what we implement
here, and the trade-offs are..."

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 02/17] pack-mtimes: support reading .mtimes files
  2021-12-02 15:06   ` Derrick Stolee
  2021-12-02 22:32     ` brian m. carlson
@ 2021-12-03 22:24     ` Taylor Blau
  2022-01-07 19:41       ` Taylor Blau
  1 sibling, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2021-12-03 22:24 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, gitster, larsxschneider, peff, tytso, brian m. carlson

On Thu, Dec 02, 2021 at 10:06:07AM -0500, Derrick Stolee wrote:
> On 11/29/2021 5:25 PM, Taylor Blau wrote:
>
> > +== pack-*.mtimes files have the format:
> > +
> > +  - A 4-byte magic number '0x4d544d45' ('MTME').
> > +
> > +  - A 4-byte version identifier (= 1).
> > +
> > +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
>
> I vaguely remember complaints about using a 1-byte identifier in
> the commit-graph and multi-pack-index formats because the "standard"
> way to refer to these hash functions was a magic number that had a
> meaning in ASCII that helped human readers a bit. I cannot find an
> example of such 4-byte identifiers, but perhaps brian (CC'd) could
> remind us.
>
> You are using a 4-byte identifier, but using the same values as
> those 1-byte identifiers.

Yeah, I'm definitely borrowing from the commit-graph and multi-pack
index formats here. Though I believe we did the same thing for .rev
files, too (and checking with Documentation/technical/pack-format.txt
confirms as much).

I don't have a strong feeling about using the 4-byte identifier or not.
But making this field four bytes wide is very much intentional, since it
makes sure that all of our reads are aligned, which should yield much
better cache performance (assuming the page size is also a multiple of
four).

I don't, but if others feel strongly we could write the magic
identifiers brian points out downthread here instead. (It would be
mildly inconvenient for GitHub, which has many hundreds of thousands of
these files laying around everywhere with '1' as the identifier. But
since the magic identifiers don't collide with the values proposed here,
GitHub's fork could easily be taught to accept both on the reading side,
but only write out the special identifier).

> > +  - A table of mtimes (one per packed object, num_objects in total, each
> > +    a 4-byte unsigned integer in network order), in the same order as
> > +    objects appear in the index file (e.g., the first entry in the mtime
> > +    table corresponds to the object with the lowest lexically-sorted
> > +    oid). The mtimes count standard epoch seconds.
>
> This paragraph seemed awkward. Here is a rephrasing that might be
> less awkward:
>
>  - A table of 4-byte unsigned integers in network order. The ith value
>    is the modified time (mtime) of the ith object of the corresponding
>    pack in lexicographic order. The mtime represents standard epoch
>    seconds.

Thanks, this is clearer. I went with a blend of the two:

    - A table of 4-byte unsigned integers in network order. The ith
      value is the modification time (mtime) of the ith object in the
      corresponding pack by lexicographic (index) order. The mtimes
      count standard epoch seconds.

> Storing these mtimes in 32-bits means we will hit the 2038 problem.
> The commit-graph stores commit times with an extra two bits to extend
> the lifetime by another hundred years or so.
>
> Could we extend the lifetime of cruft packs by decreasing the granularity
> here? Should 'mtime' store a number of _minutes_ instead of seconds? That
> should be enough granularity for these purposes.

Perhaps, though it does add some complexity to the code that deals with
this format at the expense of some future-proofing. I'm open to it,
though.

>
> > +  - A trailer, containing a:
> > +
> > +    checksum of the corresponding packfile, and
> > +
> > +    a checksum of all of the above.
>
> Could you specify the checksum as having length according to the
> specified hash function?

Great suggestion, thanks.

> > +All 4-byte numbers are in network order.
> > +
>
> Maybe this could be at the start of the format, since the file
> version and hash function are both 4-byte numbers here and we
> could remove the mention of network order from the mtime values.

This is copy-and-pasted from the .rev section above, where I think I
added the "All 4-byte numbers are in network order" bit at the end in
response to a suggestion opposite yours ;).

Here I would probably rather stay consistent with the surrounding
sections.

> > +static char *pack_mtimes_filename(struct packed_git *p)
> > +{
> > +	size_t len;
> > +	if (!strip_suffix(p->pack_name, ".pack", &len))
> > +		BUG("pack_name does not end in .pack");
> > +	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
> > +	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
> > +}
>
> I see your NEEDSWORK here and you are probably referring to this:
>
> static char *pack_revindex_filename(struct packed_git *p)
> {
> 	size_t len;
> 	if (!strip_suffix(p->pack_name, ".pack", &len))
> 		BUG("pack_name does not end in .pack");
> 	return xstrfmt("%.*s.rev", (int)len, p->pack_name);
> }
>
> and the implementation is identical except for the new trailer
> (which exist in the exts[] array in builtin/repack.c, but could
> also be pulled out into a header somewhere.
>
> I'm happy to delay any cleanup of these code clones until later,
> if at all, because doing it right might mean moving more code
> than we like. Such refactorings aren't worth it most of the time.

Yeah, I think your thoughts matched my own when writing this. Which is
to say, I felt it prudent to call out that there is an opportunity to
DRY these two up, but I'm not convinced that such a clean up would be
worthwhile.

> > +static int load_pack_mtimes_file(char *mtimes_file,
> > +				 uint32_t num_objects,
> > +				 const uint32_t **data_p, size_t *len_p)
> > +{
>
> > +	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
> > +		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
>
> This message could be more informative: "mtimes file %s has the wrong size"?

Copy-and-pasting here again from the corresponding code for the .rev
file, which is why I didn't opt to change the message here. Probably
many of these checks could be extracted out and shared between the two
paths, but I don't think we should attempt it here.

> > +	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
> > +
> > +	if (ntohl(*hdr) != MTIMES_SIGNATURE) {
> > +		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
> > +		goto cleanup;
> > +	}
>
> Interesting that you defined 'struct mtimes_header' before this
> method, but don't use it here (in favor of moving a uint32_t
> pointer). Perhaps you are avoiding pointing the struct at the
> memory map, but you could also do this:
>
> 	struct mtimes_header header;
>
> 	header.signature = ntohl(hdr[0]);
> 	header.version = ntohl(hdr[1]);
> 	header.hash_id = ntohl(hdr[2]);
>
> And then operate on the struct for your validation.
>
> At the very least, 'struct mtimes_header' is defined but not
> used in this patch. If you decide to not use it this way, then
> maybe delay its definition.

Yeah, not reading directly out of the struct is intentional, since the
compiler is free to insert padding between these members, which would
break any subsequent reads out of the struct.

But I like your idea to assign the fields manually, thanks!

> > +int load_pack_mtimes(struct packed_git *p)
> > +{
> > +	char *mtimes_name = NULL;
> > +	int ret = 0;
> > +
> > +	if (!p->is_cruft)
> > +		return ret; /* not a cruft pack */
>
> Interesting that this indicator is essentially "we have an mtimes
> file for this pack", but it makes sense to include that check next
> to the .keep and .promisor checks.

I think I had originally called it "mtimes" but changed it to "cruft",
since it makes sense as a prefix similar to the others (that is, "keep
pack", "promisor pack", and "cruft pack", not "mtimes pack").

> The hunks I did not comment on look good. Nice standard file format
> stuff.

Thanks for your review!

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 04/17] chunk-format.h: extract oid_version()
  2021-12-02 15:22   ` Derrick Stolee
@ 2021-12-03 22:40     ` Taylor Blau
  2021-12-06 17:33       ` Derrick Stolee
  0 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2021-12-03 22:40 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, larsxschneider, peff, tytso

On Thu, Dec 02, 2021 at 10:22:05AM -0500, Derrick Stolee wrote:
> I notice that you don't use this in load_pack_mtimes_file(),
> in pack-mtimes.c but you could at this point.

Hmm, I'm confused. Te extracted function converts a pointer to a struct
git_hash_algo into a uint32, but here we just care about reading the
four byte value we wrote.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 05/17] pack-mtimes: support writing pack .mtimes files
  2021-12-02 15:36   ` Derrick Stolee
@ 2021-12-03 23:04     ` Taylor Blau
  0 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2021-12-03 23:04 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, larsxschneider, peff, tytso

On Thu, Dec 02, 2021 at 10:36:16AM -0500, Derrick Stolee wrote:
> On 11/29/2021 5:25 PM, Taylor Blau wrote:> @@ -168,6 +168,9 @@ struct packing_data {
> >  	/* delta islands */
> >  	unsigned int *tree_depth;
> >  	unsigned char *layer;
> > +
> > +	/* cruft packs */
> > +	uint32_t *cruft_mtime;
>
> This comment is a bit terse. Perhaps...
>
> 	/* Used when writing cruft packs. */

Sure; here I was imitating the terseness of the "delta islands" comment
a few lines above. But I don't mind changing it here.

> > +static inline uint32_t oe_cruft_mtime(struct packing_data *pack,
> > +				      struct object_entry *e)
> > +{
> > +	if (!pack->cruft_mtime)
> > +		return 0;
> > +	return pack->cruft_mtime[e - pack->objects];
> > +}
>
> When writing a pack, it appears that the cruft_mtime array
> maps to objects in pack-order, not idx-order, correct? That
> might be worth mentioning in the struct definition because
> it differs from the .mtimes file.

Great observation and suggestion, thank you! The comment that I
ultimately settled on is:

  /*
   * Used when writing cruft packs.
   *
   * Object mtimes  are stored in pack order when writing, but
   * written out in lexicographic (index) order.
   */
   uint32_t *cruft_mtime;

> > +static void write_mtimes_objects(struct hashfile *f,
> > +				 struct packing_data *to_pack,
> > +				 struct pack_idx_entry **objects,
> > +				 uint32_t nr_objects)
> > +{
> > +	uint32_t i;
> > +	for (i = 0; i < nr_objects; i++) {
> > +		struct object_entry *e = (struct object_entry*)objects[i];
> > +		hashwrite_be32(f, oe_cruft_mtime(to_pack, e));
> > +	}
>
> The name "objects" here confused me at first, thinking it
> corresponded to the objects member of 'struct packing_data', but
> that is being handled by the fact that 'objects' is actually a
> lex-sorted list of pack_idx_entry pointers (and they happen to
> also point to 'struct object_entry' values because the 'struct
> pack_idx_entry' is the first member.
>
> So this is (very densely) handling the translation from pack-order
> to lex-order through the double pointer 'objects'. I'm not sure if
> there is a way to make it more clear or if every reader will need
> to do the same mental gymnastics I had to do.

Exactly, and sorry that I didn't point this out more clearly. It's been
long enough since I wrote this code that I can sympathize with the
mental gymnastics required ;).

> > +}
> > +
> > +static void write_mtimes_trailer(struct hashfile *f, const unsigned char *hash)
> > +{
> > +	hashwrite(f, hash, the_hash_algo->rawsz);
> > +}
> > +
> > +static const char *write_mtimes_file(const char *mtimes_name,
> > +				     struct packing_data *to_pack,
> > +				     struct pack_idx_entry **objects,
> > +				     uint32_t nr_objects,
> > +				     const unsigned char *hash)
> > +{
> > +	struct hashfile *f;
> > +	int fd;
> > +
> > +	if (!to_pack)
> > +		BUG("cannot call write_mtimes_file with NULL packing_data");
> > +
> > +	if (!mtimes_name) {
> > +		struct strbuf tmp_file = STRBUF_INIT;
> > +		fd = odb_mkstemp(&tmp_file, "pack/tmp_mtimes_XXXXXX");
> > +		mtimes_name = strbuf_detach(&tmp_file, NULL);
> > +	} else {
> > +		unlink(mtimes_name);
> > +		fd = xopen(mtimes_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
> > +	}
> > +	f = hashfd(fd, mtimes_name);
> > +
> > +	write_mtimes_header(f);
> > +	write_mtimes_objects(f, to_pack, objects, nr_objects);
> > +	write_mtimes_trailer(f, hash);
> > +
> > +	if (mtimes_name && adjust_shared_perm(mtimes_name) < 0)
> > +		die(_("failed to make %s readable"), mtimes_name);
>
> What could cause 'mtimes_name' to be NULL here? It seems that it would
> be initialized in the "if (!mtimes_name)" block above.

You're right, it's impossible for it to be NULL here. I'll remove the
redundant side of the &&-expression here.

> > +
> > +	finalize_hashfile(f, NULL,
> > +			  CSUM_HASH_IN_STREAM | CSUM_CLOSE | CSUM_FSYNC);
> > +
> > +	return mtimes_name;
>
> Note that you return the name here...
>
> > +	if (pack_idx_opts->flags & WRITE_MTIMES) {
> > +		mtimes_tmp_name = write_mtimes_file(NULL, to_pack, written_list,
> > +						    nr_written,
> > +						    hash);
> > +		if (adjust_shared_perm(mtimes_tmp_name))
> > +			die_errno("unable to make temporary mtimes file readable");
>
> ...and then adjust the perms again. I think that this adjustment is
> redundant, because it already happened within the write_mtimes_file()
> method.

Yep, thanks. I'll clean it up here to just call adjust_shared_perm()
witin write_mtimes_file().

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 01/17] Documentation/technical: add cruft-packs.txt
  2021-11-29 22:25 ` [PATCH 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
  2021-12-02 14:33   ` Derrick Stolee
@ 2021-12-04 22:20   ` Elijah Newren
  2021-12-04 23:32     ` Taylor Blau
  1 sibling, 1 reply; 200+ messages in thread
From: Elijah Newren @ 2021-12-04 22:20 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Git Mailing List, Junio C Hamano, Lars Schneider, Jeff King,
	Theodore Tso

On Mon, Nov 29, 2021 at 7:29 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> Create a technical document to explain cruft packs. It contains a brief
> overview of the problem, some background, details on the implementation,
> and a couple of alternative approaches not considered here.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  Documentation/Makefile                  |  1 +
>  Documentation/technical/cruft-packs.txt | 95 +++++++++++++++++++++++++
>  2 files changed, 96 insertions(+)
>  create mode 100644 Documentation/technical/cruft-packs.txt
>
> diff --git a/Documentation/Makefile b/Documentation/Makefile
> index ed656db2ae..0b01c9408e 100644
> --- a/Documentation/Makefile
> +++ b/Documentation/Makefile
> @@ -91,6 +91,7 @@ TECH_DOCS += MyFirstContribution
>  TECH_DOCS += MyFirstObjectWalk
>  TECH_DOCS += SubmittingPatches
>  TECH_DOCS += technical/bundle-format
> +TECH_DOCS += technical/cruft-packs
>  TECH_DOCS += technical/hash-function-transition
>  TECH_DOCS += technical/http-protocol
>  TECH_DOCS += technical/index-format
> diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
> new file mode 100644
> index 0000000000..bb54cce1b1
> --- /dev/null
> +++ b/Documentation/technical/cruft-packs.txt
> @@ -0,0 +1,95 @@
> += Cruft packs
> +
> +Cruft packs offer an alternative to Git's traditional mechanism of removing
> +unreachable objects. This document provides an overview of Git's pruning
> +mechanism, and how cruft packs can be used instead to accomplish the same.
> +
> +== Background
> +
> +To remove unreachable objects from your repository, Git offers `git repack -Ad`
> +(see linkgit:git-repack[1]). Quoting from the documentation:
> +
> +[quote]
> +[...] unreachable objects in a previous pack become loose, unpacked objects,
> +instead of being left in the old pack. [...] loose unreachable objects will be
> +pruned according to normal expiry rules with the next 'git gc' invocation.
> +
> +Unreachable objects aren't removed immediately, since doing so could race with
> +an incoming push which may reference an object which is about to be deleted.
> +Instead, those unreachable objects are stored as loose object and stay that way
> +until they are older than the expiration window, at which point they are removed
> +by linkgit:git-prune[1].
> +
> +Git must store these unreachable objects loose in order to keep track of their
> +per-object mtimes. If these unreachable objects were written into one big pack,
> +then either freshening that pack (because an object contained within it was
> +re-written) or creating a new pack of unreachable objects would cause the pack's
> +mtime to get updated, and the objects within it would never leave the expiration
> +window. Instead, objects are stored loose in order to keep track of the
> +individual object mtimes and avoid a situation where all cruft objects are
> +freshened at once.
> +
> +This can lead to undesirable situations when a repository contains many
> +unreachable objects which have not yet left the grace period. Having large
> +directories in the shards of `.git/objects` can lead to decreased performance in
> +the repository. But given enough unreachable objects, this can lead to inode
> +starvation and degrade the performance of the whole system. Since we
> +can never pack those objects, these repositories often take up a large amount of
> +disk space, since we can only zlib compress them, but not store them in delta
> +chains.
> +
> +== Cruft packs
> +
> +Cruft packs are designed to eliminate the need for storing unreachable objects
> +in a loose state by including the per-object mtimes in a separate file alongside
> +a single pack containing all loose objects.

I had the same question as Stolee here: why not use the cruft-pack's
mtime for all the objects in it?  Much later below, you make it clear
that a repository will generally only have one cruft pack which kind
of answers the question, but the repeated mention of "cruft packs"
throughout the document subtly made me make the opposite assumption.
It might be nice to address the almost-always-only-one-cruft-pack
earlier on, which may also help answer the question about why you need
to store individual mtimes in an additional file.

> +A cruft pack is written by `git repack --cruft` when generating a new pack.
> +linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft`
> +is a classic all-into-one repack, meaning that everything in the resulting pack is
> +reachable, and everything else is unreachable. Once written, the `--cruft`
> +option instructs `git repack` to generate another pack containing only objects
> +not packed in the previous step (which equates to packing all unreachable
> +objects together). This progresses as follows:
> +
> +  1. Enumerate every object, marking any object which is (a) not contained in a
> +     kept-pack, and (b) whose mtime is within the grace period as a traversal
> +     tip.
> +
> +  2. Perform a reachability traversal based on the tips gathered in the previous
> +     step, adding every object along the way to the pack.
> +
> +  3. Write the pack out, along with a `.mtimes` file that records the per-object
> +     timestamps.
> +
> +This mode is invoked internally by linkgit:git-repack[1] when instructed to
> +write a cruft pack. Crucially, the set of in-core kept packs is exactly the set
> +of packs which will not be deleted by the repack; in other words, they contain
> +all of the repository's reachable objects.
> +
> +When a repository already has a cruft pack, `git repack --cruft` typically only
> +adds objects to it. An exception to this is when `git repack` is given the
> +`--cruft-expiration` option, which allows the generated cruft pack to omit
> +expired objects instead of waiting for linkgit:git-gc[1] to expire those objects
> +later on.
> +
> +It is linkgit:git-gc[1] that is typically responsible for removing expired
> +unreachable objects.
> +
> +== Alternatives
> +
> +Notable alternatives to this design include:
> +
> +  - The location of the per-object mtime data, and
> +  - Whether cruft packs should be incremental or not.
> +
> +On the location of mtime data, a new auxiliary file tied to the pack was chosen
> +to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
> +support for optional chunks of data, it may make sense to consolidate the
> +`.mtimes` format into the `.idx` itself.
> +
> +Incremental cruft packs (i.e., where each time a repository is repacked a new
> +cruft pack is generated containing only the unreachable objects introduced since
> +the last time a cruft pack was written) are significantly more complicated to
> +construct, and so aren't pursued here. The obvious drawback to the current
> +implementation is that the entire cruft pack must be re-written from scratch.
> --
> 2.34.1.25.gb3157a20e6
>

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 01/17] Documentation/technical: add cruft-packs.txt
  2021-12-04 22:20   ` Elijah Newren
@ 2021-12-04 23:32     ` Taylor Blau
  0 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2021-12-04 23:32 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Junio C Hamano, Lars Schneider, Jeff King,
	Theodore Tso

On Sat, Dec 04, 2021 at 02:20:23PM -0800, Elijah Newren wrote:
> > +== Cruft packs
> > +
> > +Cruft packs are designed to eliminate the need for storing unreachable objects
> > +in a loose state by including the per-object mtimes in a separate file alongside
> > +a single pack containing all loose objects.
>
> I had the same question as Stolee here: why not use the cruft-pack's
> mtime for all the objects in it?  Much later below, you make it clear
> that a repository will generally only have one cruft pack which kind
> of answers the question, but the repeated mention of "cruft packs"
> throughout the document subtly made me make the opposite assumption.
> It might be nice to address the almost-always-only-one-cruft-pack
> earlier on, which may also help answer the question about why you need
> to store individual mtimes in an additional file.

Responding to your suggestions out of order ;-). Throughout the
document, I wrote "cruft packs" in the sense of "the feature this series
implements", not "multiple cruft packs".

But my wording is unintentionally vague, especially because this
document does talk about why this series stores unreachable objects in a
single cruft pack. I updated my copy to make clear the difference
between the two, which should hopefully avoid any confusion here in the
future.

As far as why not use the cruft pack's timestamp as the mtime for all of
the unreachable objects contained within it, there are a few reasons:

It makes freshening objects more complicated. Not because we couldn't
freshen individual objects (we would likely do so in the same way this
series does, by rewriting it loose and using the loose copy's mtime
instead), but because it makes it complicated to repack a repository
with many cruft packs. If I have a handful of cruft packs, and freshen a
handful of objects within them, I now need to update many cruft packs,
or pay the price of storing their objects twice (if I instead don't
rewrite them and keep the loose copies around).

It also makes it impossible to share deltas between cruft objects that
don't have the same timestamp, unless the cruft packs are stored thin
(in which case it becomes much more complicated to figure out which
cruft packs can be safely pruned without storing information about which
other packs a thin pack has deltas against).

I'm sure there were others, but these are the ones that I could recall
off the top of my head. This all felt like a little too much detail for
the "alternative designs" section, but if you think some or all of this
would be interesting to memorialize not just on the mailing list, let me
know.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 12/17] builtin/repack.c: support generating a cruft pack
  2021-11-29 22:25 ` [PATCH 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
@ 2021-12-05 20:46   ` Junio C Hamano
  2022-03-01  2:00     ` Taylor Blau
  2021-12-07 15:38   ` Derrick Stolee
  1 sibling, 1 reply; 200+ messages in thread
From: Junio C Hamano @ 2021-12-05 20:46 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, larsxschneider, peff, tytso

Various thoughts on just this part, as the hunk got my attention
while merging with other topics in 'seen'.

> +	if (pack_everything & PACK_CRUFT && delete_redundant) {
> +		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
> +			die(_("--cruft and -A are incompatible"));
> +		if (keep_unreachable)
> +			die(_("--cruft and -k are incompatible"));
> +		if (!(pack_everything & ALL_INTO_ONE))
> +			die(_("--cruft must be combined with all-into-one"));
> +	}

The "reuse similar messages for i18n" topic will encourage us to
turn this part into:

	if (pack_everything & PACK_CRUFT && delete_redundant) {
		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
			die(_("%s and %s are mutually exclusive"),
			    "--cruft", "-A");
		if (keep_unreachable)
			die(_("%s and %s are mutually exclusive"),
			    "--cruft", "-k");
		if (!(pack_everything & ALL_INTO_ONE))
			die(_("--cruft must be combined with all-into-one"));
	}

The conditionals are a bit unpleasant to read and maintain, but I
guess we cannot help it?

Saying ALL_INTO_ONE is a bit unfriendly to the end user, who would
probably not know that it is the name the code gave to the bit that
is turned on when given an option externally known under a different
name (is that "-a"?).

If "--cruft" must be used with "all into one", I wonder if it makes
sense to make it imply that?  Not in the sense that OPT_BIT()
initially flips the ALL_INTO_ONE bit on upon seeing "--cruft", but
after parse_options() returns, we check PACK_CRUFT and if it is on
turn ALL_INTO_ONE also on (so even if '-a' gains '--all-into-one'
option, the user won't break us by giving "--no-all-into-one" after
they gave us "--cruft")?  I didn't think about this part thoroughly
enough, though.

Thanks.







^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 04/17] chunk-format.h: extract oid_version()
  2021-12-03 22:40     ` Taylor Blau
@ 2021-12-06 17:33       ` Derrick Stolee
  0 siblings, 0 replies; 200+ messages in thread
From: Derrick Stolee @ 2021-12-06 17:33 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, gitster, larsxschneider, peff, tytso

On 12/3/21 5:40 PM, Taylor Blau wrote:
> On Thu, Dec 02, 2021 at 10:22:05AM -0500, Derrick Stolee wrote:
>> I notice that you don't use this in load_pack_mtimes_file(),
>> in pack-mtimes.c but you could at this point.
> 
> Hmm, I'm confused. Te extracted function converts a pointer to a struct
> git_hash_algo into a uint32, but here we just care about reading the
> four byte value we wrote.

Ah. I got mixed up here. Sorry.

-Stolee

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 06/17] t/helper: add 'pack-mtimes' test-tool
  2021-11-29 22:25 ` [PATCH 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
@ 2021-12-06 21:16   ` Derrick Stolee
  2022-02-23 22:24     ` Taylor Blau
  0 siblings, 1 reply; 200+ messages in thread
From: Derrick Stolee @ 2021-12-06 21:16 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: gitster, larsxschneider, peff, tytso

On 11/29/2021 5:25 PM, Taylor Blau wrote:
> +static int dump_mtimes(struct packed_git *p)

nit: you return an int here so you can use it as an error code...

> +{
> +	uint32_t i;
> +	if (load_pack_mtimes(p) < 0)
> +		die("could not load pack .mtimes");
> +
> +	for (i = 0; i < p->num_objects; i++) {
> +		struct object_id oid;
> +		if (nth_packed_object_id(&oid, p, i) < 0)
> +			die("could not load object id at position %"PRIu32, i);
> +
> +		printf("%s %"PRIu32"\n",
> +		       oid_to_hex(&oid), nth_packed_mtime(p, i));
> +	}
> +
> +	return 0;

But always return 0 unless you die().

> +	return p ? dump_mtimes(p) : 1;

It makes this line concise, I suppose.

Perhaps just use "return dump_mtimes(p)" and have dump_mtimes()
return 1 if the given pack is NULL?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 08/17] builtin/pack-objects.c: --cruft without expiration
  2021-11-29 22:25 ` [PATCH 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
@ 2021-12-06 21:44   ` Derrick Stolee
  2022-03-01  2:48     ` Taylor Blau
  2021-12-07 15:17   ` Derrick Stolee
  1 sibling, 1 reply; 200+ messages in thread
From: Derrick Stolee @ 2021-12-06 21:44 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: gitster, larsxschneider, peff, tytso

On 11/29/2021 5:25 PM, Taylor Blau wrote:
> Generating a non-expiring cruft packs works as follows:

I had trouble parsing the documentation changes below, so I came back
to this commit message to see if that helps.
 
>   - Callers provide a list of every pack they know about, and indicate
>     which packs are about to be removed.

This corresponds to the list over stdin.
 
>   - All packs which are going to be removed (we'll call these the
>     redundant ones) are marked as kept in-core, as well as any packs
>     that `pack-objects` found but the caller did not specify.

Ok, so as an implementation detail we mark these as keep packs.

>     These packs are presumed to have entered the repository between
>     the caller collecting packs and invoking `pack-objects`. Since we
>     do not want to include objects in these packs (because we don't know
>     which of their objects are or aren't reachable), these are also
>     marked as kept in-core.

Here, "are presumed" is doing a lot of work. Theoretically, there could
be three categories:

1. This pack was just repacked and will be removed because all of its
   objects were placed into new objects.

2. Either this pack was repacked and contains important reachable objects
   OR we did a repack of reachable objects and this pack contained some
   extra, unreachable objects.

3. This pack was added to the repository while creating those repacked
   packs from category 2, so we don't know if things are reachable or
   not.

So, the packs that we discover on-disk but are not specified over stdin
are in this third category, but these are grouped with category 1 as we
will treat them the same.

>   - Then, we enumerate all objects in the repository, and add them to
>     our packing list if they do not appear in an in-core kept pack.

Here, we are looking at all of the objects in category 2 as well as
loose objects.

> This results in a new cruft pack which contains all known objects that
> aren't included in the kept packs. When the kept pack is the result of
> `git repack -A`, the resulting pack contains all unreachable objects.

This now describes how 'git repack' will interface with this new change
to pack-objects. I'll keep an eye out for that.

> +--cruft::

Now getting to this description.

> +	Packs unreachable objects into a separate "cruft" pack, denoted
> +	by the existence of a `.mtimes` file. Pack names provided over
> +	stdin indicate which packs will remain after a `git repack`.
> +	Pack names prefixed with a `-` indicate those which will be
> +	removed. (...)

This description is too tied to 'git repack'. Can we describe the
input using terms independent of the 'git repack' operation? I need
to keep reading.

> (...) The contents of the cruft pack are all objects not
> +	contained in the surviving packs specified by `--keep-pack`)

Now you use --keep-pack, which is a way of specifying a pack as
"in-core keep" which was not in your commit message. Here, we also
don't link the packs over stdin to the concept of keep packs.

> +	which have not exceeded the grace period (see
> +	`--cruft-expiration` below), or which have exceeded the grace
> +	period, but are reachable from an other object which hasn't.

And now we think about the grace period! There is so much going on
that I need to break it down to understand.

  An object is _excluded_ from the new cruft pack if

  1. It is reachable from at least one reference.
  2. It is in a pack from stdin prefixed with "-"
  3. It is in a pack specified by `--keep-pack`
  4. It is in an existing cruft pack and the .mtimes file states
     that its mtime is at least as recent as the time specified by
     the --cruft-expiration option.

Breaking it down into a list like this helps me, at least. I'm not
sure what the best way would look like.

(Needing to pause here and look at the implementation later.)

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 08/17] builtin/pack-objects.c: --cruft without expiration
  2021-11-29 22:25 ` [PATCH 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
  2021-12-06 21:44   ` Derrick Stolee
@ 2021-12-07 15:17   ` Derrick Stolee
  2022-02-23 23:34     ` Taylor Blau
  1 sibling, 1 reply; 200+ messages in thread
From: Derrick Stolee @ 2021-12-07 15:17 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: gitster, larsxschneider, peff, tytso

On 11/29/2021 5:25 PM, Taylor Blau wrote:
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> +static int add_cruft_object_entry(const struct object_id *oid, enum object_type type,
> +				  struct packed_git *pack, off_t offset,
> +				  const char *name, uint32_t mtime)
> +{
> +	struct object_entry *entry;
> +
> +	display_progress(progress_state, ++nr_seen);

I don't love the global nr_seen here, but it is pervasive through the
file. OK.

> +	entry = packlist_find(&to_pack, oid);
> +	if (entry) {
> +		if (name) {
> +			entry->hash = pack_name_hash(name);
> +			entry->no_try_delta = name && no_try_delta(name);

This is already in an "if (name)" block, so "name &&" isn't needed.

> +		}
> +	} else {
> +		if (!want_object_in_pack(oid, 0, &pack, &offset))
> +			return 0;
> +		if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
> +			/*
> +			 * If a traversed tree has a missing blob then we want
> +			 * to avoid adding that missing object to our pack.
> +			 *
> +			 * This only applies to missing blobs, not trees,
> +			 * because the traversal needs to parse sub-trees but
> +			 * not blobs.
> +			 *
> +			 * Note we only perform this check when we couldn't
> +			 * already find the object in a pack, so we're really
> +			 * limited to "ensure non-tip blobs which don't exist in
> +			 * packs do exist via loose objects". Confused?
> +			 */
> +			return 0;
> +		}
> +
> +		entry = create_object_entry(oid, type, pack_name_hash(name),
> +					    0, name && no_try_delta(name),
> +					    pack, offset);
> +	}
> +
> +	if (mtime > oe_cruft_mtime(&to_pack, entry))
> +		oe_set_cruft_mtime(&to_pack, entry, mtime);
> +	return 1;

I was confused at this "return 1" here, while other cases return 0.

It turns out that there are multiple methods in this file that have
different semantics: add_loose_object() and add_object_entry_from_pack()
are both called from iterators where "return 1" means "stop iterating"
so they return 0 always. add_object_entry_from_bitmap() is used to
iterate over a bitmap and "return 1" means "include this object".

However, the return code for add_cruft_object_entry() is never used,
so it should probably return void or swap the meanings to have nonzero
mean an error occurred.

> +static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
> +{
> +	struct string_list_item *item = NULL;
> +	for_each_string_list_item(item, packs) {
> +		struct packed_git *p = item->util;
> +		if (!p)
> +			die(_("could not find pack '%s'"), item->string);

Interesting that this is a potential issue. We are expecting the pack
to be loaded before we get here. Is this more because some packs might
not actually load, but it's fine as long as we don't mark them as kept?

> +		p->pack_keep_in_core = keep;
> +	}
> +}
...
> +static void read_cruft_objects(void)
> +{
> +	struct strbuf buf = STRBUF_INIT;
> +	struct string_list discard_packs = STRING_LIST_INIT_DUP;
> +	struct string_list fresh_packs = STRING_LIST_INIT_DUP;
> +	struct packed_git *p;
> +
> +	ignore_packed_keep_in_core = 1;

Here is a global that we are suddenly changing. Should we not be
returning it to its initial state when this method is complete?

> +static int option_parse_cruft_expiration(const struct option *opt,
> +					 const char *arg, int unset)
> +{
> +	if (unset) {
> +		cruft = 0;

This unassignment of 'cruft' when cruft-expiration is unset with
--no-cruft-expiration seems odd. I would expect

	git pack-objects --cruft --no-cruft-expiration

to still make a cruft pack, but not expire anything. It seems that
your code here makes --no-cruft-expiration disable the --cruft option.

> +		cruft_expiration = 0;
> +	} else {
> +		cruft = 1;
> +		if (arg)
> +			cruft_expiration = approxidate(arg);
> +	}
> +	return 0;
> +}
..
> +		OPT_BOOL(0, "cruft", &cruft, N_("create a cruft pack")),
> +		OPT_CALLBACK_F(0, "cruft-expiration", NULL, N_("time"),
> +		  N_("expire cruft objects older than <time>"),
> +		  PARSE_OPT_OPTARG, option_parse_cruft_expiration),

> -static int has_loose_object(const struct object_id *oid)
> +int has_loose_object(const struct object_id *oid)
>  {
>  	return check_and_freshen(oid, 0);
>  }

I'm surprised this hasn't been modified to use a repository pointer.
Adding another caller here isn't too much debt, though.

> diff --git a/object-store.h b/object-store.h
> index d87481f101..a79c1c91ab 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -308,6 +308,8 @@ int repo_has_object_file_with_flags(struct repository *r,
>   */
>  int has_loose_object_nonlocal(const struct object_id *);

Of course, here is another example that is already more widely used.

> +int has_loose_object(const struct object_id *);
> +
>  void assert_oid_type(const struct object_id *oid, enum object_type expect);

...

> +	test_expect_success "unreachable packed objects are packed (expire $expire)" '
> +		git init repo &&
> +		test_when_finished "rm -fr repo" &&
> +		(
> +			cd repo &&
> +
> +			test_commit packed &&
> +			git repack -Ad &&
> +			test_commit other &&
> +
> +			git rev-list --objects --no-object-names packed.. >objects &&
> +			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
> +			other="$(git pack-objects --delta-base-offset \
> +				$packdir/pack <objects)" &&
> +			git prune-packed &&
> +
> +			test-tool chmtime --get -100 "$packdir/pack-$other.pack" >expect &&

I am missing how this test creates _unreachable_ objects. I would expect removal of
some refs or a 'git reset --hard' somewhere. What am I missing?

> +			cruft="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF
> +			$keep
> +			-pack-$other.pack
> +			EOF
> +			)" &&
> +			test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
> +
> +			cut -d" " -f2 <actual.raw | sort -u >actual &&
> +
> +			test_cmp expect actual
> +		)
> +	'
> +
> +	test_expect_success "unreachable cruft objects are repacked (expire $expire)" '

I have the same question for all of the tests, really.

> +			# remove the unreachable tree, but leave the commit
> +			# which has it as its root tree in-tact

nit: "intact" is one word.

> +			rm -fr "$objdir/$(test_oid_to_path "$tree")" &&
> +
> +			git repack -Ad &&
> +			basename $(ls $packdir/pack-*.pack) >in &&
> +			git pack-objects --cruft --cruft-expiration="$expire" \
> +				$packdir/pack <in
> +		)
> +	'

...

> +basic_cruft_pack_tests never

I look forward to seeing how this changes with additional expiration values.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 11/17] builtin/pack-objects.c: --cruft with expiration
  2021-11-29 22:25 ` [PATCH 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
@ 2021-12-07 15:30   ` Derrick Stolee
  2022-02-23 23:35     ` Taylor Blau
  0 siblings, 1 reply; 200+ messages in thread
From: Derrick Stolee @ 2021-12-07 15:30 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: gitster, larsxschneider, peff, tytso

On 11/29/2021 5:25 PM, Taylor Blau wrote:

> +static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs)
> +{
...
> +	/*
> +	 * Re-mark only the fresh packs as kept so that objects in
> +	 * unknown packs do not halt the reachability traversal early.
> +	 */
> +	for (p = get_all_packs(the_repository); p; p = p->next)
> +		p->pack_keep_in_core = 0;
> +	mark_pack_kept_in_core(fresh_packs, 1);

Are we ever going to recover this pack_keep_in_core state? Should we
be saving it somewhere so we can return without mutating this state
permanently?

> +	if (prepare_revision_walk(&revs))
> +		die(_("revision walk setup failed"));
> +	if (progress)
> +		progress_state = start_progress(_("Traversing cruft objects"), 0);
> +	nr_seen = 0;
> +	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
> +
> +	stop_progress(&progress_state);
> +}
> +
>  static void read_cruft_objects(void)
>  {
>  	struct strbuf buf = STRBUF_INIT;
> @@ -3515,7 +3597,7 @@ static void read_cruft_objects(void)
>  	mark_pack_kept_in_core(&discard_packs, 0);
>  
>  	if (cruft_expiration)
> -		die("--cruft-expiration not yet implemented");
> +		enumerate_and_traverse_cruft_objects(&fresh_packs);
>  	else
>  		enumerate_cruft_objects();

>  basic_cruft_pack_tests never
> +basic_cruft_pack_tests 2.weeks.ago

I'm surprised these tests didn't require any changes to adapt to the
new expiration date. But I suppose none of the mtimes were older than
two weeks ago?

I continue to miss something in these tests, because I don't see how
things are becoming unreachable.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 12/17] builtin/repack.c: support generating a cruft pack
  2021-11-29 22:25 ` [PATCH 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
  2021-12-05 20:46   ` Junio C Hamano
@ 2021-12-07 15:38   ` Derrick Stolee
  2022-02-23 23:37     ` Taylor Blau
  1 sibling, 1 reply; 200+ messages in thread
From: Derrick Stolee @ 2021-12-07 15:38 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: gitster, larsxschneider, peff, tytso

On 11/29/2021 5:25 PM, Taylor Blau wrote:

> +static int write_cruft_pack(const struct pack_objects_args *args,
> +			    const char *pack_prefix,
> +			    struct string_list *names,
> +			    struct string_list *existing_packs,
> +			    struct string_list *existing_kept_packs)
> +{
> +	struct child_process cmd = CHILD_PROCESS_INIT;
> +	struct strbuf line = STRBUF_INIT;
> +	struct string_list_item *item;
> +	FILE *in, *out;
> +	int ret;
> +
> +	prepare_pack_objects(&cmd, args);
> +
> +	strvec_push(&cmd.args, "--cruft");
> +	if (cruft_expiration)
> +		strvec_pushf(&cmd.args, "--cruft-expiration=%s",
> +			     cruft_expiration);
> +
> +	strvec_push(&cmd.args, "--honor-pack-keep");
> +	strvec_push(&cmd.args, "--non-empty");
> +	strvec_push(&cmd.args, "--max-pack-size=0");

This --max-pack-size is meaningless, right? The config that would change
this is already ignored by 'git pack-objects'.

> +		OPT_BIT(0, "cruft", &pack_everything,
> +				N_("same as -a, pack unreachable cruft objects separately"),
> +				   PACK_CRUFT | ALL_INTO_ONE),

I can understand the use of OPT_BIT here. Keep in mind that --no-cruft would
remove the '-a' option, if it already existed. Perhaps we should just use
OPT_BOOL and update to add the ALL_INTO_ONE if PACK_CRUFT exists?

> +		OPT_STRING(0, "cruft-expiration", &cruft_expiration, N_("approxidate"),
> +				N_("with -C, expire objects older than this")),

Here, --no-cruft-expiration will set cruft_expiration to NULL and not overwrite
the --cruft option, as expected. Just pointing out that this is different than
the option in 'git pack-objects'.

> --- a/t/t5327-pack-objects-cruft.sh
> +++ b/t/t5327-pack-objects-cruft.sh
> @@ -358,4 +358,157 @@ test_expect_success 'expired objects are pruned' '
>  	)
>  '
>  
> +test_expect_success 'repack --cruft generates a cruft pack' '
> +	git init repo &&
> +	test_when_finished "rm -fr repo" &&
> +	(
> +		cd repo &&
> +
> +		test_commit reachable &&
> +		git branch -M main &&
> +		git checkout --orphan other &&

Here is a way to make objects unreachable!

> +		test_commit unreachable &&
> +
> +		git checkout main &&
> +		git branch -D other &&
> +		git tag -d unreachable &&

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 02/17] pack-mtimes: support reading .mtimes files
  2021-12-03 22:24     ` Taylor Blau
@ 2022-01-07 19:41       ` Taylor Blau
  0 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-01-07 19:41 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, gitster, larsxschneider, peff, tytso, brian m. carlson

On Fri, Dec 03, 2021 at 05:24:03PM -0500, Taylor Blau wrote:
> On Thu, Dec 02, 2021 at 10:06:07AM -0500, Derrick Stolee wrote:
>     - A table of 4-byte unsigned integers in network order. The ith
>       value is the modification time (mtime) of the ith object in the
>       corresponding pack by lexicographic (index) order. The mtimes
>       count standard epoch seconds.
>
> > Storing these mtimes in 32-bits means we will hit the 2038 problem.
> > The commit-graph stores commit times with an extra two bits to extend
> > the lifetime by another hundred years or so.
> >
> > Could we extend the lifetime of cruft packs by decreasing the granularity
> > here? Should 'mtime' store a number of _minutes_ instead of seconds? That
> > should be enough granularity for these purposes.
>
> Perhaps, though it does add some complexity to the code that deals with
> this format at the expense of some future-proofing. I'm open to it,
> though.

I still have quite a bit of review from this topic sitting in my inbox.

But this had been lingering on my mind, and I realized I said something
incorrect. 32-bit mtimes won't cause us to run into the "2038" problem,
since these aren't signed values. So storing epoch seconds in a uint32_t
should get us into the year 2106.

If anybody is still using cruft packs by then, I'll call this project a
wild success ;-). So in the meantime, I don't think it makes sense to
reduce the granularity and/or use extra bits to store the timestamps.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 06/17] t/helper: add 'pack-mtimes' test-tool
  2021-12-06 21:16   ` Derrick Stolee
@ 2022-02-23 22:24     ` Taylor Blau
  0 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-02-23 22:24 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, larsxschneider, tytso

On Mon, Dec 06, 2021 at 04:16:04PM -0500, Derrick Stolee wrote:
> On 11/29/2021 5:25 PM, Taylor Blau wrote:
> > +static int dump_mtimes(struct packed_git *p)
>
> nit: you return an int here so you can use it as an error code...
>
> > +{
> > +	uint32_t i;
> > +	if (load_pack_mtimes(p) < 0)
> > +		die("could not load pack .mtimes");
> > +
> > +	for (i = 0; i < p->num_objects; i++) {
> > +		struct object_id oid;
> > +		if (nth_packed_object_id(&oid, p, i) < 0)
> > +			die("could not load object id at position %"PRIu32, i);
> > +
> > +		printf("%s %"PRIu32"\n",
> > +		       oid_to_hex(&oid), nth_packed_mtime(p, i));
> > +	}
> > +
> > +	return 0;
>
> But always return 0 unless you die().
>
> > +	return p ? dump_mtimes(p) : 1;
>
> It makes this line concise, I suppose.
>
> Perhaps just use "return dump_mtimes(p)" and have dump_mtimes()
> return 1 if the given pack is NULL?

I think just dying in the case we have a NULL pack is fine, and it
should be OK to lump it in the same case as "could not load pack .mtimes".

But we may want to catch the case a little earlier while we still have
the pack name handy. Perhaps something like this on top:

--- 8< ---
diff --git a/t/helper/test-pack-mtimes.c b/t/helper/test-pack-mtimes.c
index b143f62520..f7b79daf4c 100644
--- a/t/helper/test-pack-mtimes.c
+++ b/t/helper/test-pack-mtimes.c
@@ -5,7 +5,7 @@
 #include "packfile.h"
 #include "pack-mtimes.h"

-static int dump_mtimes(struct packed_git *p)
+static void dump_mtimes(struct packed_git *p)
 {
 	uint32_t i;
 	if (load_pack_mtimes(p) < 0)
@@ -19,8 +19,6 @@ static int dump_mtimes(struct packed_git *p)
 		printf("%s %"PRIu32"\n",
 		       oid_to_hex(&oid), nth_packed_mtime(p, i));
 	}
-
-	return 0;
 }

 static const char *pack_mtimes_usage = "\n"
@@ -49,5 +47,10 @@ int cmd__pack_mtimes(int argc, const char **argv)

 	strbuf_release(&buf);

-	return p ? dump_mtimes(p) : 1;
+	if (!p)
+		die("could not find pack '%s'", argv[1]);
+
+	dump_mtimes(p);
+
+	return 0;
 }
--- >8 ---

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 08/17] builtin/pack-objects.c: --cruft without expiration
  2021-12-07 15:17   ` Derrick Stolee
@ 2022-02-23 23:34     ` Taylor Blau
  0 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-02-23 23:34 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, larsxschneider, peff, tytso

On Tue, Dec 07, 2021 at 10:17:28AM -0500, Derrick Stolee wrote:
> On 11/29/2021 5:25 PM, Taylor Blau wrote:
> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> > +static int add_cruft_object_entry(const struct object_id *oid, enum object_type type,
> > +				  struct packed_git *pack, off_t offset,
> > +				  const char *name, uint32_t mtime)
> > +{
> > +	struct object_entry *entry;
> > +
> > +	display_progress(progress_state, ++nr_seen);
>
> I don't love the global nr_seen here, but it is pervasive through the
> file. OK.

Yeah; this is how all of the existing progress code works in
pack-objects.

> > +	entry = packlist_find(&to_pack, oid);
> > +	if (entry) {
> > +		if (name) {
> > +			entry->hash = pack_name_hash(name);
> > +			entry->no_try_delta = name && no_try_delta(name);
>
> This is already in an "if (name)" block, so "name &&" isn't needed.

Thanks; this is a copy-and-paste from add_object_entry(), where we
aren't in a conditional on "name". We could also fold the conditional on
whether or not name is NULL into no_try_delta itself, since all existing
calls look like "name && no_try_delta(name)".

So adding something like:

    if (!name)
      return 0;

to the beginning of no_try_delta()'s implementation would allow us to
get rid of the handful of "name &&"s. But I'm trying to avoid touching
other parts of pack-objects as much as I can, so I'll hold off for now.

> > +		}
> > +	} else {
> > +		if (!want_object_in_pack(oid, 0, &pack, &offset))
> > +			return 0;
> > +		if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
> > +			/*
> > +			 * If a traversed tree has a missing blob then we want
> > +			 * to avoid adding that missing object to our pack.
> > +			 *
> > +			 * This only applies to missing blobs, not trees,
> > +			 * because the traversal needs to parse sub-trees but
> > +			 * not blobs.
> > +			 *
> > +			 * Note we only perform this check when we couldn't
> > +			 * already find the object in a pack, so we're really
> > +			 * limited to "ensure non-tip blobs which don't exist in
> > +			 * packs do exist via loose objects". Confused?
> > +			 */
> > +			return 0;
> > +		}
> > +
> > +		entry = create_object_entry(oid, type, pack_name_hash(name),
> > +					    0, name && no_try_delta(name),
> > +					    pack, offset);
> > +	}
> > +
> > +	if (mtime > oe_cruft_mtime(&to_pack, entry))
> > +		oe_set_cruft_mtime(&to_pack, entry, mtime);
> > +	return 1;
>
> I was confused at this "return 1" here, while other cases return 0.
>
> It turns out that there are multiple methods in this file that have
> different semantics: add_loose_object() and add_object_entry_from_pack()
> are both called from iterators where "return 1" means "stop iterating"
> so they return 0 always. add_object_entry_from_bitmap() is used to
> iterate over a bitmap and "return 1" means "include this object".
>
> However, the return code for add_cruft_object_entry() is never used,
> so it should probably return void or swap the meanings to have nonzero
> mean an error occurred.

Yes, exactly. And thanks for tracing out both of the different
meanings/interpretations of these add_xyz_entry() functions. As you can
imagine, this implementation is copy-and-pasted from add_object_entry(),
which was specialized for this use here. At the time, I gave some effort
towards trying to share more code with add_object_entry() for this
special case, but it ended up being pretty awkward, hence the separate
implementation.

Ironically, add_object_entry()'s return code is also unused, so we could
probably clean that up, too. But like the above, I'll avoid it for now
in an effort to touch as little of pack-objects in this patch as I can.

> > +static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
> > +{
> > +	struct string_list_item *item = NULL;
> > +	for_each_string_list_item(item, packs) {
> > +		struct packed_git *p = item->util;
> > +		if (!p)
> > +			die(_("could not find pack '%s'"), item->string);
>
> Interesting that this is a potential issue. We are expecting the pack
> to be loaded before we get here. Is this more because some packs might
> not actually load, but it's fine as long as we don't mark them as kept?

Not quite "loaded" (though any pack structures that we look at by this
point will be fully "loaded"). Instead, we're making sure that all of
the packs names we read from stdin could be matched to packs that we
found in the repository (i.e., that we produce an appropriate error
message if we found "pack-does-not-exist.pack" on stdin).

This is all because we process input from stdin in two phases:

  - First, read all of the input into two string_lists, one for the
    packs we're about to discard (anything that start with '-'), and
    another for all of the "fresh" packs (i.e., anything that we're not
    going to discard).

  - Then, loop through all of the packed_git structs we have, querying
    both of the aforementioned string lists for input that matches each
    pack's `pack_name` field, and setting the `->util` pointer of the
    matching string_list_entry appropriately.

Following those two steps, any list entries that have a NULL util
pointer correspond with bogus input, so we want to call die() there.

> > +		p->pack_keep_in_core = keep;
> > +	}
> > +}
> ...
> > +static void read_cruft_objects(void)
> > +{
> > +	struct strbuf buf = STRBUF_INIT;
> > +	struct string_list discard_packs = STRING_LIST_INIT_DUP;
> > +	struct string_list fresh_packs = STRING_LIST_INIT_DUP;
> > +	struct packed_git *p;
> > +
> > +	ignore_packed_keep_in_core = 1;
>
> Here is a global that we are suddenly changing. Should we not be
> returning it to its initial state when this method is complete?

We could, although it won't matter in practice, because we'll want to
keep that setting around for our traversal, after which point
pack-objects will exit.

> > +static int option_parse_cruft_expiration(const struct option *opt,
> > +					 const char *arg, int unset)
> > +{
> > +	if (unset) {
> > +		cruft = 0;
>
> This unassignment of 'cruft' when cruft-expiration is unset with
> --no-cruft-expiration seems odd. I would expect
>
> 	git pack-objects --cruft --no-cruft-expiration
>
> to still make a cruft pack, but not expire anything. It seems that
> your code here makes --no-cruft-expiration disable the --cruft option.

Hmm. I could see compelling reasoning that goes both ways. On the one
hand, `--no-cruft-expiration` (to me, at least) seems to imply "set
`--cruft-expiration` to "never"). On the other hand, it also matches our
convention of `--no`-prefixed options to unset some value. This
implementation takes the latter approach, though we could easily change
it to set the cruft expiration to "never".

I don't have a strong opinion about which is better, so I'm happy to do
either if you have a better sense about which has more expected
behavior.

> > +		cruft_expiration = 0;
> > +	} else {
> > +		cruft = 1;
> > +		if (arg)
> > +			cruft_expiration = approxidate(arg);
> > +	}
> > +	return 0;
> > +}
> ..
> > +		OPT_BOOL(0, "cruft", &cruft, N_("create a cruft pack")),
> > +		OPT_CALLBACK_F(0, "cruft-expiration", NULL, N_("time"),
> > +		  N_("expire cruft objects older than <time>"),
> > +		  PARSE_OPT_OPTARG, option_parse_cruft_expiration),
>
> > -static int has_loose_object(const struct object_id *oid)
> > +int has_loose_object(const struct object_id *oid)
> >  {
> >  	return check_and_freshen(oid, 0);
> >  }
>
> I'm surprised this hasn't been modified to use a repository pointer.
> Adding another caller here isn't too much debt, though.

Yeah, check_and_freshen() doesn't have a variant that takes a
repository pointer. Good #leftoverbits, I guess!

> > +int has_loose_object(const struct object_id *);
> > +
> >  void assert_oid_type(const struct object_id *oid, enum object_type expect);
>
> ...
>
> > +	test_expect_success "unreachable packed objects are packed (expire $expire)" '
> > +		git init repo &&
> > +		test_when_finished "rm -fr repo" &&
> > +		(
> > +			cd repo &&
> > +
> > +			test_commit packed &&
> > +			git repack -Ad &&
> > +			test_commit other &&
> > +
> > +			git rev-list --objects --no-object-names packed.. >objects &&
> > +			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
> > +			other="$(git pack-objects --delta-base-offset \
> > +				$packdir/pack <objects)" &&
> > +			git prune-packed &&
> > +
> > +			test-tool chmtime --get -100 "$packdir/pack-$other.pack" >expect &&
>
> I am missing how this test creates _unreachable_ objects. I would expect removal of
> some refs or a 'git reset --hard' somewhere. What am I missing?

For this and the other tests the so-called "unreachable" objects are
technically reachable, but we can treat them as unreachable by putting
them in the "discard" packs list (or by not mentioning them at all to
`git pack-objects --cruft`).

> > +			# remove the unreachable tree, but leave the commit
> > +			# which has it as its root tree in-tact
>
> nit: "intact" is one word.

Thanks; fixed here and in the other test which was added by this commit.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 11/17] builtin/pack-objects.c: --cruft with expiration
  2021-12-07 15:30   ` Derrick Stolee
@ 2022-02-23 23:35     ` Taylor Blau
  0 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-02-23 23:35 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, larsxschneider, peff, tytso

On Tue, Dec 07, 2021 at 10:30:52AM -0500, Derrick Stolee wrote:
> On 11/29/2021 5:25 PM, Taylor Blau wrote:
>
> > +static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs)
> > +{
> ...
> > +	/*
> > +	 * Re-mark only the fresh packs as kept so that objects in
> > +	 * unknown packs do not halt the reachability traversal early.
> > +	 */
> > +	for (p = get_all_packs(the_repository); p; p = p->next)
> > +		p->pack_keep_in_core = 0;
> > +	mark_pack_kept_in_core(fresh_packs, 1);
>
> Are we ever going to recover this pack_keep_in_core state? Should we
> be saving it somewhere so we can return without mutating this state
> permanently?

In the same sense that we are free to modify the global
ignore_packed_keep_in_core variable (because we only stop caring about
the modified state right before the program is about to exist) we can
freely mutate these variables, too.

> > +	if (prepare_revision_walk(&revs))
> > +		die(_("revision walk setup failed"));
> > +	if (progress)
> > +		progress_state = start_progress(_("Traversing cruft objects"), 0);
> > +	nr_seen = 0;
> > +	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
> > +
> > +	stop_progress(&progress_state);
> > +}
> > +
> >  static void read_cruft_objects(void)
> >  {
> >  	struct strbuf buf = STRBUF_INIT;
> > @@ -3515,7 +3597,7 @@ static void read_cruft_objects(void)
> >  	mark_pack_kept_in_core(&discard_packs, 0);
> >
> >  	if (cruft_expiration)
> > -		die("--cruft-expiration not yet implemented");
> > +		enumerate_and_traverse_cruft_objects(&fresh_packs);
> >  	else
> >  		enumerate_cruft_objects();
>
> >  basic_cruft_pack_tests never
> > +basic_cruft_pack_tests 2.weeks.ago
>
> I'm surprised these tests didn't require any changes to adapt to the
> new expiration date. But I suppose none of the mtimes were older than
> two weeks ago?

Exactly.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 12/17] builtin/repack.c: support generating a cruft pack
  2021-12-07 15:38   ` Derrick Stolee
@ 2022-02-23 23:37     ` Taylor Blau
  0 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-02-23 23:37 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, larsxschneider, peff, tytso

(Jumping forward a little bit while responding to your review to finish
my train of though before I log off for today...)

On Tue, Dec 07, 2021 at 10:38:05AM -0500, Derrick Stolee wrote:
> > --- a/t/t5327-pack-objects-cruft.sh
> > +++ b/t/t5327-pack-objects-cruft.sh
> > @@ -358,4 +358,157 @@ test_expect_success 'expired objects are pruned' '
> >  	)
> >  '
> >
> > +test_expect_success 'repack --cruft generates a cruft pack' '
> > +	git init repo &&
> > +	test_when_finished "rm -fr repo" &&
> > +	(
> > +		cd repo &&
> > +
> > +		test_commit reachable &&
> > +		git branch -M main &&
> > +		git checkout --orphan other &&
>
> Here is a way to make objects unreachable!

Yes, indeed. And this is the first spot where we *need* to care about
object reachability, because the set of packs that `git repack` passes
over stdin to `git pack-objects --cruft` depends on which objects are
and aren't reachable.

In the tests that exercise `pack-objects --cruft` directly, we can
pretend that certain packs contain only unreachable objects by marking
them as "discarded".

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 12/17] builtin/repack.c: support generating a cruft pack
  2021-12-05 20:46   ` Junio C Hamano
@ 2022-03-01  2:00     ` Taylor Blau
  0 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-01  2:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, larsxschneider, peff, tytso

On Sun, Dec 05, 2021 at 12:46:19PM -0800, Junio C Hamano wrote:
> Various thoughts on just this part, as the hunk got my attention
> while merging with other topics in 'seen'.
>
> > +	if (pack_everything & PACK_CRUFT && delete_redundant) {
> > +		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
> > +			die(_("--cruft and -A are incompatible"));
> > +		if (keep_unreachable)
> > +			die(_("--cruft and -k are incompatible"));
> > +		if (!(pack_everything & ALL_INTO_ONE))
> > +			die(_("--cruft must be combined with all-into-one"));
> > +	}
>
> The "reuse similar messages for i18n" topic will encourage us to
> turn this part into:
>
> 	if (pack_everything & PACK_CRUFT && delete_redundant) {
> 		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
> 			die(_("%s and %s are mutually exclusive"),
> 			    "--cruft", "-A");
> 		if (keep_unreachable)
> 			die(_("%s and %s are mutually exclusive"),
> 			    "--cruft", "-k");
> 		if (!(pack_everything & ALL_INTO_ONE))
> 			die(_("--cruft must be combined with all-into-one"));
> 	}

Thanks, done.

> The conditionals are a bit unpleasant to read and maintain, but I
> guess we cannot help it?

I don't know that I find them unpleasant to read, but perhaps they are a
hassle to maintain (as we add new, mutually-exclusive options). But I
can't seem to think of a better alternative...

> Saying ALL_INTO_ONE is a bit unfriendly to the end user, who would
> probably not know that it is the name the code gave to the bit that
> is turned on when given an option externally known under a different
> name (is that "-a"?).
>
> If "--cruft" must be used with "all into one", I wonder if it makes
> sense to make it imply that?  Not in the sense that OPT_BIT()
> initially flips the ALL_INTO_ONE bit on upon seeing "--cruft", but
> after parse_options() returns, we check PACK_CRUFT and if it is on
> turn ALL_INTO_ONE also on (so even if '-a' gains '--all-into-one'
> option, the user won't break us by giving "--no-all-into-one" after
> they gave us "--cruft")?  I didn't think about this part thoroughly
> enough, though.

Yes, `--cruft` must be used with an option that sets ALL_INTO_ONE. Since
we don't have any automatic '--no-' versions of single character
options, I think that this conditional is currently redundant, but I
agree that this code would break if we (a) removed the conditional
you're talking about and (b) allowed passing something like
`--no-all-into-one` which unsets the ALL_INTO_ONE bit.

So setting ALL_INTO_ONE ourselves _after_ option parsing is done makes
sense to me, thanks.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH 08/17] builtin/pack-objects.c: --cruft without expiration
  2021-12-06 21:44   ` Derrick Stolee
@ 2022-03-01  2:48     ` Taylor Blau
  0 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-01  2:48 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, larsxschneider, peff, tytso

On Mon, Dec 06, 2021 at 04:44:31PM -0500, Derrick Stolee wrote:
> On 11/29/2021 5:25 PM, Taylor Blau wrote:
> > Generating a non-expiring cruft packs works as follows:
>
> I had trouble parsing the documentation changes below, so I came back
> to this commit message to see if that helps.
>
> >   - Callers provide a list of every pack they know about, and indicate
> >     which packs are about to be removed.
>
> This corresponds to the list over stdin.
>
> >   - All packs which are going to be removed (we'll call these the
> >     redundant ones) are marked as kept in-core, as well as any packs
> >     that `pack-objects` found but the caller did not specify.
>
> Ok, so as an implementation detail we mark these as keep packs.


> >     These packs are presumed to have entered the repository between
> >     the caller collecting packs and invoking `pack-objects`. Since we
> >     do not want to include objects in these packs (because we don't know
> >     which of their objects are or aren't reachable), these are also
> >     marked as kept in-core.
>
> Here, "are presumed" is doing a lot of work. Theoretically, there could
> be three categories:
>
> 1. This pack was just repacked and will be removed because all of its
>    objects were placed into new objects.
>
> 2. Either this pack was repacked and contains important reachable objects
>    OR we did a repack of reachable objects and this pack contained some
>    extra, unreachable objects.
>
> 3. This pack was added to the repository while creating those repacked
>    packs from category 2, so we don't know if things are reachable or
>    not.
>
> So, the packs that we discover on-disk but are not specified over stdin
> are in this third category, but these are grouped with category 1 as we
> will treat them the same.

Ah, I think I caused some unintentional confusion by attaching "are
presumed" to "these packs", when it wasn't clear that "these packs"
meant "ones that aren't listed over stdin".

Since the caller is supposed to provide a complete picture of the
repository as they see it, any packs known to the pack-objects process
that aren't mentioned over stdin are assumed to have entered the
repository after the caller was spun up.

I'll clarify this section of the commit message, since I agree it is
unnecessarily confusing.

> >   - Then, we enumerate all objects in the repository, and add them to
> >     our packing list if they do not appear in an in-core kept pack.
>
> Here, we are looking at all of the objects in category 2 as well as
> loose objects.

We're enumerating any objects that aren't in packs which are marked as
kept in-core (along with loose objects which don't appear in packs that
are marked as kept in-core).

The in-core kept packs are ones that the caller (and I find it's helpful
to read "the caller" as "git repack") has marked as "will delete". So
the non in-core pack(s) that we're looking at here contain all reachable
objects (e.g., like you would get with `git repack -A`).

> > +	Packs unreachable objects into a separate "cruft" pack, denoted
> > +	by the existence of a `.mtimes` file. Pack names provided over
> > +	stdin indicate which packs will remain after a `git repack`.
> > +	Pack names prefixed with a `-` indicate those which will be
> > +	removed. (...)
>
> This description is too tied to 'git repack'. Can we describe the
> input using terms independent of the 'git repack' operation? I need
> to keep reading.
>
> > (...) The contents of the cruft pack are all objects not
> > +	contained in the surviving packs specified by `--keep-pack`)
>
> Now you use --keep-pack, which is a way of specifying a pack as
> "in-core keep" which was not in your commit message. Here, we also
> don't link the packs over stdin to the concept of keep packs.

The mention of `--keep-pack` is a mistake left over from a previous
version; thanks for spotting. Here's a version of the first paragraph
from this piece of documentation which is less tied to `git repack` and
hopefully a little clearer:

    --cruft::
            Packs unreachable objects into a separate "cruft" pack, denoted
            by the existence of a `.mtimes` file. Typically used by `git
            repack --cruft`. Callers provide a list of pack names and
            indicate which packs will remain in the repository, along with
            which packs will be deleted (indicated by the `-` prefix). The
            contents of the cruft pack are all objects not contained in the
            surviving packs which have not exceeded the grace period (see
            `--cruft-expiration` below), or which have exceeded the grace
            period, but are reachable from an other object which hasn't.

> > +	which have not exceeded the grace period (see
> > +	`--cruft-expiration` below), or which have exceeded the grace
> > +	period, but are reachable from an other object which hasn't.
>
> And now we think about the grace period! There is so much going on
> that I need to break it down to understand.
>
>   An object is _excluded_ from the new cruft pack if
>
>   1. It is reachable from at least one reference.
>   2. It is in a pack from stdin prefixed with "-"
>   3. It is in a pack specified by `--keep-pack`
>   4. It is in an existing cruft pack and the .mtimes file states
>      that its mtime is at least as recent as the time specified by
>      the --cruft-expiration option.
>
> Breaking it down into a list like this helps me, at least. I'm not
> sure what the best way would look like.

Given some expiration T, cruft packs contain all unreachable objects
which are newer than T, along with any cruft objects (i.e., those not
directly reachable from any ref) which are older than T, but reachable
from another cruft object newer than T.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v2 00/17] cruft packs
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (17 preceding siblings ...)
  2021-12-03 19:51 ` [PATCH 00/17] " Junio C Hamano
@ 2022-03-02  0:57 ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
                     ` (17 more replies)
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                   ` (2 subsequent siblings)
  21 siblings, 18 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-02  0:57 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Here is a reroll of my series to implement "cruft packs", a pack which
stores accumulated unreachable objects, along with a new ".mtimes" file
which tracks each object's last known modification time.

This was on the list towards the end of 2021[1], and I have been
accumulating small changes to it locally for a couple of months now.
Major changes since last time include:

  - Clearer documentation and commit message(s) to better illustrate how
    the feature works and is supposed to be used.

  - Some minor documentation updates to pack-format.txt, which make some
    ambiguous details more explicit.

  - Minor code movement / tweaks to make things easier to read, ensure
    that functions aren't introduced in patches before they are used /
    etc.

  - Moved the new test script to t5328 (instead of t5327, which happens
    to be taken up by a new MIDX bitmap-related test), and purged it of
    all "rm -fr .git/logs" (replacing them with "git reflog --expire
    --all --expire=all" instead).

  - A new test which fixes a bug where loose objects which have copies
    that appear in a cruft pack would not get accumulated when doing a
    `--geometric` repack.

For convenience, a range-diff is below. Thanks in advance for taking
another look!

[1]: https://lore.kernel.org/git/cover.1638224692.git.me@ttaylorr.com/

Taylor Blau (17):
  Documentation/technical: add cruft-packs.txt
  pack-mtimes: support reading .mtimes files
  pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  chunk-format.h: extract oid_version()
  pack-mtimes: support writing pack .mtimes files
  t/helper: add 'pack-mtimes' test-tool
  builtin/pack-objects.c: return from create_object_entry()
  builtin/pack-objects.c: --cruft without expiration
  reachable: add options to add_unseen_recent_objects_to_traversal
  reachable: report precise timestamps from objects in cruft packs
  builtin/pack-objects.c: --cruft with expiration
  builtin/repack.c: support generating a cruft pack
  builtin/repack.c: allow configuring cruft pack generation
  builtin/repack.c: use named flags for existing_packs
  builtin/repack.c: add cruft packs to MIDX during geometric repack
  builtin/gc.c: conditionally avoid pruning objects via loose
  sha1-file.c: don't freshen cruft packs

 Documentation/Makefile                  |   1 +
 Documentation/config/gc.txt             |  21 +-
 Documentation/config/repack.txt         |   9 +
 Documentation/git-gc.txt                |   5 +
 Documentation/git-pack-objects.txt      |  30 +
 Documentation/git-repack.txt            |  11 +
 Documentation/technical/cruft-packs.txt |  97 ++++
 Documentation/technical/pack-format.txt |  19 +
 Makefile                                |   2 +
 builtin/gc.c                            |  10 +-
 builtin/pack-objects.c                  | 304 +++++++++-
 builtin/repack.c                        | 183 +++++-
 bulk-checkin.c                          |   2 +-
 chunk-format.c                          |  12 +
 chunk-format.h                          |   3 +
 commit-graph.c                          |  18 +-
 midx.c                                  |  18 +-
 object-file.c                           |   4 +-
 object-store.h                          |   7 +-
 pack-mtimes.c                           | 129 +++++
 pack-mtimes.h                           |  15 +
 pack-objects.c                          |   6 +
 pack-objects.h                          |  25 +
 pack-write.c                            |  93 ++-
 pack.h                                  |   4 +
 packfile.c                              |  19 +-
 reachable.c                             |  58 +-
 reachable.h                             |   9 +-
 t/helper/test-pack-mtimes.c             |  56 ++
 t/helper/test-tool.c                    |   1 +
 t/helper/test-tool.h                    |   1 +
 t/t5328-pack-objects-cruft.sh           | 739 ++++++++++++++++++++++++
 32 files changed, 1810 insertions(+), 101 deletions(-)
 create mode 100644 Documentation/technical/cruft-packs.txt
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h
 create mode 100644 t/helper/test-pack-mtimes.c
 create mode 100755 t/t5328-pack-objects-cruft.sh

Range-diff against v1:
 1:  a9f7c738e0 !  1:  784ee7e0ee Documentation/technical: add cruft-packs.txt
    @@ Documentation/technical/cruft-packs.txt (new)
     @@
     += Cruft packs
     +
    -+Cruft packs offer an alternative to Git's traditional mechanism of removing
    -+unreachable objects. This document provides an overview of Git's pruning
    -+mechanism, and how cruft packs can be used instead to accomplish the same.
    ++The cruft packs feature offer an alternative to Git's traditional mechanism of
    ++removing unreachable objects. This document provides an overview of Git's
    ++pruning mechanism, and how a cruft pack can be used instead to accomplish the
    ++same.
     +
     +== Background
     +
    @@ Documentation/technical/cruft-packs.txt (new)
     +
     +== Cruft packs
     +
    -+Cruft packs are designed to eliminate the need for storing unreachable objects
    -+in a loose state by including the per-object mtimes in a separate file alongside
    -+a single pack containing all loose objects.
    ++A cruft pack eliminates the need for storing unreachable objects in a loose
    ++state by including the per-object mtimes in a separate file alongside a single
    ++pack containing all loose objects.
     +
     +A cruft pack is written by `git repack --cruft` when generating a new pack.
     +linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft`
    @@ Documentation/technical/cruft-packs.txt (new)
     +Notable alternatives to this design include:
     +
     +  - The location of the per-object mtime data, and
    -+  - Whether cruft packs should be incremental or not.
    ++  - Storing unreachable objects in multiple cruft packs.
     +
     +On the location of mtime data, a new auxiliary file tied to the pack was chosen
     +to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
     +support for optional chunks of data, it may make sense to consolidate the
     +`.mtimes` format into the `.idx` itself.
     +
    -+Incremental cruft packs (i.e., where each time a repository is repacked a new
    -+cruft pack is generated containing only the unreachable objects introduced since
    -+the last time a cruft pack was written) are significantly more complicated to
    -+construct, and so aren't pursued here. The obvious drawback to the current
    -+implementation is that the entire cruft pack must be re-written from scratch.
    ++Storing unreachable objects among multiple cruft packs (e.g., creating a new
    ++cruft pack during each repacking operation including only unreachable objects
    ++which aren't already stored in an earlier cruft pack) is significantly more
    ++complicated to construct, and so aren't pursued here. The obvious drawback to
    ++the current implementation is that the entire cruft pack must be re-written from
    ++scratch.
 2:  7d4ae7bd3e !  2:  101b34660c pack-mtimes: support reading .mtimes files
    @@ Documentation/technical/pack-format.txt: Pack file entry: <+
     +
     +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
     +
    -+  - A table of mtimes (one per packed object, num_objects in total, each
    -+    a 4-byte unsigned integer in network order), in the same order as
    -+    objects appear in the index file (e.g., the first entry in the mtime
    -+    table corresponds to the object with the lowest lexically-sorted
    -+    oid). The mtimes count standard epoch seconds.
    ++  - A table of 4-byte unsigned integers in network order. The ith
    ++    value is the modification time (mtime) of the ith object in the
    ++    corresponding pack by lexicographic (index) order. The mtimes
    ++    count standard epoch seconds.
     +
    -+  - A trailer, containing a:
    -+
    -+    checksum of the corresponding packfile, and
    -+
    -+    a checksum of all of the above.
    ++  - A trailer, containing a checksum of the corresponding packfile,
    ++    and a checksum of all of the above (each having length according
    ++    to the specified hash function).
     +
     +All 4-byte numbers are in network order.
     +
    @@ pack-mtimes.c (new)
     +	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
     +}
     +
    -+int pack_has_mtimes(struct packed_git *p)
    -+{
    -+	struct stat st;
    -+	char *fname = pack_mtimes_filename(p);
    -+
    -+	if (stat(fname, &st) < 0) {
    -+		if (errno == ENOENT)
    -+			return 0;
    -+		die_errno(_("could not stat %s"), fname);
    -+	}
    -+
    -+	free(fname);
    -+	return 1;
    -+}
    -+
     +#define MTIMES_HEADER_SIZE (12)
     +#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))
     +
    @@ pack-mtimes.c (new)
     +	struct stat st;
     +	void *data = NULL;
     +	size_t mtimes_size;
    ++	struct mtimes_header header;
     +	uint32_t *hdr;
     +
     +	fd = git_open(mtimes_file);
    @@ pack-mtimes.c (new)
     +
     +	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
     +
    -+	if (ntohl(*hdr) != MTIMES_SIGNATURE) {
    ++	header.signature = ntohl(hdr[0]);
    ++	header.version = ntohl(hdr[1]);
    ++	header.hash_id = ntohl(hdr[2]);
    ++
    ++	if (header.signature != MTIMES_SIGNATURE) {
     +		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
     +		goto cleanup;
     +	}
     +
    -+	if (ntohl(*++hdr) != 1) {
    ++	if (header.version != 1) {
     +		ret = error(_("mtimes file %s has unsupported version %"PRIu32),
    -+			    mtimes_file, ntohl(*hdr));
    ++			    mtimes_file, header.version);
     +		goto cleanup;
     +	}
    -+	hdr++;
    -+	if (!(ntohl(*hdr) == 1 || ntohl(*hdr) == 2)) {
    ++
    ++	if (!(header.hash_id == 1 || header.hash_id == 2)) {
     +		ret = error(_("mtimes file %s has unsupported hash id %"PRIu32),
    -+			    mtimes_file, ntohl(*hdr));
    ++			    mtimes_file, header.hash_id);
     +		goto cleanup;
     +	}
     +
    @@ pack-mtimes.h (new)
     +
     +struct packed_git;
     +
    -+int pack_has_mtimes(struct packed_git *p);
     +int load_pack_mtimes(struct packed_git *p);
     +
     +uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos);
    @@ pack-mtimes.h (new)
     +#endif
     
      ## packfile.c ##
    -@@ packfile.c: void close_pack_revindex(struct packed_git *p) {
    +@@ packfile.c: static void close_pack_revindex(struct packed_git *p)
      	p->revindex_data = NULL;
      }
      
    -+void close_pack_mtimes(struct packed_git *p) {
    ++static void close_pack_mtimes(struct packed_git *p)
    ++{
     +	if (!p->mtimes_map)
     +		return;
     +
    @@ packfile.c: static void prepare_pack(const char *full_name, size_t full_name_len
      		string_list_append(data->garbage, full_name);
      	else
      		report_garbage(PACKDIR_FILE_GARBAGE, full_name);
    -
    - ## packfile.h ##
    -@@ packfile.h: uint32_t get_pack_fanout(struct packed_git *p, uint32_t value);
    - unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *);
    - void close_pack_windows(struct packed_git *);
    - void close_pack_revindex(struct packed_git *);
    -+void close_pack_mtimes(struct packed_git *p);
    - void close_pack(struct packed_git *);
    - void close_object_store(struct raw_object_store *o);
    - void unuse_pack(struct pack_window **);
 3:  7f4612e859 =  3:  a94d7dfeb3 pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
 4:  ea245b7216 =  4:  1e0ed363ae chunk-format.h: extract oid_version()
 5:  deece9eb70 !  5:  5236490688 pack-mtimes: support writing pack .mtimes files
    @@ pack-objects.h: struct packing_data {
      	unsigned int *tree_depth;
      	unsigned char *layer;
     +
    -+	/* cruft packs */
    ++	/*
    ++	 * Used when writing cruft packs.
    ++	 *
    ++	 * Object mtimes are stored in pack order when writing, but
    ++	 * written out in lexicographic (index) order.
    ++	 */
     +	uint32_t *cruft_mtime;
      };
      
    @@ pack-write.c: const char *write_rev_file_order(const char *rev_name,
     +	hashwrite_be32(f, oid_version(the_hash_algo));
     +}
     +
    ++/*
    ++ * Writes the object mtimes of "objects" for use in a .mtimes file.
    ++ * Note that objects must be in lexicographic (index) order, which is
    ++ * the expected ordering of these values in the .mtimes file.
    ++ */
     +static void write_mtimes_objects(struct hashfile *f,
     +				 struct packing_data *to_pack,
     +				 struct pack_idx_entry **objects,
    @@ pack-write.c: const char *write_rev_file_order(const char *rev_name,
     +	write_mtimes_objects(f, to_pack, objects, nr_objects);
     +	write_mtimes_trailer(f, hash);
     +
    -+	if (mtimes_name && adjust_shared_perm(mtimes_name) < 0)
    ++	if (adjust_shared_perm(mtimes_name) < 0)
     +		die(_("failed to make %s readable"), mtimes_name);
     +
     +	finalize_hashfile(f, NULL,
    @@ pack-write.c: void stage_tmp_packfiles(struct strbuf *name_buffer,
     +		mtimes_tmp_name = write_mtimes_file(NULL, to_pack, written_list,
     +						    nr_written,
     +						    hash);
    -+		if (adjust_shared_perm(mtimes_tmp_name))
    -+			die_errno("unable to make temporary mtimes file readable");
     +	}
     +
      	rename_tmp_packfile(name_buffer, pack_tmp_name, "pack");
 6:  e0a7b3b310 !  6:  78313bc441 t/helper: add 'pack-mtimes' test-tool
    @@ t/helper/test-pack-mtimes.c (new)
     +#include "packfile.h"
     +#include "pack-mtimes.h"
     +
    -+static int dump_mtimes(struct packed_git *p)
    ++static void dump_mtimes(struct packed_git *p)
     +{
     +	uint32_t i;
     +	if (load_pack_mtimes(p) < 0)
    @@ t/helper/test-pack-mtimes.c (new)
     +		printf("%s %"PRIu32"\n",
     +		       oid_to_hex(&oid), nth_packed_mtime(p, i));
     +	}
    -+
    -+	return 0;
     +}
     +
     +static const char *pack_mtimes_usage = "\n"
    @@ t/helper/test-pack-mtimes.c (new)
     +
     +	strbuf_release(&buf);
     +
    -+	return p ? dump_mtimes(p) : 1;
    ++	if (!p)
    ++		die("could not find pack '%s'", argv[1]);
    ++
    ++	dump_mtimes(p);
    ++
    ++	return 0;
     +}
     
      ## t/helper/test-tool.c ##
 7:  5710933127 =  7:  142098668d builtin/pack-objects.c: return from create_object_entry()
 8:  66165917a4 !  8:  2517a6be3d builtin/pack-objects.c: --cruft without expiration
    @@ Commit message
             which packs are about to be removed.
     
           - All packs which are going to be removed (we'll call these the
    -        redundant ones) are marked as kept in-core, as well as any packs
    -        that `pack-objects` found but the caller did not specify.
    +        redundant ones) are marked as kept in-core.
     
    -        These packs are presumed to have entered the repository between
    -        the caller collecting packs and invoking `pack-objects`. Since we
    -        do not want to include objects in these packs (because we don't know
    -        which of their objects are or aren't reachable), these are also
    -        marked as kept in-core.
    +        Any packs the caller did not mention (but are known to the
    +        `pack-objects` process) are also marked as kept in-core. Packs not
    +        mentioned by the caller are assumed to be unknown to them, i.e.,
    +        they entered the repository after the caller decided which packs
    +        should be kept and which should be discarded.
    +
    +        Since we do not want to include objects in these "unknown" packs
    +        (because we don't know which of their objects are or aren't
    +        reachable), these are also marked as kept in-core.
     
           - Then, we enumerate all objects in the repository, and add them to
             our packing list if they do not appear in an in-core kept pack.
    @@ Documentation/git-pack-objects.txt: SYNOPSIS
      	[--local] [--incremental] [--window=<n>] [--depth=<n>]
      	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
     +	[--cruft] [--cruft-expiration=<time>]
    - 	[--stdout [--filter=<filter-spec>] | base-name]
    - 	[--shallow] [--keep-true-parents] [--[no-]sparse] < object-list
    + 	[--stdout [--filter=<filter-spec>] | <base-name>]
    + 	[--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>
      
     @@ Documentation/git-pack-objects.txt: base-name::
      Incompatible with `--revs`, or options that imply `--revs` (such as
    @@ Documentation/git-pack-objects.txt: base-name::
      
     +--cruft::
     +	Packs unreachable objects into a separate "cruft" pack, denoted
    -+	by the existence of a `.mtimes` file. Pack names provided over
    -+	stdin indicate which packs will remain after a `git repack`.
    -+	Pack names prefixed with a `-` indicate those which will be
    -+	removed. The contents of the cruft pack are all objects not
    -+	contained in the surviving packs specified by `--keep-pack`)
    -+	which have not exceeded the grace period (see
    ++	by the existence of a `.mtimes` file. Typically used by `git
    ++	repack --cruft`. Callers provide a list of pack names and
    ++	indicate which packs will remain in the repository, along with
    ++	which packs will be deleted (indicated by the `-` prefix). The
    ++	contents of the cruft pack are all objects not contained in the
    ++	surviving packs which have not exceeded the grace period (see
     +	`--cruft-expiration` below), or which have exceeded the grace
     +	period, but are reachable from an other object which hasn't.
     ++
    ++When the input lists a pack containing all reachable objects (and lists
    ++all other packs as pending deletion), the corresponding cruft pack will
    ++contain all unreachable objects (with mtime newer than the
    ++`--cruft-expiration`) along with any unreachable objects whose mtime is
    ++older than the `--cruft-expiration`, but are reachable from an
    ++unreachable object whose mtime is newer than the `--cruft-expiration`).
    +++
     +Incompatible with `--unpack-unreachable`, `--keep-unreachable`,
     +`--pack-loose-unreachable`, `--stdin-packs`, as well as any other
     +options which imply `--revs`. Also incompatible with `--max-pack-size`;
    @@ builtin/pack-objects.c: static void read_packs_list_from_stdin(void)
      	string_list_clear(&exclude_packs, 0);
      }
      
    -+static int add_cruft_object_entry(const struct object_id *oid, enum object_type type,
    -+				  struct packed_git *pack, off_t offset,
    -+				  const char *name, uint32_t mtime)
    ++static void add_cruft_object_entry(const struct object_id *oid, enum object_type type,
    ++				   struct packed_git *pack, off_t offset,
    ++				   const char *name, uint32_t mtime)
     +{
     +	struct object_entry *entry;
     +
    @@ builtin/pack-objects.c: static void read_packs_list_from_stdin(void)
     +	if (entry) {
     +		if (name) {
     +			entry->hash = pack_name_hash(name);
    -+			entry->no_try_delta = name && no_try_delta(name);
    ++			entry->no_try_delta = no_try_delta(name);
     +		}
     +	} else {
     +		if (!want_object_in_pack(oid, 0, &pack, &offset))
    -+			return 0;
    ++			return;
     +		if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
     +			/*
     +			 * If a traversed tree has a missing blob then we want
    @@ builtin/pack-objects.c: static void read_packs_list_from_stdin(void)
     +			 * limited to "ensure non-tip blobs which don't exist in
     +			 * packs do exist via loose objects". Confused?
     +			 */
    -+			return 0;
    ++			return;
     +		}
     +
     +		entry = create_object_entry(oid, type, pack_name_hash(name),
    @@ builtin/pack-objects.c: static void read_packs_list_from_stdin(void)
     +
     +	if (mtime > oe_cruft_mtime(&to_pack, entry))
     +		oe_set_cruft_mtime(&to_pack, entry, mtime);
    -+	return 1;
    ++	return;
     +}
     +
     +static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
    @@ builtin/pack-objects.c: int cmd_pack_objects(int argc, const char **argv, const
      		read_packs_list_from_stdin();
      		if (rev_list_unpacked)
      			add_unreachable_loose_objects();
    --	} else if (!use_internal_rev_list)
    -+	} else if (cruft)
    ++	} else if (cruft) {
     +		read_cruft_objects();
    -+	else if (!use_internal_rev_list)
    + 	} else if (!use_internal_rev_list) {
      		read_object_list_from_stdin();
    - 	else {
    - 		get_object_list(rp.nr, rp.v);
    + 	} else {
     
      ## object-file.c ##
     @@ object-file.c: int has_loose_object_nonlocal(const struct object_id *oid)
    @@ object-store.h: int repo_has_object_file_with_flags(struct repository *r,
      
      /*
     
    - ## t/t5327-pack-objects-cruft.sh (new) ##
    + ## t/t5328-pack-objects-cruft.sh (new) ##
     @@
     +#!/bin/sh
     +
    @@ t/t5327-pack-objects-cruft.sh (new)
     +
     +			git reset --hard reachable &&
     +			git tag -d cruft &&
    -+			rm -fr .git/logs &&
    ++			git reflog expire --all --expire=all &&
     +
     +			# remove the unreachable tree, but leave the commit
    -+			# which has it as its root tree in-tact
    ++			# which has it as its root tree intact
     +			rm -fr "$objdir/$(test_oid_to_path "$tree")" &&
     +
     +			git repack -Ad &&
    @@ t/t5327-pack-objects-cruft.sh (new)
     +
     +			git reset --hard reachable &&
     +			git tag -d cruft &&
    -+			rm -fr .git/logs &&
    ++			git reflog expire --all --expire=all &&
     +
     +			# remove the unreachable blob, but leave the commit (and
    -+			# the root tree of that commit) in-tact
    ++			# the root tree of that commit) intact
     +			rm -fr "$objdir/$(test_oid_to_path "$blob")" &&
     +
     +			git repack -Ad &&
 9:  02f7fce788 =  9:  6f0e84273f reachable: add options to add_unseen_recent_objects_to_traversal
10:  52e9ac5710 = 10:  a8bde361f9 reachable: report precise timestamps from objects in cruft packs
11:  37fda94785 ! 11:  d68ce28132 builtin/pack-objects.c: --cruft with expiration
    @@ Commit message
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## builtin/pack-objects.c ##
    -@@ builtin/pack-objects.c: static int add_cruft_object_entry(const struct object_id *oid, enum object_type
    - 	return 1;
    +@@ builtin/pack-objects.c: static void add_cruft_object_entry(const struct object_id *oid, enum object_type
    + 	return;
      }
      
     +static void show_cruft_object(struct object *obj, const char *name, void *data)
    @@ builtin/pack-objects.c: static void read_cruft_objects(void)
      		enumerate_cruft_objects();
      
     
    - ## t/t5327-pack-objects-cruft.sh ##
    -@@ t/t5327-pack-objects-cruft.sh: basic_cruft_pack_tests () {
    + ## t/t5328-pack-objects-cruft.sh ##
    +@@ t/t5328-pack-objects-cruft.sh: basic_cruft_pack_tests () {
      }
      
      basic_cruft_pack_tests never
12:  a05675ab83 ! 12:  e5317cd472 builtin/repack.c: support generating a cruft pack
    @@ builtin/repack.c: static int write_midx_included_packs(struct string_list *inclu
      {
      	struct child_process cmd = CHILD_PROCESS_INIT;
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
    - 	int show_progress = isatty(2);
    + 	int show_progress;
      
      	/* variables to be filled by option parsing */
     -	int pack_everything = 0;
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
      				   LOOSEN_UNREACHABLE | ALL_INTO_ONE),
     +		OPT_BIT(0, "cruft", &pack_everything,
     +				N_("same as -a, pack unreachable cruft objects separately"),
    -+				   PACK_CRUFT | ALL_INTO_ONE),
    ++				   PACK_CRUFT),
     +		OPT_STRING(0, "cruft-expiration", &cruft_expiration, N_("approxidate"),
     +				N_("with -C, expire objects older than this")),
      		OPT_BOOL('d', NULL, &delete_redundant,
      				N_("remove redundant packs, and run git-prune-packed")),
      		OPT_BOOL('f', NULL, &po_args.no_reuse_delta,
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
    - 	if (keep_unreachable &&
      	    (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
    - 		die(_("--keep-unreachable and -A are incompatible"));
    -+	if (pack_everything & PACK_CRUFT && delete_redundant) {
    + 		die(_("options '%s' and '%s' cannot be used together"), "--keep-unreachable", "-A");
    + 
    ++	if (pack_everything & PACK_CRUFT) {
    ++		pack_everything |= ALL_INTO_ONE;
    ++
     +		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
    -+			die(_("--cruft and -A are incompatible"));
    ++			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-A");
     +		if (keep_unreachable)
    -+			die(_("--cruft and -k are incompatible"));
    -+		if (!(pack_everything & ALL_INTO_ONE))
    -+			die(_("--cruft must be combined with all-into-one"));
    ++			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-k");
     +	}
    - 
    ++
      	if (write_bitmaps < 0) {
      		if (!write_midx &&
    + 		    (!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()))
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
      	if (pack_everything & ALL_INTO_ONE) {
      		repack_promisor_objects(&po_args, &names);
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
      			for_each_string_list_item(item, &names) {
      				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
      					     packtmp_name, item->string);
    -@@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
    - 		return ret;
    - 
    - 	if (geometry) {
    -+		struct packed_git *p;
    - 		FILE *in = xfdopen(cmd.in, "w");
    - 		/*
    - 		 * The resulting pack should contain all objects in packs that
    -@@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
    - 			fprintf(in, "%s\n", pack_basename(geometry->pack[i]));
    - 		for (i = geometry->split; i < geometry->pack_nr; i++)
    - 			fprintf(in, "^%s\n", pack_basename(geometry->pack[i]));
    -+
    -+		for (p = get_all_packs(the_repository); p; p = p->next) {
    -+			if (!p->is_cruft)
    -+				continue;
    -+			fprintf(in, "^%s\n", pack_basename(p));
    -+		}
    - 		fclose(in);
    - 	}
    - 
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
      	if (!names.nr && !po_args.quiet)
      		printf_ln(_("Nothing new to pack."));
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
      		item->util = (void *)(uintptr_t)populate_pack_exts(item->string);
      	}
     
    - ## t/t5327-pack-objects-cruft.sh ##
    -@@ t/t5327-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned' '
    + ## t/t5328-pack-objects-cruft.sh ##
    +@@ t/t5328-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned' '
      	)
      '
      
    @@ t/t5327-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned'
     +		git branch -D other &&
     +		git tag -d unreachable &&
     +		# objects are not cruft if they are contained in the reflogs
    -+		rm -fr .git/logs &&
    ++		git reflog expire --all --expire=all &&
     +
     +		git rev-list --objects --all --no-object-names >reachable.raw &&
     +		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
    @@ t/t5327-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned'
     +		git checkout main &&
     +		git branch -D other &&
     +		git tag -d cruft &&
    -+		rm -fr .git/logs &&
    ++		git reflog expire --all --expire=all &&
     +
     +		git repack --cruft -d &&
     +
    @@ t/t5327-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned'
     +		git checkout main &&
     +		git branch -D other &&
     +		git tag -d cruft &&
    -+		rm -fr .git/logs &&
    ++		git reflog expire --all --expire=all &&
     +
     +		git repack --cruft &&
     +
    @@ t/t5327-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned'
     +		test_cmp before after
     +	)
     +'
    ++
    ++test_expect_success 'repack --geometric collects once-cruft objects' '
    ++	git init repo &&
    ++	test_when_finished "rm -fr repo" &&
    ++	(
    ++		cd repo &&
    ++
    ++		test_commit reachable &&
    ++		git repack -Ad &&
    ++		git branch -M main &&
    ++
    ++		git checkout --orphan other &&
    ++		git rm -rf . &&
    ++		test_commit --no-tag cruft &&
    ++		cruft="$(git rev-parse HEAD)" &&
    ++
    ++		git checkout main &&
    ++		git branch -D other &&
    ++		git reflog expire --all --expire=all &&
    ++
    ++		# Pack the objects created in the previous step into a cruft
    ++		# pack. Intentionally leave loose copies of those objects
    ++		# around so we can pick them up in a subsequent --geometric
    ++		# reapack.
    ++		git repack --cruft &&
    ++
    ++		# Now make those objects reachable, and ensure that they are
    ++		# packed into the new pack created via a --geometric repack.
    ++		git update-ref refs/heads/other $cruft &&
    ++
    ++		# Without this object, the set of unpacked objects is exactly
    ++		# the set of objects already in the cruft pack. Tweak that set
    ++		# to ensure we do not overwrite the cruft pack entirely.
    ++		test_commit reachable2 &&
    ++
    ++		find $packdir -name "pack-*.idx" | sort >before &&
    ++		git repack --geometric=2 -d &&
    ++		find $packdir -name "pack-*.idx" | sort >after &&
    ++
    ++		{
    ++			git rev-list --objects --no-object-names $cruft &&
    ++			git rev-list --objects --no-object-names reachable..reachable2
    ++		} >want.raw &&
    ++		sort want.raw >want &&
    ++
    ++		pack=$(comm -13 before after) &&
    ++		git show-index <$pack >objects.raw &&
    ++
    ++		cut -d" " -f2 objects.raw | sort >got &&
    ++
    ++		test_cmp want got
    ++	)
    ++'
    ++
     +test_expect_success 'cruft repack with no reachable objects' '
     +	git init repo &&
     +	test_when_finished "rm -fr repo" &&
    @@ t/t5327-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned'
     +
     +		git for-each-ref --format="delete %(refname)" >in &&
     +		git update-ref --stdin <in &&
    -+		rm -fr .git/logs &&
    ++		git reflog expire --all --expire=all &&
     +		rm -fr .git/index &&
     +
     +		git repack --cruft -d &&
13:  0d2dfaa062 ! 13:  b548dbbf80 builtin/repack.c: allow configuring cruft pack generation
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
      				       &existing_kept_packs);
      		if (ret)
     
    - ## t/t5327-pack-objects-cruft.sh ##
    -@@ t/t5327-pack-objects-cruft.sh: test_expect_success 'cruft repack ignores pack.packSizeLimit' '
    + ## t/t5328-pack-objects-cruft.sh ##
    +@@ t/t5328-pack-objects-cruft.sh: test_expect_success 'cruft repack ignores pack.packSizeLimit' '
      	)
      '
      
14:  fd50c39657 = 14:  e6eee7f15c builtin/repack.c: use named flags for existing_packs
15:  b2937ceda7 ! 15:  b09dbc9fe5 builtin/repack.c: add cruft packs to MIDX during geometric repack
    @@ builtin/repack.c: static void midx_included_packs(struct string_list *include,
      		for_each_string_list_item(item, existing_nonkept_packs) {
      			if ((uintptr_t)item->util & DELETE_PACK)
     
    - ## t/t5327-pack-objects-cruft.sh ##
    -@@ t/t5327-pack-objects-cruft.sh: test_expect_success 'cruft --local drops unreachable objects' '
    + ## t/t5328-pack-objects-cruft.sh ##
    +@@ t/t5328-pack-objects-cruft.sh: test_expect_success 'cruft --local drops unreachable objects' '
      	)
      '
      
    @@ t/t5327-pack-objects-cruft.sh: test_expect_success 'cruft --local drops unreacha
     +
     +		git reset --hard $unreachable^ &&
     +		git tag -d cruft &&
    -+		rm -fr .git/logs &&
    ++		git reflog expire --all --expire=all &&
     +
     +		git repack --cruft -d &&
     +
16:  394de0199f ! 16:  7a21ae1494 builtin/gc.c: conditionally avoid pruning objects via loose
    @@ builtin/gc.c: int cmd_gc(int argc, const char **argv, const char *prefix)
      			if (quiet)
      				strvec_push(&prune, "--no-progress");
     
    - ## t/t5327-pack-objects-cruft.sh ##
    -@@ t/t5327-pack-objects-cruft.sh: test_expect_success 'loose objects mtimes upsert others' '
    + ## t/t5328-pack-objects-cruft.sh ##
    +@@ t/t5328-pack-objects-cruft.sh: test_expect_success 'loose objects mtimes upsert others' '
      	)
      '
      
    @@ t/t5327-pack-objects-cruft.sh: test_expect_success 'loose objects mtimes upsert
     +		git branch -D other &&
     +		git tag -d unreachable &&
     +		# objects are not cruft if they are contained in the reflogs
    -+		rm -fr .git/logs &&
    ++		git reflog expire --all --expire=all &&
     +
     +		git rev-list --objects --all --no-object-names >reachable.raw &&
     +		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
17:  99aace8e16 ! 17:  b729b80963 sha1-file.c: don't freshen cruft packs
    @@ object-file.c: static int freshen_packed_object(const struct object_id *oid)
      		return 1;
      	if (!freshen_file(e.p->pack_name))
     
    - ## t/t5327-pack-objects-cruft.sh ##
    -@@ t/t5327-pack-objects-cruft.sh: test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
    + ## t/t5328-pack-objects-cruft.sh ##
    +@@ t/t5328-pack-objects-cruft.sh: test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
      	)
      '
      
-- 
2.35.1.73.gccc5557600

^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v2 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
                     ` (16 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Create a technical document to explain cruft packs. It contains a brief
overview of the problem, some background, details on the implementation,
and a couple of alternative approaches not considered here.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/Makefile                  |  1 +
 Documentation/technical/cruft-packs.txt | 97 +++++++++++++++++++++++++
 2 files changed, 98 insertions(+)
 create mode 100644 Documentation/technical/cruft-packs.txt

diff --git a/Documentation/Makefile b/Documentation/Makefile
index ed656db2ae..0b01c9408e 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -91,6 +91,7 @@ TECH_DOCS += MyFirstContribution
 TECH_DOCS += MyFirstObjectWalk
 TECH_DOCS += SubmittingPatches
 TECH_DOCS += technical/bundle-format
+TECH_DOCS += technical/cruft-packs
 TECH_DOCS += technical/hash-function-transition
 TECH_DOCS += technical/http-protocol
 TECH_DOCS += technical/index-format
diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
new file mode 100644
index 0000000000..2c3c5d93f8
--- /dev/null
+++ b/Documentation/technical/cruft-packs.txt
@@ -0,0 +1,97 @@
+= Cruft packs
+
+The cruft packs feature offer an alternative to Git's traditional mechanism of
+removing unreachable objects. This document provides an overview of Git's
+pruning mechanism, and how a cruft pack can be used instead to accomplish the
+same.
+
+== Background
+
+To remove unreachable objects from your repository, Git offers `git repack -Ad`
+(see linkgit:git-repack[1]). Quoting from the documentation:
+
+[quote]
+[...] unreachable objects in a previous pack become loose, unpacked objects,
+instead of being left in the old pack. [...] loose unreachable objects will be
+pruned according to normal expiry rules with the next 'git gc' invocation.
+
+Unreachable objects aren't removed immediately, since doing so could race with
+an incoming push which may reference an object which is about to be deleted.
+Instead, those unreachable objects are stored as loose object and stay that way
+until they are older than the expiration window, at which point they are removed
+by linkgit:git-prune[1].
+
+Git must store these unreachable objects loose in order to keep track of their
+per-object mtimes. If these unreachable objects were written into one big pack,
+then either freshening that pack (because an object contained within it was
+re-written) or creating a new pack of unreachable objects would cause the pack's
+mtime to get updated, and the objects within it would never leave the expiration
+window. Instead, objects are stored loose in order to keep track of the
+individual object mtimes and avoid a situation where all cruft objects are
+freshened at once.
+
+This can lead to undesirable situations when a repository contains many
+unreachable objects which have not yet left the grace period. Having large
+directories in the shards of `.git/objects` can lead to decreased performance in
+the repository. But given enough unreachable objects, this can lead to inode
+starvation and degrade the performance of the whole system. Since we
+can never pack those objects, these repositories often take up a large amount of
+disk space, since we can only zlib compress them, but not store them in delta
+chains.
+
+== Cruft packs
+
+A cruft pack eliminates the need for storing unreachable objects in a loose
+state by including the per-object mtimes in a separate file alongside a single
+pack containing all loose objects.
+
+A cruft pack is written by `git repack --cruft` when generating a new pack.
+linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft`
+is a classic all-into-one repack, meaning that everything in the resulting pack is
+reachable, and everything else is unreachable. Once written, the `--cruft`
+option instructs `git repack` to generate another pack containing only objects
+not packed in the previous step (which equates to packing all unreachable
+objects together). This progresses as follows:
+
+  1. Enumerate every object, marking any object which is (a) not contained in a
+     kept-pack, and (b) whose mtime is within the grace period as a traversal
+     tip.
+
+  2. Perform a reachability traversal based on the tips gathered in the previous
+     step, adding every object along the way to the pack.
+
+  3. Write the pack out, along with a `.mtimes` file that records the per-object
+     timestamps.
+
+This mode is invoked internally by linkgit:git-repack[1] when instructed to
+write a cruft pack. Crucially, the set of in-core kept packs is exactly the set
+of packs which will not be deleted by the repack; in other words, they contain
+all of the repository's reachable objects.
+
+When a repository already has a cruft pack, `git repack --cruft` typically only
+adds objects to it. An exception to this is when `git repack` is given the
+`--cruft-expiration` option, which allows the generated cruft pack to omit
+expired objects instead of waiting for linkgit:git-gc[1] to expire those objects
+later on.
+
+It is linkgit:git-gc[1] that is typically responsible for removing expired
+unreachable objects.
+
+== Alternatives
+
+Notable alternatives to this design include:
+
+  - The location of the per-object mtime data, and
+  - Storing unreachable objects in multiple cruft packs.
+
+On the location of mtime data, a new auxiliary file tied to the pack was chosen
+to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
+support for optional chunks of data, it may make sense to consolidate the
+`.mtimes` format into the `.idx` itself.
+
+Storing unreachable objects among multiple cruft packs (e.g., creating a new
+cruft pack during each repacking operation including only unreachable objects
+which aren't already stored in an earlier cruft pack) is significantly more
+complicated to construct, and so aren't pursued here. The obvious drawback to
+the current implementation is that the entire cruft pack must be re-written from
+scratch.
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v2 02/17] pack-mtimes: support reading .mtimes files
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02 20:22     ` Derrick Stolee
  2022-03-02  0:58   ` [PATCH v2 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
                     ` (15 subsequent siblings)
  17 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

To store the individual mtimes of objects in a cruft pack, introduce a
new `.mtimes` format that can optionally accompany a single pack in the
repository.

The format is defined in Documentation/technical/pack-format.txt, and
stores a 4-byte network order timestamp for each object in name (index)
order.

This patch prepares for cruft packs by defining the `.mtimes` format,
and introducing a basic API that callers can use to read out individual
mtimes.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/pack-format.txt |  19 ++++
 Makefile                                |   1 +
 builtin/repack.c                        |   1 +
 object-store.h                          |   5 +-
 pack-mtimes.c                           | 129 ++++++++++++++++++++++++
 pack-mtimes.h                           |  15 +++
 packfile.c                              |  19 +++-
 7 files changed, 186 insertions(+), 3 deletions(-)
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 6d3efb7d16..c443dbb526 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -294,6 +294,25 @@ Pack file entry: <+
 
 All 4-byte numbers are in network order.
 
+== pack-*.mtimes files have the format:
+
+  - A 4-byte magic number '0x4d544d45' ('MTME').
+
+  - A 4-byte version identifier (= 1).
+
+  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
+
+  - A table of 4-byte unsigned integers in network order. The ith
+    value is the modification time (mtime) of the ith object in the
+    corresponding pack by lexicographic (index) order. The mtimes
+    count standard epoch seconds.
+
+  - A trailer, containing a checksum of the corresponding packfile,
+    and a checksum of all of the above (each having length according
+    to the specified hash function).
+
+All 4-byte numbers are in network order.
+
 == multi-pack-index (MIDX) files have the following format:
 
 The multi-pack-index files refer to multiple pack-files and loose objects.
diff --git a/Makefile b/Makefile
index 6f0b4b775f..1b186f4fd7 100644
--- a/Makefile
+++ b/Makefile
@@ -959,6 +959,7 @@ LIB_OBJS += oidtree.o
 LIB_OBJS += pack-bitmap-write.o
 LIB_OBJS += pack-bitmap.o
 LIB_OBJS += pack-check.o
+LIB_OBJS += pack-mtimes.o
 LIB_OBJS += pack-objects.o
 LIB_OBJS += pack-revindex.o
 LIB_OBJS += pack-write.o
diff --git a/builtin/repack.c b/builtin/repack.c
index da1e364a75..f908f7d5dd 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -212,6 +212,7 @@ static struct {
 } exts[] = {
 	{".pack"},
 	{".rev", 1},
+	{".mtimes", 1},
 	{".bitmap", 1},
 	{".promisor", 1},
 	{".idx"},
diff --git a/object-store.h b/object-store.h
index 6f89482df0..9b227661f2 100644
--- a/object-store.h
+++ b/object-store.h
@@ -115,12 +115,15 @@ struct packed_git {
 		 freshened:1,
 		 do_not_close:1,
 		 pack_promisor:1,
-		 multi_pack_index:1;
+		 multi_pack_index:1,
+		 is_cruft:1;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	struct revindex_entry *revindex;
 	const uint32_t *revindex_data;
 	const uint32_t *revindex_map;
 	size_t revindex_size;
+	const uint32_t *mtimes_map;
+	size_t mtimes_size;
 	/* something like ".git/objects/pack/xxxxx.pack" */
 	char pack_name[FLEX_ARRAY]; /* more */
 };
diff --git a/pack-mtimes.c b/pack-mtimes.c
new file mode 100644
index 0000000000..50caa34381
--- /dev/null
+++ b/pack-mtimes.c
@@ -0,0 +1,129 @@
+#include "pack-mtimes.h"
+#include "object-store.h"
+#include "packfile.h"
+
+static char *pack_mtimes_filename(struct packed_git *p)
+{
+	size_t len;
+	if (!strip_suffix(p->pack_name, ".pack", &len))
+		BUG("pack_name does not end in .pack");
+	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
+	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
+}
+
+#define MTIMES_HEADER_SIZE (12)
+#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))
+
+struct mtimes_header {
+	uint32_t signature;
+	uint32_t version;
+	uint32_t hash_id;
+};
+
+static int load_pack_mtimes_file(char *mtimes_file,
+				 uint32_t num_objects,
+				 const uint32_t **data_p, size_t *len_p)
+{
+	int fd, ret = 0;
+	struct stat st;
+	void *data = NULL;
+	size_t mtimes_size;
+	struct mtimes_header header;
+	uint32_t *hdr;
+
+	fd = git_open(mtimes_file);
+
+	if (fd < 0) {
+		ret = -1;
+		goto cleanup;
+	}
+	if (fstat(fd, &st)) {
+		ret = error_errno(_("failed to read %s"), mtimes_file);
+		goto cleanup;
+	}
+
+	mtimes_size = xsize_t(st.st_size);
+
+	if (mtimes_size < MTIMES_MIN_SIZE) {
+		ret = error(_("mtimes file %s is too small"), mtimes_file);
+		goto cleanup;
+	}
+
+	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
+		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
+		goto cleanup;
+	}
+
+	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
+
+	header.signature = ntohl(hdr[0]);
+	header.version = ntohl(hdr[1]);
+	header.hash_id = ntohl(hdr[2]);
+
+	if (header.signature != MTIMES_SIGNATURE) {
+		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
+		goto cleanup;
+	}
+
+	if (header.version != 1) {
+		ret = error(_("mtimes file %s has unsupported version %"PRIu32),
+			    mtimes_file, header.version);
+		goto cleanup;
+	}
+
+	if (!(header.hash_id == 1 || header.hash_id == 2)) {
+		ret = error(_("mtimes file %s has unsupported hash id %"PRIu32),
+			    mtimes_file, header.hash_id);
+		goto cleanup;
+	}
+
+cleanup:
+	if (ret) {
+		if (data)
+			munmap(data, mtimes_size);
+	} else {
+		*len_p = mtimes_size;
+		*data_p = (const uint32_t *)data;
+	}
+
+	close(fd);
+	return ret;
+}
+
+int load_pack_mtimes(struct packed_git *p)
+{
+	char *mtimes_name = NULL;
+	int ret = 0;
+
+	if (!p->is_cruft)
+		return ret; /* not a cruft pack */
+	if (p->mtimes_map)
+		return ret; /* already loaded */
+
+	ret = open_pack_index(p);
+	if (ret < 0)
+		goto cleanup;
+
+	mtimes_name = pack_mtimes_filename(p);
+	ret = load_pack_mtimes_file(mtimes_name,
+				    p->num_objects,
+				    &p->mtimes_map,
+				    &p->mtimes_size);
+	if (ret)
+		goto cleanup;
+
+cleanup:
+	free(mtimes_name);
+	return ret;
+}
+
+uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)
+{
+	if (!p->mtimes_map)
+		BUG("pack .mtimes file not loaded for %s", p->pack_name);
+	if (p->num_objects <= pos)
+		BUG("pack .mtimes out-of-bounds (%"PRIu32" vs %"PRIu32")",
+		    pos, p->num_objects);
+
+	return get_be32(p->mtimes_map + pos + 3);
+}
diff --git a/pack-mtimes.h b/pack-mtimes.h
new file mode 100644
index 0000000000..38ddb9f893
--- /dev/null
+++ b/pack-mtimes.h
@@ -0,0 +1,15 @@
+#ifndef PACK_MTIMES_H
+#define PACK_MTIMES_H
+
+#include "git-compat-util.h"
+
+#define MTIMES_SIGNATURE 0x4d544d45 /* "MTME" */
+#define MTIMES_VERSION 1
+
+struct packed_git;
+
+int load_pack_mtimes(struct packed_git *p);
+
+uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos);
+
+#endif
diff --git a/packfile.c b/packfile.c
index 835b2d2716..fc0245fbab 100644
--- a/packfile.c
+++ b/packfile.c
@@ -334,12 +334,22 @@ static void close_pack_revindex(struct packed_git *p)
 	p->revindex_data = NULL;
 }
 
+static void close_pack_mtimes(struct packed_git *p)
+{
+	if (!p->mtimes_map)
+		return;
+
+	munmap((void *)p->mtimes_map, p->mtimes_size);
+	p->mtimes_map = NULL;
+}
+
 void close_pack(struct packed_git *p)
 {
 	close_pack_windows(p);
 	close_pack_fd(p);
 	close_pack_index(p);
 	close_pack_revindex(p);
+	close_pack_mtimes(p);
 	oidset_clear(&p->bad_objects);
 }
 
@@ -363,7 +373,7 @@ void close_object_store(struct raw_object_store *o)
 
 void unlink_pack_path(const char *pack_name, int force_delete)
 {
-	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
+	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".mtimes"};
 	int i;
 	struct strbuf buf = STRBUF_INIT;
 	size_t plen;
@@ -718,6 +728,10 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
 	if (!access(p->pack_name, F_OK))
 		p->pack_promisor = 1;
 
+	xsnprintf(p->pack_name + path_len, alloc - path_len, ".mtimes");
+	if (!access(p->pack_name, F_OK))
+		p->is_cruft = 1;
+
 	xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
 	if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) {
 		free(p);
@@ -869,7 +883,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 	    ends_with(file_name, ".pack") ||
 	    ends_with(file_name, ".bitmap") ||
 	    ends_with(file_name, ".keep") ||
-	    ends_with(file_name, ".promisor"))
+	    ends_with(file_name, ".promisor") ||
+	    ends_with(file_name, ".mtimes"))
 		string_list_append(data->garbage, full_name);
 	else
 		report_garbage(PACKDIR_FILE_GARBAGE, full_name);
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v2 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 04/17] chunk-format.h: extract oid_version() Taylor Blau
                     ` (14 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

This structure will be used to communicate the per-object mtimes when
writing a cruft pack. Here, we need the full packing_data structure
because the mtime information is stored in an array there, not on the
individual object_entry's themselves (to avoid paying the overhead in
structure width for operations which do not generate a cruft pack).

We haven't passed this information down before because one of the two
callers (in bulk-checkin.c) does not have a packing_data structure at
all. In that case (where no cruft pack will be generated), NULL is
passed instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 3 ++-
 bulk-checkin.c         | 2 +-
 pack-write.c           | 1 +
 pack.h                 | 3 +++
 4 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 178e611f09..385970cb7b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1254,7 +1254,8 @@ static void write_pack_file(void)
 
 			stage_tmp_packfiles(&tmpname, pack_tmp_name,
 					    written_list, nr_written,
-					    &pack_idx_opts, hash, &idx_tmp_name);
+					    &to_pack, &pack_idx_opts, hash,
+					    &idx_tmp_name);
 
 			if (write_bitmap_index) {
 				size_t tmpname_len = tmpname.len;
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 8785b2ac80..99f7596c4e 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -33,7 +33,7 @@ static void finish_tmp_packfile(struct strbuf *basename,
 	char *idx_tmp_name = NULL;
 
 	stage_tmp_packfiles(basename, pack_tmp_name, written_list, nr_written,
-			    pack_idx_opts, hash, &idx_tmp_name);
+			    NULL, pack_idx_opts, hash, &idx_tmp_name);
 	rename_tmp_packfile_idx(basename, &idx_tmp_name);
 
 	free(idx_tmp_name);
diff --git a/pack-write.c b/pack-write.c
index a5846f3a34..d594e3008e 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -483,6 +483,7 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 const char *pack_tmp_name,
 			 struct pack_idx_entry **written_list,
 			 uint32_t nr_written,
+			 struct packing_data *to_pack,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[],
 			 char **idx_tmp_name)
diff --git a/pack.h b/pack.h
index b22bfc4a18..fd27cfdfd7 100644
--- a/pack.h
+++ b/pack.h
@@ -109,11 +109,14 @@ int encode_in_pack_object_header(unsigned char *hdr, int hdr_len,
 #define PH_ERROR_PROTOCOL	(-3)
 int read_pack_header(int fd, struct pack_header *);
 
+struct packing_data;
+
 struct hashfile *create_tmp_packfile(char **pack_tmp_name);
 void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 const char *pack_tmp_name,
 			 struct pack_idx_entry **written_list,
 			 uint32_t nr_written,
+			 struct packing_data *to_pack,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[],
 			 char **idx_tmp_name);
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v2 04/17] chunk-format.h: extract oid_version()
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (2 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
                     ` (13 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

There are three definitions of an identical function which converts
`the_hash_algo` into either 1 (for SHA-1) or 2 (for SHA-256). There is a
copy of this function for writing both the commit-graph and
multi-pack-index file, and another inline definition used to write the
.rev header.

Consolidate these into a single definition in chunk-format.h. It's not
clear that this is the best header to define this function in, but it
should do for now.

(Worth noting, the .rev caller expects a 4-byte unsigned, but the other
two callers work with a single unsigned byte. The consolidated version
uses the latter type, and lets the compiler widen it when required).

Another caller will be added in a subsequent patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 chunk-format.c | 12 ++++++++++++
 chunk-format.h |  3 +++
 commit-graph.c | 18 +++---------------
 midx.c         | 18 +++---------------
 pack-write.c   | 15 ++-------------
 5 files changed, 23 insertions(+), 43 deletions(-)

diff --git a/chunk-format.c b/chunk-format.c
index 1c3dca62e2..0275b74a89 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -181,3 +181,15 @@ int read_chunk(struct chunkfile *cf,
 
 	return CHUNK_NOT_FOUND;
 }
+
+uint8_t oid_version(const struct git_hash_algo *algop)
+{
+	switch (hash_algo_by_ptr(algop)) {
+	case GIT_HASH_SHA1:
+		return 1;
+	case GIT_HASH_SHA256:
+		return 2;
+	default:
+		die(_("invalid hash version"));
+	}
+}
diff --git a/chunk-format.h b/chunk-format.h
index 9ccbe00377..7885aa0848 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -2,6 +2,7 @@
 #define CHUNK_FORMAT_H
 
 #include "git-compat-util.h"
+#include "hash.h"
 
 struct hashfile;
 struct chunkfile;
@@ -65,4 +66,6 @@ int read_chunk(struct chunkfile *cf,
 	       chunk_read_fn fn,
 	       void *data);
 
+uint8_t oid_version(const struct git_hash_algo *algop);
+
 #endif
diff --git a/commit-graph.c b/commit-graph.c
index 265c010122..f678d2c4a1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -193,18 +193,6 @@ char *get_commit_graph_chain_filename(struct object_directory *odb)
 	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
 }
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 static struct commit_graph *alloc_commit_graph(void)
 {
 	struct commit_graph *g = xcalloc(1, sizeof(*g));
@@ -365,9 +353,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 	}
 
 	hash_version = *(unsigned char*)(data + 5);
-	if (hash_version != oid_version()) {
+	if (hash_version != oid_version(the_hash_algo)) {
 		error(_("commit-graph hash version %X does not match version %X"),
-		      hash_version, oid_version());
+		      hash_version, oid_version(the_hash_algo));
 		return NULL;
 	}
 
@@ -1911,7 +1899,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
 	hashwrite_u8(f, GRAPH_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, oid_version(the_hash_algo));
 	hashwrite_u8(f, get_num_chunks(cf));
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
diff --git a/midx.c b/midx.c
index 865170bad0..65e670c5e2 100644
--- a/midx.c
+++ b/midx.c
@@ -41,18 +41,6 @@
 
 #define PACK_EXPIRED UINT_MAX
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 const unsigned char *get_midx_checksum(struct multi_pack_index *m)
 {
 	return m->data + m->data_len - the_hash_algo->rawsz;
@@ -134,9 +122,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 		      m->version);
 
 	hash_version = m->data[MIDX_BYTE_HASH_VERSION];
-	if (hash_version != oid_version()) {
+	if (hash_version != oid_version(the_hash_algo)) {
 		error(_("multi-pack-index hash version %u does not match version %u"),
-		      hash_version, oid_version());
+		      hash_version, oid_version(the_hash_algo));
 		goto cleanup_fail;
 	}
 	m->hash_len = the_hash_algo->rawsz;
@@ -420,7 +408,7 @@ static size_t write_midx_header(struct hashfile *f,
 {
 	hashwrite_be32(f, MIDX_SIGNATURE);
 	hashwrite_u8(f, MIDX_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, oid_version(the_hash_algo));
 	hashwrite_u8(f, num_chunks);
 	hashwrite_u8(f, 0); /* unused */
 	hashwrite_be32(f, num_packs);
diff --git a/pack-write.c b/pack-write.c
index d594e3008e..ff305b404c 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -2,6 +2,7 @@
 #include "pack.h"
 #include "csum-file.h"
 #include "remote.h"
+#include "chunk-format.h"
 
 void reset_pack_idx_option(struct pack_idx_option *opts)
 {
@@ -181,21 +182,9 @@ static int pack_order_cmp(const void *va, const void *vb, void *ctx)
 
 static void write_rev_header(struct hashfile *f)
 {
-	uint32_t oid_version;
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		oid_version = 1;
-		break;
-	case GIT_HASH_SHA256:
-		oid_version = 2;
-		break;
-	default:
-		die("write_rev_header: unknown hash version");
-	}
-
 	hashwrite_be32(f, RIDX_SIGNATURE);
 	hashwrite_be32(f, RIDX_VERSION);
-	hashwrite_be32(f, oid_version);
+	hashwrite_be32(f, oid_version(the_hash_algo));
 }
 
 static void write_rev_index_positions(struct hashfile *f,
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v2 05/17] pack-mtimes: support writing pack .mtimes files
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (3 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 04/17] chunk-format.h: extract oid_version() Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
                     ` (12 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Now that the `.mtimes` format is defined, supplement the pack-write API
to be able to conditionally write an `.mtimes` file along with a pack by
setting an additional flag and passing an oidmap that contains the
timestamps corresponding to each object in the pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-objects.c |  6 ++++
 pack-objects.h | 25 ++++++++++++++++
 pack-write.c   | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++
 pack.h         |  1 +
 4 files changed, 109 insertions(+)

diff --git a/pack-objects.c b/pack-objects.c
index fe2a4eace9..272e8d4517 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -170,6 +170,9 @@ struct object_entry *packlist_alloc(struct packing_data *pdata,
 
 		if (pdata->layer)
 			REALLOC_ARRAY(pdata->layer, pdata->nr_alloc);
+
+		if (pdata->cruft_mtime)
+			REALLOC_ARRAY(pdata->cruft_mtime, pdata->nr_alloc);
 	}
 
 	new_entry = pdata->objects + pdata->nr_objects++;
@@ -198,6 +201,9 @@ struct object_entry *packlist_alloc(struct packing_data *pdata,
 	if (pdata->layer)
 		pdata->layer[pdata->nr_objects - 1] = 0;
 
+	if (pdata->cruft_mtime)
+		pdata->cruft_mtime[pdata->nr_objects - 1] = 0;
+
 	return new_entry;
 }
 
diff --git a/pack-objects.h b/pack-objects.h
index dca2351ef9..393b9db546 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -168,6 +168,14 @@ struct packing_data {
 	/* delta islands */
 	unsigned int *tree_depth;
 	unsigned char *layer;
+
+	/*
+	 * Used when writing cruft packs.
+	 *
+	 * Object mtimes are stored in pack order when writing, but
+	 * written out in lexicographic (index) order.
+	 */
+	uint32_t *cruft_mtime;
 };
 
 void prepare_packing_data(struct repository *r, struct packing_data *pdata);
@@ -289,4 +297,21 @@ static inline void oe_set_layer(struct packing_data *pack,
 	pack->layer[e - pack->objects] = layer;
 }
 
+static inline uint32_t oe_cruft_mtime(struct packing_data *pack,
+				      struct object_entry *e)
+{
+	if (!pack->cruft_mtime)
+		return 0;
+	return pack->cruft_mtime[e - pack->objects];
+}
+
+static inline void oe_set_cruft_mtime(struct packing_data *pack,
+				      struct object_entry *e,
+				      uint32_t mtime)
+{
+	if (!pack->cruft_mtime)
+		CALLOC_ARRAY(pack->cruft_mtime, pack->nr_alloc);
+	pack->cruft_mtime[e - pack->objects] = mtime;
+}
+
 #endif
diff --git a/pack-write.c b/pack-write.c
index ff305b404c..270280c4df 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -3,6 +3,10 @@
 #include "csum-file.h"
 #include "remote.h"
 #include "chunk-format.h"
+#include "pack-mtimes.h"
+#include "oidmap.h"
+#include "chunk-format.h"
+#include "pack-objects.h"
 
 void reset_pack_idx_option(struct pack_idx_option *opts)
 {
@@ -276,6 +280,70 @@ const char *write_rev_file_order(const char *rev_name,
 	return rev_name;
 }
 
+static void write_mtimes_header(struct hashfile *f)
+{
+	hashwrite_be32(f, MTIMES_SIGNATURE);
+	hashwrite_be32(f, MTIMES_VERSION);
+	hashwrite_be32(f, oid_version(the_hash_algo));
+}
+
+/*
+ * Writes the object mtimes of "objects" for use in a .mtimes file.
+ * Note that objects must be in lexicographic (index) order, which is
+ * the expected ordering of these values in the .mtimes file.
+ */
+static void write_mtimes_objects(struct hashfile *f,
+				 struct packing_data *to_pack,
+				 struct pack_idx_entry **objects,
+				 uint32_t nr_objects)
+{
+	uint32_t i;
+	for (i = 0; i < nr_objects; i++) {
+		struct object_entry *e = (struct object_entry*)objects[i];
+		hashwrite_be32(f, oe_cruft_mtime(to_pack, e));
+	}
+}
+
+static void write_mtimes_trailer(struct hashfile *f, const unsigned char *hash)
+{
+	hashwrite(f, hash, the_hash_algo->rawsz);
+}
+
+static const char *write_mtimes_file(const char *mtimes_name,
+				     struct packing_data *to_pack,
+				     struct pack_idx_entry **objects,
+				     uint32_t nr_objects,
+				     const unsigned char *hash)
+{
+	struct hashfile *f;
+	int fd;
+
+	if (!to_pack)
+		BUG("cannot call write_mtimes_file with NULL packing_data");
+
+	if (!mtimes_name) {
+		struct strbuf tmp_file = STRBUF_INIT;
+		fd = odb_mkstemp(&tmp_file, "pack/tmp_mtimes_XXXXXX");
+		mtimes_name = strbuf_detach(&tmp_file, NULL);
+	} else {
+		unlink(mtimes_name);
+		fd = xopen(mtimes_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
+	}
+	f = hashfd(fd, mtimes_name);
+
+	write_mtimes_header(f);
+	write_mtimes_objects(f, to_pack, objects, nr_objects);
+	write_mtimes_trailer(f, hash);
+
+	if (adjust_shared_perm(mtimes_name) < 0)
+		die(_("failed to make %s readable"), mtimes_name);
+
+	finalize_hashfile(f, NULL,
+			  CSUM_HASH_IN_STREAM | CSUM_CLOSE | CSUM_FSYNC);
+
+	return mtimes_name;
+}
+
 off_t write_pack_header(struct hashfile *f, uint32_t nr_entries)
 {
 	struct pack_header hdr;
@@ -478,6 +546,7 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 char **idx_tmp_name)
 {
 	const char *rev_tmp_name = NULL;
+	const char *mtimes_tmp_name = NULL;
 
 	if (adjust_shared_perm(pack_tmp_name))
 		die_errno("unable to make temporary pack file readable");
@@ -490,9 +559,17 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 	rev_tmp_name = write_rev_file(NULL, written_list, nr_written, hash,
 				      pack_idx_opts->flags);
 
+	if (pack_idx_opts->flags & WRITE_MTIMES) {
+		mtimes_tmp_name = write_mtimes_file(NULL, to_pack, written_list,
+						    nr_written,
+						    hash);
+	}
+
 	rename_tmp_packfile(name_buffer, pack_tmp_name, "pack");
 	if (rev_tmp_name)
 		rename_tmp_packfile(name_buffer, rev_tmp_name, "rev");
+	if (mtimes_tmp_name)
+		rename_tmp_packfile(name_buffer, mtimes_tmp_name, "mtimes");
 }
 
 void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought)
diff --git a/pack.h b/pack.h
index fd27cfdfd7..01d385903a 100644
--- a/pack.h
+++ b/pack.h
@@ -44,6 +44,7 @@ struct pack_idx_option {
 #define WRITE_IDX_STRICT 02
 #define WRITE_REV 04
 #define WRITE_REV_VERIFY 010
+#define WRITE_MTIMES 020
 
 	uint32_t version;
 	uint32_t off32_limit;
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v2 06/17] t/helper: add 'pack-mtimes' test-tool
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (4 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
                     ` (11 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

In the next patch, we will implement and test support for writing a
cruft pack via a special mode of `git pack-objects`. To make sure that
objects are written with the correct timestamps, and a new test-tool
that can dump the object names and corresponding timestamps from a given
`.mtimes` file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Makefile                    |  1 +
 t/helper/test-pack-mtimes.c | 56 +++++++++++++++++++++++++++++++++++++
 t/helper/test-tool.c        |  1 +
 t/helper/test-tool.h        |  1 +
 4 files changed, 59 insertions(+)
 create mode 100644 t/helper/test-pack-mtimes.c

diff --git a/Makefile b/Makefile
index 1b186f4fd7..5c0ed1ade7 100644
--- a/Makefile
+++ b/Makefile
@@ -727,6 +727,7 @@ TEST_BUILTINS_OBJS += test-oid-array.o
 TEST_BUILTINS_OBJS += test-oidmap.o
 TEST_BUILTINS_OBJS += test-oidtree.o
 TEST_BUILTINS_OBJS += test-online-cpus.o
+TEST_BUILTINS_OBJS += test-pack-mtimes.o
 TEST_BUILTINS_OBJS += test-parse-options.o
 TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
 TEST_BUILTINS_OBJS += test-partial-clone.o
diff --git a/t/helper/test-pack-mtimes.c b/t/helper/test-pack-mtimes.c
new file mode 100644
index 0000000000..f7b79daf4c
--- /dev/null
+++ b/t/helper/test-pack-mtimes.c
@@ -0,0 +1,56 @@
+#include "git-compat-util.h"
+#include "test-tool.h"
+#include "strbuf.h"
+#include "object-store.h"
+#include "packfile.h"
+#include "pack-mtimes.h"
+
+static void dump_mtimes(struct packed_git *p)
+{
+	uint32_t i;
+	if (load_pack_mtimes(p) < 0)
+		die("could not load pack .mtimes");
+
+	for (i = 0; i < p->num_objects; i++) {
+		struct object_id oid;
+		if (nth_packed_object_id(&oid, p, i) < 0)
+			die("could not load object id at position %"PRIu32, i);
+
+		printf("%s %"PRIu32"\n",
+		       oid_to_hex(&oid), nth_packed_mtime(p, i));
+	}
+}
+
+static const char *pack_mtimes_usage = "\n"
+"  test-tool pack-mtimes <pack-name.mtimes>";
+
+int cmd__pack_mtimes(int argc, const char **argv)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct packed_git *p;
+
+	setup_git_directory();
+
+	if (argc != 2)
+		usage(pack_mtimes_usage);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		strbuf_addstr(&buf, basename(p->pack_name));
+		strbuf_strip_suffix(&buf, ".pack");
+		strbuf_addstr(&buf, ".mtimes");
+
+		if (!strcmp(buf.buf, argv[1]))
+			break;
+
+		strbuf_reset(&buf);
+	}
+
+	strbuf_release(&buf);
+
+	if (!p)
+		die("could not find pack '%s'", argv[1]);
+
+	dump_mtimes(p);
+
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index e6ec69cf32..7d472b31fd 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -47,6 +47,7 @@ static struct test_cmd cmds[] = {
 	{ "oidmap", cmd__oidmap },
 	{ "oidtree", cmd__oidtree },
 	{ "online-cpus", cmd__online_cpus },
+	{ "pack-mtimes", cmd__pack_mtimes },
 	{ "parse-options", cmd__parse_options },
 	{ "parse-pathspec-file", cmd__parse_pathspec_file },
 	{ "partial-clone", cmd__partial_clone },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 20756eefdd..0ac4f32955 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -37,6 +37,7 @@ int cmd__mktemp(int argc, const char **argv);
 int cmd__oidmap(int argc, const char **argv);
 int cmd__oidtree(int argc, const char **argv);
 int cmd__online_cpus(int argc, const char **argv);
+int cmd__pack_mtimes(int argc, const char **argv);
 int cmd__parse_options(int argc, const char **argv);
 int cmd__parse_pathspec_file(int argc, const char** argv);
 int cmd__partial_clone(int argc, const char **argv);
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v2 07/17] builtin/pack-objects.c: return from create_object_entry()
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (5 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
                     ` (10 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

A new caller in the next commit will want to immediately modify the
object_entry structure created by create_object_entry(). Instead of
forcing that caller to wastefully look-up the entry we just created,
return it from create_object_entry() instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 385970cb7b..3f08a3c63a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1508,13 +1508,13 @@ static int want_object_in_pack(const struct object_id *oid,
 	return 1;
 }
 
-static void create_object_entry(const struct object_id *oid,
-				enum object_type type,
-				uint32_t hash,
-				int exclude,
-				int no_try_delta,
-				struct packed_git *found_pack,
-				off_t found_offset)
+static struct object_entry *create_object_entry(const struct object_id *oid,
+						enum object_type type,
+						uint32_t hash,
+						int exclude,
+						int no_try_delta,
+						struct packed_git *found_pack,
+						off_t found_offset)
 {
 	struct object_entry *entry;
 
@@ -1531,6 +1531,8 @@ static void create_object_entry(const struct object_id *oid,
 	}
 
 	entry->no_try_delta = no_try_delta;
+
+	return entry;
 }
 
 static const char no_closure_warning[] = N_(
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v2 08/17] builtin/pack-objects.c: --cruft without expiration
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (6 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
                     ` (9 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Teach `pack-objects` how to generate a cruft pack when no objects are
dropped (i.e., `--cruft-expiration=never`). Later patches will teach
`pack-objects` how to generate a cruft pack that prunes objects.

When generating a cruft pack which does not prune objects, we want to
collect all unreachable objects into a single pack (noting and updating
their mtimes as we accumulate them). Ordinary use will pass the result
of a `git repack -A` as a kept pack, so when this patch says "kept
pack", readers should think "reachable objects".

Generating a non-expiring cruft packs works as follows:

  - Callers provide a list of every pack they know about, and indicate
    which packs are about to be removed.

  - All packs which are going to be removed (we'll call these the
    redundant ones) are marked as kept in-core.

    Any packs the caller did not mention (but are known to the
    `pack-objects` process) are also marked as kept in-core. Packs not
    mentioned by the caller are assumed to be unknown to them, i.e.,
    they entered the repository after the caller decided which packs
    should be kept and which should be discarded.

    Since we do not want to include objects in these "unknown" packs
    (because we don't know which of their objects are or aren't
    reachable), these are also marked as kept in-core.

  - Then, we enumerate all objects in the repository, and add them to
    our packing list if they do not appear in an in-core kept pack.

This results in a new cruft pack which contains all known objects that
aren't included in the kept packs. When the kept pack is the result of
`git repack -A`, the resulting pack contains all unreachable objects.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-pack-objects.txt |  30 ++++
 builtin/pack-objects.c             | 201 +++++++++++++++++++++++++-
 object-file.c                      |   2 +-
 object-store.h                     |   2 +
 t/t5328-pack-objects-cruft.sh      | 218 +++++++++++++++++++++++++++++
 5 files changed, 448 insertions(+), 5 deletions(-)
 create mode 100755 t/t5328-pack-objects-cruft.sh

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index f8344e1e5b..a9995a932c 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -13,6 +13,7 @@ SYNOPSIS
 	[--no-reuse-delta] [--delta-base-offset] [--non-empty]
 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
 	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
+	[--cruft] [--cruft-expiration=<time>]
 	[--stdout [--filter=<filter-spec>] | <base-name>]
 	[--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>
 
@@ -95,6 +96,35 @@ base-name::
 Incompatible with `--revs`, or options that imply `--revs` (such as
 `--all`), with the exception of `--unpacked`, which is compatible.
 
+--cruft::
+	Packs unreachable objects into a separate "cruft" pack, denoted
+	by the existence of a `.mtimes` file. Typically used by `git
+	repack --cruft`. Callers provide a list of pack names and
+	indicate which packs will remain in the repository, along with
+	which packs will be deleted (indicated by the `-` prefix). The
+	contents of the cruft pack are all objects not contained in the
+	surviving packs which have not exceeded the grace period (see
+	`--cruft-expiration` below), or which have exceeded the grace
+	period, but are reachable from an other object which hasn't.
++
+When the input lists a pack containing all reachable objects (and lists
+all other packs as pending deletion), the corresponding cruft pack will
+contain all unreachable objects (with mtime newer than the
+`--cruft-expiration`) along with any unreachable objects whose mtime is
+older than the `--cruft-expiration`, but are reachable from an
+unreachable object whose mtime is newer than the `--cruft-expiration`).
++
+Incompatible with `--unpack-unreachable`, `--keep-unreachable`,
+`--pack-loose-unreachable`, `--stdin-packs`, as well as any other
+options which imply `--revs`. Also incompatible with `--max-pack-size`;
+when this option is set, the maximum pack size is not inferred from
+`pack.packSizeLimit`.
+
+--cruft-expiration=<approxidate>::
+	If specified, objects are eliminated from the cruft pack if they
+	have an mtime older than `<approxidate>`. If unspecified (and
+	given `--cruft`), then no objects are eliminated.
+
 --window=<n>::
 --depth=<n>::
 	These two options affect how the objects contained in
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 3f08a3c63a..5ba4fc9c2c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -36,6 +36,7 @@
 #include "trace2.h"
 #include "shallow.h"
 #include "promisor-remote.h"
+#include "pack-mtimes.h"
 
 /*
  * Objects we are going to pack are collected in the `to_pack` structure.
@@ -194,6 +195,8 @@ static int reuse_delta = 1, reuse_object = 1;
 static int keep_unreachable, unpack_unreachable, include_tag;
 static timestamp_t unpack_unreachable_expiration;
 static int pack_loose_unreachable;
+static int cruft;
+static timestamp_t cruft_expiration;
 static int local;
 static int have_non_local_packs;
 static int incremental;
@@ -1252,6 +1255,9 @@ static void write_pack_file(void)
 					&to_pack, written_list, nr_written);
 			}
 
+			if (cruft)
+				pack_idx_opts.flags |= WRITE_MTIMES;
+
 			stage_tmp_packfiles(&tmpname, pack_tmp_name,
 					    written_list, nr_written,
 					    &to_pack, &pack_idx_opts, hash,
@@ -3389,6 +3395,135 @@ static void read_packs_list_from_stdin(void)
 	string_list_clear(&exclude_packs, 0);
 }
 
+static void add_cruft_object_entry(const struct object_id *oid, enum object_type type,
+				   struct packed_git *pack, off_t offset,
+				   const char *name, uint32_t mtime)
+{
+	struct object_entry *entry;
+
+	display_progress(progress_state, ++nr_seen);
+
+	entry = packlist_find(&to_pack, oid);
+	if (entry) {
+		if (name) {
+			entry->hash = pack_name_hash(name);
+			entry->no_try_delta = no_try_delta(name);
+		}
+	} else {
+		if (!want_object_in_pack(oid, 0, &pack, &offset))
+			return;
+		if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
+			/*
+			 * If a traversed tree has a missing blob then we want
+			 * to avoid adding that missing object to our pack.
+			 *
+			 * This only applies to missing blobs, not trees,
+			 * because the traversal needs to parse sub-trees but
+			 * not blobs.
+			 *
+			 * Note we only perform this check when we couldn't
+			 * already find the object in a pack, so we're really
+			 * limited to "ensure non-tip blobs which don't exist in
+			 * packs do exist via loose objects". Confused?
+			 */
+			return;
+		}
+
+		entry = create_object_entry(oid, type, pack_name_hash(name),
+					    0, name && no_try_delta(name),
+					    pack, offset);
+	}
+
+	if (mtime > oe_cruft_mtime(&to_pack, entry))
+		oe_set_cruft_mtime(&to_pack, entry, mtime);
+	return;
+}
+
+static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
+{
+	struct string_list_item *item = NULL;
+	for_each_string_list_item(item, packs) {
+		struct packed_git *p = item->util;
+		if (!p)
+			die(_("could not find pack '%s'"), item->string);
+		p->pack_keep_in_core = keep;
+	}
+}
+
+static void add_unreachable_loose_objects(void);
+static void add_objects_in_unpacked_packs(void);
+
+static void enumerate_cruft_objects(void)
+{
+	if (progress)
+		progress_state = start_progress(_("Enumerating cruft objects"), 0);
+
+	add_objects_in_unpacked_packs();
+	add_unreachable_loose_objects();
+
+	stop_progress(&progress_state);
+}
+
+static void read_cruft_objects(void)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct string_list discard_packs = STRING_LIST_INIT_DUP;
+	struct string_list fresh_packs = STRING_LIST_INIT_DUP;
+	struct packed_git *p;
+
+	ignore_packed_keep_in_core = 1;
+
+	while (strbuf_getline(&buf, stdin) != EOF) {
+		if (!buf.len)
+			continue;
+
+		if (*buf.buf == '-')
+			string_list_append(&discard_packs, buf.buf + 1);
+		else
+			string_list_append(&fresh_packs, buf.buf);
+		strbuf_reset(&buf);
+	}
+
+	string_list_sort(&discard_packs);
+	string_list_sort(&fresh_packs);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		const char *pack_name = pack_basename(p);
+		struct string_list_item *item;
+
+		item = string_list_lookup(&fresh_packs, pack_name);
+		if (!item)
+			item = string_list_lookup(&discard_packs, pack_name);
+
+		if (item) {
+			item->util = p;
+		} else {
+			/*
+			 * This pack wasn't mentioned in either the "fresh" or
+			 * "discard" list, so the caller didn't know about it.
+			 *
+			 * Mark it as kept so that its objects are ignored by
+			 * add_unseen_recent_objects_to_traversal(). We'll
+			 * unmark it before starting the traversal so it doesn't
+			 * halt the traversal early.
+			 */
+			p->pack_keep_in_core = 1;
+		}
+	}
+
+	mark_pack_kept_in_core(&fresh_packs, 1);
+	mark_pack_kept_in_core(&discard_packs, 0);
+
+	if (cruft_expiration)
+		die("--cruft-expiration not yet implemented");
+	else
+		enumerate_cruft_objects();
+
+	strbuf_release(&buf);
+	string_list_clear(&discard_packs, 0);
+	string_list_clear(&fresh_packs, 0);
+}
+
 static void read_object_list_from_stdin(void)
 {
 	char line[GIT_MAX_HEXSZ + 1 + PATH_MAX + 2];
@@ -3521,7 +3656,24 @@ static int add_object_in_unpacked_pack(const struct object_id *oid,
 				       uint32_t pos,
 				       void *_data)
 {
-	add_object_entry(oid, OBJ_NONE, "", 0);
+	if (cruft) {
+		off_t offset;
+		time_t mtime;
+
+		if (pack->is_cruft) {
+			if (load_pack_mtimes(pack) < 0)
+				die(_("could not load cruft pack .mtimes"));
+			mtime = nth_packed_mtime(pack, pos);
+		} else {
+			mtime = pack->mtime;
+		}
+		offset = nth_packed_object_offset(pack, pos);
+
+		add_cruft_object_entry(oid, OBJ_NONE, pack, offset,
+				       NULL, mtime);
+	} else {
+		add_object_entry(oid, OBJ_NONE, "", 0);
+	}
 	return 0;
 }
 
@@ -3545,7 +3697,19 @@ static int add_loose_object(const struct object_id *oid, const char *path,
 		return 0;
 	}
 
-	add_object_entry(oid, type, "", 0);
+	if (cruft) {
+		struct stat st;
+		if (stat(path, &st) < 0) {
+			if (errno == ENOENT)
+				return 0;
+			return error_errno("unable to stat %s", oid_to_hex(oid));
+		}
+
+		add_cruft_object_entry(oid, type, NULL, 0, NULL,
+				       st.st_mtime);
+	} else {
+		add_object_entry(oid, type, "", 0);
+	}
 	return 0;
 }
 
@@ -3864,6 +4028,20 @@ static int option_parse_unpack_unreachable(const struct option *opt,
 	return 0;
 }
 
+static int option_parse_cruft_expiration(const struct option *opt,
+					 const char *arg, int unset)
+{
+	if (unset) {
+		cruft = 0;
+		cruft_expiration = 0;
+	} else {
+		cruft = 1;
+		if (arg)
+			cruft_expiration = approxidate(arg);
+	}
+	return 0;
+}
+
 int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 {
 	int use_internal_rev_list = 0;
@@ -3936,6 +4114,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		OPT_CALLBACK_F(0, "unpack-unreachable", NULL, N_("time"),
 		  N_("unpack unreachable objects newer than <time>"),
 		  PARSE_OPT_OPTARG, option_parse_unpack_unreachable),
+		OPT_BOOL(0, "cruft", &cruft, N_("create a cruft pack")),
+		OPT_CALLBACK_F(0, "cruft-expiration", NULL, N_("time"),
+		  N_("expire cruft objects older than <time>"),
+		  PARSE_OPT_OPTARG, option_parse_cruft_expiration),
 		OPT_BOOL(0, "sparse", &sparse,
 			 N_("use the sparse reachability algorithm")),
 		OPT_BOOL(0, "thin", &thin,
@@ -4062,7 +4244,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	if (!HAVE_THREADS && delta_search_threads != 1)
 		warning(_("no threads support, ignoring --threads"));
-	if (!pack_to_stdout && !pack_size_limit)
+	if (!pack_to_stdout && !pack_size_limit && !cruft)
 		pack_size_limit = pack_size_limit_cfg;
 	if (pack_to_stdout && pack_size_limit)
 		die(_("--max-pack-size cannot be used to build a pack for transfer"));
@@ -4089,6 +4271,15 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
 
+	if (cruft) {
+		if (use_internal_rev_list)
+			die(_("cannot use internal rev list with --cruft"));
+		if (stdin_packs)
+			die(_("cannot use --stdin-packs with --cruft"));
+		if (pack_size_limit)
+			die(_("cannot use --max-pack-size with --cruft"));
+	}
+
 	/*
 	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
 	 *
@@ -4145,7 +4336,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			    the_repository);
 	prepare_packing_data(the_repository, &to_pack);
 
-	if (progress)
+	if (progress && !cruft)
 		progress_state = start_progress(_("Enumerating objects"), 0);
 	if (stdin_packs) {
 		/* avoids adding objects in excluded packs */
@@ -4153,6 +4344,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		read_packs_list_from_stdin();
 		if (rev_list_unpacked)
 			add_unreachable_loose_objects();
+	} else if (cruft) {
+		read_cruft_objects();
 	} else if (!use_internal_rev_list) {
 		read_object_list_from_stdin();
 	} else {
diff --git a/object-file.c b/object-file.c
index 8be57f48de..e80da1368d 100644
--- a/object-file.c
+++ b/object-file.c
@@ -996,7 +996,7 @@ int has_loose_object_nonlocal(const struct object_id *oid)
 	return check_and_freshen_nonlocal(oid, 0);
 }
 
-static int has_loose_object(const struct object_id *oid)
+int has_loose_object(const struct object_id *oid)
 {
 	return check_and_freshen(oid, 0);
 }
diff --git a/object-store.h b/object-store.h
index 9b227661f2..6b025dc670 100644
--- a/object-store.h
+++ b/object-store.h
@@ -334,6 +334,8 @@ int repo_has_object_file_with_flags(struct repository *r,
  */
 int has_loose_object_nonlocal(const struct object_id *);
 
+int has_loose_object(const struct object_id *);
+
 void assert_oid_type(const struct object_id *oid, enum object_type expect);
 
 /*
diff --git a/t/t5328-pack-objects-cruft.sh b/t/t5328-pack-objects-cruft.sh
new file mode 100755
index 0000000000..003ca7344e
--- /dev/null
+++ b/t/t5328-pack-objects-cruft.sh
@@ -0,0 +1,218 @@
+#!/bin/sh
+
+test_description='cruft pack related pack-objects tests'
+. ./test-lib.sh
+
+objdir=.git/objects
+packdir=$objdir/pack
+
+basic_cruft_pack_tests () {
+	expire="$1"
+
+	test_expect_success "unreachable loose objects are packed (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit base &&
+			git repack -Ad &&
+			test_commit loose &&
+
+			test-tool chmtime +2000 "$objdir/$(test_oid_to_path \
+				$(git rev-parse loose:loose.t))" &&
+			test-tool chmtime +1000 "$objdir/$(test_oid_to_path \
+				$(git rev-parse loose^{tree}))" &&
+
+			(
+				git rev-list --objects --no-object-names base..loose |
+				while read oid
+				do
+					path="$objdir/$(test_oid_to_path "$oid")" &&
+					printf "%s %d\n" "$oid" "$(test-tool chmtime --get "$path")"
+				done |
+				sort -k1
+			) >expect &&
+
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+			cruft="$(echo $keep | git pack-objects --cruft \
+				--cruft-expiration="$expire" $packdir/pack)" &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "unreachable packed objects are packed (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit packed &&
+			git repack -Ad &&
+			test_commit other &&
+
+			git rev-list --objects --no-object-names packed.. >objects &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+			other="$(git pack-objects --delta-base-offset \
+				$packdir/pack <objects)" &&
+			git prune-packed &&
+
+			test-tool chmtime --get -100 "$packdir/pack-$other.pack" >expect &&
+
+			cruft="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF
+			$keep
+			-pack-$other.pack
+			EOF
+			)" &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+			cut -d" " -f2 <actual.raw | sort -u >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "unreachable cruft objects are repacked (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit packed &&
+			git repack -Ad &&
+			test_commit other &&
+
+			git rev-list --objects --no-object-names packed.. >objects &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+
+			cruft_a="$(echo $keep | git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack)" &&
+			git prune-packed &&
+			cruft_b="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF
+			$keep
+			-pack-$cruft_a.pack
+			EOF
+			)" &&
+
+			test-tool pack-mtimes "pack-$cruft_a.mtimes" >expect.raw &&
+			test-tool pack-mtimes "pack-$cruft_b.mtimes" >actual.raw &&
+
+			sort <expect.raw >expect &&
+			sort <actual.raw >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "multiple cruft packs (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			git repack -Ad &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+
+			test_commit cruft &&
+			loose="$objdir/$(test_oid_to_path $(git rev-parse cruft))" &&
+
+			# generate three copies of the cruft object in different
+			# cruft packs, each with a unique mtime:
+			#   - one expired (1000 seconds ago)
+			#   - two non-expired (one 1000 seconds in the future,
+			#     one 1500 seconds in the future)
+			test-tool chmtime =-1000 "$loose" &&
+			git pack-objects --cruft $packdir/pack-A <<-EOF &&
+			$keep
+			EOF
+			test-tool chmtime =+1000 "$loose" &&
+			git pack-objects --cruft $packdir/pack-B <<-EOF &&
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			EOF
+			test-tool chmtime =+1500 "$loose" &&
+			git pack-objects --cruft $packdir/pack-C <<-EOF &&
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			-$(basename $(ls $packdir/pack-B-*.pack))
+			EOF
+
+			# ensure the resulting cruft pack takes the most recent
+			# mtime among all copies
+			cruft="$(git pack-objects --cruft \
+				--cruft-expiration="$expire" \
+				$packdir/pack <<-EOF
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			-$(basename $(ls $packdir/pack-B-*.pack))
+			-$(basename $(ls $packdir/pack-C-*.pack))
+			EOF
+			)" &&
+
+			test-tool pack-mtimes "$(basename $(ls $packdir/pack-C-*.mtimes))" >expect.raw &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+			sort expect.raw >expect &&
+			sort actual.raw >actual &&
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "cruft packs tolerate missing trees (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			test_commit cruft &&
+
+			tree="$(git rev-parse cruft^{tree})" &&
+
+			git reset --hard reachable &&
+			git tag -d cruft &&
+			git reflog expire --all --expire=all &&
+
+			# remove the unreachable tree, but leave the commit
+			# which has it as its root tree intact
+			rm -fr "$objdir/$(test_oid_to_path "$tree")" &&
+
+			git repack -Ad &&
+			basename $(ls $packdir/pack-*.pack) >in &&
+			git pack-objects --cruft --cruft-expiration="$expire" \
+				$packdir/pack <in
+		)
+	'
+
+	test_expect_success "cruft packs tolerate missing blobs (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			test_commit cruft &&
+
+			blob="$(git rev-parse cruft:cruft.t)" &&
+
+			git reset --hard reachable &&
+			git tag -d cruft &&
+			git reflog expire --all --expire=all &&
+
+			# remove the unreachable blob, but leave the commit (and
+			# the root tree of that commit) intact
+			rm -fr "$objdir/$(test_oid_to_path "$blob")" &&
+
+			git repack -Ad &&
+			basename $(ls $packdir/pack-*.pack) >in &&
+			git pack-objects --cruft --cruft-expiration="$expire" \
+				$packdir/pack <in
+		)
+	'
+}
+
+basic_cruft_pack_tests never
+
+test_done
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v2 09/17] reachable: add options to add_unseen_recent_objects_to_traversal
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (7 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02 20:19     ` Derrick Stolee
  2022-03-02  0:58   ` [PATCH v2 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
                     ` (8 subsequent siblings)
  17 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

This function behaves very similarly to what we will need in
pack-objects in order to implement cruft packs with expiration. But it
is lacking a couple of things. Namely, it needs:

  - a mechanism to communicate the timestamps of individual recent
    objects to some external caller

  - and, in the case of packed objects, our future caller will also want
    to know the originating pack, as well as the offset within that pack
    at which the object can be found

  - finally, it needs a way to skip over packs which are marked as kept
    in-core.

To address the first two, add a callback interface in this patch which
reports the time of each recent object, as well as a (packed_git,
off_t) pair for packed objects.

Likewise, add a new option to the packed object iterators to skip over
packs which are marked as kept in core. This option will become
implicitly tested in a future patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c |  2 +-
 reachable.c            | 51 +++++++++++++++++++++++++++++++++++-------
 reachable.h            |  9 +++++++-
 3 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5ba4fc9c2c..1ef333717d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3951,7 +3951,7 @@ static void get_object_list(int ac, const char **av)
 	if (unpack_unreachable_expiration) {
 		revs.ignore_missing_links = 1;
 		if (add_unseen_recent_objects_to_traversal(&revs,
-				unpack_unreachable_expiration))
+				unpack_unreachable_expiration, NULL, 0))
 			die(_("unable to add recent objects"));
 		if (prepare_revision_walk(&revs))
 			die(_("revision walk setup failed"));
diff --git a/reachable.c b/reachable.c
index 84e3d0d75e..0eb9909f47 100644
--- a/reachable.c
+++ b/reachable.c
@@ -60,9 +60,13 @@ static void mark_commit(struct commit *c, void *data)
 struct recent_data {
 	struct rev_info *revs;
 	timestamp_t timestamp;
+	report_recent_object_fn *cb;
+	int ignore_in_core_kept_packs;
 };
 
 static void add_recent_object(const struct object_id *oid,
+			      struct packed_git *pack,
+			      off_t offset,
 			      timestamp_t mtime,
 			      struct recent_data *data)
 {
@@ -103,13 +107,29 @@ static void add_recent_object(const struct object_id *oid,
 		die("unable to lookup %s", oid_to_hex(oid));
 
 	add_pending_object(data->revs, obj, "");
+	if (data->cb)
+		data->cb(obj, pack, offset, mtime);
+}
+
+static int want_recent_object(struct recent_data *data,
+			      const struct object_id *oid)
+{
+	if (data->ignore_in_core_kept_packs &&
+	    has_object_kept_pack(oid, IN_CORE_KEEP_PACKS))
+		return 0;
+	return 1;
 }
 
 static int add_recent_loose(const struct object_id *oid,
 			    const char *path, void *data)
 {
 	struct stat st;
-	struct object *obj = lookup_object(the_repository, oid);
+	struct object *obj;
+
+	if (!want_recent_object(data, oid))
+		return 0;
+
+	obj = lookup_object(the_repository, oid);
 
 	if (obj && obj->flags & SEEN)
 		return 0;
@@ -126,7 +146,7 @@ static int add_recent_loose(const struct object_id *oid,
 		return error_errno("unable to stat %s", oid_to_hex(oid));
 	}
 
-	add_recent_object(oid, st.st_mtime, data);
+	add_recent_object(oid, NULL, 0, st.st_mtime, data);
 	return 0;
 }
 
@@ -134,29 +154,43 @@ static int add_recent_packed(const struct object_id *oid,
 			     struct packed_git *p, uint32_t pos,
 			     void *data)
 {
-	struct object *obj = lookup_object(the_repository, oid);
+	struct object *obj;
+
+	if (!want_recent_object(data, oid))
+		return 0;
+
+	obj = lookup_object(the_repository, oid);
 
 	if (obj && obj->flags & SEEN)
 		return 0;
-	add_recent_object(oid, p->mtime, data);
+	add_recent_object(oid, p, nth_packed_object_offset(p, pos), p->mtime, data);
 	return 0;
 }
 
 int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
-					   timestamp_t timestamp)
+					   timestamp_t timestamp,
+					   report_recent_object_fn *cb,
+					   int ignore_in_core_kept_packs)
 {
 	struct recent_data data;
+	enum for_each_object_flags flags;
 	int r;
 
 	data.revs = revs;
 	data.timestamp = timestamp;
+	data.cb = cb;
+	data.ignore_in_core_kept_packs = ignore_in_core_kept_packs;
 
 	r = for_each_loose_object(add_recent_loose, &data,
 				  FOR_EACH_OBJECT_LOCAL_ONLY);
 	if (r)
 		return r;
-	return for_each_packed_object(add_recent_packed, &data,
-				      FOR_EACH_OBJECT_LOCAL_ONLY);
+
+	flags = FOR_EACH_OBJECT_LOCAL_ONLY | FOR_EACH_OBJECT_PACK_ORDER;
+	if (ignore_in_core_kept_packs)
+		flags |= FOR_EACH_OBJECT_SKIP_IN_CORE_KEPT_PACKS;
+
+	return for_each_packed_object(add_recent_packed, &data, flags);
 }
 
 static int mark_object_seen(const struct object_id *oid,
@@ -217,7 +251,8 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 
 	if (mark_recent) {
 		revs->ignore_missing_links = 1;
-		if (add_unseen_recent_objects_to_traversal(revs, mark_recent))
+		if (add_unseen_recent_objects_to_traversal(revs, mark_recent,
+							   NULL, 0))
 			die("unable to mark recent objects");
 		if (prepare_revision_walk(revs))
 			die("revision walk setup failed");
diff --git a/reachable.h b/reachable.h
index 5df932ad8f..b776761baa 100644
--- a/reachable.h
+++ b/reachable.h
@@ -1,11 +1,18 @@
 #ifndef REACHEABLE_H
 #define REACHEABLE_H
 
+#include "object.h"
+
 struct progress;
 struct rev_info;
 
+typedef void report_recent_object_fn(const struct object *, struct packed_git *,
+				     off_t, time_t);
+
 int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
-					   timestamp_t timestamp);
+					   timestamp_t timestamp,
+					   report_recent_object_fn cb,
+					   int ignore_in_core_kept_packs);
 void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 			    timestamp_t mark_recent, struct progress *);
 
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v2 10/17] reachable: report precise timestamps from objects in cruft packs
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (8 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
                     ` (7 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

When generating a cruft pack, the caller within pack-objects will want
to know the precise timestamps of cruft objects (i.e., their
corresponding values in the .mtimes table) rather than the mtime of the
cruft pack itself.

Teach add_recent_packed() to lookup each object's precise mtime from the
.mtimes file if one exists (indicated by the is_cruft bit on the
packed_git structure).

A couple of small things worth noting here:

  - load_pack_mtimes() needs to be called before asking for
    nth_packed_mtime(), and that call is done lazily here. That function
    exits early if the .mtimes file has already been opened and parsed,
    so only the first call is slow.

  - Checking the is_cruft bit can be done without any extra work on the
    caller's behalf, since it is set up for us automatically as a
    side-effect of calling add_packed_git() (just like the 'pack_keep'
    and 'pack_promisor' bits).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 reachable.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/reachable.c b/reachable.c
index 0eb9909f47..9ec8e6bd5b 100644
--- a/reachable.c
+++ b/reachable.c
@@ -13,6 +13,7 @@
 #include "worktree.h"
 #include "object-store.h"
 #include "pack-bitmap.h"
+#include "pack-mtimes.h"
 
 struct connectivity_progress {
 	struct progress *progress;
@@ -155,6 +156,7 @@ static int add_recent_packed(const struct object_id *oid,
 			     void *data)
 {
 	struct object *obj;
+	timestamp_t mtime = p->mtime;
 
 	if (!want_recent_object(data, oid))
 		return 0;
@@ -163,7 +165,12 @@ static int add_recent_packed(const struct object_id *oid,
 
 	if (obj && obj->flags & SEEN)
 		return 0;
-	add_recent_object(oid, p, nth_packed_object_offset(p, pos), p->mtime, data);
+	if (p->is_cruft) {
+		if (load_pack_mtimes(p) < 0)
+			die(_("could not load cruft pack .mtimes"));
+		mtime = nth_packed_mtime(p, pos);
+	}
+	add_recent_object(oid, p, nth_packed_object_offset(p, pos), mtime, data);
 	return 0;
 }
 
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v2 11/17] builtin/pack-objects.c: --cruft with expiration
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (9 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  7:42     ` Junio C Hamano
  2022-03-02  0:58   ` [PATCH v2 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
                     ` (6 subsequent siblings)
  17 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

In a previous patch, pack-objects learned how to generate a cruft pack
so long as no objects are dropped.

This patch teaches pack-objects to handle the case where a non-never
`--cruft-expiration` value is passed. This case is slightly more
complicated than before, because we want pack-objects to save
unreachable objects which would have been pruned when there is another
recent (i.e., non-prunable) unreachable object which reaches the other.
We'll call these objects "unreachable but reachable-from-recent".

Here is how pack-objects handles `--cruft-expiration`:

  - Instead of adding all objects outside of the kept pack(s) into the
    packing list, only handle the ones whose mtime is within the grace
    period.

  - Construct a reachability traversal whose tips are the
    unreachable-but-recent objects.

  - Then, walk along that traversal, stopping if we reach an object in
    the kept pack. At each step along the traversal, we add the object
    we are visiting to the packing list.

In the majority of these cases, any object we visit in this traversal
will already be in our packing list. But we will sometimes encounter
reachable-from-recent cruft objects, which we want to retain even if
they aged out of the grace period.

The most subtle point of this process is that we actually don't need to
bother to update the rescued object's mtime. Even though we will write
an .mtimes file with a value that is older than the expiration window,
it will continue to survive cruft repacks so long as any objects which
reach it haven't aged out.

That is, a future repack will also exclude that object from the initial
packing list, only to discover it later on when doing the reachability
traversal.

Finally, stopping early once an object is found in a kept pack is safe
to do because the kept packs ordinarily represent which packs will
survive after repacking. Assuming that it _isn't_ safe to halt a
traversal early would mean that there is some ancestor object which is
missing, which implies repository corruption (i.e., the complete set of
reachable objects isn't present).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c        |  84 +++++++++++++++++++-
 t/t5328-pack-objects-cruft.sh | 143 ++++++++++++++++++++++++++++++++++
 2 files changed, 226 insertions(+), 1 deletion(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 1ef333717d..fcac0b5c91 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3439,6 +3439,44 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
 	return;
 }
 
+static void show_cruft_object(struct object *obj, const char *name, void *data)
+{
+	/*
+	 * if we did not record it earlier, it's at least as old as our
+	 * expiration value. Rather than find it exactly, just use that
+	 * value.  This may bump it forward from its real mtime, but it
+	 * will still be "too old" next time we run with the same
+	 * expiration.
+	 *
+	 * if obj does appear in the packing list, this call is a noop (or may
+	 * set the namehash).
+	 */
+	add_cruft_object_entry(&obj->oid, obj->type, NULL, 0, name, cruft_expiration);
+}
+
+static void show_cruft_commit(struct commit *commit, void *data)
+{
+	show_cruft_object((struct object*)commit, NULL, data);
+}
+
+static int cruft_include_check_obj(struct object *obj, void *data)
+{
+	return !has_object_kept_pack(&obj->oid, IN_CORE_KEEP_PACKS);
+}
+
+static int cruft_include_check(struct commit *commit, void *data)
+{
+	return cruft_include_check_obj((struct object*)commit, data);
+}
+
+static void set_cruft_mtime(const struct object *object,
+			    struct packed_git *pack,
+			    off_t offset, time_t mtime)
+{
+	add_cruft_object_entry(&object->oid, object->type, pack, offset, NULL,
+			       mtime);
+}
+
 static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
 {
 	struct string_list_item *item = NULL;
@@ -3464,6 +3502,50 @@ static void enumerate_cruft_objects(void)
 	stop_progress(&progress_state);
 }
 
+static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs)
+{
+	struct packed_git *p;
+	struct rev_info revs;
+	int ret;
+
+	repo_init_revisions(the_repository, &revs, NULL);
+
+	revs.tag_objects = 1;
+	revs.tree_objects = 1;
+	revs.blob_objects = 1;
+
+	revs.include_check = cruft_include_check;
+	revs.include_check_obj = cruft_include_check_obj;
+
+	revs.ignore_missing_links = 1;
+
+	if (progress)
+		progress_state = start_progress(_("Enumerating cruft objects"), 0);
+	ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
+						     set_cruft_mtime, 1);
+	stop_progress(&progress_state);
+
+	if (ret)
+		die(_("unable to add cruft objects"));
+
+	/*
+	 * Re-mark only the fresh packs as kept so that objects in
+	 * unknown packs do not halt the reachability traversal early.
+	 */
+	for (p = get_all_packs(the_repository); p; p = p->next)
+		p->pack_keep_in_core = 0;
+	mark_pack_kept_in_core(fresh_packs, 1);
+
+	if (prepare_revision_walk(&revs))
+		die(_("revision walk setup failed"));
+	if (progress)
+		progress_state = start_progress(_("Traversing cruft objects"), 0);
+	nr_seen = 0;
+	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
+
+	stop_progress(&progress_state);
+}
+
 static void read_cruft_objects(void)
 {
 	struct strbuf buf = STRBUF_INIT;
@@ -3515,7 +3597,7 @@ static void read_cruft_objects(void)
 	mark_pack_kept_in_core(&discard_packs, 0);
 
 	if (cruft_expiration)
-		die("--cruft-expiration not yet implemented");
+		enumerate_and_traverse_cruft_objects(&fresh_packs);
 	else
 		enumerate_cruft_objects();
 
diff --git a/t/t5328-pack-objects-cruft.sh b/t/t5328-pack-objects-cruft.sh
index 003ca7344e..939cdc297a 100755
--- a/t/t5328-pack-objects-cruft.sh
+++ b/t/t5328-pack-objects-cruft.sh
@@ -214,5 +214,148 @@ basic_cruft_pack_tests () {
 }
 
 basic_cruft_pack_tests never
+basic_cruft_pack_tests 2.weeks.ago
+
+test_expect_success 'cruft tags rescue tagged objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit tagged &&
+		git tag -a annotated -m tag &&
+
+		git rev-list --objects --no-object-names packed.. >objects &&
+		while read oid
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $oid)"
+		done <objects &&
+
+		test-tool chmtime -500 \
+			"$objdir/$(test_oid_to_path $(git rev-parse annotated))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+		cut -f1 -d" " <actual.raw | sort >actual &&
+
+		(
+			cat objects &&
+			git rev-parse annotated
+		) >expect.raw &&
+		sort <expect.raw >expect &&
+
+		test_cmp expect actual &&
+		cat actual
+	)
+'
+
+test_expect_success 'cruft commits rescue parents, trees' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit old &&
+		test_commit new &&
+
+		git rev-list --objects --no-object-names packed..new >objects &&
+		while read object
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $object)"
+		done <objects &&
+		test-tool chmtime +500 "$objdir/$(test_oid_to_path \
+			$(git rev-parse HEAD))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+		cut -d" " -f1 <actual.raw | sort >actual &&
+		sort <objects >expect &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'cruft trees rescue sub-trees, blobs' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		mkdir -p dir/sub &&
+		echo foo >foo &&
+		echo bar >dir/bar &&
+		echo baz >dir/sub/baz &&
+
+		test_tick &&
+		git add . &&
+		git commit -m "pruned" &&
+
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD^{tree}))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:foo))" &&
+		test-tool chmtime  -500 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/bar))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/sub))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/sub/baz))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+		cut -f1 -d" " <actual.raw | sort >actual &&
+
+		git rev-parse HEAD:dir HEAD:dir/bar HEAD:dir/sub HEAD:dir/sub/baz >expect.raw &&
+		sort <expect.raw >expect &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'expired objects are pruned' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit pruned &&
+
+		git rev-list --objects --no-object-names packed..pruned >objects &&
+		while read object
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $object)"
+		done <objects &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual &&
+		test_must_be_empty actual
+	)
+'
 
 test_done
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v2 12/17] builtin/repack.c: support generating a cruft pack
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (10 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
                     ` (5 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Expose a way to split the contents of a repository into a main and cruft
pack when doing an all-into-one repack with `git repack --cruft -d`, and
a complementary configuration variable.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt            |  11 ++
 Documentation/technical/cruft-packs.txt |   2 +-
 builtin/repack.c                        | 106 +++++++++++-
 t/t5328-pack-objects-cruft.sh           | 207 ++++++++++++++++++++++++
 4 files changed, 320 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index ee30edc178..0bf13893d8 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -63,6 +63,17 @@ to the new separate pack will be written.
 	Also run  'git prune-packed' to remove redundant
 	loose object files.
 
+--cruft::
+	Same as `-a`, unless `-d` is used. Then any unreachable objects
+	are packed into a separate cruft pack. Unreachable objects can
+	be pruned using the normal expiry rules with the next `git gc`
+	invocation (see linkgit:git-gc[1]). Incompatible with `-k`.
+
+--cruft-expiration=<approxidate>::
+	Expire unreachable objects older than `<approxidate>`
+	immediately instead of waiting for the next `git gc` invocation.
+	Only useful with `--cruft -d`.
+
 -l::
 	Pass the `--local` option to 'git pack-objects'. See
 	linkgit:git-pack-objects[1].
diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
index 2c3c5d93f8..f80e975a47 100644
--- a/Documentation/technical/cruft-packs.txt
+++ b/Documentation/technical/cruft-packs.txt
@@ -17,7 +17,7 @@ pruned according to normal expiry rules with the next 'git gc' invocation.
 
 Unreachable objects aren't removed immediately, since doing so could race with
 an incoming push which may reference an object which is about to be deleted.
-Instead, those unreachable objects are stored as loose object and stay that way
+Instead, those unreachable objects are stored as loose objects and stay that way
 until they are older than the expiration window, at which point they are removed
 by linkgit:git-prune[1].
 
diff --git a/builtin/repack.c b/builtin/repack.c
index f908f7d5dd..f7fb88bcf1 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -18,11 +18,17 @@
 #include "pack-bitmap.h"
 #include "refs.h"
 
+#define ALL_INTO_ONE 1
+#define LOOSEN_UNREACHABLE 2
+#define PACK_CRUFT 4
+
+static int pack_everything;
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
 static int write_bitmaps = -1;
 static int use_delta_islands;
 static char *packdir, *packtmp_name, *packtmp;
+static char *cruft_expiration;
 
 static const char *const git_repack_usage[] = {
 	N_("git repack [<options>]"),
@@ -54,6 +60,7 @@ static int repack_config(const char *var, const char *value, void *cb)
 		use_delta_islands = git_config_bool(var, value);
 		return 0;
 	}
+
 	return git_default_config(var, value, cb);
 }
 
@@ -300,9 +307,6 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 		die(_("could not finish pack-objects to repack promisor objects"));
 }
 
-#define ALL_INTO_ONE 1
-#define LOOSEN_UNREACHABLE 2
-
 struct pack_geometry {
 	struct packed_git **pack;
 	uint32_t pack_nr, pack_alloc;
@@ -339,6 +343,8 @@ static void init_pack_geometry(struct pack_geometry **geometry_p)
 	for (p = get_all_packs(the_repository); p; p = p->next) {
 		if (!pack_kept_objects && p->pack_keep)
 			continue;
+		if (p->is_cruft)
+			continue;
 
 		ALLOC_GROW(geometry->pack,
 			   geometry->pack_nr + 1,
@@ -600,6 +606,67 @@ static int write_midx_included_packs(struct string_list *include,
 	return finish_command(&cmd);
 }
 
+static int write_cruft_pack(const struct pack_objects_args *args,
+			    const char *pack_prefix,
+			    struct string_list *names,
+			    struct string_list *existing_packs,
+			    struct string_list *existing_kept_packs)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf line = STRBUF_INIT;
+	struct string_list_item *item;
+	FILE *in, *out;
+	int ret;
+
+	prepare_pack_objects(&cmd, args);
+
+	strvec_push(&cmd.args, "--cruft");
+	if (cruft_expiration)
+		strvec_pushf(&cmd.args, "--cruft-expiration=%s",
+			     cruft_expiration);
+
+	strvec_push(&cmd.args, "--honor-pack-keep");
+	strvec_push(&cmd.args, "--non-empty");
+	strvec_push(&cmd.args, "--max-pack-size=0");
+
+	cmd.in = -1;
+
+	ret = start_command(&cmd);
+	if (ret)
+		return ret;
+
+	/*
+	 * names has a confusing double use: it both provides the list
+	 * of just-written new packs, and accepts the name of the cruft
+	 * pack we are writing.
+	 *
+	 * By the time it is read here, it contains only the pack(s)
+	 * that were just written, which is exactly the set of packs we
+	 * want to consider kept.
+	 */
+	in = xfdopen(cmd.in, "w");
+	for_each_string_list_item(item, names)
+		fprintf(in, "%s-%s.pack\n", pack_prefix, item->string);
+	for_each_string_list_item(item, existing_packs)
+		fprintf(in, "-%s.pack\n", item->string);
+	for_each_string_list_item(item, existing_kept_packs)
+		fprintf(in, "%s.pack\n", item->string);
+	fclose(in);
+
+	out = xfdopen(cmd.out, "r");
+	while (strbuf_getline_lf(&line, out) != EOF) {
+		if (line.len != the_hash_algo->hexsz)
+			die(_("repack: Expecting full hex object ID lines only "
+			      "from pack-objects."));
+		string_list_append(names, line.buf);
+	}
+	fclose(out);
+
+	strbuf_release(&line);
+
+	return finish_command(&cmd);
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -616,7 +683,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int show_progress;
 
 	/* variables to be filled by option parsing */
-	int pack_everything = 0;
 	int delete_redundant = 0;
 	const char *unpack_unreachable = NULL;
 	int keep_unreachable = 0;
@@ -632,6 +698,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_BIT('A', NULL, &pack_everything,
 				N_("same as -a, and turn unreachable objects loose"),
 				   LOOSEN_UNREACHABLE | ALL_INTO_ONE),
+		OPT_BIT(0, "cruft", &pack_everything,
+				N_("same as -a, pack unreachable cruft objects separately"),
+				   PACK_CRUFT),
+		OPT_STRING(0, "cruft-expiration", &cruft_expiration, N_("approxidate"),
+				N_("with -C, expire objects older than this")),
 		OPT_BOOL('d', NULL, &delete_redundant,
 				N_("remove redundant packs, and run git-prune-packed")),
 		OPT_BOOL('f', NULL, &po_args.no_reuse_delta,
@@ -684,6 +755,15 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	    (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
 		die(_("options '%s' and '%s' cannot be used together"), "--keep-unreachable", "-A");
 
+	if (pack_everything & PACK_CRUFT) {
+		pack_everything |= ALL_INTO_ONE;
+
+		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
+			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-A");
+		if (keep_unreachable)
+			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-k");
+	}
+
 	if (write_bitmaps < 0) {
 		if (!write_midx &&
 		    (!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()))
@@ -767,7 +847,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (pack_everything & ALL_INTO_ONE) {
 		repack_promisor_objects(&po_args, &names);
 
-		if (existing_nonkept_packs.nr && delete_redundant) {
+		if (existing_nonkept_packs.nr && delete_redundant &&
+		    !(pack_everything & PACK_CRUFT)) {
 			for_each_string_list_item(item, &names) {
 				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
 					     packtmp_name, item->string);
@@ -829,6 +910,21 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (!names.nr && !po_args.quiet)
 		printf_ln(_("Nothing new to pack."));
 
+	if (pack_everything & PACK_CRUFT) {
+		const char *pack_prefix;
+		if (!skip_prefix(packtmp, packdir, &pack_prefix))
+			die(_("pack prefix %s does not begin with objdir %s"),
+			    packtmp, packdir);
+		if (*pack_prefix == '/')
+			pack_prefix++;
+
+		ret = write_cruft_pack(&po_args, pack_prefix, &names,
+				       &existing_nonkept_packs,
+				       &existing_kept_packs);
+		if (ret)
+			return ret;
+	}
+
 	for_each_string_list_item(item, &names) {
 		item->util = (void *)(uintptr_t)populate_pack_exts(item->string);
 	}
diff --git a/t/t5328-pack-objects-cruft.sh b/t/t5328-pack-objects-cruft.sh
index 939cdc297a..06c550c958 100755
--- a/t/t5328-pack-objects-cruft.sh
+++ b/t/t5328-pack-objects-cruft.sh
@@ -358,4 +358,211 @@ test_expect_success 'expired objects are pruned' '
 	)
 '
 
+test_expect_success 'repack --cruft generates a cruft pack' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git branch -M main &&
+		git checkout --orphan other &&
+		test_commit unreachable &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d unreachable &&
+		# objects are not cruft if they are contained in the reflogs
+		git reflog expire --all --expire=all &&
+
+		git rev-list --objects --all --no-object-names >reachable.raw &&
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+		sort <reachable.raw >reachable &&
+		comm -13 reachable objects >unreachable &&
+
+		git repack --cruft -d &&
+
+		cruft=$(basename $(ls $packdir/pack-*.mtimes) .mtimes) &&
+		pack=$(basename $(ls $packdir/pack-*.pack | grep -v $cruft) .pack) &&
+
+		git show-index <$packdir/$pack.idx >actual.raw &&
+		cut -f2 -d" " actual.raw | sort >actual &&
+		test_cmp reachable actual &&
+
+		git show-index <$packdir/$cruft.idx >actual.raw &&
+		cut -f2 -d" " actual.raw | sort >actual &&
+		test_cmp unreachable actual
+	)
+'
+
+test_expect_success 'loose objects mtimes upsert others' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		test_commit cruft &&
+		# incremental repack, leaving existing objects loose (so
+		# they can be "freshened")
+		git repack &&
+
+		tip="$(git rev-parse cruft)" &&
+		path="$objdir/$(test_oid_to_path "$(git rev-parse cruft)")" &&
+		test-tool chmtime --get +1000 "$path" >expect &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d cruft &&
+		git reflog expire --all --expire=all &&
+
+		git repack --cruft -d &&
+
+		mtimes="$(basename $(ls $packdir/pack-*.mtimes))" &&
+		test-tool pack-mtimes "$mtimes" >actual.raw &&
+		grep "$tip" actual.raw | cut -d" " -f2 >actual &&
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'cruft packs are not included in geometric repack' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		test_commit cruft &&
+		git repack -d &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d cruft &&
+		git reflog expire --all --expire=all &&
+
+		git repack --cruft &&
+
+		find $packdir -type f | sort >before &&
+		git repack --geometric=2 -d &&
+		find $packdir -type f | sort >after &&
+
+		test_cmp before after
+	)
+'
+
+test_expect_success 'repack --geometric collects once-cruft objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		git rm -rf . &&
+		test_commit --no-tag cruft &&
+		cruft="$(git rev-parse HEAD)" &&
+
+		git checkout main &&
+		git branch -D other &&
+		git reflog expire --all --expire=all &&
+
+		# Pack the objects created in the previous step into a cruft
+		# pack. Intentionally leave loose copies of those objects
+		# around so we can pick them up in a subsequent --geometric
+		# reapack.
+		git repack --cruft &&
+
+		# Now make those objects reachable, and ensure that they are
+		# packed into the new pack created via a --geometric repack.
+		git update-ref refs/heads/other $cruft &&
+
+		# Without this object, the set of unpacked objects is exactly
+		# the set of objects already in the cruft pack. Tweak that set
+		# to ensure we do not overwrite the cruft pack entirely.
+		test_commit reachable2 &&
+
+		find $packdir -name "pack-*.idx" | sort >before &&
+		git repack --geometric=2 -d &&
+		find $packdir -name "pack-*.idx" | sort >after &&
+
+		{
+			git rev-list --objects --no-object-names $cruft &&
+			git rev-list --objects --no-object-names reachable..reachable2
+		} >want.raw &&
+		sort want.raw >want &&
+
+		pack=$(comm -13 before after) &&
+		git show-index <$pack >objects.raw &&
+
+		cut -d" " -f2 objects.raw | sort >got &&
+
+		test_cmp want got
+	)
+'
+
+test_expect_success 'cruft repack with no reachable objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		git repack -ad &&
+
+		base="$(git rev-parse base)" &&
+
+		git for-each-ref --format="delete %(refname)" >in &&
+		git update-ref --stdin <in &&
+		git reflog expire --all --expire=all &&
+		rm -fr .git/index &&
+
+		git repack --cruft -d &&
+
+		git cat-file -t $base
+	)
+'
+
+test_expect_success 'cruft repack ignores --max-pack-size' '
+	git init max-pack-size &&
+	(
+		cd max-pack-size &&
+		test_commit base &&
+		# two cruft objects which exceed the maximum pack size
+		test-tool genrandom foo 1048576 | git hash-object --stdin -w &&
+		test-tool genrandom bar 1048576 | git hash-object --stdin -w &&
+		git repack --cruft --max-pack-size=1M &&
+		find $packdir -name "*.mtimes" >cruft &&
+		test_line_count = 1 cruft &&
+		test-tool pack-mtimes "$(basename "$(cat cruft)")" >objects &&
+		test_line_count = 2 objects
+	)
+'
+
+test_expect_success 'cruft repack ignores pack.packSizeLimit' '
+	(
+		cd max-pack-size &&
+		# repack everything back together to remove the existing cruft
+		# pack (but to keep its objects)
+		git repack -adk &&
+		git -c pack.packSizeLimit=1M repack --cruft &&
+		# ensure the same post condition is met when --max-pack-size
+		# would otherwise be inferred from the configuration
+		find $packdir -name "*.mtimes" >cruft &&
+		test_line_count = 1 cruft &&
+		test-tool pack-mtimes "$(basename "$(cat cruft)")" >objects &&
+		test_line_count = 2 objects
+	)
+'
+
 test_done
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v2 13/17] builtin/repack.c: allow configuring cruft pack generation
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (11 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
                     ` (4 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

In servers which set the pack.window configuration to a large value, we
can wind up spending quite a lot of time finding new bases when breaking
delta chains between reachable and unreachable objects while generating
a cruft pack.

Introduce a handful of `repack.cruft*` configuration variables to
control the parameters used by pack-objects when generating a cruft
pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/repack.txt |  9 ++++
 builtin/repack.c                | 50 ++++++++++++++------
 t/t5328-pack-objects-cruft.sh   | 83 +++++++++++++++++++++++++++++++++
 3 files changed, 128 insertions(+), 14 deletions(-)

diff --git a/Documentation/config/repack.txt b/Documentation/config/repack.txt
index 9c413e177e..fd18d1fb89 100644
--- a/Documentation/config/repack.txt
+++ b/Documentation/config/repack.txt
@@ -25,3 +25,12 @@ repack.writeBitmaps::
 	space and extra time spent on the initial repack.  This has
 	no effect if multiple packfiles are created.
 	Defaults to true on bare repos, false otherwise.
+
+repack.cruftWindow::
+repack.cruftWindowMemory::
+repack.cruftDepth::
+repack.cruftThreads::
+	Parameters used by linkgit:git-pack-objects[1] when generating
+	a cruft pack and the respective parameters are not given over
+	the command line. See similarly named `pack.*` configuration
+	variables for defaults and meaning.
diff --git a/builtin/repack.c b/builtin/repack.c
index f7fb88bcf1..d61c78e94e 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -40,9 +40,21 @@ static const char incremental_bitmap_conflict_error[] = N_(
 "--no-write-bitmap-index or disable the pack.writebitmaps configuration."
 );
 
+struct pack_objects_args {
+	const char *window;
+	const char *window_memory;
+	const char *depth;
+	const char *threads;
+	const char *max_pack_size;
+	int no_reuse_delta;
+	int no_reuse_object;
+	int quiet;
+	int local;
+};
 
 static int repack_config(const char *var, const char *value, void *cb)
 {
+	struct pack_objects_args *cruft_po_args = cb;
 	if (!strcmp(var, "repack.usedeltabaseoffset")) {
 		delta_base_offset = git_config_bool(var, value);
 		return 0;
@@ -61,6 +73,15 @@ static int repack_config(const char *var, const char *value, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(var, "repack.cruftwindow"))
+		return git_config_string(&cruft_po_args->window, var, value);
+	if (!strcmp(var, "repack.cruftwindowmemory"))
+		return git_config_string(&cruft_po_args->window_memory, var, value);
+	if (!strcmp(var, "repack.cruftdepth"))
+		return git_config_string(&cruft_po_args->depth, var, value);
+	if (!strcmp(var, "repack.cruftthreads"))
+		return git_config_string(&cruft_po_args->threads, var, value);
+
 	return git_default_config(var, value, cb);
 }
 
@@ -153,18 +174,6 @@ static void remove_redundant_pack(const char *dir_name, const char *base_name)
 	strbuf_release(&buf);
 }
 
-struct pack_objects_args {
-	const char *window;
-	const char *window_memory;
-	const char *depth;
-	const char *threads;
-	const char *max_pack_size;
-	int no_reuse_delta;
-	int no_reuse_object;
-	int quiet;
-	int local;
-};
-
 static void prepare_pack_objects(struct child_process *cmd,
 				 const struct pack_objects_args *args)
 {
@@ -689,6 +698,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	int no_update_server_info = 0;
 	struct pack_objects_args po_args = {NULL};
+	struct pack_objects_args cruft_po_args = {NULL};
 	int geometric_factor = 0;
 	int write_midx = 0;
 
@@ -743,7 +753,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_END()
 	};
 
-	git_config(repack_config, NULL);
+	git_config(repack_config, &cruft_po_args);
 
 	argc = parse_options(argc, argv, prefix, builtin_repack_options,
 				git_repack_usage, 0);
@@ -918,7 +928,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		if (*pack_prefix == '/')
 			pack_prefix++;
 
-		ret = write_cruft_pack(&po_args, pack_prefix, &names,
+		if (!cruft_po_args.window)
+			cruft_po_args.window = po_args.window;
+		if (!cruft_po_args.window_memory)
+			cruft_po_args.window_memory = po_args.window_memory;
+		if (!cruft_po_args.depth)
+			cruft_po_args.depth = po_args.depth;
+		if (!cruft_po_args.threads)
+			cruft_po_args.threads = po_args.threads;
+
+		cruft_po_args.local = po_args.local;
+		cruft_po_args.quiet = po_args.quiet;
+
+		ret = write_cruft_pack(&cruft_po_args, pack_prefix, &names,
 				       &existing_nonkept_packs,
 				       &existing_kept_packs);
 		if (ret)
diff --git a/t/t5328-pack-objects-cruft.sh b/t/t5328-pack-objects-cruft.sh
index 06c550c958..e4744e4465 100755
--- a/t/t5328-pack-objects-cruft.sh
+++ b/t/t5328-pack-objects-cruft.sh
@@ -565,4 +565,87 @@ test_expect_success 'cruft repack ignores pack.packSizeLimit' '
 	)
 '
 
+test_expect_success 'cruft repack respects repack.cruftWindow' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		GIT_TRACE2_EVENT=$(pwd)/event.trace \
+		git -c pack.window=1 -c repack.cruftWindow=2 repack \
+		       --cruft --window=3 &&
+
+		grep "pack-objects.*--window=2.*--cruft" event.trace
+	)
+'
+
+test_expect_success 'cruft repack respects --window by default' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		GIT_TRACE2_EVENT=$(pwd)/event.trace \
+		git -c pack.window=2 repack --cruft --window=3 &&
+
+		grep "pack-objects.*--window=3.*--cruft" event.trace
+	)
+'
+
+test_expect_success 'cruft repack respects --quiet' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		GIT_PROGRESS_DELAY=0 git repack --cruft --quiet 2>err &&
+		test_must_be_empty err
+	)
+'
+
+test_expect_success 'cruft --local drops unreachable objects' '
+	git init alternate &&
+	git init repo &&
+	test_when_finished "rm -fr alternate repo" &&
+
+	test_commit -C alternate base &&
+	# Pack all objects in alterate so that the cruft repack in "repo" sees
+	# the object it dropped due to `--local` as packed. Otherwise this
+	# object would not appear packed anywhere (since it is not packed in
+	# alternate and likewise not part of the cruft pack in the other repo
+	# because of `--local`).
+	git -C alternate repack -ad &&
+
+	(
+		cd repo &&
+
+		object="$(git -C ../alternate rev-parse HEAD:base.t)" &&
+		git -C ../alternate cat-file -p $object >contents &&
+
+		# Write some reachable objects and two unreachable ones: one
+		# that the alternate has and another that is unique.
+		test_commit other &&
+		git hash-object -w -t blob contents &&
+		cruft="$(echo cruft | git hash-object -w -t blob --stdin)" &&
+
+		( cd ../alternate/.git/objects && pwd ) \
+		       >.git/objects/info/alternates &&
+
+		test_path_is_file $objdir/$(test_oid_to_path $cruft) &&
+		test_path_is_file $objdir/$(test_oid_to_path $object) &&
+
+		git repack -d --cruft --local &&
+
+		test-tool pack-mtimes "$(basename $(ls $packdir/pack-*.mtimes))" \
+		       >objects &&
+		! grep $object objects &&
+		grep $cruft objects
+	)
+'
+
 test_done
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v2 14/17] builtin/repack.c: use named flags for existing_packs
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (12 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
                     ` (3 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

We use the `util` pointer for items in the `existing_packs` string list
to indicate which packs are going to be deleted. Since that has so far
been the only use of that `util` pointer, we just set it to 0 or 1.

But we're going to add an additional state to this field in the next
patch, so prepare for that by adding a #define for the first bit so we
can more expressively inspect the flags state.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index d61c78e94e..afa4d51a22 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -22,6 +22,8 @@
 #define LOOSEN_UNREACHABLE 2
 #define PACK_CRUFT 4
 
+#define DELETE_PACK 1
+
 static int pack_everything;
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
@@ -561,7 +563,7 @@ static void midx_included_packs(struct string_list *include,
 		}
 	} else {
 		for_each_string_list_item(item, existing_nonkept_packs) {
-			if (item->util)
+			if ((uintptr_t)item->util & DELETE_PACK)
 				continue;
 			string_list_insert(include, xstrfmt("%s.idx", item->string));
 		}
@@ -1000,7 +1002,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			 * was given) and that we will actually delete this pack
 			 * (if `-d` was given).
 			 */
-			item->util = (void*)(intptr_t)!string_list_has_string(&names, sha1);
+			if (!string_list_has_string(&names, sha1))
+				item->util = (void*)(uintptr_t)((size_t)item->util | DELETE_PACK);
 		}
 	}
 
@@ -1024,7 +1027,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (delete_redundant) {
 		int opts = 0;
 		for_each_string_list_item(item, &existing_nonkept_packs) {
-			if (!item->util)
+			if (!((uintptr_t)item->util & DELETE_PACK))
 				continue;
 			remove_redundant_pack(packdir, item->string);
 		}
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v2 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (13 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
                     ` (2 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

When using cruft packs, the following race can occur when a geometric
repack that writes a MIDX bitmap takes place afterwords:

  - First, create an unreachable object and do an all-into-one cruft
    repack which stores that object in the repository's cruft pack.
  - Then make that object reachable.
  - Finally, do a geometric repack and write a MIDX bitmap.

Assuming that we are sufficiently unlucky as to select a commit from the
MIDX which reaches that object for bitmapping, then the `git
multi-pack-index` process will complain that that object is missing.

The reason is because we don't include cruft packs in the MIDX when
doing a geometric repack. Since the "make that object reachable" doesn't
necessarily mean that we'll create a new copy of that object in one of
the packs that will get rolled up as part of a geometric repack, it's
possible that the MIDX won't see any copies of that now-reachable
object.

Of course, it's desirable to avoid including cruft packs in the MIDX
because it causes the MIDX to store a bunch of objects which are likely
to get thrown away. But excluding that pack does open us up to the above
race.

This patch demonstrates the bug, and resolves it by including cruft
packs in the MIDX even when doing a geometric repack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c              | 19 +++++++++++++++++--
 t/t5328-pack-objects-cruft.sh | 26 ++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index afa4d51a22..59b60cd309 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -23,6 +23,7 @@
 #define PACK_CRUFT 4
 
 #define DELETE_PACK 1
+#define CRUFT_PACK 2
 
 static int pack_everything;
 static int delta_base_offset = 1;
@@ -158,8 +159,11 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 		if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
 		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
 			string_list_append_nodup(fname_kept_list, fname);
-		else
-			string_list_append_nodup(fname_nonkept_list, fname);
+		else {
+			struct string_list_item *item = string_list_append_nodup(fname_nonkept_list, fname);
+			if (file_exists(mkpath("%s/%s.mtimes", packdir, fname)))
+				item->util = (void*)(uintptr_t)CRUFT_PACK;
+		}
 	}
 	closedir(dir);
 }
@@ -561,6 +565,17 @@ static void midx_included_packs(struct string_list *include,
 
 			string_list_insert(include, strbuf_detach(&buf, NULL));
 		}
+
+		for_each_string_list_item(item, existing_nonkept_packs) {
+			if (!((uintptr_t)item->util & CRUFT_PACK)) {
+				/*
+				 * no need to check DELETE_PACK, since we're not
+				 * doing an ALL_INTO_ONE repack
+				 */
+				continue;
+			}
+			string_list_insert(include, xstrfmt("%s.idx", item->string));
+		}
 	} else {
 		for_each_string_list_item(item, existing_nonkept_packs) {
 			if ((uintptr_t)item->util & DELETE_PACK)
diff --git a/t/t5328-pack-objects-cruft.sh b/t/t5328-pack-objects-cruft.sh
index e4744e4465..13158e4ab7 100755
--- a/t/t5328-pack-objects-cruft.sh
+++ b/t/t5328-pack-objects-cruft.sh
@@ -648,4 +648,30 @@ test_expect_success 'cruft --local drops unreachable objects' '
 	)
 '
 
+test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		test_commit cruft &&
+		unreachable="$(git rev-parse cruft)" &&
+
+		git reset --hard $unreachable^ &&
+		git tag -d cruft &&
+		git reflog expire --all --expire=all &&
+
+		git repack --cruft -d &&
+
+		# resurrect the unreachable object via a new commit. the
+		# new commit will get selected for a bitmap, but be
+		# missing one of its parents from the selected packs.
+		git reset --hard $unreachable &&
+		test_commit resurrect &&
+
+		git repack --write-midx --write-bitmap-index --geometric=2 -d
+	)
+'
+
 test_done
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v2 16/17] builtin/gc.c: conditionally avoid pruning objects via loose
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (14 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
  2022-03-02 20:23   ` [PATCH v2 00/17] " Derrick Stolee
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Expose the new `git repack --cruft` mode from `git gc` via a new opt-in
flag. When invoked like `git gc --cruft`, `git gc` will avoid exploding
unreachable objects as loose ones, and instead create a cruft pack and
`.mtimes` file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/gc.txt   | 21 +++++++++++++-------
 Documentation/git-gc.txt      |  5 +++++
 builtin/gc.c                  | 10 +++++++++-
 t/t5328-pack-objects-cruft.sh | 37 +++++++++++++++++++++++++++++++++++
 4 files changed, 65 insertions(+), 8 deletions(-)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index c834e07991..38fea076a2 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -81,14 +81,21 @@ gc.packRefs::
 	to enable it within all non-bare repos or it can be set to a
 	boolean value.  The default is `true`.
 
+gc.cruftPacks::
+	Store unreachable objects in a cruft pack (see
+	linkgit:git-repack[1]) instead of as loose objects. The default
+	is `false`.
+
 gc.pruneExpire::
-	When 'git gc' is run, it will call 'prune --expire 2.weeks.ago'.
-	Override the grace period with this config variable.  The value
-	"now" may be used to disable this grace period and always prune
-	unreachable objects immediately, or "never" may be used to
-	suppress pruning.  This feature helps prevent corruption when
-	'git gc' runs concurrently with another process writing to the
-	repository; see the "NOTES" section of linkgit:git-gc[1].
+	When 'git gc' is run, it will call 'prune --expire 2.weeks.ago'
+	(and 'repack --cruft --cruft-expiration 2.weeks.ago' if using
+	cruft packs via `gc.cruftPacks` or `--cruft`).  Override the
+	grace period with this config variable.  The value "now" may be
+	used to disable this grace period and always prune unreachable
+	objects immediately, or "never" may be used to suppress pruning.
+	This feature helps prevent corruption when 'git gc' runs
+	concurrently with another process writing to the repository; see
+	the "NOTES" section of linkgit:git-gc[1].
 
 gc.worktreePruneExpire::
 	When 'git gc' is run, it calls
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 853967dea0..ba4e67700e 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -54,6 +54,11 @@ other housekeeping tasks (e.g. rerere, working trees, reflog...) will
 be performed as well.
 
 
+--cruft::
+	When expiring unreachable objects, pack them separately into a
+	cruft pack instead of storing the loose objects as loose
+	objects.
+
 --prune=<date>::
 	Prune loose objects older than date (default is 2 weeks ago,
 	overridable by the config variable `gc.pruneExpire`).
diff --git a/builtin/gc.c b/builtin/gc.c
index ffaf0daf5d..11f5150234 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -43,6 +43,7 @@ static const char * const builtin_gc_usage[] = {
 
 static int pack_refs = 1;
 static int prune_reflogs = 1;
+static int cruft_packs = 0;
 static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
@@ -153,6 +154,7 @@ static void gc_config(void)
 	git_config_get_int("gc.auto", &gc_auto_threshold);
 	git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
 	git_config_get_bool("gc.autodetach", &detach_auto);
+	git_config_get_bool("gc.cruftpacks", &cruft_packs);
 	git_config_get_expiry("gc.pruneexpire", &prune_expire);
 	git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire);
 	git_config_get_expiry("gc.logexpiry", &gc_log_expire);
@@ -332,7 +334,11 @@ static void add_repack_all_option(struct string_list *keep_pack)
 {
 	if (prune_expire && !strcmp(prune_expire, "now"))
 		strvec_push(&repack, "-a");
-	else {
+	else if (cruft_packs) {
+		strvec_push(&repack, "--cruft");
+		if (prune_expire)
+			strvec_pushf(&repack, "--cruft-expiration=%s", prune_expire);
+	} else {
 		strvec_push(&repack, "-A");
 		if (prune_expire)
 			strvec_pushf(&repack, "--unpack-unreachable=%s", prune_expire);
@@ -552,6 +558,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 		{ OPTION_STRING, 0, "prune", &prune_expire, N_("date"),
 			N_("prune unreferenced objects"),
 			PARSE_OPT_OPTARG, NULL, (intptr_t)prune_expire },
+		OPT_BOOL(0, "cruft", &cruft_packs, N_("pack unreferenced objects separately")),
 		OPT_BOOL(0, "aggressive", &aggressive, N_("be more thorough (increased runtime)")),
 		OPT_BOOL_F(0, "auto", &auto_gc, N_("enable auto-gc mode"),
 			   PARSE_OPT_NOCOMPLETE),
@@ -671,6 +678,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 			die(FAILED_RUN, repack.v[0]);
 
 		if (prune_expire) {
+			/* run `git prune` even if using cruft packs */
 			strvec_push(&prune, prune_expire);
 			if (quiet)
 				strvec_push(&prune, "--no-progress");
diff --git a/t/t5328-pack-objects-cruft.sh b/t/t5328-pack-objects-cruft.sh
index 13158e4ab7..3910e186ef 100755
--- a/t/t5328-pack-objects-cruft.sh
+++ b/t/t5328-pack-objects-cruft.sh
@@ -429,6 +429,43 @@ test_expect_success 'loose objects mtimes upsert others' '
 	)
 '
 
+test_expect_success 'expiring cruft objects with git gc' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git branch -M main &&
+		git checkout --orphan other &&
+		test_commit unreachable &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d unreachable &&
+		# objects are not cruft if they are contained in the reflogs
+		git reflog expire --all --expire=all &&
+
+		git rev-list --objects --all --no-object-names >reachable.raw &&
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+		sort <reachable.raw >reachable &&
+		comm -13 reachable objects >unreachable &&
+
+		git repack --cruft -d &&
+
+		mtimes=$(ls .git/objects/pack/pack-*.mtimes) &&
+		test_path_is_file $mtimes &&
+
+		git gc --cruft --prune=now &&
+
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+
+		comm -23 unreachable objects >removed &&
+		test_cmp unreachable removed &&
+		test_path_is_missing $mtimes
+	)
+'
+
 test_expect_success 'cruft packs are not included in geometric repack' '
 	git init repo &&
 	test_when_finished "rm -fr repo" &&
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v2 17/17] sha1-file.c: don't freshen cruft packs
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (15 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02 20:23   ` [PATCH v2 00/17] " Derrick Stolee
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

We don't bother to freshen objects stored in a cruft pack individually
by updating the `.mtimes` file. This is because we can't portably `mmap`
and write into the middle of a file (i.e., to update the mtime of just
one object). Instead, we would have to rewrite the entire `.mtimes` file
which may incur some wasted effort especially if there a lot of cruft
objects and they are freshened infrequently.

Instead, force the freshening code to avoid an optimizing write by
writing out the object loose and letting it pick up a current mtime.

This works because we prefer the mtime of the loose copy of an object
when both a loose and packed one exist (whether or not the packed copy
comes from a cruft pack or not).

This could certainly do with a test and/or be included earlier in this
series/PR, but I want to wait until after I have a chance to clean up
the overly-repetitive nature of the cruft pack tests in general.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 object-file.c                 |  2 ++
 t/t5328-pack-objects-cruft.sh | 25 +++++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/object-file.c b/object-file.c
index e80da1368d..65b8df7fb6 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1989,6 +1989,8 @@ static int freshen_packed_object(const struct object_id *oid)
 	struct pack_entry e;
 	if (!find_pack_entry(the_repository, oid, &e))
 		return 0;
+	if (e.p->is_cruft)
+		return 0;
 	if (e.p->freshened)
 		return 1;
 	if (!freshen_file(e.p->pack_name))
diff --git a/t/t5328-pack-objects-cruft.sh b/t/t5328-pack-objects-cruft.sh
index 3910e186ef..4681558612 100755
--- a/t/t5328-pack-objects-cruft.sh
+++ b/t/t5328-pack-objects-cruft.sh
@@ -711,4 +711,29 @@ test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
 	)
 '
 
+test_expect_success 'cruft objects are freshend via loose' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		echo "cruft" >contents &&
+		blob="$(git hash-object -w -t blob contents)" &&
+		loose="$objdir/$(test_oid_to_path $blob)" &&
+
+		test_commit base &&
+
+		git repack --cruft -d &&
+
+		test_path_is_missing "$loose" &&
+		test-tool pack-mtimes "$(basename "$(ls $packdir/pack-*.mtimes)")" >cruft &&
+		grep "$blob" cruft &&
+
+		# write the same object again
+		git hash-object -w -t blob contents &&
+
+		test_path_is_file "$loose"
+	)
+'
+
 test_done
-- 
2.35.1.73.gccc5557600

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v2 11/17] builtin/pack-objects.c: --cruft with expiration
  2022-03-02  0:58   ` [PATCH v2 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
@ 2022-03-02  7:42     ` Junio C Hamano
  2022-03-02 15:54       ` Taylor Blau
  0 siblings, 1 reply; 200+ messages in thread
From: Junio C Hamano @ 2022-03-02  7:42 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, tytso, derrickstolee, larsxschneider

Taylor Blau <me@ttaylorr.com> writes:

>  builtin/pack-objects.c        |  84 +++++++++++++++++++-
>  t/t5328-pack-objects-cruft.sh | 143 ++++++++++++++++++++++++++++++++++
>  2 files changed, 226 insertions(+), 1 deletion(-)

I'd renumber this to 5329, as the latest iteration of generation
number v2 series took 5328, while queuing.



^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v2 11/17] builtin/pack-objects.c: --cruft with expiration
  2022-03-02  7:42     ` Junio C Hamano
@ 2022-03-02 15:54       ` Taylor Blau
  2022-03-02 19:57         ` Derrick Stolee
  0 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2022-03-02 15:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, tytso, derrickstolee, larsxschneider

On Tue, Mar 01, 2022 at 11:42:57PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> >  builtin/pack-objects.c        |  84 +++++++++++++++++++-
> >  t/t5328-pack-objects-cruft.sh | 143 ++++++++++++++++++++++++++++++++++
> >  2 files changed, 226 insertions(+), 1 deletion(-)
>
> I'd renumber this to 5329, as the latest iteration of generation
> number v2 series took 5328, while queuing.

Oops. I had scanned that series, but glossed over the new test number.

Thanks for renaming (I'll do the same, in case we end up accumulating
more reroll-able bits).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v2 11/17] builtin/pack-objects.c: --cruft with expiration
  2022-03-02 15:54       ` Taylor Blau
@ 2022-03-02 19:57         ` Derrick Stolee
  0 siblings, 0 replies; 200+ messages in thread
From: Derrick Stolee @ 2022-03-02 19:57 UTC (permalink / raw)
  To: Taylor Blau, Junio C Hamano; +Cc: git, tytso, larsxschneider

On 3/2/2022 10:54 AM, Taylor Blau wrote:
> On Tue, Mar 01, 2022 at 11:42:57PM -0800, Junio C Hamano wrote:
>> Taylor Blau <me@ttaylorr.com> writes:
>>
>>>  builtin/pack-objects.c        |  84 +++++++++++++++++++-
>>>  t/t5328-pack-objects-cruft.sh | 143 ++++++++++++++++++++++++++++++++++
>>>  2 files changed, 226 insertions(+), 1 deletion(-)
>>
>> I'd renumber this to 5329, as the latest iteration of generation
>> number v2 series took 5328, while queuing.
> 
> Oops. I had scanned that series, but glossed over the new test number.
> 
> Thanks for renaming (I'll do the same, in case we end up accumulating
> more reroll-able bits).

Sorry for the collision! Had I realized this was already used here,
I would have changed the number myself.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v2 09/17] reachable: add options to add_unseen_recent_objects_to_traversal
  2022-03-02  0:58   ` [PATCH v2 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
@ 2022-03-02 20:19     ` Derrick Stolee
  2022-03-02 21:28       ` Taylor Blau
  0 siblings, 1 reply; 200+ messages in thread
From: Derrick Stolee @ 2022-03-02 20:19 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: tytso, gitster, larsxschneider

On 3/1/2022 7:58 PM, Taylor Blau wrote:

> diff --git a/reachable.h b/reachable.h
> index 5df932ad8f..b776761baa 100644
> --- a/reachable.h
> +++ b/reachable.h
> @@ -1,11 +1,18 @@
>  #ifndef REACHEABLE_H
>  #define REACHEABLE_H
>  
> +#include "object.h"
> +

Nit: just realized this include could be replaced by a struct
declaration:

>  struct progress;
>  struct rev_info;

Like these. 'struct object;' should be enough for the typedef.
>  
> +typedef void report_recent_object_fn(const struct object *, struct packed_git *,
> +				     off_t, time_t);
> +
>  int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
> -					   timestamp_t timestamp);
> +					   timestamp_t timestamp,
> +					   report_recent_object_fn cb,
> +					   int ignore_in_core_kept_packs);

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v2 02/17] pack-mtimes: support reading .mtimes files
  2022-03-02  0:58   ` [PATCH v2 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
@ 2022-03-02 20:22     ` Derrick Stolee
  2022-03-02 21:33       ` Taylor Blau
  0 siblings, 1 reply; 200+ messages in thread
From: Derrick Stolee @ 2022-03-02 20:22 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: tytso, gitster, larsxschneider

On 3/1/2022 7:58 PM, Taylor Blau wrote:
> To store the individual mtimes of objects in a cruft pack, introduce a
> new `.mtimes` format that can optionally accompany a single pack in the
> repository.
> 
> The format is defined in Documentation/technical/pack-format.txt, and
> stores a 4-byte network order timestamp for each object in name (index)
> order.
> 
> This patch prepares for cruft packs by defining the `.mtimes` format,
> and introducing a basic API that callers can use to read out individual
> mtimes.
...
> +int load_pack_mtimes(struct packed_git *p)
> +{
> +	char *mtimes_name = NULL;
> +	int ret = 0;
> +
> +	if (!p->is_cruft)
> +		return ret; /* not a cruft pack */
> +	if (p->mtimes_map)
> +		return ret; /* already loaded */
> +
> +	ret = open_pack_index(p);
> +	if (ret < 0)
> +		goto cleanup;
> +
> +	mtimes_name = pack_mtimes_filename(p);
> +	ret = load_pack_mtimes_file(mtimes_name,
> +				    p->num_objects,
> +				    &p->mtimes_map,
> +				    &p->mtimes_size);
> +	if (ret)
> +		goto cleanup;

This looked odd to me, so I supposed that you had some code
that would be inserted between this 'goto cleanup' and the
'cleanup:' label, but I did not find such an insertion in
the remaining patchs. This 'if' can be deleted.

> +cleanup:
> +	free(mtimes_name);
> +	return ret;
> +}

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v2 00/17] cruft packs
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (16 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
@ 2022-03-02 20:23   ` Derrick Stolee
  2022-03-02 21:36     ` Taylor Blau
  17 siblings, 1 reply; 200+ messages in thread
From: Derrick Stolee @ 2022-03-02 20:23 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: tytso, gitster, larsxschneider

On 3/1/2022 7:57 PM, Taylor Blau wrote:
> Here is a reroll of my series to implement "cruft packs", a pack which
> stores accumulated unreachable objects, along with a new ".mtimes" file
> which tracks each object's last known modification time.
> 
> This was on the list towards the end of 2021[1], and I have been
> accumulating small changes to it locally for a couple of months now.
> Major changes since last time include:
> 
>   - Clearer documentation and commit message(s) to better illustrate how
>     the feature works and is supposed to be used.
> 
>   - Some minor documentation updates to pack-format.txt, which make some
>     ambiguous details more explicit.
> 
>   - Minor code movement / tweaks to make things easier to read, ensure
>     that functions aren't introduced in patches before they are used /
>     etc.
> 
>   - Moved the new test script to t5328 (instead of t5327, which happens
>     to be taken up by a new MIDX bitmap-related test), and purged it of
>     all "rm -fr .git/logs" (replacing them with "git reflog --expire
>     --all --expire=all" instead).
> 
>   - A new test which fixes a bug where loose objects which have copies
>     that appear in a cruft pack would not get accumulated when doing a
>     `--geometric` repack.
> 
> For convenience, a range-diff is below. Thanks in advance for taking
> another look!

It had been a while since my last read, so I read the patches
in full one more time. I found a couple nitpicks, but otherwise
everything is looking good.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v2 09/17] reachable: add options to add_unseen_recent_objects_to_traversal
  2022-03-02 20:19     ` Derrick Stolee
@ 2022-03-02 21:28       ` Taylor Blau
  0 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-02 21:28 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, tytso, gitster, larsxschneider

On Wed, Mar 02, 2022 at 03:19:57PM -0500, Derrick Stolee wrote:
> Nit: just realized this include could be replaced by a struct
> declaration:
>
> >  struct progress;
> >  struct rev_info;
>
> Like these. 'struct object;' should be enough for the typedef.

Good catch. We would need one for the packed_git struct, too. I don't
have a strong opinion about including object.h or not, though needing
two stubs pushes me slightly in the direction of leaving the include
alone.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v2 02/17] pack-mtimes: support reading .mtimes files
  2022-03-02 20:22     ` Derrick Stolee
@ 2022-03-02 21:33       ` Taylor Blau
  0 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-02 21:33 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, tytso, gitster, larsxschneider

On Wed, Mar 02, 2022 at 03:22:18PM -0500, Derrick Stolee wrote:
> > +	ret = load_pack_mtimes_file(mtimes_name,
> > +				    p->num_objects,
> > +				    &p->mtimes_map,
> > +				    &p->mtimes_size);
> > +	if (ret)
> > +		goto cleanup;
>
> This looked odd to me, so I supposed that you had some code
> that would be inserted between this 'goto cleanup' and the
> 'cleanup:' label, but I did not find such an insertion in
> the remaining patchs. This 'if' can be deleted.

Thanks for spotting. My gut was that there must be something in the
range-diff between this and the previous round, but there isn't. So this
code has always been there.

It likely comes from load_pack_revindex_from_disk(), which assigns the
`revindex_data` member of `struct packed_git` after calling
load_revindex_from_disk(), but only if it returned zero.

We don't have to assign mtimes_data here (since it doesn't exist, and)
because all of our reads into mtimes_map are offset by 3 to adjust for
the width of the header.

Anyway, we don't need this if statement here, so I'll drop it.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v2 00/17] cruft packs
  2022-03-02 20:23   ` [PATCH v2 00/17] " Derrick Stolee
@ 2022-03-02 21:36     ` Taylor Blau
  0 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-02 21:36 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, tytso, gitster, larsxschneider

On Wed, Mar 02, 2022 at 03:23:05PM -0500, Derrick Stolee wrote:
> > For convenience, a range-diff is below. Thanks in advance for taking
> > another look!
>
> It had been a while since my last read, so I read the patches
> in full one more time. I found a couple nitpicks, but otherwise
> everything is looking good.

Thanks for reading! I took both of your suggestions (along with Junio's
to rename the test script to t5329 to avoid a clash with your series)
and will re-submit a tiny reroll shortly.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v3 00/17] cruft packs
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (18 preceding siblings ...)
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
@ 2022-03-03  0:20 ` Taylor Blau
  2022-03-03  0:20   ` [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
                     ` (17 more replies)
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
  21 siblings, 18 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-03  0:20 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Here is a small reroll of my series to implement "cruft packs", based on
Stolee's review.

The changes here are minor, and mostly are limited to removing a
redundant "if" statement, avoiding an unnecessary header include, and
moving the tests (again!) to t5329's territory.

As always, a range-diff is below. Thanks in advance for taking another
look!

Taylor Blau (17):
  Documentation/technical: add cruft-packs.txt
  pack-mtimes: support reading .mtimes files
  pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  chunk-format.h: extract oid_version()
  pack-mtimes: support writing pack .mtimes files
  t/helper: add 'pack-mtimes' test-tool
  builtin/pack-objects.c: return from create_object_entry()
  builtin/pack-objects.c: --cruft without expiration
  reachable: add options to add_unseen_recent_objects_to_traversal
  reachable: report precise timestamps from objects in cruft packs
  builtin/pack-objects.c: --cruft with expiration
  builtin/repack.c: support generating a cruft pack
  builtin/repack.c: allow configuring cruft pack generation
  builtin/repack.c: use named flags for existing_packs
  builtin/repack.c: add cruft packs to MIDX during geometric repack
  builtin/gc.c: conditionally avoid pruning objects via loose
  sha1-file.c: don't freshen cruft packs

 Documentation/Makefile                  |   1 +
 Documentation/config/gc.txt             |  21 +-
 Documentation/config/repack.txt         |   9 +
 Documentation/git-gc.txt                |   5 +
 Documentation/git-pack-objects.txt      |  30 +
 Documentation/git-repack.txt            |  11 +
 Documentation/technical/cruft-packs.txt |  97 ++++
 Documentation/technical/pack-format.txt |  19 +
 Makefile                                |   2 +
 builtin/gc.c                            |  10 +-
 builtin/pack-objects.c                  | 304 +++++++++-
 builtin/repack.c                        | 183 +++++-
 bulk-checkin.c                          |   2 +-
 chunk-format.c                          |  12 +
 chunk-format.h                          |   3 +
 commit-graph.c                          |  18 +-
 midx.c                                  |  18 +-
 object-file.c                           |   4 +-
 object-store.h                          |   7 +-
 pack-mtimes.c                           | 126 ++++
 pack-mtimes.h                           |  15 +
 pack-objects.c                          |   6 +
 pack-objects.h                          |  25 +
 pack-write.c                            |  93 ++-
 pack.h                                  |   4 +
 packfile.c                              |  19 +-
 reachable.c                             |  58 +-
 reachable.h                             |   9 +-
 t/helper/test-pack-mtimes.c             |  56 ++
 t/helper/test-tool.c                    |   1 +
 t/helper/test-tool.h                    |   1 +
 t/t5329-pack-objects-cruft.sh           | 739 ++++++++++++++++++++++++
 32 files changed, 1807 insertions(+), 101 deletions(-)
 create mode 100644 Documentation/technical/cruft-packs.txt
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h
 create mode 100644 t/helper/test-pack-mtimes.c
 create mode 100755 t/t5329-pack-objects-cruft.sh

Range-diff against v2:
 -:  ---------- >  1:  784ee7e0ee Documentation/technical: add cruft-packs.txt
 1:  101b34660c !  2:  1ec754ad1b pack-mtimes: support reading .mtimes files
    @@ pack-mtimes.c (new)
     +				    p->num_objects,
     +				    &p->mtimes_map,
     +				    &p->mtimes_size);
    -+	if (ret)
    -+		goto cleanup;
    -+
     +cleanup:
     +	free(mtimes_name);
     +	return ret;
 2:  a94d7dfeb3 =  3:  0f5d6d6492 pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
 3:  1e0ed363ae =  4:  135a07276b chunk-format.h: extract oid_version()
 4:  5236490688 =  5:  0600503856 pack-mtimes: support writing pack .mtimes files
 5:  78313bc441 =  6:  4780c8437b t/helper: add 'pack-mtimes' test-tool
 6:  142098668d =  7:  33862a07c9 builtin/pack-objects.c: return from create_object_entry()
 7:  2517a6be3d !  8:  22705e4887 builtin/pack-objects.c: --cruft without expiration
    @@ object-store.h: int repo_has_object_file_with_flags(struct repository *r,
      
      /*
     
    - ## t/t5328-pack-objects-cruft.sh (new) ##
    + ## t/t5329-pack-objects-cruft.sh (new) ##
     @@
     +#!/bin/sh
     +
 8:  6f0e84273f =  9:  cebb30b667 reachable: add options to add_unseen_recent_objects_to_traversal
 9:  a8bde361f9 = 10:  fa4de8859d reachable: report precise timestamps from objects in cruft packs
10:  d68ce28132 ! 11:  92318f8700 builtin/pack-objects.c: --cruft with expiration
    @@ builtin/pack-objects.c: static void read_cruft_objects(void)
      		enumerate_cruft_objects();
      
     
    - ## t/t5328-pack-objects-cruft.sh ##
    -@@ t/t5328-pack-objects-cruft.sh: basic_cruft_pack_tests () {
    + ## reachable.h ##
    +@@
    + #ifndef REACHEABLE_H
    + #define REACHEABLE_H
    + 
    +-#include "object.h"
    +-
    + struct progress;
    + struct rev_info;
    ++struct object;
    ++struct packed_git;
    + 
    + typedef void report_recent_object_fn(const struct object *, struct packed_git *,
    + 				     off_t, time_t);
    +
    + ## t/t5329-pack-objects-cruft.sh ##
    +@@ t/t5329-pack-objects-cruft.sh: basic_cruft_pack_tests () {
      }
      
      basic_cruft_pack_tests never
11:  e5317cd472 ! 12:  1e94b33cb4 builtin/repack.c: support generating a cruft pack
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
      		item->util = (void *)(uintptr_t)populate_pack_exts(item->string);
      	}
     
    - ## t/t5328-pack-objects-cruft.sh ##
    -@@ t/t5328-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned' '
    + ## t/t5329-pack-objects-cruft.sh ##
    +@@ t/t5329-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned' '
      	)
      '
      
12:  b548dbbf80 ! 13:  9cfcd123bd builtin/repack.c: allow configuring cruft pack generation
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
      				       &existing_kept_packs);
      		if (ret)
     
    - ## t/t5328-pack-objects-cruft.sh ##
    -@@ t/t5328-pack-objects-cruft.sh: test_expect_success 'cruft repack ignores pack.packSizeLimit' '
    + ## t/t5329-pack-objects-cruft.sh ##
    +@@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft repack ignores pack.packSizeLimit' '
      	)
      '
      
13:  e6eee7f15c = 14:  1a58807df0 builtin/repack.c: use named flags for existing_packs
14:  b09dbc9fe5 ! 15:  ed05cf536b builtin/repack.c: add cruft packs to MIDX during geometric repack
    @@ builtin/repack.c: static void midx_included_packs(struct string_list *include,
      		for_each_string_list_item(item, existing_nonkept_packs) {
      			if ((uintptr_t)item->util & DELETE_PACK)
     
    - ## t/t5328-pack-objects-cruft.sh ##
    -@@ t/t5328-pack-objects-cruft.sh: test_expect_success 'cruft --local drops unreachable objects' '
    + ## t/t5329-pack-objects-cruft.sh ##
    +@@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft --local drops unreachable objects' '
      	)
      '
      
15:  7a21ae1494 ! 16:  1d5f334138 builtin/gc.c: conditionally avoid pruning objects via loose
    @@ builtin/gc.c: int cmd_gc(int argc, const char **argv, const char *prefix)
      			if (quiet)
      				strvec_push(&prune, "--no-progress");
     
    - ## t/t5328-pack-objects-cruft.sh ##
    -@@ t/t5328-pack-objects-cruft.sh: test_expect_success 'loose objects mtimes upsert others' '
    + ## t/t5329-pack-objects-cruft.sh ##
    +@@ t/t5329-pack-objects-cruft.sh: test_expect_success 'loose objects mtimes upsert others' '
      	)
      '
      
16:  b729b80963 ! 17:  f74b425872 sha1-file.c: don't freshen cruft packs
    @@ object-file.c: static int freshen_packed_object(const struct object_id *oid)
      		return 1;
      	if (!freshen_file(e.p->pack_name))
     
    - ## t/t5328-pack-objects-cruft.sh ##
    -@@ t/t5328-pack-objects-cruft.sh: test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
    + ## t/t5329-pack-objects-cruft.sh ##
    +@@ t/t5329-pack-objects-cruft.sh: test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
      	)
      '
      
-- 
2.35.1.73.gccc5557600

^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
@ 2022-03-03  0:20   ` Taylor Blau
  2022-03-07 18:03     ` Jonathan Nieder
  2022-03-03  0:20   ` [PATCH v3 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
                     ` (16 subsequent siblings)
  17 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2022-03-03  0:20 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Create a technical document to explain cruft packs. It contains a brief
overview of the problem, some background, details on the implementation,
and a couple of alternative approaches not considered here.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/Makefile                  |  1 +
 Documentation/technical/cruft-packs.txt | 97 +++++++++++++++++++++++++
 2 files changed, 98 insertions(+)
 create mode 100644 Documentation/technical/cruft-packs.txt

diff --git a/Documentation/Makefile b/Documentation/Makefile
index ed656db2ae..0b01c9408e 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -91,6 +91,7 @@ TECH_DOCS += MyFirstContribution
 TECH_DOCS += MyFirstObjectWalk
 TECH_DOCS += SubmittingPatches
 TECH_DOCS += technical/bundle-format
+TECH_DOCS += technical/cruft-packs
 TECH_DOCS += technical/hash-function-transition
 TECH_DOCS += technical/http-protocol
 TECH_DOCS += technical/index-format
diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
new file mode 100644
index 0000000000..2c3c5d93f8
--- /dev/null
+++ b/Documentation/technical/cruft-packs.txt
@@ -0,0 +1,97 @@
+= Cruft packs
+
+The cruft packs feature offer an alternative to Git's traditional mechanism of
+removing unreachable objects. This document provides an overview of Git's
+pruning mechanism, and how a cruft pack can be used instead to accomplish the
+same.
+
+== Background
+
+To remove unreachable objects from your repository, Git offers `git repack -Ad`
+(see linkgit:git-repack[1]). Quoting from the documentation:
+
+[quote]
+[...] unreachable objects in a previous pack become loose, unpacked objects,
+instead of being left in the old pack. [...] loose unreachable objects will be
+pruned according to normal expiry rules with the next 'git gc' invocation.
+
+Unreachable objects aren't removed immediately, since doing so could race with
+an incoming push which may reference an object which is about to be deleted.
+Instead, those unreachable objects are stored as loose object and stay that way
+until they are older than the expiration window, at which point they are removed
+by linkgit:git-prune[1].
+
+Git must store these unreachable objects loose in order to keep track of their
+per-object mtimes. If these unreachable objects were written into one big pack,
+then either freshening that pack (because an object contained within it was
+re-written) or creating a new pack of unreachable objects would cause the pack's
+mtime to get updated, and the objects within it would never leave the expiration
+window. Instead, objects are stored loose in order to keep track of the
+individual object mtimes and avoid a situation where all cruft objects are
+freshened at once.
+
+This can lead to undesirable situations when a repository contains many
+unreachable objects which have not yet left the grace period. Having large
+directories in the shards of `.git/objects` can lead to decreased performance in
+the repository. But given enough unreachable objects, this can lead to inode
+starvation and degrade the performance of the whole system. Since we
+can never pack those objects, these repositories often take up a large amount of
+disk space, since we can only zlib compress them, but not store them in delta
+chains.
+
+== Cruft packs
+
+A cruft pack eliminates the need for storing unreachable objects in a loose
+state by including the per-object mtimes in a separate file alongside a single
+pack containing all loose objects.
+
+A cruft pack is written by `git repack --cruft` when generating a new pack.
+linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft`
+is a classic all-into-one repack, meaning that everything in the resulting pack is
+reachable, and everything else is unreachable. Once written, the `--cruft`
+option instructs `git repack` to generate another pack containing only objects
+not packed in the previous step (which equates to packing all unreachable
+objects together). This progresses as follows:
+
+  1. Enumerate every object, marking any object which is (a) not contained in a
+     kept-pack, and (b) whose mtime is within the grace period as a traversal
+     tip.
+
+  2. Perform a reachability traversal based on the tips gathered in the previous
+     step, adding every object along the way to the pack.
+
+  3. Write the pack out, along with a `.mtimes` file that records the per-object
+     timestamps.
+
+This mode is invoked internally by linkgit:git-repack[1] when instructed to
+write a cruft pack. Crucially, the set of in-core kept packs is exactly the set
+of packs which will not be deleted by the repack; in other words, they contain
+all of the repository's reachable objects.
+
+When a repository already has a cruft pack, `git repack --cruft` typically only
+adds objects to it. An exception to this is when `git repack` is given the
+`--cruft-expiration` option, which allows the generated cruft pack to omit
+expired objects instead of waiting for linkgit:git-gc[1] to expire those objects
+later on.
+
+It is linkgit:git-gc[1] that is typically responsible for removing expired
+unreachable objects.
+
+== Alternatives
+
+Notable alternatives to this design include:
+
+  - The location of the per-object mtime data, and
+  - Storing unreachable objects in multiple cruft packs.
+
+On the location of mtime data, a new auxiliary file tied to the pack was chosen
+to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
+support for optional chunks of data, it may make sense to consolidate the
+`.mtimes` format into the `.idx` itself.
+
+Storing unreachable objects among multiple cruft packs (e.g., creating a new
+cruft pack during each repacking operation including only unreachable objects
+which aren't already stored in an earlier cruft pack) is significantly more
+complicated to construct, and so aren't pursued here. The obvious drawback to
+the current implementation is that the entire cruft pack must be re-written from
+scratch.
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v3 02/17] pack-mtimes: support reading .mtimes files
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
  2022-03-03  0:20   ` [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
@ 2022-03-03  0:20   ` Taylor Blau
  2022-03-03  0:20   ` [PATCH v3 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
                     ` (15 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-03  0:20 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

To store the individual mtimes of objects in a cruft pack, introduce a
new `.mtimes` format that can optionally accompany a single pack in the
repository.

The format is defined in Documentation/technical/pack-format.txt, and
stores a 4-byte network order timestamp for each object in name (index)
order.

This patch prepares for cruft packs by defining the `.mtimes` format,
and introducing a basic API that callers can use to read out individual
mtimes.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/pack-format.txt |  19 ++++
 Makefile                                |   1 +
 builtin/repack.c                        |   1 +
 object-store.h                          |   5 +-
 pack-mtimes.c                           | 126 ++++++++++++++++++++++++
 pack-mtimes.h                           |  15 +++
 packfile.c                              |  19 +++-
 7 files changed, 183 insertions(+), 3 deletions(-)
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 6d3efb7d16..c443dbb526 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -294,6 +294,25 @@ Pack file entry: <+
 
 All 4-byte numbers are in network order.
 
+== pack-*.mtimes files have the format:
+
+  - A 4-byte magic number '0x4d544d45' ('MTME').
+
+  - A 4-byte version identifier (= 1).
+
+  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
+
+  - A table of 4-byte unsigned integers in network order. The ith
+    value is the modification time (mtime) of the ith object in the
+    corresponding pack by lexicographic (index) order. The mtimes
+    count standard epoch seconds.
+
+  - A trailer, containing a checksum of the corresponding packfile,
+    and a checksum of all of the above (each having length according
+    to the specified hash function).
+
+All 4-byte numbers are in network order.
+
 == multi-pack-index (MIDX) files have the following format:
 
 The multi-pack-index files refer to multiple pack-files and loose objects.
diff --git a/Makefile b/Makefile
index 6f0b4b775f..1b186f4fd7 100644
--- a/Makefile
+++ b/Makefile
@@ -959,6 +959,7 @@ LIB_OBJS += oidtree.o
 LIB_OBJS += pack-bitmap-write.o
 LIB_OBJS += pack-bitmap.o
 LIB_OBJS += pack-check.o
+LIB_OBJS += pack-mtimes.o
 LIB_OBJS += pack-objects.o
 LIB_OBJS += pack-revindex.o
 LIB_OBJS += pack-write.o
diff --git a/builtin/repack.c b/builtin/repack.c
index da1e364a75..f908f7d5dd 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -212,6 +212,7 @@ static struct {
 } exts[] = {
 	{".pack"},
 	{".rev", 1},
+	{".mtimes", 1},
 	{".bitmap", 1},
 	{".promisor", 1},
 	{".idx"},
diff --git a/object-store.h b/object-store.h
index 6f89482df0..9b227661f2 100644
--- a/object-store.h
+++ b/object-store.h
@@ -115,12 +115,15 @@ struct packed_git {
 		 freshened:1,
 		 do_not_close:1,
 		 pack_promisor:1,
-		 multi_pack_index:1;
+		 multi_pack_index:1,
+		 is_cruft:1;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	struct revindex_entry *revindex;
 	const uint32_t *revindex_data;
 	const uint32_t *revindex_map;
 	size_t revindex_size;
+	const uint32_t *mtimes_map;
+	size_t mtimes_size;
 	/* something like ".git/objects/pack/xxxxx.pack" */
 	char pack_name[FLEX_ARRAY]; /* more */
 };
diff --git a/pack-mtimes.c b/pack-mtimes.c
new file mode 100644
index 0000000000..46ad584af1
--- /dev/null
+++ b/pack-mtimes.c
@@ -0,0 +1,126 @@
+#include "pack-mtimes.h"
+#include "object-store.h"
+#include "packfile.h"
+
+static char *pack_mtimes_filename(struct packed_git *p)
+{
+	size_t len;
+	if (!strip_suffix(p->pack_name, ".pack", &len))
+		BUG("pack_name does not end in .pack");
+	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
+	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
+}
+
+#define MTIMES_HEADER_SIZE (12)
+#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))
+
+struct mtimes_header {
+	uint32_t signature;
+	uint32_t version;
+	uint32_t hash_id;
+};
+
+static int load_pack_mtimes_file(char *mtimes_file,
+				 uint32_t num_objects,
+				 const uint32_t **data_p, size_t *len_p)
+{
+	int fd, ret = 0;
+	struct stat st;
+	void *data = NULL;
+	size_t mtimes_size;
+	struct mtimes_header header;
+	uint32_t *hdr;
+
+	fd = git_open(mtimes_file);
+
+	if (fd < 0) {
+		ret = -1;
+		goto cleanup;
+	}
+	if (fstat(fd, &st)) {
+		ret = error_errno(_("failed to read %s"), mtimes_file);
+		goto cleanup;
+	}
+
+	mtimes_size = xsize_t(st.st_size);
+
+	if (mtimes_size < MTIMES_MIN_SIZE) {
+		ret = error(_("mtimes file %s is too small"), mtimes_file);
+		goto cleanup;
+	}
+
+	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
+		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
+		goto cleanup;
+	}
+
+	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
+
+	header.signature = ntohl(hdr[0]);
+	header.version = ntohl(hdr[1]);
+	header.hash_id = ntohl(hdr[2]);
+
+	if (header.signature != MTIMES_SIGNATURE) {
+		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
+		goto cleanup;
+	}
+
+	if (header.version != 1) {
+		ret = error(_("mtimes file %s has unsupported version %"PRIu32),
+			    mtimes_file, header.version);
+		goto cleanup;
+	}
+
+	if (!(header.hash_id == 1 || header.hash_id == 2)) {
+		ret = error(_("mtimes file %s has unsupported hash id %"PRIu32),
+			    mtimes_file, header.hash_id);
+		goto cleanup;
+	}
+
+cleanup:
+	if (ret) {
+		if (data)
+			munmap(data, mtimes_size);
+	} else {
+		*len_p = mtimes_size;
+		*data_p = (const uint32_t *)data;
+	}
+
+	close(fd);
+	return ret;
+}
+
+int load_pack_mtimes(struct packed_git *p)
+{
+	char *mtimes_name = NULL;
+	int ret = 0;
+
+	if (!p->is_cruft)
+		return ret; /* not a cruft pack */
+	if (p->mtimes_map)
+		return ret; /* already loaded */
+
+	ret = open_pack_index(p);
+	if (ret < 0)
+		goto cleanup;
+
+	mtimes_name = pack_mtimes_filename(p);
+	ret = load_pack_mtimes_file(mtimes_name,
+				    p->num_objects,
+				    &p->mtimes_map,
+				    &p->mtimes_size);
+cleanup:
+	free(mtimes_name);
+	return ret;
+}
+
+uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)
+{
+	if (!p->mtimes_map)
+		BUG("pack .mtimes file not loaded for %s", p->pack_name);
+	if (p->num_objects <= pos)
+		BUG("pack .mtimes out-of-bounds (%"PRIu32" vs %"PRIu32")",
+		    pos, p->num_objects);
+
+	return get_be32(p->mtimes_map + pos + 3);
+}
diff --git a/pack-mtimes.h b/pack-mtimes.h
new file mode 100644
index 0000000000..38ddb9f893
--- /dev/null
+++ b/pack-mtimes.h
@@ -0,0 +1,15 @@
+#ifndef PACK_MTIMES_H
+#define PACK_MTIMES_H
+
+#include "git-compat-util.h"
+
+#define MTIMES_SIGNATURE 0x4d544d45 /* "MTME" */
+#define MTIMES_VERSION 1
+
+struct packed_git;
+
+int load_pack_mtimes(struct packed_git *p);
+
+uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos);
+
+#endif
diff --git a/packfile.c b/packfile.c
index 835b2d2716..fc0245fbab 100644
--- a/packfile.c
+++ b/packfile.c
@@ -334,12 +334,22 @@ static void close_pack_revindex(struct packed_git *p)
 	p->revindex_data = NULL;
 }
 
+static void close_pack_mtimes(struct packed_git *p)
+{
+	if (!p->mtimes_map)
+		return;
+
+	munmap((void *)p->mtimes_map, p->mtimes_size);
+	p->mtimes_map = NULL;
+}
+
 void close_pack(struct packed_git *p)
 {
 	close_pack_windows(p);
 	close_pack_fd(p);
 	close_pack_index(p);
 	close_pack_revindex(p);
+	close_pack_mtimes(p);
 	oidset_clear(&p->bad_objects);
 }
 
@@ -363,7 +373,7 @@ void close_object_store(struct raw_object_store *o)
 
 void unlink_pack_path(const char *pack_name, int force_delete)
 {
-	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
+	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".mtimes"};
 	int i;
 	struct strbuf buf = STRBUF_INIT;
 	size_t plen;
@@ -718,6 +728,10 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
 	if (!access(p->pack_name, F_OK))
 		p->pack_promisor = 1;
 
+	xsnprintf(p->pack_name + path_len, alloc - path_len, ".mtimes");
+	if (!access(p->pack_name, F_OK))
+		p->is_cruft = 1;
+
 	xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
 	if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) {
 		free(p);
@@ -869,7 +883,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 	    ends_with(file_name, ".pack") ||
 	    ends_with(file_name, ".bitmap") ||
 	    ends_with(file_name, ".keep") ||
-	    ends_with(file_name, ".promisor"))
+	    ends_with(file_name, ".promisor") ||
+	    ends_with(file_name, ".mtimes"))
 		string_list_append(data->garbage, full_name);
 	else
 		report_garbage(PACKDIR_FILE_GARBAGE, full_name);
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v3 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
  2022-03-03  0:20   ` [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
  2022-03-03  0:20   ` [PATCH v3 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
@ 2022-03-03  0:20   ` Taylor Blau
  2022-03-03  0:20   ` [PATCH v3 04/17] chunk-format.h: extract oid_version() Taylor Blau
                     ` (14 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-03  0:20 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

This structure will be used to communicate the per-object mtimes when
writing a cruft pack. Here, we need the full packing_data structure
because the mtime information is stored in an array there, not on the
individual object_entry's themselves (to avoid paying the overhead in
structure width for operations which do not generate a cruft pack).

We haven't passed this information down before because one of the two
callers (in bulk-checkin.c) does not have a packing_data structure at
all. In that case (where no cruft pack will be generated), NULL is
passed instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 3 ++-
 bulk-checkin.c         | 2 +-
 pack-write.c           | 1 +
 pack.h                 | 3 +++
 4 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 178e611f09..385970cb7b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1254,7 +1254,8 @@ static void write_pack_file(void)
 
 			stage_tmp_packfiles(&tmpname, pack_tmp_name,
 					    written_list, nr_written,
-					    &pack_idx_opts, hash, &idx_tmp_name);
+					    &to_pack, &pack_idx_opts, hash,
+					    &idx_tmp_name);
 
 			if (write_bitmap_index) {
 				size_t tmpname_len = tmpname.len;
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 8785b2ac80..99f7596c4e 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -33,7 +33,7 @@ static void finish_tmp_packfile(struct strbuf *basename,
 	char *idx_tmp_name = NULL;
 
 	stage_tmp_packfiles(basename, pack_tmp_name, written_list, nr_written,
-			    pack_idx_opts, hash, &idx_tmp_name);
+			    NULL, pack_idx_opts, hash, &idx_tmp_name);
 	rename_tmp_packfile_idx(basename, &idx_tmp_name);
 
 	free(idx_tmp_name);
diff --git a/pack-write.c b/pack-write.c
index a5846f3a34..d594e3008e 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -483,6 +483,7 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 const char *pack_tmp_name,
 			 struct pack_idx_entry **written_list,
 			 uint32_t nr_written,
+			 struct packing_data *to_pack,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[],
 			 char **idx_tmp_name)
diff --git a/pack.h b/pack.h
index b22bfc4a18..fd27cfdfd7 100644
--- a/pack.h
+++ b/pack.h
@@ -109,11 +109,14 @@ int encode_in_pack_object_header(unsigned char *hdr, int hdr_len,
 #define PH_ERROR_PROTOCOL	(-3)
 int read_pack_header(int fd, struct pack_header *);
 
+struct packing_data;
+
 struct hashfile *create_tmp_packfile(char **pack_tmp_name);
 void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 const char *pack_tmp_name,
 			 struct pack_idx_entry **written_list,
 			 uint32_t nr_written,
+			 struct packing_data *to_pack,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[],
 			 char **idx_tmp_name);
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v3 04/17] chunk-format.h: extract oid_version()
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (2 preceding siblings ...)
  2022-03-03  0:20   ` [PATCH v3 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
@ 2022-03-03  0:20   ` Taylor Blau
  2022-03-03 16:30     ` Ævar Arnfjörð Bjarmason
  2022-03-03  0:20   ` [PATCH v3 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
                     ` (13 subsequent siblings)
  17 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2022-03-03  0:20 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

There are three definitions of an identical function which converts
`the_hash_algo` into either 1 (for SHA-1) or 2 (for SHA-256). There is a
copy of this function for writing both the commit-graph and
multi-pack-index file, and another inline definition used to write the
.rev header.

Consolidate these into a single definition in chunk-format.h. It's not
clear that this is the best header to define this function in, but it
should do for now.

(Worth noting, the .rev caller expects a 4-byte unsigned, but the other
two callers work with a single unsigned byte. The consolidated version
uses the latter type, and lets the compiler widen it when required).

Another caller will be added in a subsequent patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 chunk-format.c | 12 ++++++++++++
 chunk-format.h |  3 +++
 commit-graph.c | 18 +++---------------
 midx.c         | 18 +++---------------
 pack-write.c   | 15 ++-------------
 5 files changed, 23 insertions(+), 43 deletions(-)

diff --git a/chunk-format.c b/chunk-format.c
index 1c3dca62e2..0275b74a89 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -181,3 +181,15 @@ int read_chunk(struct chunkfile *cf,
 
 	return CHUNK_NOT_FOUND;
 }
+
+uint8_t oid_version(const struct git_hash_algo *algop)
+{
+	switch (hash_algo_by_ptr(algop)) {
+	case GIT_HASH_SHA1:
+		return 1;
+	case GIT_HASH_SHA256:
+		return 2;
+	default:
+		die(_("invalid hash version"));
+	}
+}
diff --git a/chunk-format.h b/chunk-format.h
index 9ccbe00377..7885aa0848 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -2,6 +2,7 @@
 #define CHUNK_FORMAT_H
 
 #include "git-compat-util.h"
+#include "hash.h"
 
 struct hashfile;
 struct chunkfile;
@@ -65,4 +66,6 @@ int read_chunk(struct chunkfile *cf,
 	       chunk_read_fn fn,
 	       void *data);
 
+uint8_t oid_version(const struct git_hash_algo *algop);
+
 #endif
diff --git a/commit-graph.c b/commit-graph.c
index 265c010122..f678d2c4a1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -193,18 +193,6 @@ char *get_commit_graph_chain_filename(struct object_directory *odb)
 	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
 }
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 static struct commit_graph *alloc_commit_graph(void)
 {
 	struct commit_graph *g = xcalloc(1, sizeof(*g));
@@ -365,9 +353,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 	}
 
 	hash_version = *(unsigned char*)(data + 5);
-	if (hash_version != oid_version()) {
+	if (hash_version != oid_version(the_hash_algo)) {
 		error(_("commit-graph hash version %X does not match version %X"),
-		      hash_version, oid_version());
+		      hash_version, oid_version(the_hash_algo));
 		return NULL;
 	}
 
@@ -1911,7 +1899,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
 	hashwrite_u8(f, GRAPH_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, oid_version(the_hash_algo));
 	hashwrite_u8(f, get_num_chunks(cf));
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
diff --git a/midx.c b/midx.c
index 865170bad0..65e670c5e2 100644
--- a/midx.c
+++ b/midx.c
@@ -41,18 +41,6 @@
 
 #define PACK_EXPIRED UINT_MAX
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 const unsigned char *get_midx_checksum(struct multi_pack_index *m)
 {
 	return m->data + m->data_len - the_hash_algo->rawsz;
@@ -134,9 +122,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 		      m->version);
 
 	hash_version = m->data[MIDX_BYTE_HASH_VERSION];
-	if (hash_version != oid_version()) {
+	if (hash_version != oid_version(the_hash_algo)) {
 		error(_("multi-pack-index hash version %u does not match version %u"),
-		      hash_version, oid_version());
+		      hash_version, oid_version(the_hash_algo));
 		goto cleanup_fail;
 	}
 	m->hash_len = the_hash_algo->rawsz;
@@ -420,7 +408,7 @@ static size_t write_midx_header(struct hashfile *f,
 {
 	hashwrite_be32(f, MIDX_SIGNATURE);
 	hashwrite_u8(f, MIDX_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, oid_version(the_hash_algo));
 	hashwrite_u8(f, num_chunks);
 	hashwrite_u8(f, 0); /* unused */
 	hashwrite_be32(f, num_packs);
diff --git a/pack-write.c b/pack-write.c
index d594e3008e..ff305b404c 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -2,6 +2,7 @@
 #include "pack.h"
 #include "csum-file.h"
 #include "remote.h"
+#include "chunk-format.h"
 
 void reset_pack_idx_option(struct pack_idx_option *opts)
 {
@@ -181,21 +182,9 @@ static int pack_order_cmp(const void *va, const void *vb, void *ctx)
 
 static void write_rev_header(struct hashfile *f)
 {
-	uint32_t oid_version;
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		oid_version = 1;
-		break;
-	case GIT_HASH_SHA256:
-		oid_version = 2;
-		break;
-	default:
-		die("write_rev_header: unknown hash version");
-	}
-
 	hashwrite_be32(f, RIDX_SIGNATURE);
 	hashwrite_be32(f, RIDX_VERSION);
-	hashwrite_be32(f, oid_version);
+	hashwrite_be32(f, oid_version(the_hash_algo));
 }
 
 static void write_rev_index_positions(struct hashfile *f,
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v3 05/17] pack-mtimes: support writing pack .mtimes files
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (3 preceding siblings ...)
  2022-03-03  0:20   ` [PATCH v3 04/17] chunk-format.h: extract oid_version() Taylor Blau
@ 2022-03-03  0:20   ` Taylor Blau
  2022-03-03 16:45     ` Ævar Arnfjörð Bjarmason
  2022-03-03  0:20   ` [PATCH v3 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
                     ` (12 subsequent siblings)
  17 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2022-03-03  0:20 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Now that the `.mtimes` format is defined, supplement the pack-write API
to be able to conditionally write an `.mtimes` file along with a pack by
setting an additional flag and passing an oidmap that contains the
timestamps corresponding to each object in the pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-objects.c |  6 ++++
 pack-objects.h | 25 ++++++++++++++++
 pack-write.c   | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++
 pack.h         |  1 +
 4 files changed, 109 insertions(+)

diff --git a/pack-objects.c b/pack-objects.c
index fe2a4eace9..272e8d4517 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -170,6 +170,9 @@ struct object_entry *packlist_alloc(struct packing_data *pdata,
 
 		if (pdata->layer)
 			REALLOC_ARRAY(pdata->layer, pdata->nr_alloc);
+
+		if (pdata->cruft_mtime)
+			REALLOC_ARRAY(pdata->cruft_mtime, pdata->nr_alloc);
 	}
 
 	new_entry = pdata->objects + pdata->nr_objects++;
@@ -198,6 +201,9 @@ struct object_entry *packlist_alloc(struct packing_data *pdata,
 	if (pdata->layer)
 		pdata->layer[pdata->nr_objects - 1] = 0;
 
+	if (pdata->cruft_mtime)
+		pdata->cruft_mtime[pdata->nr_objects - 1] = 0;
+
 	return new_entry;
 }
 
diff --git a/pack-objects.h b/pack-objects.h
index dca2351ef9..393b9db546 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -168,6 +168,14 @@ struct packing_data {
 	/* delta islands */
 	unsigned int *tree_depth;
 	unsigned char *layer;
+
+	/*
+	 * Used when writing cruft packs.
+	 *
+	 * Object mtimes are stored in pack order when writing, but
+	 * written out in lexicographic (index) order.
+	 */
+	uint32_t *cruft_mtime;
 };
 
 void prepare_packing_data(struct repository *r, struct packing_data *pdata);
@@ -289,4 +297,21 @@ static inline void oe_set_layer(struct packing_data *pack,
 	pack->layer[e - pack->objects] = layer;
 }
 
+static inline uint32_t oe_cruft_mtime(struct packing_data *pack,
+				      struct object_entry *e)
+{
+	if (!pack->cruft_mtime)
+		return 0;
+	return pack->cruft_mtime[e - pack->objects];
+}
+
+static inline void oe_set_cruft_mtime(struct packing_data *pack,
+				      struct object_entry *e,
+				      uint32_t mtime)
+{
+	if (!pack->cruft_mtime)
+		CALLOC_ARRAY(pack->cruft_mtime, pack->nr_alloc);
+	pack->cruft_mtime[e - pack->objects] = mtime;
+}
+
 #endif
diff --git a/pack-write.c b/pack-write.c
index ff305b404c..270280c4df 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -3,6 +3,10 @@
 #include "csum-file.h"
 #include "remote.h"
 #include "chunk-format.h"
+#include "pack-mtimes.h"
+#include "oidmap.h"
+#include "chunk-format.h"
+#include "pack-objects.h"
 
 void reset_pack_idx_option(struct pack_idx_option *opts)
 {
@@ -276,6 +280,70 @@ const char *write_rev_file_order(const char *rev_name,
 	return rev_name;
 }
 
+static void write_mtimes_header(struct hashfile *f)
+{
+	hashwrite_be32(f, MTIMES_SIGNATURE);
+	hashwrite_be32(f, MTIMES_VERSION);
+	hashwrite_be32(f, oid_version(the_hash_algo));
+}
+
+/*
+ * Writes the object mtimes of "objects" for use in a .mtimes file.
+ * Note that objects must be in lexicographic (index) order, which is
+ * the expected ordering of these values in the .mtimes file.
+ */
+static void write_mtimes_objects(struct hashfile *f,
+				 struct packing_data *to_pack,
+				 struct pack_idx_entry **objects,
+				 uint32_t nr_objects)
+{
+	uint32_t i;
+	for (i = 0; i < nr_objects; i++) {
+		struct object_entry *e = (struct object_entry*)objects[i];
+		hashwrite_be32(f, oe_cruft_mtime(to_pack, e));
+	}
+}
+
+static void write_mtimes_trailer(struct hashfile *f, const unsigned char *hash)
+{
+	hashwrite(f, hash, the_hash_algo->rawsz);
+}
+
+static const char *write_mtimes_file(const char *mtimes_name,
+				     struct packing_data *to_pack,
+				     struct pack_idx_entry **objects,
+				     uint32_t nr_objects,
+				     const unsigned char *hash)
+{
+	struct hashfile *f;
+	int fd;
+
+	if (!to_pack)
+		BUG("cannot call write_mtimes_file with NULL packing_data");
+
+	if (!mtimes_name) {
+		struct strbuf tmp_file = STRBUF_INIT;
+		fd = odb_mkstemp(&tmp_file, "pack/tmp_mtimes_XXXXXX");
+		mtimes_name = strbuf_detach(&tmp_file, NULL);
+	} else {
+		unlink(mtimes_name);
+		fd = xopen(mtimes_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
+	}
+	f = hashfd(fd, mtimes_name);
+
+	write_mtimes_header(f);
+	write_mtimes_objects(f, to_pack, objects, nr_objects);
+	write_mtimes_trailer(f, hash);
+
+	if (adjust_shared_perm(mtimes_name) < 0)
+		die(_("failed to make %s readable"), mtimes_name);
+
+	finalize_hashfile(f, NULL,
+			  CSUM_HASH_IN_STREAM | CSUM_CLOSE | CSUM_FSYNC);
+
+	return mtimes_name;
+}
+
 off_t write_pack_header(struct hashfile *f, uint32_t nr_entries)
 {
 	struct pack_header hdr;
@@ -478,6 +546,7 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 char **idx_tmp_name)
 {
 	const char *rev_tmp_name = NULL;
+	const char *mtimes_tmp_name = NULL;
 
 	if (adjust_shared_perm(pack_tmp_name))
 		die_errno("unable to make temporary pack file readable");
@@ -490,9 +559,17 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 	rev_tmp_name = write_rev_file(NULL, written_list, nr_written, hash,
 				      pack_idx_opts->flags);
 
+	if (pack_idx_opts->flags & WRITE_MTIMES) {
+		mtimes_tmp_name = write_mtimes_file(NULL, to_pack, written_list,
+						    nr_written,
+						    hash);
+	}
+
 	rename_tmp_packfile(name_buffer, pack_tmp_name, "pack");
 	if (rev_tmp_name)
 		rename_tmp_packfile(name_buffer, rev_tmp_name, "rev");
+	if (mtimes_tmp_name)
+		rename_tmp_packfile(name_buffer, mtimes_tmp_name, "mtimes");
 }
 
 void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought)
diff --git a/pack.h b/pack.h
index fd27cfdfd7..01d385903a 100644
--- a/pack.h
+++ b/pack.h
@@ -44,6 +44,7 @@ struct pack_idx_option {
 #define WRITE_IDX_STRICT 02
 #define WRITE_REV 04
 #define WRITE_REV_VERIFY 010
+#define WRITE_MTIMES 020
 
 	uint32_t version;
 	uint32_t off32_limit;
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v3 06/17] t/helper: add 'pack-mtimes' test-tool
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (4 preceding siblings ...)
  2022-03-03  0:20   ` [PATCH v3 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
@ 2022-03-03  0:20   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
                     ` (11 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-03  0:20 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

In the next patch, we will implement and test support for writing a
cruft pack via a special mode of `git pack-objects`. To make sure that
objects are written with the correct timestamps, and a new test-tool
that can dump the object names and corresponding timestamps from a given
`.mtimes` file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Makefile                    |  1 +
 t/helper/test-pack-mtimes.c | 56 +++++++++++++++++++++++++++++++++++++
 t/helper/test-tool.c        |  1 +
 t/helper/test-tool.h        |  1 +
 4 files changed, 59 insertions(+)
 create mode 100644 t/helper/test-pack-mtimes.c

diff --git a/Makefile b/Makefile
index 1b186f4fd7..5c0ed1ade7 100644
--- a/Makefile
+++ b/Makefile
@@ -727,6 +727,7 @@ TEST_BUILTINS_OBJS += test-oid-array.o
 TEST_BUILTINS_OBJS += test-oidmap.o
 TEST_BUILTINS_OBJS += test-oidtree.o
 TEST_BUILTINS_OBJS += test-online-cpus.o
+TEST_BUILTINS_OBJS += test-pack-mtimes.o
 TEST_BUILTINS_OBJS += test-parse-options.o
 TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
 TEST_BUILTINS_OBJS += test-partial-clone.o
diff --git a/t/helper/test-pack-mtimes.c b/t/helper/test-pack-mtimes.c
new file mode 100644
index 0000000000..f7b79daf4c
--- /dev/null
+++ b/t/helper/test-pack-mtimes.c
@@ -0,0 +1,56 @@
+#include "git-compat-util.h"
+#include "test-tool.h"
+#include "strbuf.h"
+#include "object-store.h"
+#include "packfile.h"
+#include "pack-mtimes.h"
+
+static void dump_mtimes(struct packed_git *p)
+{
+	uint32_t i;
+	if (load_pack_mtimes(p) < 0)
+		die("could not load pack .mtimes");
+
+	for (i = 0; i < p->num_objects; i++) {
+		struct object_id oid;
+		if (nth_packed_object_id(&oid, p, i) < 0)
+			die("could not load object id at position %"PRIu32, i);
+
+		printf("%s %"PRIu32"\n",
+		       oid_to_hex(&oid), nth_packed_mtime(p, i));
+	}
+}
+
+static const char *pack_mtimes_usage = "\n"
+"  test-tool pack-mtimes <pack-name.mtimes>";
+
+int cmd__pack_mtimes(int argc, const char **argv)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct packed_git *p;
+
+	setup_git_directory();
+
+	if (argc != 2)
+		usage(pack_mtimes_usage);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		strbuf_addstr(&buf, basename(p->pack_name));
+		strbuf_strip_suffix(&buf, ".pack");
+		strbuf_addstr(&buf, ".mtimes");
+
+		if (!strcmp(buf.buf, argv[1]))
+			break;
+
+		strbuf_reset(&buf);
+	}
+
+	strbuf_release(&buf);
+
+	if (!p)
+		die("could not find pack '%s'", argv[1]);
+
+	dump_mtimes(p);
+
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index e6ec69cf32..7d472b31fd 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -47,6 +47,7 @@ static struct test_cmd cmds[] = {
 	{ "oidmap", cmd__oidmap },
 	{ "oidtree", cmd__oidtree },
 	{ "online-cpus", cmd__online_cpus },
+	{ "pack-mtimes", cmd__pack_mtimes },
 	{ "parse-options", cmd__parse_options },
 	{ "parse-pathspec-file", cmd__parse_pathspec_file },
 	{ "partial-clone", cmd__partial_clone },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 20756eefdd..0ac4f32955 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -37,6 +37,7 @@ int cmd__mktemp(int argc, const char **argv);
 int cmd__oidmap(int argc, const char **argv);
 int cmd__oidtree(int argc, const char **argv);
 int cmd__online_cpus(int argc, const char **argv);
+int cmd__pack_mtimes(int argc, const char **argv);
 int cmd__parse_options(int argc, const char **argv);
 int cmd__parse_pathspec_file(int argc, const char** argv);
 int cmd__partial_clone(int argc, const char **argv);
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v3 07/17] builtin/pack-objects.c: return from create_object_entry()
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (5 preceding siblings ...)
  2022-03-03  0:20   ` [PATCH v3 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
                     ` (10 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

A new caller in the next commit will want to immediately modify the
object_entry structure created by create_object_entry(). Instead of
forcing that caller to wastefully look-up the entry we just created,
return it from create_object_entry() instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 385970cb7b..3f08a3c63a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1508,13 +1508,13 @@ static int want_object_in_pack(const struct object_id *oid,
 	return 1;
 }
 
-static void create_object_entry(const struct object_id *oid,
-				enum object_type type,
-				uint32_t hash,
-				int exclude,
-				int no_try_delta,
-				struct packed_git *found_pack,
-				off_t found_offset)
+static struct object_entry *create_object_entry(const struct object_id *oid,
+						enum object_type type,
+						uint32_t hash,
+						int exclude,
+						int no_try_delta,
+						struct packed_git *found_pack,
+						off_t found_offset)
 {
 	struct object_entry *entry;
 
@@ -1531,6 +1531,8 @@ static void create_object_entry(const struct object_id *oid,
 	}
 
 	entry->no_try_delta = no_try_delta;
+
+	return entry;
 }
 
 static const char no_closure_warning[] = N_(
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v3 08/17] builtin/pack-objects.c: --cruft without expiration
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (6 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
                     ` (9 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Teach `pack-objects` how to generate a cruft pack when no objects are
dropped (i.e., `--cruft-expiration=never`). Later patches will teach
`pack-objects` how to generate a cruft pack that prunes objects.

When generating a cruft pack which does not prune objects, we want to
collect all unreachable objects into a single pack (noting and updating
their mtimes as we accumulate them). Ordinary use will pass the result
of a `git repack -A` as a kept pack, so when this patch says "kept
pack", readers should think "reachable objects".

Generating a non-expiring cruft packs works as follows:

  - Callers provide a list of every pack they know about, and indicate
    which packs are about to be removed.

  - All packs which are going to be removed (we'll call these the
    redundant ones) are marked as kept in-core.

    Any packs the caller did not mention (but are known to the
    `pack-objects` process) are also marked as kept in-core. Packs not
    mentioned by the caller are assumed to be unknown to them, i.e.,
    they entered the repository after the caller decided which packs
    should be kept and which should be discarded.

    Since we do not want to include objects in these "unknown" packs
    (because we don't know which of their objects are or aren't
    reachable), these are also marked as kept in-core.

  - Then, we enumerate all objects in the repository, and add them to
    our packing list if they do not appear in an in-core kept pack.

This results in a new cruft pack which contains all known objects that
aren't included in the kept packs. When the kept pack is the result of
`git repack -A`, the resulting pack contains all unreachable objects.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-pack-objects.txt |  30 ++++
 builtin/pack-objects.c             | 201 +++++++++++++++++++++++++-
 object-file.c                      |   2 +-
 object-store.h                     |   2 +
 t/t5329-pack-objects-cruft.sh      | 218 +++++++++++++++++++++++++++++
 5 files changed, 448 insertions(+), 5 deletions(-)
 create mode 100755 t/t5329-pack-objects-cruft.sh

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index f8344e1e5b..a9995a932c 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -13,6 +13,7 @@ SYNOPSIS
 	[--no-reuse-delta] [--delta-base-offset] [--non-empty]
 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
 	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
+	[--cruft] [--cruft-expiration=<time>]
 	[--stdout [--filter=<filter-spec>] | <base-name>]
 	[--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>
 
@@ -95,6 +96,35 @@ base-name::
 Incompatible with `--revs`, or options that imply `--revs` (such as
 `--all`), with the exception of `--unpacked`, which is compatible.
 
+--cruft::
+	Packs unreachable objects into a separate "cruft" pack, denoted
+	by the existence of a `.mtimes` file. Typically used by `git
+	repack --cruft`. Callers provide a list of pack names and
+	indicate which packs will remain in the repository, along with
+	which packs will be deleted (indicated by the `-` prefix). The
+	contents of the cruft pack are all objects not contained in the
+	surviving packs which have not exceeded the grace period (see
+	`--cruft-expiration` below), or which have exceeded the grace
+	period, but are reachable from an other object which hasn't.
++
+When the input lists a pack containing all reachable objects (and lists
+all other packs as pending deletion), the corresponding cruft pack will
+contain all unreachable objects (with mtime newer than the
+`--cruft-expiration`) along with any unreachable objects whose mtime is
+older than the `--cruft-expiration`, but are reachable from an
+unreachable object whose mtime is newer than the `--cruft-expiration`).
++
+Incompatible with `--unpack-unreachable`, `--keep-unreachable`,
+`--pack-loose-unreachable`, `--stdin-packs`, as well as any other
+options which imply `--revs`. Also incompatible with `--max-pack-size`;
+when this option is set, the maximum pack size is not inferred from
+`pack.packSizeLimit`.
+
+--cruft-expiration=<approxidate>::
+	If specified, objects are eliminated from the cruft pack if they
+	have an mtime older than `<approxidate>`. If unspecified (and
+	given `--cruft`), then no objects are eliminated.
+
 --window=<n>::
 --depth=<n>::
 	These two options affect how the objects contained in
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 3f08a3c63a..5ba4fc9c2c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -36,6 +36,7 @@
 #include "trace2.h"
 #include "shallow.h"
 #include "promisor-remote.h"
+#include "pack-mtimes.h"
 
 /*
  * Objects we are going to pack are collected in the `to_pack` structure.
@@ -194,6 +195,8 @@ static int reuse_delta = 1, reuse_object = 1;
 static int keep_unreachable, unpack_unreachable, include_tag;
 static timestamp_t unpack_unreachable_expiration;
 static int pack_loose_unreachable;
+static int cruft;
+static timestamp_t cruft_expiration;
 static int local;
 static int have_non_local_packs;
 static int incremental;
@@ -1252,6 +1255,9 @@ static void write_pack_file(void)
 					&to_pack, written_list, nr_written);
 			}
 
+			if (cruft)
+				pack_idx_opts.flags |= WRITE_MTIMES;
+
 			stage_tmp_packfiles(&tmpname, pack_tmp_name,
 					    written_list, nr_written,
 					    &to_pack, &pack_idx_opts, hash,
@@ -3389,6 +3395,135 @@ static void read_packs_list_from_stdin(void)
 	string_list_clear(&exclude_packs, 0);
 }
 
+static void add_cruft_object_entry(const struct object_id *oid, enum object_type type,
+				   struct packed_git *pack, off_t offset,
+				   const char *name, uint32_t mtime)
+{
+	struct object_entry *entry;
+
+	display_progress(progress_state, ++nr_seen);
+
+	entry = packlist_find(&to_pack, oid);
+	if (entry) {
+		if (name) {
+			entry->hash = pack_name_hash(name);
+			entry->no_try_delta = no_try_delta(name);
+		}
+	} else {
+		if (!want_object_in_pack(oid, 0, &pack, &offset))
+			return;
+		if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
+			/*
+			 * If a traversed tree has a missing blob then we want
+			 * to avoid adding that missing object to our pack.
+			 *
+			 * This only applies to missing blobs, not trees,
+			 * because the traversal needs to parse sub-trees but
+			 * not blobs.
+			 *
+			 * Note we only perform this check when we couldn't
+			 * already find the object in a pack, so we're really
+			 * limited to "ensure non-tip blobs which don't exist in
+			 * packs do exist via loose objects". Confused?
+			 */
+			return;
+		}
+
+		entry = create_object_entry(oid, type, pack_name_hash(name),
+					    0, name && no_try_delta(name),
+					    pack, offset);
+	}
+
+	if (mtime > oe_cruft_mtime(&to_pack, entry))
+		oe_set_cruft_mtime(&to_pack, entry, mtime);
+	return;
+}
+
+static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
+{
+	struct string_list_item *item = NULL;
+	for_each_string_list_item(item, packs) {
+		struct packed_git *p = item->util;
+		if (!p)
+			die(_("could not find pack '%s'"), item->string);
+		p->pack_keep_in_core = keep;
+	}
+}
+
+static void add_unreachable_loose_objects(void);
+static void add_objects_in_unpacked_packs(void);
+
+static void enumerate_cruft_objects(void)
+{
+	if (progress)
+		progress_state = start_progress(_("Enumerating cruft objects"), 0);
+
+	add_objects_in_unpacked_packs();
+	add_unreachable_loose_objects();
+
+	stop_progress(&progress_state);
+}
+
+static void read_cruft_objects(void)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct string_list discard_packs = STRING_LIST_INIT_DUP;
+	struct string_list fresh_packs = STRING_LIST_INIT_DUP;
+	struct packed_git *p;
+
+	ignore_packed_keep_in_core = 1;
+
+	while (strbuf_getline(&buf, stdin) != EOF) {
+		if (!buf.len)
+			continue;
+
+		if (*buf.buf == '-')
+			string_list_append(&discard_packs, buf.buf + 1);
+		else
+			string_list_append(&fresh_packs, buf.buf);
+		strbuf_reset(&buf);
+	}
+
+	string_list_sort(&discard_packs);
+	string_list_sort(&fresh_packs);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		const char *pack_name = pack_basename(p);
+		struct string_list_item *item;
+
+		item = string_list_lookup(&fresh_packs, pack_name);
+		if (!item)
+			item = string_list_lookup(&discard_packs, pack_name);
+
+		if (item) {
+			item->util = p;
+		} else {
+			/*
+			 * This pack wasn't mentioned in either the "fresh" or
+			 * "discard" list, so the caller didn't know about it.
+			 *
+			 * Mark it as kept so that its objects are ignored by
+			 * add_unseen_recent_objects_to_traversal(). We'll
+			 * unmark it before starting the traversal so it doesn't
+			 * halt the traversal early.
+			 */
+			p->pack_keep_in_core = 1;
+		}
+	}
+
+	mark_pack_kept_in_core(&fresh_packs, 1);
+	mark_pack_kept_in_core(&discard_packs, 0);
+
+	if (cruft_expiration)
+		die("--cruft-expiration not yet implemented");
+	else
+		enumerate_cruft_objects();
+
+	strbuf_release(&buf);
+	string_list_clear(&discard_packs, 0);
+	string_list_clear(&fresh_packs, 0);
+}
+
 static void read_object_list_from_stdin(void)
 {
 	char line[GIT_MAX_HEXSZ + 1 + PATH_MAX + 2];
@@ -3521,7 +3656,24 @@ static int add_object_in_unpacked_pack(const struct object_id *oid,
 				       uint32_t pos,
 				       void *_data)
 {
-	add_object_entry(oid, OBJ_NONE, "", 0);
+	if (cruft) {
+		off_t offset;
+		time_t mtime;
+
+		if (pack->is_cruft) {
+			if (load_pack_mtimes(pack) < 0)
+				die(_("could not load cruft pack .mtimes"));
+			mtime = nth_packed_mtime(pack, pos);
+		} else {
+			mtime = pack->mtime;
+		}
+		offset = nth_packed_object_offset(pack, pos);
+
+		add_cruft_object_entry(oid, OBJ_NONE, pack, offset,
+				       NULL, mtime);
+	} else {
+		add_object_entry(oid, OBJ_NONE, "", 0);
+	}
 	return 0;
 }
 
@@ -3545,7 +3697,19 @@ static int add_loose_object(const struct object_id *oid, const char *path,
 		return 0;
 	}
 
-	add_object_entry(oid, type, "", 0);
+	if (cruft) {
+		struct stat st;
+		if (stat(path, &st) < 0) {
+			if (errno == ENOENT)
+				return 0;
+			return error_errno("unable to stat %s", oid_to_hex(oid));
+		}
+
+		add_cruft_object_entry(oid, type, NULL, 0, NULL,
+				       st.st_mtime);
+	} else {
+		add_object_entry(oid, type, "", 0);
+	}
 	return 0;
 }
 
@@ -3864,6 +4028,20 @@ static int option_parse_unpack_unreachable(const struct option *opt,
 	return 0;
 }
 
+static int option_parse_cruft_expiration(const struct option *opt,
+					 const char *arg, int unset)
+{
+	if (unset) {
+		cruft = 0;
+		cruft_expiration = 0;
+	} else {
+		cruft = 1;
+		if (arg)
+			cruft_expiration = approxidate(arg);
+	}
+	return 0;
+}
+
 int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 {
 	int use_internal_rev_list = 0;
@@ -3936,6 +4114,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		OPT_CALLBACK_F(0, "unpack-unreachable", NULL, N_("time"),
 		  N_("unpack unreachable objects newer than <time>"),
 		  PARSE_OPT_OPTARG, option_parse_unpack_unreachable),
+		OPT_BOOL(0, "cruft", &cruft, N_("create a cruft pack")),
+		OPT_CALLBACK_F(0, "cruft-expiration", NULL, N_("time"),
+		  N_("expire cruft objects older than <time>"),
+		  PARSE_OPT_OPTARG, option_parse_cruft_expiration),
 		OPT_BOOL(0, "sparse", &sparse,
 			 N_("use the sparse reachability algorithm")),
 		OPT_BOOL(0, "thin", &thin,
@@ -4062,7 +4244,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	if (!HAVE_THREADS && delta_search_threads != 1)
 		warning(_("no threads support, ignoring --threads"));
-	if (!pack_to_stdout && !pack_size_limit)
+	if (!pack_to_stdout && !pack_size_limit && !cruft)
 		pack_size_limit = pack_size_limit_cfg;
 	if (pack_to_stdout && pack_size_limit)
 		die(_("--max-pack-size cannot be used to build a pack for transfer"));
@@ -4089,6 +4271,15 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
 
+	if (cruft) {
+		if (use_internal_rev_list)
+			die(_("cannot use internal rev list with --cruft"));
+		if (stdin_packs)
+			die(_("cannot use --stdin-packs with --cruft"));
+		if (pack_size_limit)
+			die(_("cannot use --max-pack-size with --cruft"));
+	}
+
 	/*
 	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
 	 *
@@ -4145,7 +4336,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			    the_repository);
 	prepare_packing_data(the_repository, &to_pack);
 
-	if (progress)
+	if (progress && !cruft)
 		progress_state = start_progress(_("Enumerating objects"), 0);
 	if (stdin_packs) {
 		/* avoids adding objects in excluded packs */
@@ -4153,6 +4344,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		read_packs_list_from_stdin();
 		if (rev_list_unpacked)
 			add_unreachable_loose_objects();
+	} else if (cruft) {
+		read_cruft_objects();
 	} else if (!use_internal_rev_list) {
 		read_object_list_from_stdin();
 	} else {
diff --git a/object-file.c b/object-file.c
index 8be57f48de..e80da1368d 100644
--- a/object-file.c
+++ b/object-file.c
@@ -996,7 +996,7 @@ int has_loose_object_nonlocal(const struct object_id *oid)
 	return check_and_freshen_nonlocal(oid, 0);
 }
 
-static int has_loose_object(const struct object_id *oid)
+int has_loose_object(const struct object_id *oid)
 {
 	return check_and_freshen(oid, 0);
 }
diff --git a/object-store.h b/object-store.h
index 9b227661f2..6b025dc670 100644
--- a/object-store.h
+++ b/object-store.h
@@ -334,6 +334,8 @@ int repo_has_object_file_with_flags(struct repository *r,
  */
 int has_loose_object_nonlocal(const struct object_id *);
 
+int has_loose_object(const struct object_id *);
+
 void assert_oid_type(const struct object_id *oid, enum object_type expect);
 
 /*
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
new file mode 100755
index 0000000000..003ca7344e
--- /dev/null
+++ b/t/t5329-pack-objects-cruft.sh
@@ -0,0 +1,218 @@
+#!/bin/sh
+
+test_description='cruft pack related pack-objects tests'
+. ./test-lib.sh
+
+objdir=.git/objects
+packdir=$objdir/pack
+
+basic_cruft_pack_tests () {
+	expire="$1"
+
+	test_expect_success "unreachable loose objects are packed (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit base &&
+			git repack -Ad &&
+			test_commit loose &&
+
+			test-tool chmtime +2000 "$objdir/$(test_oid_to_path \
+				$(git rev-parse loose:loose.t))" &&
+			test-tool chmtime +1000 "$objdir/$(test_oid_to_path \
+				$(git rev-parse loose^{tree}))" &&
+
+			(
+				git rev-list --objects --no-object-names base..loose |
+				while read oid
+				do
+					path="$objdir/$(test_oid_to_path "$oid")" &&
+					printf "%s %d\n" "$oid" "$(test-tool chmtime --get "$path")"
+				done |
+				sort -k1
+			) >expect &&
+
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+			cruft="$(echo $keep | git pack-objects --cruft \
+				--cruft-expiration="$expire" $packdir/pack)" &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "unreachable packed objects are packed (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit packed &&
+			git repack -Ad &&
+			test_commit other &&
+
+			git rev-list --objects --no-object-names packed.. >objects &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+			other="$(git pack-objects --delta-base-offset \
+				$packdir/pack <objects)" &&
+			git prune-packed &&
+
+			test-tool chmtime --get -100 "$packdir/pack-$other.pack" >expect &&
+
+			cruft="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF
+			$keep
+			-pack-$other.pack
+			EOF
+			)" &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+			cut -d" " -f2 <actual.raw | sort -u >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "unreachable cruft objects are repacked (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit packed &&
+			git repack -Ad &&
+			test_commit other &&
+
+			git rev-list --objects --no-object-names packed.. >objects &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+
+			cruft_a="$(echo $keep | git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack)" &&
+			git prune-packed &&
+			cruft_b="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF
+			$keep
+			-pack-$cruft_a.pack
+			EOF
+			)" &&
+
+			test-tool pack-mtimes "pack-$cruft_a.mtimes" >expect.raw &&
+			test-tool pack-mtimes "pack-$cruft_b.mtimes" >actual.raw &&
+
+			sort <expect.raw >expect &&
+			sort <actual.raw >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "multiple cruft packs (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			git repack -Ad &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+
+			test_commit cruft &&
+			loose="$objdir/$(test_oid_to_path $(git rev-parse cruft))" &&
+
+			# generate three copies of the cruft object in different
+			# cruft packs, each with a unique mtime:
+			#   - one expired (1000 seconds ago)
+			#   - two non-expired (one 1000 seconds in the future,
+			#     one 1500 seconds in the future)
+			test-tool chmtime =-1000 "$loose" &&
+			git pack-objects --cruft $packdir/pack-A <<-EOF &&
+			$keep
+			EOF
+			test-tool chmtime =+1000 "$loose" &&
+			git pack-objects --cruft $packdir/pack-B <<-EOF &&
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			EOF
+			test-tool chmtime =+1500 "$loose" &&
+			git pack-objects --cruft $packdir/pack-C <<-EOF &&
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			-$(basename $(ls $packdir/pack-B-*.pack))
+			EOF
+
+			# ensure the resulting cruft pack takes the most recent
+			# mtime among all copies
+			cruft="$(git pack-objects --cruft \
+				--cruft-expiration="$expire" \
+				$packdir/pack <<-EOF
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			-$(basename $(ls $packdir/pack-B-*.pack))
+			-$(basename $(ls $packdir/pack-C-*.pack))
+			EOF
+			)" &&
+
+			test-tool pack-mtimes "$(basename $(ls $packdir/pack-C-*.mtimes))" >expect.raw &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+			sort expect.raw >expect &&
+			sort actual.raw >actual &&
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "cruft packs tolerate missing trees (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			test_commit cruft &&
+
+			tree="$(git rev-parse cruft^{tree})" &&
+
+			git reset --hard reachable &&
+			git tag -d cruft &&
+			git reflog expire --all --expire=all &&
+
+			# remove the unreachable tree, but leave the commit
+			# which has it as its root tree intact
+			rm -fr "$objdir/$(test_oid_to_path "$tree")" &&
+
+			git repack -Ad &&
+			basename $(ls $packdir/pack-*.pack) >in &&
+			git pack-objects --cruft --cruft-expiration="$expire" \
+				$packdir/pack <in
+		)
+	'
+
+	test_expect_success "cruft packs tolerate missing blobs (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			test_commit cruft &&
+
+			blob="$(git rev-parse cruft:cruft.t)" &&
+
+			git reset --hard reachable &&
+			git tag -d cruft &&
+			git reflog expire --all --expire=all &&
+
+			# remove the unreachable blob, but leave the commit (and
+			# the root tree of that commit) intact
+			rm -fr "$objdir/$(test_oid_to_path "$blob")" &&
+
+			git repack -Ad &&
+			basename $(ls $packdir/pack-*.pack) >in &&
+			git pack-objects --cruft --cruft-expiration="$expire" \
+				$packdir/pack <in
+		)
+	'
+}
+
+basic_cruft_pack_tests never
+
+test_done
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v3 09/17] reachable: add options to add_unseen_recent_objects_to_traversal
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (7 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
                     ` (8 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

This function behaves very similarly to what we will need in
pack-objects in order to implement cruft packs with expiration. But it
is lacking a couple of things. Namely, it needs:

  - a mechanism to communicate the timestamps of individual recent
    objects to some external caller

  - and, in the case of packed objects, our future caller will also want
    to know the originating pack, as well as the offset within that pack
    at which the object can be found

  - finally, it needs a way to skip over packs which are marked as kept
    in-core.

To address the first two, add a callback interface in this patch which
reports the time of each recent object, as well as a (packed_git,
off_t) pair for packed objects.

Likewise, add a new option to the packed object iterators to skip over
packs which are marked as kept in core. This option will become
implicitly tested in a future patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c |  2 +-
 reachable.c            | 51 +++++++++++++++++++++++++++++++++++-------
 reachable.h            |  9 +++++++-
 3 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5ba4fc9c2c..1ef333717d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3951,7 +3951,7 @@ static void get_object_list(int ac, const char **av)
 	if (unpack_unreachable_expiration) {
 		revs.ignore_missing_links = 1;
 		if (add_unseen_recent_objects_to_traversal(&revs,
-				unpack_unreachable_expiration))
+				unpack_unreachable_expiration, NULL, 0))
 			die(_("unable to add recent objects"));
 		if (prepare_revision_walk(&revs))
 			die(_("revision walk setup failed"));
diff --git a/reachable.c b/reachable.c
index 84e3d0d75e..0eb9909f47 100644
--- a/reachable.c
+++ b/reachable.c
@@ -60,9 +60,13 @@ static void mark_commit(struct commit *c, void *data)
 struct recent_data {
 	struct rev_info *revs;
 	timestamp_t timestamp;
+	report_recent_object_fn *cb;
+	int ignore_in_core_kept_packs;
 };
 
 static void add_recent_object(const struct object_id *oid,
+			      struct packed_git *pack,
+			      off_t offset,
 			      timestamp_t mtime,
 			      struct recent_data *data)
 {
@@ -103,13 +107,29 @@ static void add_recent_object(const struct object_id *oid,
 		die("unable to lookup %s", oid_to_hex(oid));
 
 	add_pending_object(data->revs, obj, "");
+	if (data->cb)
+		data->cb(obj, pack, offset, mtime);
+}
+
+static int want_recent_object(struct recent_data *data,
+			      const struct object_id *oid)
+{
+	if (data->ignore_in_core_kept_packs &&
+	    has_object_kept_pack(oid, IN_CORE_KEEP_PACKS))
+		return 0;
+	return 1;
 }
 
 static int add_recent_loose(const struct object_id *oid,
 			    const char *path, void *data)
 {
 	struct stat st;
-	struct object *obj = lookup_object(the_repository, oid);
+	struct object *obj;
+
+	if (!want_recent_object(data, oid))
+		return 0;
+
+	obj = lookup_object(the_repository, oid);
 
 	if (obj && obj->flags & SEEN)
 		return 0;
@@ -126,7 +146,7 @@ static int add_recent_loose(const struct object_id *oid,
 		return error_errno("unable to stat %s", oid_to_hex(oid));
 	}
 
-	add_recent_object(oid, st.st_mtime, data);
+	add_recent_object(oid, NULL, 0, st.st_mtime, data);
 	return 0;
 }
 
@@ -134,29 +154,43 @@ static int add_recent_packed(const struct object_id *oid,
 			     struct packed_git *p, uint32_t pos,
 			     void *data)
 {
-	struct object *obj = lookup_object(the_repository, oid);
+	struct object *obj;
+
+	if (!want_recent_object(data, oid))
+		return 0;
+
+	obj = lookup_object(the_repository, oid);
 
 	if (obj && obj->flags & SEEN)
 		return 0;
-	add_recent_object(oid, p->mtime, data);
+	add_recent_object(oid, p, nth_packed_object_offset(p, pos), p->mtime, data);
 	return 0;
 }
 
 int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
-					   timestamp_t timestamp)
+					   timestamp_t timestamp,
+					   report_recent_object_fn *cb,
+					   int ignore_in_core_kept_packs)
 {
 	struct recent_data data;
+	enum for_each_object_flags flags;
 	int r;
 
 	data.revs = revs;
 	data.timestamp = timestamp;
+	data.cb = cb;
+	data.ignore_in_core_kept_packs = ignore_in_core_kept_packs;
 
 	r = for_each_loose_object(add_recent_loose, &data,
 				  FOR_EACH_OBJECT_LOCAL_ONLY);
 	if (r)
 		return r;
-	return for_each_packed_object(add_recent_packed, &data,
-				      FOR_EACH_OBJECT_LOCAL_ONLY);
+
+	flags = FOR_EACH_OBJECT_LOCAL_ONLY | FOR_EACH_OBJECT_PACK_ORDER;
+	if (ignore_in_core_kept_packs)
+		flags |= FOR_EACH_OBJECT_SKIP_IN_CORE_KEPT_PACKS;
+
+	return for_each_packed_object(add_recent_packed, &data, flags);
 }
 
 static int mark_object_seen(const struct object_id *oid,
@@ -217,7 +251,8 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 
 	if (mark_recent) {
 		revs->ignore_missing_links = 1;
-		if (add_unseen_recent_objects_to_traversal(revs, mark_recent))
+		if (add_unseen_recent_objects_to_traversal(revs, mark_recent,
+							   NULL, 0))
 			die("unable to mark recent objects");
 		if (prepare_revision_walk(revs))
 			die("revision walk setup failed");
diff --git a/reachable.h b/reachable.h
index 5df932ad8f..b776761baa 100644
--- a/reachable.h
+++ b/reachable.h
@@ -1,11 +1,18 @@
 #ifndef REACHEABLE_H
 #define REACHEABLE_H
 
+#include "object.h"
+
 struct progress;
 struct rev_info;
 
+typedef void report_recent_object_fn(const struct object *, struct packed_git *,
+				     off_t, time_t);
+
 int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
-					   timestamp_t timestamp);
+					   timestamp_t timestamp,
+					   report_recent_object_fn cb,
+					   int ignore_in_core_kept_packs);
 void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 			    timestamp_t mark_recent, struct progress *);
 
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v3 10/17] reachable: report precise timestamps from objects in cruft packs
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (8 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
                     ` (7 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

When generating a cruft pack, the caller within pack-objects will want
to know the precise timestamps of cruft objects (i.e., their
corresponding values in the .mtimes table) rather than the mtime of the
cruft pack itself.

Teach add_recent_packed() to lookup each object's precise mtime from the
.mtimes file if one exists (indicated by the is_cruft bit on the
packed_git structure).

A couple of small things worth noting here:

  - load_pack_mtimes() needs to be called before asking for
    nth_packed_mtime(), and that call is done lazily here. That function
    exits early if the .mtimes file has already been opened and parsed,
    so only the first call is slow.

  - Checking the is_cruft bit can be done without any extra work on the
    caller's behalf, since it is set up for us automatically as a
    side-effect of calling add_packed_git() (just like the 'pack_keep'
    and 'pack_promisor' bits).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 reachable.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/reachable.c b/reachable.c
index 0eb9909f47..9ec8e6bd5b 100644
--- a/reachable.c
+++ b/reachable.c
@@ -13,6 +13,7 @@
 #include "worktree.h"
 #include "object-store.h"
 #include "pack-bitmap.h"
+#include "pack-mtimes.h"
 
 struct connectivity_progress {
 	struct progress *progress;
@@ -155,6 +156,7 @@ static int add_recent_packed(const struct object_id *oid,
 			     void *data)
 {
 	struct object *obj;
+	timestamp_t mtime = p->mtime;
 
 	if (!want_recent_object(data, oid))
 		return 0;
@@ -163,7 +165,12 @@ static int add_recent_packed(const struct object_id *oid,
 
 	if (obj && obj->flags & SEEN)
 		return 0;
-	add_recent_object(oid, p, nth_packed_object_offset(p, pos), p->mtime, data);
+	if (p->is_cruft) {
+		if (load_pack_mtimes(p) < 0)
+			die(_("could not load cruft pack .mtimes"));
+		mtime = nth_packed_mtime(p, pos);
+	}
+	add_recent_object(oid, p, nth_packed_object_offset(p, pos), mtime, data);
 	return 0;
 }
 
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v3 11/17] builtin/pack-objects.c: --cruft with expiration
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (9 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
                     ` (6 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

In a previous patch, pack-objects learned how to generate a cruft pack
so long as no objects are dropped.

This patch teaches pack-objects to handle the case where a non-never
`--cruft-expiration` value is passed. This case is slightly more
complicated than before, because we want pack-objects to save
unreachable objects which would have been pruned when there is another
recent (i.e., non-prunable) unreachable object which reaches the other.
We'll call these objects "unreachable but reachable-from-recent".

Here is how pack-objects handles `--cruft-expiration`:

  - Instead of adding all objects outside of the kept pack(s) into the
    packing list, only handle the ones whose mtime is within the grace
    period.

  - Construct a reachability traversal whose tips are the
    unreachable-but-recent objects.

  - Then, walk along that traversal, stopping if we reach an object in
    the kept pack. At each step along the traversal, we add the object
    we are visiting to the packing list.

In the majority of these cases, any object we visit in this traversal
will already be in our packing list. But we will sometimes encounter
reachable-from-recent cruft objects, which we want to retain even if
they aged out of the grace period.

The most subtle point of this process is that we actually don't need to
bother to update the rescued object's mtime. Even though we will write
an .mtimes file with a value that is older than the expiration window,
it will continue to survive cruft repacks so long as any objects which
reach it haven't aged out.

That is, a future repack will also exclude that object from the initial
packing list, only to discover it later on when doing the reachability
traversal.

Finally, stopping early once an object is found in a kept pack is safe
to do because the kept packs ordinarily represent which packs will
survive after repacking. Assuming that it _isn't_ safe to halt a
traversal early would mean that there is some ancestor object which is
missing, which implies repository corruption (i.e., the complete set of
reachable objects isn't present).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c        |  84 +++++++++++++++++++-
 reachable.h                   |   4 +-
 t/t5329-pack-objects-cruft.sh | 143 ++++++++++++++++++++++++++++++++++
 3 files changed, 228 insertions(+), 3 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 1ef333717d..fcac0b5c91 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3439,6 +3439,44 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
 	return;
 }
 
+static void show_cruft_object(struct object *obj, const char *name, void *data)
+{
+	/*
+	 * if we did not record it earlier, it's at least as old as our
+	 * expiration value. Rather than find it exactly, just use that
+	 * value.  This may bump it forward from its real mtime, but it
+	 * will still be "too old" next time we run with the same
+	 * expiration.
+	 *
+	 * if obj does appear in the packing list, this call is a noop (or may
+	 * set the namehash).
+	 */
+	add_cruft_object_entry(&obj->oid, obj->type, NULL, 0, name, cruft_expiration);
+}
+
+static void show_cruft_commit(struct commit *commit, void *data)
+{
+	show_cruft_object((struct object*)commit, NULL, data);
+}
+
+static int cruft_include_check_obj(struct object *obj, void *data)
+{
+	return !has_object_kept_pack(&obj->oid, IN_CORE_KEEP_PACKS);
+}
+
+static int cruft_include_check(struct commit *commit, void *data)
+{
+	return cruft_include_check_obj((struct object*)commit, data);
+}
+
+static void set_cruft_mtime(const struct object *object,
+			    struct packed_git *pack,
+			    off_t offset, time_t mtime)
+{
+	add_cruft_object_entry(&object->oid, object->type, pack, offset, NULL,
+			       mtime);
+}
+
 static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
 {
 	struct string_list_item *item = NULL;
@@ -3464,6 +3502,50 @@ static void enumerate_cruft_objects(void)
 	stop_progress(&progress_state);
 }
 
+static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs)
+{
+	struct packed_git *p;
+	struct rev_info revs;
+	int ret;
+
+	repo_init_revisions(the_repository, &revs, NULL);
+
+	revs.tag_objects = 1;
+	revs.tree_objects = 1;
+	revs.blob_objects = 1;
+
+	revs.include_check = cruft_include_check;
+	revs.include_check_obj = cruft_include_check_obj;
+
+	revs.ignore_missing_links = 1;
+
+	if (progress)
+		progress_state = start_progress(_("Enumerating cruft objects"), 0);
+	ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
+						     set_cruft_mtime, 1);
+	stop_progress(&progress_state);
+
+	if (ret)
+		die(_("unable to add cruft objects"));
+
+	/*
+	 * Re-mark only the fresh packs as kept so that objects in
+	 * unknown packs do not halt the reachability traversal early.
+	 */
+	for (p = get_all_packs(the_repository); p; p = p->next)
+		p->pack_keep_in_core = 0;
+	mark_pack_kept_in_core(fresh_packs, 1);
+
+	if (prepare_revision_walk(&revs))
+		die(_("revision walk setup failed"));
+	if (progress)
+		progress_state = start_progress(_("Traversing cruft objects"), 0);
+	nr_seen = 0;
+	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
+
+	stop_progress(&progress_state);
+}
+
 static void read_cruft_objects(void)
 {
 	struct strbuf buf = STRBUF_INIT;
@@ -3515,7 +3597,7 @@ static void read_cruft_objects(void)
 	mark_pack_kept_in_core(&discard_packs, 0);
 
 	if (cruft_expiration)
-		die("--cruft-expiration not yet implemented");
+		enumerate_and_traverse_cruft_objects(&fresh_packs);
 	else
 		enumerate_cruft_objects();
 
diff --git a/reachable.h b/reachable.h
index b776761baa..020a887b99 100644
--- a/reachable.h
+++ b/reachable.h
@@ -1,10 +1,10 @@
 #ifndef REACHEABLE_H
 #define REACHEABLE_H
 
-#include "object.h"
-
 struct progress;
 struct rev_info;
+struct object;
+struct packed_git;
 
 typedef void report_recent_object_fn(const struct object *, struct packed_git *,
 				     off_t, time_t);
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 003ca7344e..939cdc297a 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -214,5 +214,148 @@ basic_cruft_pack_tests () {
 }
 
 basic_cruft_pack_tests never
+basic_cruft_pack_tests 2.weeks.ago
+
+test_expect_success 'cruft tags rescue tagged objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit tagged &&
+		git tag -a annotated -m tag &&
+
+		git rev-list --objects --no-object-names packed.. >objects &&
+		while read oid
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $oid)"
+		done <objects &&
+
+		test-tool chmtime -500 \
+			"$objdir/$(test_oid_to_path $(git rev-parse annotated))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+		cut -f1 -d" " <actual.raw | sort >actual &&
+
+		(
+			cat objects &&
+			git rev-parse annotated
+		) >expect.raw &&
+		sort <expect.raw >expect &&
+
+		test_cmp expect actual &&
+		cat actual
+	)
+'
+
+test_expect_success 'cruft commits rescue parents, trees' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit old &&
+		test_commit new &&
+
+		git rev-list --objects --no-object-names packed..new >objects &&
+		while read object
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $object)"
+		done <objects &&
+		test-tool chmtime +500 "$objdir/$(test_oid_to_path \
+			$(git rev-parse HEAD))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+		cut -d" " -f1 <actual.raw | sort >actual &&
+		sort <objects >expect &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'cruft trees rescue sub-trees, blobs' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		mkdir -p dir/sub &&
+		echo foo >foo &&
+		echo bar >dir/bar &&
+		echo baz >dir/sub/baz &&
+
+		test_tick &&
+		git add . &&
+		git commit -m "pruned" &&
+
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD^{tree}))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:foo))" &&
+		test-tool chmtime  -500 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/bar))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/sub))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/sub/baz))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+		cut -f1 -d" " <actual.raw | sort >actual &&
+
+		git rev-parse HEAD:dir HEAD:dir/bar HEAD:dir/sub HEAD:dir/sub/baz >expect.raw &&
+		sort <expect.raw >expect &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'expired objects are pruned' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit pruned &&
+
+		git rev-list --objects --no-object-names packed..pruned >objects &&
+		while read object
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $object)"
+		done <objects &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual &&
+		test_must_be_empty actual
+	)
+'
 
 test_done
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v3 12/17] builtin/repack.c: support generating a cruft pack
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (10 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
                     ` (5 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Expose a way to split the contents of a repository into a main and cruft
pack when doing an all-into-one repack with `git repack --cruft -d`, and
a complementary configuration variable.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt            |  11 ++
 Documentation/technical/cruft-packs.txt |   2 +-
 builtin/repack.c                        | 106 +++++++++++-
 t/t5329-pack-objects-cruft.sh           | 207 ++++++++++++++++++++++++
 4 files changed, 320 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index ee30edc178..0bf13893d8 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -63,6 +63,17 @@ to the new separate pack will be written.
 	Also run  'git prune-packed' to remove redundant
 	loose object files.
 
+--cruft::
+	Same as `-a`, unless `-d` is used. Then any unreachable objects
+	are packed into a separate cruft pack. Unreachable objects can
+	be pruned using the normal expiry rules with the next `git gc`
+	invocation (see linkgit:git-gc[1]). Incompatible with `-k`.
+
+--cruft-expiration=<approxidate>::
+	Expire unreachable objects older than `<approxidate>`
+	immediately instead of waiting for the next `git gc` invocation.
+	Only useful with `--cruft -d`.
+
 -l::
 	Pass the `--local` option to 'git pack-objects'. See
 	linkgit:git-pack-objects[1].
diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
index 2c3c5d93f8..f80e975a47 100644
--- a/Documentation/technical/cruft-packs.txt
+++ b/Documentation/technical/cruft-packs.txt
@@ -17,7 +17,7 @@ pruned according to normal expiry rules with the next 'git gc' invocation.
 
 Unreachable objects aren't removed immediately, since doing so could race with
 an incoming push which may reference an object which is about to be deleted.
-Instead, those unreachable objects are stored as loose object and stay that way
+Instead, those unreachable objects are stored as loose objects and stay that way
 until they are older than the expiration window, at which point they are removed
 by linkgit:git-prune[1].
 
diff --git a/builtin/repack.c b/builtin/repack.c
index f908f7d5dd..f7fb88bcf1 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -18,11 +18,17 @@
 #include "pack-bitmap.h"
 #include "refs.h"
 
+#define ALL_INTO_ONE 1
+#define LOOSEN_UNREACHABLE 2
+#define PACK_CRUFT 4
+
+static int pack_everything;
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
 static int write_bitmaps = -1;
 static int use_delta_islands;
 static char *packdir, *packtmp_name, *packtmp;
+static char *cruft_expiration;
 
 static const char *const git_repack_usage[] = {
 	N_("git repack [<options>]"),
@@ -54,6 +60,7 @@ static int repack_config(const char *var, const char *value, void *cb)
 		use_delta_islands = git_config_bool(var, value);
 		return 0;
 	}
+
 	return git_default_config(var, value, cb);
 }
 
@@ -300,9 +307,6 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 		die(_("could not finish pack-objects to repack promisor objects"));
 }
 
-#define ALL_INTO_ONE 1
-#define LOOSEN_UNREACHABLE 2
-
 struct pack_geometry {
 	struct packed_git **pack;
 	uint32_t pack_nr, pack_alloc;
@@ -339,6 +343,8 @@ static void init_pack_geometry(struct pack_geometry **geometry_p)
 	for (p = get_all_packs(the_repository); p; p = p->next) {
 		if (!pack_kept_objects && p->pack_keep)
 			continue;
+		if (p->is_cruft)
+			continue;
 
 		ALLOC_GROW(geometry->pack,
 			   geometry->pack_nr + 1,
@@ -600,6 +606,67 @@ static int write_midx_included_packs(struct string_list *include,
 	return finish_command(&cmd);
 }
 
+static int write_cruft_pack(const struct pack_objects_args *args,
+			    const char *pack_prefix,
+			    struct string_list *names,
+			    struct string_list *existing_packs,
+			    struct string_list *existing_kept_packs)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf line = STRBUF_INIT;
+	struct string_list_item *item;
+	FILE *in, *out;
+	int ret;
+
+	prepare_pack_objects(&cmd, args);
+
+	strvec_push(&cmd.args, "--cruft");
+	if (cruft_expiration)
+		strvec_pushf(&cmd.args, "--cruft-expiration=%s",
+			     cruft_expiration);
+
+	strvec_push(&cmd.args, "--honor-pack-keep");
+	strvec_push(&cmd.args, "--non-empty");
+	strvec_push(&cmd.args, "--max-pack-size=0");
+
+	cmd.in = -1;
+
+	ret = start_command(&cmd);
+	if (ret)
+		return ret;
+
+	/*
+	 * names has a confusing double use: it both provides the list
+	 * of just-written new packs, and accepts the name of the cruft
+	 * pack we are writing.
+	 *
+	 * By the time it is read here, it contains only the pack(s)
+	 * that were just written, which is exactly the set of packs we
+	 * want to consider kept.
+	 */
+	in = xfdopen(cmd.in, "w");
+	for_each_string_list_item(item, names)
+		fprintf(in, "%s-%s.pack\n", pack_prefix, item->string);
+	for_each_string_list_item(item, existing_packs)
+		fprintf(in, "-%s.pack\n", item->string);
+	for_each_string_list_item(item, existing_kept_packs)
+		fprintf(in, "%s.pack\n", item->string);
+	fclose(in);
+
+	out = xfdopen(cmd.out, "r");
+	while (strbuf_getline_lf(&line, out) != EOF) {
+		if (line.len != the_hash_algo->hexsz)
+			die(_("repack: Expecting full hex object ID lines only "
+			      "from pack-objects."));
+		string_list_append(names, line.buf);
+	}
+	fclose(out);
+
+	strbuf_release(&line);
+
+	return finish_command(&cmd);
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -616,7 +683,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int show_progress;
 
 	/* variables to be filled by option parsing */
-	int pack_everything = 0;
 	int delete_redundant = 0;
 	const char *unpack_unreachable = NULL;
 	int keep_unreachable = 0;
@@ -632,6 +698,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_BIT('A', NULL, &pack_everything,
 				N_("same as -a, and turn unreachable objects loose"),
 				   LOOSEN_UNREACHABLE | ALL_INTO_ONE),
+		OPT_BIT(0, "cruft", &pack_everything,
+				N_("same as -a, pack unreachable cruft objects separately"),
+				   PACK_CRUFT),
+		OPT_STRING(0, "cruft-expiration", &cruft_expiration, N_("approxidate"),
+				N_("with -C, expire objects older than this")),
 		OPT_BOOL('d', NULL, &delete_redundant,
 				N_("remove redundant packs, and run git-prune-packed")),
 		OPT_BOOL('f', NULL, &po_args.no_reuse_delta,
@@ -684,6 +755,15 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	    (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
 		die(_("options '%s' and '%s' cannot be used together"), "--keep-unreachable", "-A");
 
+	if (pack_everything & PACK_CRUFT) {
+		pack_everything |= ALL_INTO_ONE;
+
+		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
+			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-A");
+		if (keep_unreachable)
+			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-k");
+	}
+
 	if (write_bitmaps < 0) {
 		if (!write_midx &&
 		    (!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()))
@@ -767,7 +847,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (pack_everything & ALL_INTO_ONE) {
 		repack_promisor_objects(&po_args, &names);
 
-		if (existing_nonkept_packs.nr && delete_redundant) {
+		if (existing_nonkept_packs.nr && delete_redundant &&
+		    !(pack_everything & PACK_CRUFT)) {
 			for_each_string_list_item(item, &names) {
 				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
 					     packtmp_name, item->string);
@@ -829,6 +910,21 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (!names.nr && !po_args.quiet)
 		printf_ln(_("Nothing new to pack."));
 
+	if (pack_everything & PACK_CRUFT) {
+		const char *pack_prefix;
+		if (!skip_prefix(packtmp, packdir, &pack_prefix))
+			die(_("pack prefix %s does not begin with objdir %s"),
+			    packtmp, packdir);
+		if (*pack_prefix == '/')
+			pack_prefix++;
+
+		ret = write_cruft_pack(&po_args, pack_prefix, &names,
+				       &existing_nonkept_packs,
+				       &existing_kept_packs);
+		if (ret)
+			return ret;
+	}
+
 	for_each_string_list_item(item, &names) {
 		item->util = (void *)(uintptr_t)populate_pack_exts(item->string);
 	}
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 939cdc297a..06c550c958 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -358,4 +358,211 @@ test_expect_success 'expired objects are pruned' '
 	)
 '
 
+test_expect_success 'repack --cruft generates a cruft pack' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git branch -M main &&
+		git checkout --orphan other &&
+		test_commit unreachable &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d unreachable &&
+		# objects are not cruft if they are contained in the reflogs
+		git reflog expire --all --expire=all &&
+
+		git rev-list --objects --all --no-object-names >reachable.raw &&
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+		sort <reachable.raw >reachable &&
+		comm -13 reachable objects >unreachable &&
+
+		git repack --cruft -d &&
+
+		cruft=$(basename $(ls $packdir/pack-*.mtimes) .mtimes) &&
+		pack=$(basename $(ls $packdir/pack-*.pack | grep -v $cruft) .pack) &&
+
+		git show-index <$packdir/$pack.idx >actual.raw &&
+		cut -f2 -d" " actual.raw | sort >actual &&
+		test_cmp reachable actual &&
+
+		git show-index <$packdir/$cruft.idx >actual.raw &&
+		cut -f2 -d" " actual.raw | sort >actual &&
+		test_cmp unreachable actual
+	)
+'
+
+test_expect_success 'loose objects mtimes upsert others' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		test_commit cruft &&
+		# incremental repack, leaving existing objects loose (so
+		# they can be "freshened")
+		git repack &&
+
+		tip="$(git rev-parse cruft)" &&
+		path="$objdir/$(test_oid_to_path "$(git rev-parse cruft)")" &&
+		test-tool chmtime --get +1000 "$path" >expect &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d cruft &&
+		git reflog expire --all --expire=all &&
+
+		git repack --cruft -d &&
+
+		mtimes="$(basename $(ls $packdir/pack-*.mtimes))" &&
+		test-tool pack-mtimes "$mtimes" >actual.raw &&
+		grep "$tip" actual.raw | cut -d" " -f2 >actual &&
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'cruft packs are not included in geometric repack' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		test_commit cruft &&
+		git repack -d &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d cruft &&
+		git reflog expire --all --expire=all &&
+
+		git repack --cruft &&
+
+		find $packdir -type f | sort >before &&
+		git repack --geometric=2 -d &&
+		find $packdir -type f | sort >after &&
+
+		test_cmp before after
+	)
+'
+
+test_expect_success 'repack --geometric collects once-cruft objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		git rm -rf . &&
+		test_commit --no-tag cruft &&
+		cruft="$(git rev-parse HEAD)" &&
+
+		git checkout main &&
+		git branch -D other &&
+		git reflog expire --all --expire=all &&
+
+		# Pack the objects created in the previous step into a cruft
+		# pack. Intentionally leave loose copies of those objects
+		# around so we can pick them up in a subsequent --geometric
+		# reapack.
+		git repack --cruft &&
+
+		# Now make those objects reachable, and ensure that they are
+		# packed into the new pack created via a --geometric repack.
+		git update-ref refs/heads/other $cruft &&
+
+		# Without this object, the set of unpacked objects is exactly
+		# the set of objects already in the cruft pack. Tweak that set
+		# to ensure we do not overwrite the cruft pack entirely.
+		test_commit reachable2 &&
+
+		find $packdir -name "pack-*.idx" | sort >before &&
+		git repack --geometric=2 -d &&
+		find $packdir -name "pack-*.idx" | sort >after &&
+
+		{
+			git rev-list --objects --no-object-names $cruft &&
+			git rev-list --objects --no-object-names reachable..reachable2
+		} >want.raw &&
+		sort want.raw >want &&
+
+		pack=$(comm -13 before after) &&
+		git show-index <$pack >objects.raw &&
+
+		cut -d" " -f2 objects.raw | sort >got &&
+
+		test_cmp want got
+	)
+'
+
+test_expect_success 'cruft repack with no reachable objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		git repack -ad &&
+
+		base="$(git rev-parse base)" &&
+
+		git for-each-ref --format="delete %(refname)" >in &&
+		git update-ref --stdin <in &&
+		git reflog expire --all --expire=all &&
+		rm -fr .git/index &&
+
+		git repack --cruft -d &&
+
+		git cat-file -t $base
+	)
+'
+
+test_expect_success 'cruft repack ignores --max-pack-size' '
+	git init max-pack-size &&
+	(
+		cd max-pack-size &&
+		test_commit base &&
+		# two cruft objects which exceed the maximum pack size
+		test-tool genrandom foo 1048576 | git hash-object --stdin -w &&
+		test-tool genrandom bar 1048576 | git hash-object --stdin -w &&
+		git repack --cruft --max-pack-size=1M &&
+		find $packdir -name "*.mtimes" >cruft &&
+		test_line_count = 1 cruft &&
+		test-tool pack-mtimes "$(basename "$(cat cruft)")" >objects &&
+		test_line_count = 2 objects
+	)
+'
+
+test_expect_success 'cruft repack ignores pack.packSizeLimit' '
+	(
+		cd max-pack-size &&
+		# repack everything back together to remove the existing cruft
+		# pack (but to keep its objects)
+		git repack -adk &&
+		git -c pack.packSizeLimit=1M repack --cruft &&
+		# ensure the same post condition is met when --max-pack-size
+		# would otherwise be inferred from the configuration
+		find $packdir -name "*.mtimes" >cruft &&
+		test_line_count = 1 cruft &&
+		test-tool pack-mtimes "$(basename "$(cat cruft)")" >objects &&
+		test_line_count = 2 objects
+	)
+'
+
 test_done
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v3 13/17] builtin/repack.c: allow configuring cruft pack generation
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (11 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
                     ` (4 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

In servers which set the pack.window configuration to a large value, we
can wind up spending quite a lot of time finding new bases when breaking
delta chains between reachable and unreachable objects while generating
a cruft pack.

Introduce a handful of `repack.cruft*` configuration variables to
control the parameters used by pack-objects when generating a cruft
pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/repack.txt |  9 ++++
 builtin/repack.c                | 50 ++++++++++++++------
 t/t5329-pack-objects-cruft.sh   | 83 +++++++++++++++++++++++++++++++++
 3 files changed, 128 insertions(+), 14 deletions(-)

diff --git a/Documentation/config/repack.txt b/Documentation/config/repack.txt
index 9c413e177e..fd18d1fb89 100644
--- a/Documentation/config/repack.txt
+++ b/Documentation/config/repack.txt
@@ -25,3 +25,12 @@ repack.writeBitmaps::
 	space and extra time spent on the initial repack.  This has
 	no effect if multiple packfiles are created.
 	Defaults to true on bare repos, false otherwise.
+
+repack.cruftWindow::
+repack.cruftWindowMemory::
+repack.cruftDepth::
+repack.cruftThreads::
+	Parameters used by linkgit:git-pack-objects[1] when generating
+	a cruft pack and the respective parameters are not given over
+	the command line. See similarly named `pack.*` configuration
+	variables for defaults and meaning.
diff --git a/builtin/repack.c b/builtin/repack.c
index f7fb88bcf1..d61c78e94e 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -40,9 +40,21 @@ static const char incremental_bitmap_conflict_error[] = N_(
 "--no-write-bitmap-index or disable the pack.writebitmaps configuration."
 );
 
+struct pack_objects_args {
+	const char *window;
+	const char *window_memory;
+	const char *depth;
+	const char *threads;
+	const char *max_pack_size;
+	int no_reuse_delta;
+	int no_reuse_object;
+	int quiet;
+	int local;
+};
 
 static int repack_config(const char *var, const char *value, void *cb)
 {
+	struct pack_objects_args *cruft_po_args = cb;
 	if (!strcmp(var, "repack.usedeltabaseoffset")) {
 		delta_base_offset = git_config_bool(var, value);
 		return 0;
@@ -61,6 +73,15 @@ static int repack_config(const char *var, const char *value, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(var, "repack.cruftwindow"))
+		return git_config_string(&cruft_po_args->window, var, value);
+	if (!strcmp(var, "repack.cruftwindowmemory"))
+		return git_config_string(&cruft_po_args->window_memory, var, value);
+	if (!strcmp(var, "repack.cruftdepth"))
+		return git_config_string(&cruft_po_args->depth, var, value);
+	if (!strcmp(var, "repack.cruftthreads"))
+		return git_config_string(&cruft_po_args->threads, var, value);
+
 	return git_default_config(var, value, cb);
 }
 
@@ -153,18 +174,6 @@ static void remove_redundant_pack(const char *dir_name, const char *base_name)
 	strbuf_release(&buf);
 }
 
-struct pack_objects_args {
-	const char *window;
-	const char *window_memory;
-	const char *depth;
-	const char *threads;
-	const char *max_pack_size;
-	int no_reuse_delta;
-	int no_reuse_object;
-	int quiet;
-	int local;
-};
-
 static void prepare_pack_objects(struct child_process *cmd,
 				 const struct pack_objects_args *args)
 {
@@ -689,6 +698,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	int no_update_server_info = 0;
 	struct pack_objects_args po_args = {NULL};
+	struct pack_objects_args cruft_po_args = {NULL};
 	int geometric_factor = 0;
 	int write_midx = 0;
 
@@ -743,7 +753,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_END()
 	};
 
-	git_config(repack_config, NULL);
+	git_config(repack_config, &cruft_po_args);
 
 	argc = parse_options(argc, argv, prefix, builtin_repack_options,
 				git_repack_usage, 0);
@@ -918,7 +928,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		if (*pack_prefix == '/')
 			pack_prefix++;
 
-		ret = write_cruft_pack(&po_args, pack_prefix, &names,
+		if (!cruft_po_args.window)
+			cruft_po_args.window = po_args.window;
+		if (!cruft_po_args.window_memory)
+			cruft_po_args.window_memory = po_args.window_memory;
+		if (!cruft_po_args.depth)
+			cruft_po_args.depth = po_args.depth;
+		if (!cruft_po_args.threads)
+			cruft_po_args.threads = po_args.threads;
+
+		cruft_po_args.local = po_args.local;
+		cruft_po_args.quiet = po_args.quiet;
+
+		ret = write_cruft_pack(&cruft_po_args, pack_prefix, &names,
 				       &existing_nonkept_packs,
 				       &existing_kept_packs);
 		if (ret)
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 06c550c958..e4744e4465 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -565,4 +565,87 @@ test_expect_success 'cruft repack ignores pack.packSizeLimit' '
 	)
 '
 
+test_expect_success 'cruft repack respects repack.cruftWindow' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		GIT_TRACE2_EVENT=$(pwd)/event.trace \
+		git -c pack.window=1 -c repack.cruftWindow=2 repack \
+		       --cruft --window=3 &&
+
+		grep "pack-objects.*--window=2.*--cruft" event.trace
+	)
+'
+
+test_expect_success 'cruft repack respects --window by default' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		GIT_TRACE2_EVENT=$(pwd)/event.trace \
+		git -c pack.window=2 repack --cruft --window=3 &&
+
+		grep "pack-objects.*--window=3.*--cruft" event.trace
+	)
+'
+
+test_expect_success 'cruft repack respects --quiet' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		GIT_PROGRESS_DELAY=0 git repack --cruft --quiet 2>err &&
+		test_must_be_empty err
+	)
+'
+
+test_expect_success 'cruft --local drops unreachable objects' '
+	git init alternate &&
+	git init repo &&
+	test_when_finished "rm -fr alternate repo" &&
+
+	test_commit -C alternate base &&
+	# Pack all objects in alterate so that the cruft repack in "repo" sees
+	# the object it dropped due to `--local` as packed. Otherwise this
+	# object would not appear packed anywhere (since it is not packed in
+	# alternate and likewise not part of the cruft pack in the other repo
+	# because of `--local`).
+	git -C alternate repack -ad &&
+
+	(
+		cd repo &&
+
+		object="$(git -C ../alternate rev-parse HEAD:base.t)" &&
+		git -C ../alternate cat-file -p $object >contents &&
+
+		# Write some reachable objects and two unreachable ones: one
+		# that the alternate has and another that is unique.
+		test_commit other &&
+		git hash-object -w -t blob contents &&
+		cruft="$(echo cruft | git hash-object -w -t blob --stdin)" &&
+
+		( cd ../alternate/.git/objects && pwd ) \
+		       >.git/objects/info/alternates &&
+
+		test_path_is_file $objdir/$(test_oid_to_path $cruft) &&
+		test_path_is_file $objdir/$(test_oid_to_path $object) &&
+
+		git repack -d --cruft --local &&
+
+		test-tool pack-mtimes "$(basename $(ls $packdir/pack-*.mtimes))" \
+		       >objects &&
+		! grep $object objects &&
+		grep $cruft objects
+	)
+'
+
 test_done
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v3 14/17] builtin/repack.c: use named flags for existing_packs
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (12 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
                     ` (3 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

We use the `util` pointer for items in the `existing_packs` string list
to indicate which packs are going to be deleted. Since that has so far
been the only use of that `util` pointer, we just set it to 0 or 1.

But we're going to add an additional state to this field in the next
patch, so prepare for that by adding a #define for the first bit so we
can more expressively inspect the flags state.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index d61c78e94e..afa4d51a22 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -22,6 +22,8 @@
 #define LOOSEN_UNREACHABLE 2
 #define PACK_CRUFT 4
 
+#define DELETE_PACK 1
+
 static int pack_everything;
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
@@ -561,7 +563,7 @@ static void midx_included_packs(struct string_list *include,
 		}
 	} else {
 		for_each_string_list_item(item, existing_nonkept_packs) {
-			if (item->util)
+			if ((uintptr_t)item->util & DELETE_PACK)
 				continue;
 			string_list_insert(include, xstrfmt("%s.idx", item->string));
 		}
@@ -1000,7 +1002,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			 * was given) and that we will actually delete this pack
 			 * (if `-d` was given).
 			 */
-			item->util = (void*)(intptr_t)!string_list_has_string(&names, sha1);
+			if (!string_list_has_string(&names, sha1))
+				item->util = (void*)(uintptr_t)((size_t)item->util | DELETE_PACK);
 		}
 	}
 
@@ -1024,7 +1027,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (delete_redundant) {
 		int opts = 0;
 		for_each_string_list_item(item, &existing_nonkept_packs) {
-			if (!item->util)
+			if (!((uintptr_t)item->util & DELETE_PACK))
 				continue;
 			remove_redundant_pack(packdir, item->string);
 		}
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v3 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (13 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
                     ` (2 subsequent siblings)
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

When using cruft packs, the following race can occur when a geometric
repack that writes a MIDX bitmap takes place afterwords:

  - First, create an unreachable object and do an all-into-one cruft
    repack which stores that object in the repository's cruft pack.
  - Then make that object reachable.
  - Finally, do a geometric repack and write a MIDX bitmap.

Assuming that we are sufficiently unlucky as to select a commit from the
MIDX which reaches that object for bitmapping, then the `git
multi-pack-index` process will complain that that object is missing.

The reason is because we don't include cruft packs in the MIDX when
doing a geometric repack. Since the "make that object reachable" doesn't
necessarily mean that we'll create a new copy of that object in one of
the packs that will get rolled up as part of a geometric repack, it's
possible that the MIDX won't see any copies of that now-reachable
object.

Of course, it's desirable to avoid including cruft packs in the MIDX
because it causes the MIDX to store a bunch of objects which are likely
to get thrown away. But excluding that pack does open us up to the above
race.

This patch demonstrates the bug, and resolves it by including cruft
packs in the MIDX even when doing a geometric repack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c              | 19 +++++++++++++++++--
 t/t5329-pack-objects-cruft.sh | 26 ++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index afa4d51a22..59b60cd309 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -23,6 +23,7 @@
 #define PACK_CRUFT 4
 
 #define DELETE_PACK 1
+#define CRUFT_PACK 2
 
 static int pack_everything;
 static int delta_base_offset = 1;
@@ -158,8 +159,11 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 		if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
 		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
 			string_list_append_nodup(fname_kept_list, fname);
-		else
-			string_list_append_nodup(fname_nonkept_list, fname);
+		else {
+			struct string_list_item *item = string_list_append_nodup(fname_nonkept_list, fname);
+			if (file_exists(mkpath("%s/%s.mtimes", packdir, fname)))
+				item->util = (void*)(uintptr_t)CRUFT_PACK;
+		}
 	}
 	closedir(dir);
 }
@@ -561,6 +565,17 @@ static void midx_included_packs(struct string_list *include,
 
 			string_list_insert(include, strbuf_detach(&buf, NULL));
 		}
+
+		for_each_string_list_item(item, existing_nonkept_packs) {
+			if (!((uintptr_t)item->util & CRUFT_PACK)) {
+				/*
+				 * no need to check DELETE_PACK, since we're not
+				 * doing an ALL_INTO_ONE repack
+				 */
+				continue;
+			}
+			string_list_insert(include, xstrfmt("%s.idx", item->string));
+		}
 	} else {
 		for_each_string_list_item(item, existing_nonkept_packs) {
 			if ((uintptr_t)item->util & DELETE_PACK)
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index e4744e4465..13158e4ab7 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -648,4 +648,30 @@ test_expect_success 'cruft --local drops unreachable objects' '
 	)
 '
 
+test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		test_commit cruft &&
+		unreachable="$(git rev-parse cruft)" &&
+
+		git reset --hard $unreachable^ &&
+		git tag -d cruft &&
+		git reflog expire --all --expire=all &&
+
+		git repack --cruft -d &&
+
+		# resurrect the unreachable object via a new commit. the
+		# new commit will get selected for a bitmap, but be
+		# missing one of its parents from the selected packs.
+		git reset --hard $unreachable &&
+		test_commit resurrect &&
+
+		git repack --write-midx --write-bitmap-index --geometric=2 -d
+	)
+'
+
 test_done
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v3 16/17] builtin/gc.c: conditionally avoid pruning objects via loose
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (14 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
  2022-03-03  1:29   ` [PATCH v3 00/17] " Derrick Stolee
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Expose the new `git repack --cruft` mode from `git gc` via a new opt-in
flag. When invoked like `git gc --cruft`, `git gc` will avoid exploding
unreachable objects as loose ones, and instead create a cruft pack and
`.mtimes` file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/gc.txt   | 21 +++++++++++++-------
 Documentation/git-gc.txt      |  5 +++++
 builtin/gc.c                  | 10 +++++++++-
 t/t5329-pack-objects-cruft.sh | 37 +++++++++++++++++++++++++++++++++++
 4 files changed, 65 insertions(+), 8 deletions(-)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index c834e07991..38fea076a2 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -81,14 +81,21 @@ gc.packRefs::
 	to enable it within all non-bare repos or it can be set to a
 	boolean value.  The default is `true`.
 
+gc.cruftPacks::
+	Store unreachable objects in a cruft pack (see
+	linkgit:git-repack[1]) instead of as loose objects. The default
+	is `false`.
+
 gc.pruneExpire::
-	When 'git gc' is run, it will call 'prune --expire 2.weeks.ago'.
-	Override the grace period with this config variable.  The value
-	"now" may be used to disable this grace period and always prune
-	unreachable objects immediately, or "never" may be used to
-	suppress pruning.  This feature helps prevent corruption when
-	'git gc' runs concurrently with another process writing to the
-	repository; see the "NOTES" section of linkgit:git-gc[1].
+	When 'git gc' is run, it will call 'prune --expire 2.weeks.ago'
+	(and 'repack --cruft --cruft-expiration 2.weeks.ago' if using
+	cruft packs via `gc.cruftPacks` or `--cruft`).  Override the
+	grace period with this config variable.  The value "now" may be
+	used to disable this grace period and always prune unreachable
+	objects immediately, or "never" may be used to suppress pruning.
+	This feature helps prevent corruption when 'git gc' runs
+	concurrently with another process writing to the repository; see
+	the "NOTES" section of linkgit:git-gc[1].
 
 gc.worktreePruneExpire::
 	When 'git gc' is run, it calls
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 853967dea0..ba4e67700e 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -54,6 +54,11 @@ other housekeeping tasks (e.g. rerere, working trees, reflog...) will
 be performed as well.
 
 
+--cruft::
+	When expiring unreachable objects, pack them separately into a
+	cruft pack instead of storing the loose objects as loose
+	objects.
+
 --prune=<date>::
 	Prune loose objects older than date (default is 2 weeks ago,
 	overridable by the config variable `gc.pruneExpire`).
diff --git a/builtin/gc.c b/builtin/gc.c
index ffaf0daf5d..11f5150234 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -43,6 +43,7 @@ static const char * const builtin_gc_usage[] = {
 
 static int pack_refs = 1;
 static int prune_reflogs = 1;
+static int cruft_packs = 0;
 static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
@@ -153,6 +154,7 @@ static void gc_config(void)
 	git_config_get_int("gc.auto", &gc_auto_threshold);
 	git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
 	git_config_get_bool("gc.autodetach", &detach_auto);
+	git_config_get_bool("gc.cruftpacks", &cruft_packs);
 	git_config_get_expiry("gc.pruneexpire", &prune_expire);
 	git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire);
 	git_config_get_expiry("gc.logexpiry", &gc_log_expire);
@@ -332,7 +334,11 @@ static void add_repack_all_option(struct string_list *keep_pack)
 {
 	if (prune_expire && !strcmp(prune_expire, "now"))
 		strvec_push(&repack, "-a");
-	else {
+	else if (cruft_packs) {
+		strvec_push(&repack, "--cruft");
+		if (prune_expire)
+			strvec_pushf(&repack, "--cruft-expiration=%s", prune_expire);
+	} else {
 		strvec_push(&repack, "-A");
 		if (prune_expire)
 			strvec_pushf(&repack, "--unpack-unreachable=%s", prune_expire);
@@ -552,6 +558,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 		{ OPTION_STRING, 0, "prune", &prune_expire, N_("date"),
 			N_("prune unreferenced objects"),
 			PARSE_OPT_OPTARG, NULL, (intptr_t)prune_expire },
+		OPT_BOOL(0, "cruft", &cruft_packs, N_("pack unreferenced objects separately")),
 		OPT_BOOL(0, "aggressive", &aggressive, N_("be more thorough (increased runtime)")),
 		OPT_BOOL_F(0, "auto", &auto_gc, N_("enable auto-gc mode"),
 			   PARSE_OPT_NOCOMPLETE),
@@ -671,6 +678,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 			die(FAILED_RUN, repack.v[0]);
 
 		if (prune_expire) {
+			/* run `git prune` even if using cruft packs */
 			strvec_push(&prune, prune_expire);
 			if (quiet)
 				strvec_push(&prune, "--no-progress");
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 13158e4ab7..3910e186ef 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -429,6 +429,43 @@ test_expect_success 'loose objects mtimes upsert others' '
 	)
 '
 
+test_expect_success 'expiring cruft objects with git gc' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git branch -M main &&
+		git checkout --orphan other &&
+		test_commit unreachable &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d unreachable &&
+		# objects are not cruft if they are contained in the reflogs
+		git reflog expire --all --expire=all &&
+
+		git rev-list --objects --all --no-object-names >reachable.raw &&
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+		sort <reachable.raw >reachable &&
+		comm -13 reachable objects >unreachable &&
+
+		git repack --cruft -d &&
+
+		mtimes=$(ls .git/objects/pack/pack-*.mtimes) &&
+		test_path_is_file $mtimes &&
+
+		git gc --cruft --prune=now &&
+
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+
+		comm -23 unreachable objects >removed &&
+		test_cmp unreachable removed &&
+		test_path_is_missing $mtimes
+	)
+'
+
 test_expect_success 'cruft packs are not included in geometric repack' '
 	git init repo &&
 	test_when_finished "rm -fr repo" &&
-- 
2.35.1.73.gccc5557600


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v3 17/17] sha1-file.c: don't freshen cruft packs
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (15 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  1:29   ` [PATCH v3 00/17] " Derrick Stolee
  17 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

We don't bother to freshen objects stored in a cruft pack individually
by updating the `.mtimes` file. This is because we can't portably `mmap`
and write into the middle of a file (i.e., to update the mtime of just
one object). Instead, we would have to rewrite the entire `.mtimes` file
which may incur some wasted effort especially if there a lot of cruft
objects and they are freshened infrequently.

Instead, force the freshening code to avoid an optimizing write by
writing out the object loose and letting it pick up a current mtime.

This works because we prefer the mtime of the loose copy of an object
when both a loose and packed one exist (whether or not the packed copy
comes from a cruft pack or not).

This could certainly do with a test and/or be included earlier in this
series/PR, but I want to wait until after I have a chance to clean up
the overly-repetitive nature of the cruft pack tests in general.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 object-file.c                 |  2 ++
 t/t5329-pack-objects-cruft.sh | 25 +++++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/object-file.c b/object-file.c
index e80da1368d..65b8df7fb6 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1989,6 +1989,8 @@ static int freshen_packed_object(const struct object_id *oid)
 	struct pack_entry e;
 	if (!find_pack_entry(the_repository, oid, &e))
 		return 0;
+	if (e.p->is_cruft)
+		return 0;
 	if (e.p->freshened)
 		return 1;
 	if (!freshen_file(e.p->pack_name))
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 3910e186ef..4681558612 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -711,4 +711,29 @@ test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
 	)
 '
 
+test_expect_success 'cruft objects are freshend via loose' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		echo "cruft" >contents &&
+		blob="$(git hash-object -w -t blob contents)" &&
+		loose="$objdir/$(test_oid_to_path $blob)" &&
+
+		test_commit base &&
+
+		git repack --cruft -d &&
+
+		test_path_is_missing "$loose" &&
+		test-tool pack-mtimes "$(basename "$(ls $packdir/pack-*.mtimes)")" >cruft &&
+		grep "$blob" cruft &&
+
+		# write the same object again
+		git hash-object -w -t blob contents &&
+
+		test_path_is_file "$loose"
+	)
+'
+
 test_done
-- 
2.35.1.73.gccc5557600

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 00/17] cruft packs
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (16 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
@ 2022-03-03  1:29   ` Derrick Stolee
  17 siblings, 0 replies; 200+ messages in thread
From: Derrick Stolee @ 2022-03-03  1:29 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: tytso, gitster, larsxschneider

On 3/2/2022 7:20 PM, Taylor Blau wrote:
> Here is a small reroll of my series to implement "cruft packs", based on
> Stolee's review.
> 
> The changes here are minor, and mostly are limited to removing a
> redundant "if" statement, avoiding an unnecessary header include, and
> moving the tests (again!) to t5329's territory.
> 
> As always, a range-diff is below. Thanks in advance for taking another
> look!

This range-diff satisfies my comments. Thanks!
-Stolee

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 04/17] chunk-format.h: extract oid_version()
  2022-03-03  0:20   ` [PATCH v3 04/17] chunk-format.h: extract oid_version() Taylor Blau
@ 2022-03-03 16:30     ` Ævar Arnfjörð Bjarmason
  2022-03-03 23:32       ` Taylor Blau
  0 siblings, 1 reply; 200+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-03 16:30 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, tytso, derrickstolee, gitster, larsxschneider


On Wed, Mar 02 2022, Taylor Blau wrote:

> Consolidate these into a single definition in chunk-format.h. It's not
> clear that this is the best header to define this function in, but it
> should do for now.
> [...]
> +
> +uint8_t oid_version(const struct git_hash_algo *algop)
> +{
> +	switch (hash_algo_by_ptr(algop)) {
> +	case GIT_HASH_SHA1:
> +		return 1;
> +	case GIT_HASH_SHA256:
> +		return 2;

Not a new issue, but I wonder why these don't return hash_algo_by_ptr
aka GIT_HASH_WHATEVER here. I.e. this is the same as this more
straightforward & obvious code that avoids re-hardcoding the magic
constants:

	const int algo = hash_algo_by_ptr(algop)

	switch (algo) {
	case GIT_HASH_SHA1:
	case GIT_HASH_SHA256:
		return algo;
	default:
        [...]
        }

Probably best left as a later cleanup. FWIW I came up with this on top
of my designated init series:

diff --git a/hash.h b/hash.h
index 5d40368f18a..fd710ec6ae8 100644
--- a/hash.h
+++ b/hash.h
@@ -86,14 +86,18 @@ static inline void git_SHA256_Clone(git_SHA256_CTX *dst, const git_SHA256_CTX *s
  * field for being non-zero.  Use the name field for user-visible situations and
  * the format_id field for fixed-length fields on disk.
  */
-/* An unknown hash function. */
-#define GIT_HASH_UNKNOWN 0
-/* SHA-1 */
-#define GIT_HASH_SHA1 1
-/* SHA-256  */
-#define GIT_HASH_SHA256 2
-/* Number of algorithms supported (including unknown). */
-#define GIT_HASH_NALGOS (GIT_HASH_SHA256 + 1)
+enum git_hash_algo_name {
+	/* An unknown hash function. */
+	GIT_HASH_UNKNOWN,
+	/* SHA-1 */
+	GIT_HASH_SHA1,
+	GIT_HASH_SHA256,
+	/*
+	 * Number of algorithms supported (including unknown). This
+	 * must be kept last!
+	 */
+	GIT_HASH_NALGOS,
+};
 
 /* "sha1", big-endian */
 #define GIT_SHA1_FORMAT_ID 0x73686131
diff --git a/object-file.c b/object-file.c
index 5074471b471..f2d54a86969 100644
--- a/object-file.c
+++ b/object-file.c
@@ -166,7 +166,7 @@ static void git_hash_unknown_final_oid(struct object_id *oid, git_hash_ctx *ctx)
 }
 
 const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
-	{
+	[GIT_HASH_UNKNOWN] = {
 		.name = NULL,
 		.format_id = 0x00000000,
 		.rawsz = 0,
@@ -181,7 +181,7 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.empty_blob = NULL,
 		.null_oid = NULL,
 	},
-	{
+	[GIT_HASH_SHA1] = {
 		.name = "sha1",
 		.format_id = GIT_SHA1_FORMAT_ID,
 		.rawsz = GIT_SHA1_RAWSZ,
@@ -196,7 +196,7 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.empty_blob = &empty_blob_oid,
 		.null_oid = &null_oid_sha1,
 	},
-	{
+	[GIT_HASH_SHA256] = {
 		.name = "sha256",
 		.format_id = GIT_SHA256_FORMAT_ID,
 		.rawsz = GIT_SHA256_RAWSZ,

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 05/17] pack-mtimes: support writing pack .mtimes files
  2022-03-03  0:20   ` [PATCH v3 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
@ 2022-03-03 16:45     ` Ævar Arnfjörð Bjarmason
  2022-03-03 23:35       ` Taylor Blau
  0 siblings, 1 reply; 200+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-03 16:45 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, tytso, derrickstolee, gitster, larsxschneider


On Wed, Mar 02 2022, Taylor Blau wrote:

> Now that the `.mtimes` format is defined, supplement the pack-write API
> to be able to conditionally write an `.mtimes` file along with a pack by
> setting an additional flag and passing an oidmap that contains the
> timestamps corresponding to each object in the pack.
> [...]
>  void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought)
> diff --git a/pack.h b/pack.h
> index fd27cfdfd7..01d385903a 100644
> --- a/pack.h
> +++ b/pack.h
> @@ -44,6 +44,7 @@ struct pack_idx_option {
>  #define WRITE_IDX_STRICT 02
>  #define WRITE_REV 04
>  #define WRITE_REV_VERIFY 010
> +#define WRITE_MTIMES 020
>  
>  	uint32_t version;
>  	uint32_t off32_limit;

Why the hardcoding? The 010 was added in your 8ef50d9958f (pack-write.c:
prepare to write 'pack-*.rev' files, 2021-01-25). That would be the same
as 8|2, but there's no 8 there., ditto this new 020 that's the same as
1<<4 | 1<<2, but there's no "16", just WRITE_REV=4.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 04/17] chunk-format.h: extract oid_version()
  2022-03-03 16:30     ` Ævar Arnfjörð Bjarmason
@ 2022-03-03 23:32       ` Taylor Blau
  2022-03-04  0:16         ` Junio C Hamano
  0 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2022-03-03 23:32 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, git, tytso, derrickstolee, gitster, larsxschneider

On Thu, Mar 03, 2022 at 05:30:44PM +0100, Ævar Arnfjörð Bjarmason wrote:
>
> On Wed, Mar 02 2022, Taylor Blau wrote:
>
> > Consolidate these into a single definition in chunk-format.h. It's not
> > clear that this is the best header to define this function in, but it
> > should do for now.
> > [...]
> > +
> > +uint8_t oid_version(const struct git_hash_algo *algop)
> > +{
> > +	switch (hash_algo_by_ptr(algop)) {
> > +	case GIT_HASH_SHA1:
> > +		return 1;
> > +	case GIT_HASH_SHA256:
> > +		return 2;
>
> Not a new issue, but I wonder why these don't return hash_algo_by_ptr
> aka GIT_HASH_WHATEVER here. I.e. this is the same as this more
> straightforward & obvious code that avoids re-hardcoding the magic
> constants:

Hmm. Certainly the value returned by hash_algo_by_ptr() works for SHA-1
and SHA-256, but writes may want to use a different value for future
hashes. Not that this couldn't be changed then, but my feeling is that
the existing code is clearer since it avoids the reader having to jump
to hash_algo_by_ptr()'s implementation to figure out what it returns.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 05/17] pack-mtimes: support writing pack .mtimes files
  2022-03-03 16:45     ` Ævar Arnfjörð Bjarmason
@ 2022-03-03 23:35       ` Taylor Blau
  2022-03-04 10:40         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2022-03-03 23:35 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, git, tytso, derrickstolee, gitster, larsxschneider

On Thu, Mar 03, 2022 at 05:45:23PM +0100, Ævar Arnfjörð Bjarmason wrote:
>
> On Wed, Mar 02 2022, Taylor Blau wrote:
>
> > Now that the `.mtimes` format is defined, supplement the pack-write API
> > to be able to conditionally write an `.mtimes` file along with a pack by
> > setting an additional flag and passing an oidmap that contains the
> > timestamps corresponding to each object in the pack.
> > [...]
> >  void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought)
> > diff --git a/pack.h b/pack.h
> > index fd27cfdfd7..01d385903a 100644
> > --- a/pack.h
> > +++ b/pack.h
> > @@ -44,6 +44,7 @@ struct pack_idx_option {
> >  #define WRITE_IDX_STRICT 02
> >  #define WRITE_REV 04
> >  #define WRITE_REV_VERIFY 010
> > +#define WRITE_MTIMES 020
> >
> >  	uint32_t version;
> >  	uint32_t off32_limit;
>
> Why the hardcoding? The 010 was added in your 8ef50d9958f (pack-write.c:
> prepare to write 'pack-*.rev' files, 2021-01-25). That would be the same
> as 8|2, but there's no 8 there., ditto this new 020 that's the same as
> 1<<4 | 1<<2, but there's no "16", just WRITE_REV=4.

I'm not sure I understand. These are octals, so octal "20" (or decimal
16) just gives us bit 5 -- the next available -- by itself.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 04/17] chunk-format.h: extract oid_version()
  2022-03-03 23:32       ` Taylor Blau
@ 2022-03-04  0:16         ` Junio C Hamano
  0 siblings, 0 replies; 200+ messages in thread
From: Junio C Hamano @ 2022-03-04  0:16 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Ævar Arnfjörð Bjarmason, git, tytso,
	derrickstolee, larsxschneider

Taylor Blau <me@ttaylorr.com> writes:

> On Thu, Mar 03, 2022 at 05:30:44PM +0100, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Wed, Mar 02 2022, Taylor Blau wrote:
>>
>> > Consolidate these into a single definition in chunk-format.h. It's not
>> > clear that this is the best header to define this function in, but it
>> > should do for now.
>> > [...]
>> > +
>> > +uint8_t oid_version(const struct git_hash_algo *algop)
>> > +{
>> > +	switch (hash_algo_by_ptr(algop)) {
>> > +	case GIT_HASH_SHA1:
>> > +		return 1;
>> > +	case GIT_HASH_SHA256:
>> > +		return 2;
>>
>> Not a new issue, but I wonder why these don't return hash_algo_by_ptr
>> aka GIT_HASH_WHATEVER here. I.e. this is the same as this more
>> straightforward & obvious code that avoids re-hardcoding the magic
>> constants:
>
> Hmm. Certainly the value returned by hash_algo_by_ptr() works for SHA-1
> and SHA-256, but writes may want to use a different value for future
> hashes. Not that this couldn't be changed then, but my feeling is that
> the existing code is clearer since it avoids the reader having to jump
> to hash_algo_by_ptr()'s implementation to figure out what it returns.

If we promise that everywhere in file formats where we identify what
hash is used, we write "1" for SHA1 and "2" for SHA256, it would be
natural to define GIT_HASH_SHA1 to "1" and GIT_HASH_SHA256 to "2".

And readers do not have to "figure out", if that is a clearly
written guideline to represent the hash used in file formats.  As
written, the readers who -assumes- such a guideline is there must
figure out from hash.h that GIT_HASH_SHA1 is 1 and GIT_HASH_SHA256
is 2 to be convinced that the above code is correct.

Now, hash.h says GIT_HASH_SHA1 is 1 and GIT_HASH_SHA256 is 2.  So

	int oidv = hash_algo_by_ptr(algop)
	switch (oidv) {
	case GIT_HASH_SHA1:
	case GIT_HASH_SHA256:
		return oidv;
	default:
		die();
	}

should work already.  To put it differently, if this didn't work, we
should renumber GIT_HASH_SHA1 and GIT_HASH_SHA256 to make it work, I
would think.  If not, we have a huge mess on our hands, as constants
used in on-disk file formats is hard (almost impossible) to change.

An overly generic function name oid_version() cannot be justified
unless the same constants are used everywhere.  I see hits from 'git
grep oid_version' in

    chunk-format.c (obviously)
    commit-graph.c
    midx.c
    pack-write.c

so presumably these types of files are using the "canonical"
numbering.

And when we introduce GIT_HASH_SHA3 or whatever, we should give it a
number that this function can return (i.e. from the range 3..255).

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 05/17] pack-mtimes: support writing pack .mtimes files
  2022-03-03 23:35       ` Taylor Blau
@ 2022-03-04 10:40         ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 200+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-04 10:40 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, tytso, derrickstolee, gitster, larsxschneider


On Thu, Mar 03 2022, Taylor Blau wrote:

> On Thu, Mar 03, 2022 at 05:45:23PM +0100, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Wed, Mar 02 2022, Taylor Blau wrote:
>>
>> > Now that the `.mtimes` format is defined, supplement the pack-write API
>> > to be able to conditionally write an `.mtimes` file along with a pack by
>> > setting an additional flag and passing an oidmap that contains the
>> > timestamps corresponding to each object in the pack.
>> > [...]
>> >  void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought)
>> > diff --git a/pack.h b/pack.h
>> > index fd27cfdfd7..01d385903a 100644
>> > --- a/pack.h
>> > +++ b/pack.h
>> > @@ -44,6 +44,7 @@ struct pack_idx_option {
>> >  #define WRITE_IDX_STRICT 02
>> >  #define WRITE_REV 04
>> >  #define WRITE_REV_VERIFY 010
>> > +#define WRITE_MTIMES 020
>> >
>> >  	uint32_t version;
>> >  	uint32_t off32_limit;
>>
>> Why the hardcoding? The 010 was added in your 8ef50d9958f (pack-write.c:
>> prepare to write 'pack-*.rev' files, 2021-01-25). That would be the same
>> as 8|2, but there's no 8 there., ditto this new 020 that's the same as
>> 1<<4 | 1<<2, but there's no "16", just WRITE_REV=4.
>
> I'm not sure I understand. These are octals, so octal "20" (or decimal
> 16) just gives us bit 5 -- the next available -- by itself.

Urgh, tired/rushed eyes yesterday. I managed to read these as decimals,
sorry.

I see from:

    git grep 'define[^0-9]*(\b020\b|\b16\b|1.*<<.*\b4\b)[^0-9]*$'

That I managed to patch what seems to be one of two other places in the
codebase using it recently (that goes >=020) in 245b9488150 (cat-file:
use GET_OID_ONLY_TO_DIE in --(textconv|filters), 2021-12-28).

Anyway, I think nothing needs to be done here. If you ever feel like
some churn here I think converting it to the almost ubiquitous "1 << N"
style we use almost everywhere else would be an improvement :)

Sorry!

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-03  0:20   ` [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
@ 2022-03-07 18:03     ` Jonathan Nieder
  2022-03-22  1:16       ` Taylor Blau
  0 siblings, 1 reply; 200+ messages in thread
From: Jonathan Nieder @ 2022-03-07 18:03 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, tytso, derrickstolee, gitster, larsxschneider

Hi,

Taylor Blau wrote:

> Create a technical document to explain cruft packs. It contains a brief
> overview of the problem, some background, details on the implementation,
> and a couple of alternative approaches not considered here.

Sorry for the very slow review!  I've mentioned a few times that this
overlaps in interesting ways with the gc mechanism described in
hash-function-transition.txt, so I'd like to compare and see how they
interact.

[...]
> --- /dev/null
> +++ b/Documentation/technical/cruft-packs.txt
> @@ -0,0 +1,97 @@
[...]
> +Unreachable objects aren't removed immediately, since doing so could race with
> +an incoming push which may reference an object which is about to be deleted.
> +Instead, those unreachable objects are stored as loose object and stay that way
> +until they are older than the expiration window, at which point they are removed
> +by linkgit:git-prune[1].
> +
> +Git must store these unreachable objects loose in order to keep track of their
> +per-object mtimes.

It's worth noting that this behavior is already racy.  That is because
when an unreachable object becomes newly reachable, we do not update
its mtime and the mtimes of every object reachable from it, so if it
then becomes transiently unreachable again then it can be wrongly
collected.

[...]
>                                these repositories often take up a large amount of
> +disk space, since we can only zlib compress them, but not store them in delta
> +chains.

Yes!  I'm happy we're making progress on this.

> +
> +== Cruft packs
> +
> +A cruft pack eliminates the need for storing unreachable objects in a loose
> +state by including the per-object mtimes in a separate file alongside a single
> +pack containing all loose objects.

Can this doc say a little about how "git prune" handles these files?
In particular, does a non cruft pack aware copy of Git (or JGit,
libgit2, etc) do the right thing or does it fight with this mechanism?
If the latter, do we have a repository extension (extensions.*) to
prevent that?

[...]
> +  3. Write the pack out, along with a `.mtimes` file that records the per-object
> +     timestamps.

As a point of comparison, the design in hash-function-transition uses
a single timestamp for the whole pack.  During read operations, objects
in a cruft pack are considered present; during writes, they are
considered _not present_ so that if we want to make a cruft object
newly present then we put a copy of it in a new pack.

Advantage of the mtimes file approach:
- less duplication of storage: a revived object is only stored once,
  in a cruft pack, and then the next gc can "graduate" it out of the
  cruft pack and shrink the cruft pack
- less affect on non-gc Git code: writes don't need to know that any
  cruft objects referenced need to be copied into a new pack

Advantages of the mtime per cruft pack approach:
- easy expiration: once a cruft pack has reached its expiration date,
  it can be deleted as a whole
- less I/O churn: a cruft pack stays as-is until combined into another
  cruft pack or deleted.  There is no frequently-modified mtimes file
  associated to it
- informs the storage layer about what is likely to be accessed: cruft
  packs can get filesystem attributes to put them in less-optimized
  storage since they are likely to be less frequently read

[...]
> +Notable alternatives to this design include:
> +
> +  - The location of the per-object mtime data, and
> +  - Storing unreachable objects in multiple cruft packs.
> +
> +On the location of mtime data, a new auxiliary file tied to the pack was chosen
> +to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
> +support for optional chunks of data, it may make sense to consolidate the
> +`.mtimes` format into the `.idx` itself.
> +
> +Storing unreachable objects among multiple cruft packs (e.g., creating a new
> +cruft pack during each repacking operation including only unreachable objects
> +which aren't already stored in an earlier cruft pack) is significantly more
> +complicated to construct, and so aren't pursued here. The obvious drawback to
> +the current implementation is that the entire cruft pack must be re-written from
> +scratch.

This doesn't mention the approach described in
hash-function-transition.txt (and that's already implemented and has
been in use for many years in JGit's DfsRepository).  Does that mean
you aren't aware of it?

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-07 18:03     ` Jonathan Nieder
@ 2022-03-22  1:16       ` Taylor Blau
  2022-03-22 21:45         ` Jonathan Nieder
  0 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2022-03-22  1:16 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, tytso, derrickstolee, gitster, larsxschneider

On Mon, Mar 07, 2022 at 10:03:35AM -0800, Jonathan Nieder wrote:
> Sorry for the very slow review!  I've mentioned a few times that this
> overlaps in interesting ways with the gc mechanism described in
> hash-function-transition.txt, so I'd like to compare and see how they
> interact.

Sorry for my equally-slow reply ;). I was on vacation last week and
wasn't following the list closely.

> > +Unreachable objects aren't removed immediately, since doing so could race with
> > +an incoming push which may reference an object which is about to be deleted.
> > +Instead, those unreachable objects are stored as loose object and stay that way
> > +until they are older than the expiration window, at which point they are removed
> > +by linkgit:git-prune[1].
> > +
> > +Git must store these unreachable objects loose in order to keep track of their
> > +per-object mtimes.
>
> It's worth noting that this behavior is already racy.  That is because
> when an unreachable object becomes newly reachable, we do not update
> its mtime and the mtimes of every object reachable from it, so if it
> then becomes transiently unreachable again then it can be wrongly
> collected.

Just to be clear, the race here only happens if the object in question
becomes reachable _after_ a pruning GC determines its mtime. If that's
the case, then the object will indeed be wrongly collected. This is
consistent with the existing behavior (which is racy in the exact same
way).

(After re-reading what you wrote and my response, I think we are saying
the exact same thing, but it doesn't hurt to think aloud).

> > +
> > +== Cruft packs
> > +
> > +A cruft pack eliminates the need for storing unreachable objects in a loose
> > +state by including the per-object mtimes in a separate file alongside a single
> > +pack containing all loose objects.
>
> Can this doc say a little about how "git prune" handles these files?
> In particular, does a non cruft pack aware copy of Git (or JGit,
> libgit2, etc) do the right thing or does it fight with this mechanism?
> If the latter, do we have a repository extension (extensions.*) to
> prevent that?

I mentioned this in much more detail in [1], but the answer is that the
cruft pack looks like any other pack, it just happens to have another
metadata file (the .mtimes one) attached to it. So other implementations
of Git should treat it as they would any other pack. Like I mentioned in
[1], cruft packs were designed with the explicit goal of not requiring a
repository extension.

> > +  3. Write the pack out, along with a `.mtimes` file that records the per-object
> > +     timestamps.
>
> As a point of comparison, the design in hash-function-transition uses
> a single timestamp for the whole pack.  During read operations, objects
> in a cruft pack are considered present; during writes, they are
> considered _not present_ so that if we want to make a cruft object
> newly present then we put a copy of it in a new pack.
>
> Advantage of the mtimes file approach:
> - less duplication of storage: a revived object is only stored once,
>   in a cruft pack, and then the next gc can "graduate" it out of the
>   cruft pack and shrink the cruft pack
> - less affect on non-gc Git code: writes don't need to know that any
>   cruft objects referenced need to be copied into a new pack
>
> Advantages of the mtime per cruft pack approach:
> - easy expiration: once a cruft pack has reached its expiration date,
>   it can be deleted as a whole
> - less I/O churn: a cruft pack stays as-is until combined into another
>   cruft pack or deleted.  There is no frequently-modified mtimes file
>   associated to it
> - informs the storage layer about what is likely to be accessed: cruft
>   packs can get filesystem attributes to put them in less-optimized
>   storage since they are likely to be less frequently read
>
> [...]

The key advantage of cruft packs is that you can expire unreachable
objects in piecemeal while still retaining the benefit of being able to
de-duplicate cruft objects and store them packed against each other.

> > +Notable alternatives to this design include:
>
> This doesn't mention the approach described in
> hash-function-transition.txt (and that's already implemented and has
> been in use for many years in JGit's DfsRepository).  Does that mean
> you aren't aware of it?

Implementing the UNREACHABLE_GARBAGE concept from
hash-function-transition.txt in cruft pack-terms would be equivalent to
not writing the mtimes file at all. This follows from the fact that a
pre-cruft packs implementation of Git considers a packed object's mtime
to be the same as the pack it's contained in. (I'm deliberately
avoiding any details from the h-f-t document regarding re-writing
objects contained in a garbage pack here, since this is separate from
the pack structure itself (and could easily be implemented on top of
cruft packs)).

So I'm not sure what the alternative we'd list would be, since it
removes the key feature of the design of cruft packs.

Thanks,
Taylor

[1]: https://lore.kernel.org/git/YiZMhuI%2FDdpvQ%2FED@nand.local/

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-22  1:16       ` Taylor Blau
@ 2022-03-22 21:45         ` Jonathan Nieder
  2022-03-22 22:02           ` Taylor Blau
  0 siblings, 1 reply; 200+ messages in thread
From: Jonathan Nieder @ 2022-03-22 21:45 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, tytso, derrickstolee, gitster, larsxschneider

Hi,

Taylor Blau wrote:
> On Mon, Mar 07, 2022 at 10:03:35AM -0800, Jonathan Nieder wrote:

>> Sorry for the very slow review!  I've mentioned a few times that this
>> overlaps in interesting ways with the gc mechanism described in
>> hash-function-transition.txt, so I'd like to compare and see how they
>> interact.
>
> Sorry for my equally-slow reply ;). I was on vacation last week and
> wasn't following the list closely.

No problem --- thanks for getting back to me.

[...]
> (After re-reading what you wrote and my response, I think we are saying
> the exact same thing, but it doesn't hurt to think aloud).

Great.  Can the doc cover this?  I think it would be helpful to make
that easy to find for others with similar questions.

If it's a matter of finding enough time to write some text, let me
know and I can try to find some time to help.

[...]
>> Can this doc say a little about how "git prune" handles these files?
>> In particular, does a non cruft pack aware copy of Git (or JGit,
>> libgit2, etc) do the right thing or does it fight with this mechanism?
>> If the latter, do we have a repository extension (extensions.*) to
>> prevent that?
>
> I mentioned this in much more detail in [1], but the answer is that the
> cruft pack looks like any other pack, it just happens to have another
> metadata file (the .mtimes one) attached to it. So other implementations
> of Git should treat it as they would any other pack. Like I mentioned in
> [1], cruft packs were designed with the explicit goal of not requiring a
> repository extension.

Sorry, the above seems like it's answering a different question than I
asked.  The doc in Documentation/technical/ seems like a natural place
to describe what semantics the new .mtimes file has, and I didn't find
that there.  Is there a different piece of documentation I should have
been looking at?

Can you tell me a little more about why we would want _not_ to have a
repository format extension?  To me, it seems like a fairly simple
addition that would drastically reduce the cognitive overload for
people considering making use of this feature.

[...]
> The key advantage of cruft packs is that you can expire unreachable
> objects in piecemeal while still retaining the benefit of being able to
> de-duplicate cruft objects and store them packed against each other.

Can you say a little more about this?  My experience with the similar
feature in JGit is that it has been helpful to be able to expire a
cruft pack altogether; since objects that became reachable around the
same time get packed at the same time, it's not obvious to me what
benefit this extra piecemeal capability brings.

That doesn't mean the benefit doesn't exist, just that it seems like
there's a piece of context I'm still missing.

>>> +Notable alternatives to this design include:
>>
>> This doesn't mention the approach described in
>> hash-function-transition.txt (and that's already implemented and has
>> been in use for many years in JGit's DfsRepository).  Does that mean
>> you aren't aware of it?
>
> Implementing the UNREACHABLE_GARBAGE concept from
> hash-function-transition.txt in cruft pack-terms would be equivalent to
> not writing the mtimes file at all. This follows from the fact that a
> pre-cruft packs implementation of Git considers a packed object's mtime
> to be the same as the pack it's contained in. (I'm deliberately
> avoiding any details from the h-f-t document regarding re-writing
> objects contained in a garbage pack here, since this is separate from
> the pack structure itself (and could easily be implemented on top of
> cruft packs)).
>
> So I'm not sure what the alternative we'd list would be, since it
> removes the key feature of the design of cruft packs.

Sorry, I don't understand this answer either.  Do you mean to say that
JGit's DfsRepository does not in fact have a cruft packs like feature
that is live in the wild?  Or that that feature is equivalent to not
having such a feature?  Or something else?

To be clear, I'm not trying to say that that's superior to what you've
proposed here --- only that documenting the comparison would be
useful.

Puzzled,
Jonathan

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-22 21:45         ` Jonathan Nieder
@ 2022-03-22 22:02           ` Taylor Blau
  2022-03-22 23:04             ` Jonathan Nieder
  0 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2022-03-22 22:02 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Taylor Blau, git, tytso, derrickstolee, gitster, larsxschneider

On Tue, Mar 22, 2022 at 02:45:16PM -0700, Jonathan Nieder wrote:
> Hi,
>
> Taylor Blau wrote:
> > On Mon, Mar 07, 2022 at 10:03:35AM -0800, Jonathan Nieder wrote:
>
> >> Sorry for the very slow review!  I've mentioned a few times that this
> >> overlaps in interesting ways with the gc mechanism described in
> >> hash-function-transition.txt, so I'd like to compare and see how they
> >> interact.
> >
> > Sorry for my equally-slow reply ;). I was on vacation last week and
> > wasn't following the list closely.
>
> No problem --- thanks for getting back to me.
>
> [...]
> > (After re-reading what you wrote and my response, I think we are saying
> > the exact same thing, but it doesn't hurt to think aloud).
>
> Great.  Can the doc cover this?  I think it would be helpful to make
> that easy to find for others with similar questions.

I believe the doc covers this already, see the paragraph beginning with
"Unreachable objects aren't removed immediately...".

> >> Can this doc say a little about how "git prune" handles these files?
> >> In particular, does a non cruft pack aware copy of Git (or JGit,
> >> libgit2, etc) do the right thing or does it fight with this mechanism?
> >> If the latter, do we have a repository extension (extensions.*) to
> >> prevent that?
> >
> > I mentioned this in much more detail in [1], but the answer is that the
> > cruft pack looks like any other pack, it just happens to have another
> > metadata file (the .mtimes one) attached to it. So other implementations
> > of Git should treat it as they would any other pack. Like I mentioned in
> > [1], cruft packs were designed with the explicit goal of not requiring a
> > repository extension.
>
> Sorry, the above seems like it's answering a different question than I
> asked.  The doc in Documentation/technical/ seems like a natural place
> to describe what semantics the new .mtimes file has, and I didn't find
> that there.  Is there a different piece of documentation I should have
> been looking at?

Are you looking for a technical description of the mtimes file? If so,
there is a section in Documentation/technical/pack-format.txt (added in
"pack-mtimes: support reading .mtimes files") that explains this.

> Can you tell me a little more about why we would want _not_ to have a
> repository format extension?  To me, it seems like a fairly simple
> addition that would drastically reduce the cognitive overload for
> people considering making use of this feature.

There is no reason to prevent a pre-cruft packs version of Git from
reading/writing a repository that uses cruft packs, since the two
versions will still function as normal. Since there's no need to prevent
the old version from interacting with a repository that has cruft packs,
we wouldn't want to enforce an unnecessary boundary with an extension.

> [...]
> > The key advantage of cruft packs is that you can expire unreachable
> > objects in piecemeal while still retaining the benefit of being able to
> > de-duplicate cruft objects and store them packed against each other.
>
> Can you say a little more about this?  My experience with the similar
> feature in JGit is that it has been helpful to be able to expire a
> cruft pack altogether; since objects that became reachable around the
> same time get packed at the same time, it's not obvious to me what
> benefit this extra piecemeal capability brings.
>
> That doesn't mean the benefit doesn't exist, just that it seems like
> there's a piece of context I'm still missing.

Expiring objects in piecemeal is somewhat interesting, but I think I was
reaching a little too far when I said it was the "key benefit". It does
have some nice properties, like being able to store cruft objects as
deltas against other cruft objects which might get pruned at a different
time (though, of course, you'll need to re-delta them in the case you do
prune an object which is the base of another cruft object).

But the issue with having multiple cruft packs is that the semantics get
significantly more complicated. E.g., if you have an object represented
in multiple cruft packs, which mtime do you use? If you want to prune
it, you suddenly may have many packs you need to update and keep track
of.

> >>> +Notable alternatives to this design include:
> >>
> >> This doesn't mention the approach described in
> >> hash-function-transition.txt (and that's already implemented and has
> >> been in use for many years in JGit's DfsRepository).  Does that mean
> >> you aren't aware of it?
> >
> > Implementing the UNREACHABLE_GARBAGE concept from
> > hash-function-transition.txt in cruft pack-terms would be equivalent to
> > not writing the mtimes file at all. This follows from the fact that a
> > pre-cruft packs implementation of Git considers a packed object's mtime
> > to be the same as the pack it's contained in. (I'm deliberately
> > avoiding any details from the h-f-t document regarding re-writing
> > objects contained in a garbage pack here, since this is separate from
> > the pack structure itself (and could easily be implemented on top of
> > cruft packs)).
> >
> > So I'm not sure what the alternative we'd list would be, since it
> > removes the key feature of the design of cruft packs.
>
> Sorry, I don't understand this answer either.  Do you mean to say that
> JGit's DfsRepository does not in fact have a cruft packs like feature
> that is live in the wild?  Or that that feature is equivalent to not
> having such a feature?  Or something else?
>
> To be clear, I'm not trying to say that that's superior to what you've
> proposed here --- only that documenting the comparison would be
> useful.

I'm not familiar enough with JGit (or its DfsRepository class) to know
how to answer this. I was comparing cruft packs to the
UNREACHABLE_GARBAGE concept mentioned in the hash-function-transition
doc, and noting the differences there.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-22 22:02           ` Taylor Blau
@ 2022-03-22 23:04             ` Jonathan Nieder
  2022-03-23  1:01               ` Taylor Blau
  0 siblings, 1 reply; 200+ messages in thread
From: Jonathan Nieder @ 2022-03-22 23:04 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, tytso, derrickstolee, gitster, larsxschneider

Hi,

Taylor Blau wrote:
> On Tue, Mar 22, 2022 at 02:45:16PM -0700, Jonathan Nieder wrote:

>> Great.  Can the doc cover this?  I think it would be helpful to make
>> that easy to find for others with similar questions.
>
> I believe the doc covers this already, see the paragraph beginning with
> "Unreachable objects aren't removed immediately...".

Thanks.  I just reread that section and it didn't say anything obvious
about the race that continues to exist and whether cruft packs address
it.

[...]
>> Sorry, the above seems like it's answering a different question than I
>> asked.  The doc in Documentation/technical/ seems like a natural place
>> to describe what semantics the new .mtimes file has, and I didn't find
>> that there.  Is there a different piece of documentation I should have
>> been looking at?
>
> Are you looking for a technical description of the mtimes file? If so,
> there is a section in Documentation/technical/pack-format.txt (added in
> "pack-mtimes: support reading .mtimes files") that explains this.

I see --- is the idea that cruft-packs.txt means to refer to
pack-format.txt for the details, and cruft-packs is an overview of some
other, non-detail aspects?

I just checked pack-format.txt and it didn't describe the semantics
(what a Git implementation is expected to do when it sees an mtimes
file).  For example, in Documentation/technical/cruft-packs.txt, the
kind of thing I'd expect to see is

- what does an mtime value in the mtimes file represent?  When is it
  meant to be updated?
- what guarantees are present about when an object is safe to be
  pruned?

[...]
>> Can you tell me a little more about why we would want _not_ to have a
>> repository format extension?  To me, it seems like a fairly simple
>> addition that would drastically reduce the cognitive overload for
>> people considering making use of this feature.
>
>There is no reason to prevent a pre-cruft packs version of Git from
> reading/writing a repository that uses cruft packs, since the two
> versions will still function as normal. Since there's no need to prevent
> the old version from interacting with a repository that has cruft packs,
> we wouldn't want to enforce an unnecessary boundary with an extension.

Does "function as normal" include in repository maintenance operations
like "git maintenance", "git gc", and "git prune"?  If so, this seems
like something very useful to describe in the cruft-packs.txt
document, since what happens when we bounce back and forth between old
and new versions of Git operating on the same NFS mounted repository
would not be obvious without such a discussion.

I'm still interested in the _downsides_ of using a repository format
extension.  "There is no reason" is not a downside, unless you mean
that it requires adding a line of code. :)  The main downside I can
imagine is that it prevents accessing the repository _that has enabled
this feature_ with an older version of Git, but I (perhaps due to a
failure of imagination) haven't put two and two together yet about
when I would want to do so.

[...]
> Expiring objects in piecemeal is somewhat interesting, but I think I was
> reaching a little too far when I said it was the "key benefit". It does
> have some nice properties, like being able to store cruft objects as
> deltas against other cruft objects which might get pruned at a different
> time (though, of course, you'll need to re-delta them in the case you do
> prune an object which is the base of another cruft object).
>
> But the issue with having multiple cruft packs is that the semantics get
> significantly more complicated. E.g., if you have an object represented
> in multiple cruft packs, which mtime do you use? If you want to prune
> it, you suddenly may have many packs you need to update and keep track
> of.

Thanks for this explanation.  In hash-function-transition.txt, I see

	"git gc" currently expels any unreachable objects it encounters in
	pack files to loose objects in an attempt to prevent a race when
	pruning them (in case another process is simultaneously writing a new
	object that refers to the about-to-be-deleted object). This leads to
	an explosion in the number of loose objects present and disk space
	usage due to the objects in delta form being replaced with independent
	loose objects.  Worse, the race is still present for loose objects.

	Instead, "git gc" will need to move unreachable objects to a new
	packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see
	below). To avoid the race when writing new objects referring to an
	about-to-be-deleted object, code paths that write new objects will
	need to copy any objects from UNREACHABLE_GARBAGE packs that they
	refer to new, non-UNREACHABLE_GARBAGE packs (or loose objects).
	UNREACHABLE_GARBAGE are then safe to delete if their creation time (as
	indicated by the file's mtime) is long enough ago.

	To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be
	combined under certain circumstances. [etc]

So the proposal there is that the file mtime for an UNREACHABLE_GARBAGE
pack refers to when that pack was written and governs when that pack
can be deleted.  If an object is present in multiple packs, then newer
packs with the object have a newer mtime and thus cause the object to
be kept around for longer.

[...]
>> Sorry, I don't understand this answer either.  Do you mean to say that
>> JGit's DfsRepository does not in fact have a cruft packs like feature
>> that is live in the wild?  Or that that feature is equivalent to not
>> having such a feature?  Or something else?
>>
>> To be clear, I'm not trying to say that that's superior to what you've
>> proposed here --- only that documenting the comparison would be
>> useful.
>
> I'm not familiar enough with JGit (or its DfsRepository class) to know
> how to answer this. I was comparing cruft packs to the
> UNREACHABLE_GARBAGE concept mentioned in the hash-function-transition
> doc, and noting the differences there.

Thanks.  I think there's some implied feedback about the documentation
of UNREACHABLE_GARBAGE there, because if I understand then you're
saying that it does not describe maintaining cruft packs.  Perhaps a
pointer to the particular sentence that led you to that conclusion
would help.

Sincerely,
Jonathan

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-22 23:04             ` Jonathan Nieder
@ 2022-03-23  1:01               ` Taylor Blau
  2022-03-28 18:46                 ` Taylor Blau
  0 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2022-03-23  1:01 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Taylor Blau, git, tytso, derrickstolee, gitster, larsxschneider

On Tue, Mar 22, 2022 at 04:04:53PM -0700, Jonathan Nieder wrote:
> Hi,
>
> Taylor Blau wrote:
> > On Tue, Mar 22, 2022 at 02:45:16PM -0700, Jonathan Nieder wrote:
>
> >> Great.  Can the doc cover this?  I think it would be helpful to make
> >> that easy to find for others with similar questions.
> >
> > I believe the doc covers this already, see the paragraph beginning with
> > "Unreachable objects aren't removed immediately...".
>
> Thanks.  I just reread that section and it didn't say anything obvious
> about the race that continues to exist and whether cruft packs address
> it.

Yeah, there isn't an explicit "and cruft packs addresses the loose
object explosion but does not address the race" sentence. I'm not
opposed to adding something like that to clarify (though TBH, I would
rather do it as a clean-up on top rather than send out a bazillion
mostly-unchanged patches).

> [...]
> >> Sorry, the above seems like it's answering a different question than I
> >> asked.  The doc in Documentation/technical/ seems like a natural place
> >> to describe what semantics the new .mtimes file has, and I didn't find
> >> that there.  Is there a different piece of documentation I should have
> >> been looking at?
> >
> > Are you looking for a technical description of the mtimes file? If so,
> > there is a section in Documentation/technical/pack-format.txt (added in
> > "pack-mtimes: support reading .mtimes files") that explains this.
>
> I see --- is the idea that cruft-packs.txt means to refer to
> pack-format.txt for the details, and cruft-packs is an overview of some
> other, non-detail aspects?
>
> I just checked pack-format.txt and it didn't describe the semantics
> (what a Git implementation is expected to do when it sees an mtimes
> file).  For example, in Documentation/technical/cruft-packs.txt, the
> kind of thing I'd expect to see is

Right; I have always considered the files in Documentation/technical to
primarily be about the file format itself.

> - what does an mtime value in the mtimes file represent?  When is it
>   meant to be updated?
> - what guarantees are present about when an object is safe to be
>   pruned?

The cruft-packs.txt document covers these, though I think somewhat
implicitly. Again, I'm not opposed to more clarification, but I again
would like to do so on top.

I think many of these are discussed within the threads above, but to
answer your questions in order:

  - The mtime of an object in a cruft pack represents the last time that
    object was known to be reachable, and it's updated when generating a
    cruft pack or pruning.

  - The same guarantees are made in the cruft pack case as in the
    non-cruft case (i.e., "none", and so a grace period is recommended).

> [...]
> >> Can you tell me a little more about why we would want _not_ to have a
> >> repository format extension?  To me, it seems like a fairly simple
> >> addition that would drastically reduce the cognitive overload for
> >> people considering making use of this feature.
> >
> >There is no reason to prevent a pre-cruft packs version of Git from
> > reading/writing a repository that uses cruft packs, since the two
> > versions will still function as normal. Since there's no need to prevent
> > the old version from interacting with a repository that has cruft packs,
> > we wouldn't want to enforce an unnecessary boundary with an extension.
>
> Does "function as normal" include in repository maintenance operations
> like "git maintenance", "git gc", and "git prune"?  If so, this seems
> like something very useful to describe in the cruft-packs.txt
> document, since what happens when we bounce back and forth between old
> and new versions of Git operating on the same NFS mounted repository
> would not be obvious without such a discussion.

Yes, all of those commands will simply ignore the .mtimes file and treat
the unreachable objects as normal (where "normal" means in the exact
same way as they currently do without cruft packs). I think adding a
section that summarizes our discussion would be useful.

> I'm still interested in the _downsides_ of using a repository format
> extension.  "There is no reason" is not a downside, unless you mean
> that it requires adding a line of code. :)  The main downside I can
> imagine is that it prevents accessing the repository _that has enabled
> this feature_ with an older version of Git, but I (perhaps due to a
> failure of imagination) haven't put two and two together yet about
> when I would want to do so.

Sorry for not being clear; I meant: "There is no reason [to prohibit
two versions of Git from interacting with each other when they are
compatible to do so]".

> [...]
> > Expiring objects in piecemeal is somewhat interesting, but I think I was
> > reaching a little too far when I said it was the "key benefit". It does
> > have some nice properties, like being able to store cruft objects as
> > deltas against other cruft objects which might get pruned at a different
> > time (though, of course, you'll need to re-delta them in the case you do
> > prune an object which is the base of another cruft object).
> >
> > But the issue with having multiple cruft packs is that the semantics get
> > significantly more complicated. E.g., if you have an object represented
> > in multiple cruft packs, which mtime do you use? If you want to prune
> > it, you suddenly may have many packs you need to update and keep track
> > of.
>
> Thanks for this explanation.  In hash-function-transition.txt, I see
>
> 	"git gc" currently expels any unreachable objects it encounters in
> 	pack files to loose objects in an attempt to prevent a race when
> 	pruning them (in case another process is simultaneously writing a new
> 	object that refers to the about-to-be-deleted object). This leads to
> 	an explosion in the number of loose objects present and disk space
> 	usage due to the objects in delta form being replaced with independent
> 	loose objects.  Worse, the race is still present for loose objects.
>
> 	Instead, "git gc" will need to move unreachable objects to a new
> 	packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see
> 	below). To avoid the race when writing new objects referring to an
> 	about-to-be-deleted object, code paths that write new objects will
> 	need to copy any objects from UNREACHABLE_GARBAGE packs that they
> 	refer to new, non-UNREACHABLE_GARBAGE packs (or loose objects).
> 	UNREACHABLE_GARBAGE are then safe to delete if their creation time (as
> 	indicated by the file's mtime) is long enough ago.
>
> 	To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be
> 	combined under certain circumstances. [etc]
>
> So the proposal there is that the file mtime for an UNREACHABLE_GARBAGE
> pack refers to when that pack was written and governs when that pack
> can be deleted.  If an object is present in multiple packs, then newer
> packs with the object have a newer mtime and thus cause the object to
> be kept around for longer.

That matches my understanding.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-23  1:01               ` Taylor Blau
@ 2022-03-28 18:46                 ` Taylor Blau
  2022-03-28 20:55                   ` Junio C Hamano
  0 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2022-03-28 18:46 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, tytso, derrickstolee, gitster, larsxschneider

On Tue, Mar 22, 2022 at 09:01:43PM -0400, Taylor Blau wrote:
> > >> Can you tell me a little more about why we would want _not_ to have a
> > >> repository format extension?  To me, it seems like a fairly simple
> > >> addition that would drastically reduce the cognitive overload for
> > >> people considering making use of this feature.
> > >
> > >There is no reason to prevent a pre-cruft packs version of Git from
> > > reading/writing a repository that uses cruft packs, since the two
> > > versions will still function as normal. Since there's no need to prevent
> > > the old version from interacting with a repository that has cruft packs,
> > > we wouldn't want to enforce an unnecessary boundary with an extension.
> >
> > Does "function as normal" include in repository maintenance operations
> > like "git maintenance", "git gc", and "git prune"?  If so, this seems
> > like something very useful to describe in the cruft-packs.txt
> > document, since what happens when we bounce back and forth between old
> > and new versions of Git operating on the same NFS mounted repository
> > would not be obvious without such a discussion.
>
> Yes, all of those commands will simply ignore the .mtimes file and treat
> the unreachable objects as normal (where "normal" means in the exact
> same way as they currently do without cruft packs). I think adding a
> section that summarizes our discussion would be useful.
>
> > I'm still interested in the _downsides_ of using a repository format
> > extension.  "There is no reason" is not a downside, unless you mean
> > that it requires adding a line of code. :)  The main downside I can
> > imagine is that it prevents accessing the repository _that has enabled
> > this feature_ with an older version of Git, but I (perhaps due to a
> > failure of imagination) haven't put two and two together yet about
> > when I would want to do so.
>
> Sorry for not being clear; I meant: "There is no reason [to prohibit
> two versions of Git from interacting with each other when they are
> compatible to do so]".

Jonathan, myself, and others discussed this extensively in today's
standup.

To summarize Jonathan's point (as I think I severely misunderstood it
before), if two writers are repacking a repository with unreachable
objects. The following can happen:

  - $NEWGIT packs the repository and writes a cruft pack and .mtimes
    file.

  - $OLDGIT packs the repository, exploding unreachable objects from the
    cruft pack as loose, setting their mtimes to "now".

This causes the repository to lose information about the unreachable
mtimes, which would cause the repository to never prune objects (except
for when`--unpack-unreachable=now` is passed).

One approach (that Jonathan suggested) is to prevent the above situation
by introducing a format extension, so $OLDGIT could not touch the
repository. But this comes at a (in my view, significant) cost which is
that $OLDGIT can't touch the repository _at all_. An extension would be
desirable if cross-version interaction resulted in repository
corruption, but this scenario does not lead to corruption at all.

Another approach (courtesy Stolee, in an off-list discussion) is that we
could introduce an optional extension available as an opt-in to prevent
older versions of Git from interacting in a repository that contains
cruft packs, but is not required to write them.

A third approach (and probably my preferred direction) is to indicate
clearly via a combination of updates to Documentation/cruft-packs.txt
and the release notes that say something along the lines of:

    If you use are repacking a repository using both a pre- and
    post-cruft packs version of Git, please be aware that you will lose
    information about the mtimes of unreachable objects.

I imagine that would probably be sufficient, but we could also introduce
the opt-in extension as an easy alternative to avoid forcing an upgrade
of Git.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-28 18:46                 ` Taylor Blau
@ 2022-03-28 20:55                   ` Junio C Hamano
  2022-03-28 21:21                     ` Taylor Blau
  0 siblings, 1 reply; 200+ messages in thread
From: Junio C Hamano @ 2022-03-28 20:55 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Nieder, git, tytso, derrickstolee, larsxschneider

Taylor Blau <me@ttaylorr.com> writes:

> To summarize Jonathan's point (as I think I severely misunderstood it
> before), if two writers are repacking a repository with unreachable
> objects. The following can happen:
>
>   - $NEWGIT packs the repository and writes a cruft pack and .mtimes
>     file.
>
>   - $OLDGIT packs the repository, exploding unreachable objects from the
>     cruft pack as loose, setting their mtimes to "now".

And if these repeat, alternating new and old versions of Git, we
will keep refreshing the unreachable objects' mtimes forever.

But once you stop using old versions of Git, perhaps in 3 release
cycles or so, we'll eventually be able to purge them, right?

> One approach (that Jonathan suggested) is to prevent the above situation
> by introducing a format extension, so $OLDGIT could not touch the
> repository. But this comes at a (in my view, significant) cost which is
> that $OLDGIT can't touch the repository _at all_. An extension would be
> desirable if cross-version interaction resulted in repository
> corruption, but this scenario does not lead to corruption at all.

A repository may not be in a healthy state, when tons of unreachable
objects stay around forever, but it probably is a bit too harsh to
call it "corrupt".

> Another approach (courtesy Stolee, in an off-list discussion) is that we
> could introduce an optional extension available as an opt-in to prevent
> older versions of Git from interacting in a repository that contains
> cruft packs, but is not required to write them.

That smells too magic; let's not go there.

> A third approach (and probably my preferred direction) is to indicate
> clearly via a combination of updates to Documentation/cruft-packs.txt
> and the release notes that say something along the lines of:
>
>     If you use are repacking a repository using both a pre- and
>     post-cruft packs version of Git, please be aware that you will lose
>     information about the mtimes of unreachable objects.

I do not quite see how it helps.  After hearing "... will lose
information about the mtimes ...", what concrete action can a user
take?  Or a sys-admin?

It's not like use of cruft-pack is mandatory when you upgrade the
new version of Git, right?  Perhaps use of cruft-pack should be
guarded behind a configuration variable so that users who might want
to use mixed versions of Git will be protected against accidental
use of new version of Git that introduces the forever-renewing
untracked objects problem?  

Perhaps a configuration variable, repack.cruftPackEnabled, that is
by default disabled, can be used to protect people who do not want
to get into the "keep refreshing mtime" loop from using the cruft
packs by mistake?  repack.cruftPackEnabled can probably be part of
the "experimental" feature set, if we think it is the direction in
the future.





^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-28 20:55                   ` Junio C Hamano
@ 2022-03-28 21:21                     ` Taylor Blau
  2022-03-29 15:59                       ` Junio C Hamano
  0 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2022-03-28 21:21 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Taylor Blau, Jonathan Nieder, git, tytso, derrickstolee, larsxschneider

On Mon, Mar 28, 2022 at 01:55:43PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > To summarize Jonathan's point (as I think I severely misunderstood it
> > before), if two writers are repacking a repository with unreachable
> > objects. The following can happen:
> >
> >   - $NEWGIT packs the repository and writes a cruft pack and .mtimes
> >     file.
> >
> >   - $OLDGIT packs the repository, exploding unreachable objects from the
> >     cruft pack as loose, setting their mtimes to "now".
>
> And if these repeat, alternating new and old versions of Git, we
> will keep refreshing the unreachable objects' mtimes forever.
>
> But once you stop using old versions of Git, perhaps in 3 release
> cycles or so, we'll eventually be able to purge them, right?

As soon as all of the repackers understand cruft packs, yes.

> > One approach (that Jonathan suggested) is to prevent the above situation
> > by introducing a format extension, so $OLDGIT could not touch the
> > repository. But this comes at a (in my view, significant) cost which is
> > that $OLDGIT can't touch the repository _at all_. An extension would be
> > desirable if cross-version interaction resulted in repository
> > corruption, but this scenario does not lead to corruption at all.
>
> A repository may not be in a healthy state, when tons of unreachable
> objects stay around forever, but it probably is a bit too harsh to
> call it "corrupt".

I agree, though I would note that this is no worse than the situation
today, where unreachable-but-recent objects are already exploded as
loose can already cause the kinds of issues that this series is designed
to prevent.

> > Another approach (courtesy Stolee, in an off-list discussion) is that we
> > could introduce an optional extension available as an opt-in to prevent
> > older versions of Git from interacting in a repository that contains
> > cruft packs, but is not required to write them.
>
> That smells too magic; let's not go there.

I'm not sure... if we did:

--- 8< ---

diff --git a/setup.c b/setup.c
index 04ce33cdcd..fa54c9baa4 100644
--- a/setup.c
+++ b/setup.c
@@ -565,2 +565,4 @@ static enum extension_result handle_extension(const char *var,
 		return EXTENSION_OK;
+	} else if (!strcmp(ext, "cruftpacks")) {
+		return EXTENSION_OK;
 	}

--- >8 ---

but nothing more, then a hypothetical `extensions.cruftPacks` could be
used to prevent older writers in a mixed version environment. But if you
don't have or care about older versions of Git, you can avoid setting it
altogether.

The key bit is that we don't have a check along the lines of "only allow
writing a cruft pack when extensions.cruftPacks" is set, so it's opt-in
as far as the new code is concerned.

> > A third approach (and probably my preferred direction) is to indicate
> > clearly via a combination of updates to Documentation/cruft-packs.txt
> > and the release notes that say something along the lines of:
> >
> >     If you use are repacking a repository using both a pre- and
> >     post-cruft packs version of Git, please be aware that you will lose
> >     information about the mtimes of unreachable objects.
>
> I do not quite see how it helps.  After hearing "... will lose
> information about the mtimes ...", what concrete action can a user
> take?  Or a sys-admin?
>
> It's not like use of cruft-pack is mandatory when you upgrade the
> new version of Git, right?  Perhaps use of cruft-pack should be
> guarded behind a configuration variable so that users who might want
> to use mixed versions of Git will be protected against accidental
> use of new version of Git that introduces the forever-renewing
> untracked objects problem?

I don't think we would have much to offer a user in that case; if the
mtimes are gone, then I couldn't think of anything to bring them back
outside of setting them manually.

But cruft packs are already guarded in two places:

  - `git repack` won't write a cruft pack unless given the `--cruft`
    flag (i.e., `git repack -A` doesn't suddenly start generating cruft
    packs upon upgrade).

  - `git gc` won't write cruft packs unless the `gc.cruftPacks`
    configuration is set, or `--cruft` is given as a flag.

I'd be curious what Jonathan and others think of that approach (which,
to be clear, is what this series already implements). We could make it
clear to say:

    If you have mixed versions of Git which both repack a repository
    (either manually or by auto-GC / background maintenance), consider
    leaving `gc.cruftPacks` unset and avoiding passing `--cruft` as a
    command-line argument to `git repack` and `git gc`, since doing so
    can lead to [...]

> Perhaps a configuration variable, repack.cruftPackEnabled, that is
> by default disabled, can be used to protect people who do not want
> to get into the "keep refreshing mtime" loop from using the cruft
> packs by mistake?  repack.cruftPackEnabled can probably be part of
> the "experimental" feature set, if we think it is the direction in
> the future.

I'd probably want to leave `-A` separate from `--cruft`, since something
about setting `repack.cruftPackEnabled` having the effect of causing
`-A` to produce a cruft pack feels strange to me.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-28 21:21                     ` Taylor Blau
@ 2022-03-29 15:59                       ` Junio C Hamano
  2022-03-30  2:23                         ` Taylor Blau
  0 siblings, 1 reply; 200+ messages in thread
From: Junio C Hamano @ 2022-03-29 15:59 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Nieder, git, tytso, derrickstolee, larsxschneider

Taylor Blau <me@ttaylorr.com> writes:

> I'm not sure... if we did:
>
> --- 8< ---
>
> diff --git a/setup.c b/setup.c
> index 04ce33cdcd..fa54c9baa4 100644
> --- a/setup.c
> +++ b/setup.c
> @@ -565,2 +565,4 @@ static enum extension_result handle_extension(const char *var,
>  		return EXTENSION_OK;
> +	} else if (!strcmp(ext, "cruftpacks")) {
> +		return EXTENSION_OK;
>  	}
>
> --- >8 ---
>
> but nothing more, then a hypothetical `extensions.cruftPacks` could be
> used to prevent older writers in a mixed version environment. But if you
> don't have or care about older versions of Git, you can avoid setting it
> altogether.

Smells like "unsafe by default, but you can opt into safety", which
is backwards, isn't it?

>> I do not quite see how it helps.  After hearing "... will lose
>> information about the mtimes ...", what concrete action can a user
>> take?  Or a sys-admin?
>>
>> It's not like use of cruft-pack is mandatory when you upgrade the
>> new version of Git, right?  Perhaps use of cruft-pack should be
>> guarded behind a configuration variable so that users who might want
>> to use mixed versions of Git will be protected against accidental
>> use of new version of Git that introduces the forever-renewing
>> untracked objects problem?
>
> I don't think we would have much to offer a user in that case; if the
> mtimes are gone, then I couldn't think of anything to bring them back
> outside of setting them manually.

Yes, so rambling about losing mtimes in documentation or release
notes would not help users all that much.  Let's not do that.

> But cruft packs are already guarded in two places:
>
>   - `git repack` won't write a cruft pack unless given the `--cruft`
>     flag (i.e., `git repack -A` doesn't suddenly start generating cruft
>     packs upon upgrade).
>
>   - `git gc` won't write cruft packs unless the `gc.cruftPacks`
>     configuration is set, or `--cruft` is given as a flag.

Hmph, OK.  So individuals can sort-of protect from hurting
themselves by refraining from running these with --cruft or writing
--cruft in their maintenance scripts.  An organization that wants to
let the more adventurous types to early opt-in can prepare two
versions of the maintenance scripts they distribute to their users,
one with and the other without --cruft, and use the mechanism they
use for gradual rollouts to control the population.  Perhaps that
would make sufficient protection?  I dunno.

Jonathan, what do you think?

> I'd be curious what Jonathan and others think of that approach (which,
> to be clear, is what this series already implements). We could make it
> clear to say:
>
>     If you have mixed versions of Git which both repack a repository
>     (either manually or by auto-GC / background maintenance), consider
>     leaving `gc.cruftPacks` unset and avoiding passing `--cruft` as a
>     command-line argument to `git repack` and `git gc`, since doing so
>     can lead to [...]

That message is (depending on what comes in [...]) much more helpful
than just throwing a word "mtime" out and letting the reader figure
out the rest ;-)

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-29 15:59                       ` Junio C Hamano
@ 2022-03-30  2:23                         ` Taylor Blau
  2022-03-30 13:37                           ` Junio C Hamano
  0 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2022-03-30  2:23 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Taylor Blau, Jonathan Nieder, git, tytso, derrickstolee, larsxschneider

On Tue, Mar 29, 2022 at 08:59:24AM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > I'm not sure... if we did:
> >
> > --- 8< ---
> >
> > diff --git a/setup.c b/setup.c
> > index 04ce33cdcd..fa54c9baa4 100644
> > --- a/setup.c
> > +++ b/setup.c
> > @@ -565,2 +565,4 @@ static enum extension_result handle_extension(const char *var,
> >  		return EXTENSION_OK;
> > +	} else if (!strcmp(ext, "cruftpacks")) {
> > +		return EXTENSION_OK;
> >  	}
> >
> > --- >8 ---
> >
> > but nothing more, then a hypothetical `extensions.cruftPacks` could be
> > used to prevent older writers in a mixed version environment. But if you
> > don't have or care about older versions of Git, you can avoid setting it
> > altogether.
>
> Smells like "unsafe by default, but you can opt into safety", which
> is backwards, isn't it?

I see it a little differently. The default (not writing cruft packs at
all) is safe, even in a mixed-version environment. If a user (a) wants
to use cruft packs, and (b) has older versions of Git also gc'ing the
repository, and (c) can't get rid of them, _then_ an opt-in extension
would make it impossible for those older versions to interact with the
repository.

I still can't shake the feeling that this is a pretty fringe and
timing-dependent scenario, which at worst keeps too many unreachable
objects around.

But I think this in conjunction with the already opt-in nature of cruft
packs would be a nice way to create safeguards for the situation
Jonathan described. There may be a simpler way, but I'm not sure I see
it (i.e., if you control whether or not `--cruft` is passed when
doing maintenance with newer versions of Git, but not whether older
versions are running around doing their own maintenance, then an
extension would be necessary to lock the old versions out).

> > But cruft packs are already guarded in two places:
> >
> >   - `git repack` won't write a cruft pack unless given the `--cruft`
> >     flag (i.e., `git repack -A` doesn't suddenly start generating cruft
> >     packs upon upgrade).
> >
> >   - `git gc` won't write cruft packs unless the `gc.cruftPacks`
> >     configuration is set, or `--cruft` is given as a flag.
>
> Hmph, OK.  So individuals can sort-of protect from hurting
> themselves by refraining from running these with --cruft or writing
> --cruft in their maintenance scripts.  An organization that wants to
> let the more adventurous types to early opt-in can prepare two
> versions of the maintenance scripts they distribute to their users,
> one with and the other without --cruft, and use the mechanism they
> use for gradual rollouts to control the population.  Perhaps that
> would make sufficient protection?  I dunno.
>
> Jonathan, what do you think?

I'm confused: if newer versions of Git are writing cruft packs, then
having the older versions gc'ing in the same repository runs into the
same scenario Jonathan originally describes.

The thing I think Jonathan seeks to prevent is older versions of Git
gc'ing a repo that has cruft packs. I think I may need you to clarify a
little, sorry :-(.

> > I'd be curious what Jonathan and others think of that approach (which,
> > to be clear, is what this series already implements). We could make it
> > clear to say:
> >
> >     If you have mixed versions of Git which both repack a repository
> >     (either manually or by auto-GC / background maintenance), consider
> >     leaving `gc.cruftPacks` unset and avoiding passing `--cruft` as a
> >     command-line argument to `git repack` and `git gc`, since doing so
> >     can lead to [...]
>
> That message is (depending on what comes in [...]) much more helpful
> than just throwing a word "mtime" out and letting the reader figure
> out the rest ;-)

Yes, totally agreed.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-30  2:23                         ` Taylor Blau
@ 2022-03-30 13:37                           ` Junio C Hamano
  2022-03-30 17:30                             ` Taylor Blau
  0 siblings, 1 reply; 200+ messages in thread
From: Junio C Hamano @ 2022-03-30 13:37 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Nieder, git, tytso, derrickstolee, larsxschneider

Taylor Blau <me@ttaylorr.com> writes:

> The thing I think Jonathan seeks to prevent is older versions of Git
> gc'ing a repo that has cruft packs. I think I may need you to clarify a
> little, sorry :-(.

By making controlled rollout of the use of "--cruft" option (and the
assumption here is that a large organization setting people do not
manually say "gc --cruft", and they can ship their maintenance
scripts that may be run via cron or whatever with and without
"--cruft"), you can control the number of repositories that can
potentially see older versions of Git running gc on with cruft
packs.  Those users, for whom it is not their turn to start using
"--cruft" enabled version of the script, will not have cruft packs,
so it does not matter if they keep an older version of Git somewhere
hidden in a hermetic build of an IDE that bundles Git and gc kicks
in for them.


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-30 13:37                           ` Junio C Hamano
@ 2022-03-30 17:30                             ` Taylor Blau
  0 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-03-30 17:30 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Taylor Blau, Jonathan Nieder, git, tytso, derrickstolee, larsxschneider

On Wed, Mar 30, 2022 at 06:37:54AM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > The thing I think Jonathan seeks to prevent is older versions of Git
> > gc'ing a repo that has cruft packs. I think I may need you to clarify a
> > little, sorry :-(.
>
> By making controlled rollout of the use of "--cruft" option (and the
> assumption here is that a large organization setting people do not
> manually say "gc --cruft", and they can ship their maintenance
> scripts that may be run via cron or whatever with and without
> "--cruft"), you can control the number of repositories that can
> potentially see older versions of Git running gc on with cruft
> packs.  Those users, for whom it is not their turn to start using
> "--cruft" enabled version of the script, will not have cruft packs,
> so it does not matter if they keep an older version of Git somewhere
> hidden in a hermetic build of an IDE that bundles Git and gc kicks
> in for them.

Ahh, OK. Thanks for explaining: this is what I was pretty sure you
meant, but I wanted to make sure before agreeing to it.

Yes, this solution amounts to: "if you have mixed-versions of Git
mutually gc'ing a repository, then use the same rollout method used for
controlling Git itself to guard when to start creating cruft packs".

I would be very eager to hear if this works for Jonathan's case. It
should do the trick, I'd think.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v4 00/17] cruft packs
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (19 preceding siblings ...)
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
@ 2022-05-18 23:10 ` Taylor Blau
  2022-05-18 23:10   ` [PATCH v4 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
                     ` (19 more replies)
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
  21 siblings, 20 replies; 200+ messages in thread
From: Taylor Blau @ 2022-05-18 23:10 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

Here is another reroll of my series to implement "cruft packs", which is based
on the v2.36 tree, and incorporates feedback from the discussion we had about
mixed-version GCs with cruft packs in [1].

The changes here are limited to:

  - a cautionary note in Documentation/technical/cruft-packs.txt
    describing the potential interaction between pruning GCs across pre-
    and post-cruft pack versions of Git, as discussed towards the bottom
    of [2]

  - updating the `finalize_hashfile()` calls for writing `.mtimes` files
    to indicate that they are `FSYNC_COMPONENT_PACK_METADATA`, since the
    original version of this series predates the fine-grained fsync
    configuration in 2.36.

As always, a range-diff is below. Thanks in advance for taking another
look!

[1]: https://lore.kernel.org/git/YiZI99yeijQe5Jaq@google.com/
[2]: https://lore.kernel.org/git/YkIm7lnQsUT0JnvS@nand.local/

Taylor Blau (17):
  Documentation/technical: add cruft-packs.txt
  pack-mtimes: support reading .mtimes files
  pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  chunk-format.h: extract oid_version()
  pack-mtimes: support writing pack .mtimes files
  t/helper: add 'pack-mtimes' test-tool
  builtin/pack-objects.c: return from create_object_entry()
  builtin/pack-objects.c: --cruft without expiration
  reachable: add options to add_unseen_recent_objects_to_traversal
  reachable: report precise timestamps from objects in cruft packs
  builtin/pack-objects.c: --cruft with expiration
  builtin/repack.c: support generating a cruft pack
  builtin/repack.c: allow configuring cruft pack generation
  builtin/repack.c: use named flags for existing_packs
  builtin/repack.c: add cruft packs to MIDX during geometric repack
  builtin/gc.c: conditionally avoid pruning objects via loose
  sha1-file.c: don't freshen cruft packs

 Documentation/Makefile                  |   1 +
 Documentation/config/gc.txt             |  21 +-
 Documentation/config/repack.txt         |   9 +
 Documentation/git-gc.txt                |   5 +
 Documentation/git-pack-objects.txt      |  30 +
 Documentation/git-repack.txt            |  11 +
 Documentation/technical/cruft-packs.txt | 123 ++++
 Documentation/technical/pack-format.txt |  19 +
 Makefile                                |   2 +
 builtin/gc.c                            |  10 +-
 builtin/pack-objects.c                  | 304 +++++++++-
 builtin/repack.c                        | 181 +++++-
 bulk-checkin.c                          |   2 +-
 chunk-format.c                          |  12 +
 chunk-format.h                          |   3 +
 commit-graph.c                          |  18 +-
 midx.c                                  |  18 +-
 object-file.c                           |   4 +-
 object-store.h                          |   7 +-
 pack-mtimes.c                           | 126 ++++
 pack-mtimes.h                           |  15 +
 pack-objects.c                          |   6 +
 pack-objects.h                          |  25 +
 pack-write.c                            |  93 ++-
 pack.h                                  |   4 +
 packfile.c                              |  19 +-
 reachable.c                             |  58 +-
 reachable.h                             |   9 +-
 t/helper/test-pack-mtimes.c             |  56 ++
 t/helper/test-tool.c                    |   1 +
 t/helper/test-tool.h                    |   1 +
 t/t5329-pack-objects-cruft.sh           | 739 ++++++++++++++++++++++++
 32 files changed, 1831 insertions(+), 101 deletions(-)
 create mode 100644 Documentation/technical/cruft-packs.txt
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h
 create mode 100644 t/helper/test-pack-mtimes.c
 create mode 100755 t/t5329-pack-objects-cruft.sh

Range-diff against v3:
 1:  784ee7e0ee !  1:  f494ef7377 Documentation/technical: add cruft-packs.txt
    @@ Documentation/technical/cruft-packs.txt (new)
     +It is linkgit:git-gc[1] that is typically responsible for removing expired
     +unreachable objects.
     +
    ++== Caution for mixed-version environments
    ++
    ++Repositories that have cruft packs in them will continue to work with any older
    ++version of Git. Note, however, that previous versions of Git which do not
    ++understand the `.mtimes` file will use the cruft pack's mtime as the mtime for
    ++all of the objects in it. In other words, do not expect older (pre-cruft pack)
    ++versions of Git to interpret or even read the contents of the `.mtimes` file.
    ++
    ++Note that having mixed versions of Git GC-ing the same repository can lead to
    ++unreachable objects never being completely pruned. This can happen under the
    ++following circumstances:
    ++
    ++  - An older version of Git running GC explodes the contents of an existing
    ++    cruft pack loose, using the cruft pack's mtime.
    ++  - A newer version running GC collects those loose objects into a cruft pack,
    ++    where the .mtime file reflects the loose object's actual mtimes, but the
    ++    cruft pack mtime is "now".
    ++
    ++Repeating this process will lead to unreachable objects not getting pruned as a
    ++result of repeatedly resetting the objects' mtimes to the present time.
    ++
    ++If you are GC-ing repositories in a mixed version environment, consider omitting
    ++the `--cruft` option when using linkgit:git-repack[1] and linkgit:git-gc[1], and
    ++leaving the `gc.cruftPacks` configuration unset until all writers understand
    ++cruft packs.
    ++
     +== Alternatives
     +
     +Notable alternatives to this design include:
 2:  1ec754ad1b =  2:  8f9fd21be9 pack-mtimes: support reading .mtimes files
 3:  0f5d6d6492 =  3:  cdb21236e1 pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
 4:  135a07276b =  4:  1d775f9850 chunk-format.h: extract oid_version()
 5:  0600503856 !  5:  6172861bd9 pack-mtimes: support writing pack .mtimes files
    @@ pack-write.c: const char *write_rev_file_order(const char *rev_name,
     +	if (adjust_shared_perm(mtimes_name) < 0)
     +		die(_("failed to make %s readable"), mtimes_name);
     +
    -+	finalize_hashfile(f, NULL,
    ++	finalize_hashfile(f, NULL, FSYNC_COMPONENT_PACK_METADATA,
     +			  CSUM_HASH_IN_STREAM | CSUM_CLOSE | CSUM_FSYNC);
     +
     +	return mtimes_name;
 6:  4780c8437b =  6:  5f9a9a5b7b t/helper: add 'pack-mtimes' test-tool
 7:  33862a07c9 =  7:  b8a38fe2e4 builtin/pack-objects.c: return from create_object_entry()
 8:  22705e4887 !  8:  94fe03cc65 builtin/pack-objects.c: --cruft without expiration
    @@ builtin/pack-objects.c: static int option_parse_unpack_unreachable(const struct
     +	return 0;
     +}
     +
    - int cmd_pack_objects(int argc, const char **argv, const char *prefix)
    - {
    - 	int use_internal_rev_list = 0;
    + struct po_filter_data {
    + 	unsigned have_revs:1;
    + 	struct rev_info revs;
     @@ builtin/pack-objects.c: int cmd_pack_objects(int argc, const char **argv, const char *prefix)
      		OPT_CALLBACK_F(0, "unpack-unreachable", NULL, N_("time"),
      		  N_("unpack unreachable objects newer than <time>"),
    @@ builtin/pack-objects.c: int cmd_pack_objects(int argc, const char **argv, const
     +		read_cruft_objects();
      	} else if (!use_internal_rev_list) {
      		read_object_list_from_stdin();
    - 	} else {
    + 	} else if (pfd.have_revs) {
     
      ## object-file.c ##
     @@ object-file.c: int has_loose_object_nonlocal(const struct object_id *oid)
    @@ object-store.h: int repo_has_object_file_with_flags(struct repository *r,
      
     +int has_loose_object(const struct object_id *);
     +
    - void assert_oid_type(const struct object_id *oid, enum object_type expect);
    - 
    - /*
    + /**
    +  * format_object_header() is a thin wrapper around s xsnprintf() that
    +  * writes the initial "<type> <obj-len>" part of the loose object
     
      ## t/t5329-pack-objects-cruft.sh (new) ##
     @@
 9:  cebb30b667 !  9:  da7273f41f reachable: add options to add_unseen_recent_objects_to_traversal
    @@ Commit message
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## builtin/pack-objects.c ##
    -@@ builtin/pack-objects.c: static void get_object_list(int ac, const char **av)
    +@@ builtin/pack-objects.c: static void get_object_list(struct rev_info *revs, int ac, const char **av)
      	if (unpack_unreachable_expiration) {
    - 		revs.ignore_missing_links = 1;
    - 		if (add_unseen_recent_objects_to_traversal(&revs,
    + 		revs->ignore_missing_links = 1;
    + 		if (add_unseen_recent_objects_to_traversal(revs,
     -				unpack_unreachable_expiration))
     +				unpack_unreachable_expiration, NULL, 0))
      			die(_("unable to add recent objects"));
    - 		if (prepare_revision_walk(&revs))
    + 		if (prepare_revision_walk(revs))
      			die(_("revision walk setup failed"));
     
      ## reachable.c ##
10:  fa4de8859d = 10:  58fecd1747 reachable: report precise timestamps from objects in cruft packs
11:  92318f8700 = 11:  1740b8ef01 builtin/pack-objects.c: --cruft with expiration
12:  1e94b33cb4 ! 12:  5992a72cbf builtin/repack.c: support generating a cruft pack
    @@ builtin/repack.c
      static int pack_kept_objects = -1;
      static int write_bitmaps = -1;
      static int use_delta_islands;
    + static int run_update_server_info = 1;
      static char *packdir, *packtmp_name, *packtmp;
     +static char *cruft_expiration;
      
      static const char *const git_repack_usage[] = {
      	N_("git repack [<options>]"),
    -@@ builtin/repack.c: static int repack_config(const char *var, const char *value, void *cb)
    - 		use_delta_islands = git_config_bool(var, value);
    - 		return 0;
    - 	}
    -+
    - 	return git_default_config(var, value, cb);
    - }
    - 
     @@ builtin/repack.c: static void repack_promisor_objects(const struct pack_objects_args *args,
      		die(_("could not finish pack-objects to repack promisor objects"));
      }
13:  9cfcd123bd ! 13:  1b241f8f91 builtin/repack.c: allow configuring cruft pack generation
    @@ Commit message
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## Documentation/config/repack.txt ##
    -@@ Documentation/config/repack.txt: repack.writeBitmaps::
    - 	space and extra time spent on the initial repack.  This has
    - 	no effect if multiple packfiles are created.
    - 	Defaults to true on bare repos, false otherwise.
    +@@ Documentation/config/repack.txt: repack.updateServerInfo::
    + 	If set to false, linkgit:git-repack[1] will not run
    + 	linkgit:git-update-server-info[1]. Defaults to true. Can be overridden
    + 	when true by the `-n` option of linkgit:git-repack[1].
     +
     +repack.cruftWindow::
     +repack.cruftWindowMemory::
    @@ builtin/repack.c: static const char incremental_bitmap_conflict_error[] = N_(
      		delta_base_offset = git_config_bool(var, value);
      		return 0;
     @@ builtin/repack.c: static int repack_config(const char *var, const char *value, void *cb)
    + 		run_update_server_info = git_config_bool(var, value);
      		return 0;
      	}
    - 
     +	if (!strcmp(var, "repack.cruftwindow"))
     +		return git_config_string(&cruft_po_args->window, var, value);
     +	if (!strcmp(var, "repack.cruftwindowmemory"))
    @@ builtin/repack.c: static int repack_config(const char *var, const char *value, v
     +		return git_config_string(&cruft_po_args->depth, var, value);
     +	if (!strcmp(var, "repack.cruftthreads"))
     +		return git_config_string(&cruft_po_args->threads, var, value);
    -+
      	return git_default_config(var, value, cb);
      }
      
    @@ builtin/repack.c: static void remove_redundant_pack(const char *dir_name, const
      				 const struct pack_objects_args *args)
      {
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
    + 	int keep_unreachable = 0;
      	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
    - 	int no_update_server_info = 0;
      	struct pack_objects_args po_args = {NULL};
     +	struct pack_objects_args cruft_po_args = {NULL};
      	int geometric_factor = 0;
14:  1a58807df0 = 14:  ffae78852c builtin/repack.c: use named flags for existing_packs
15:  ed05cf536b = 15:  0743e373ba builtin/repack.c: add cruft packs to MIDX during geometric repack
16:  1d5f334138 = 16:  9f7e0acac6 builtin/gc.c: conditionally avoid pruning objects via loose
17:  f74b425872 = 17:  07fa9d4b47 sha1-file.c: don't freshen cruft packs
-- 
2.36.1.94.gb0d54bedca

^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v4 01/17] Documentation/technical: add cruft-packs.txt
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
@ 2022-05-18 23:10   ` Taylor Blau
  2022-05-19 14:04     ` Junio C Hamano
  2022-05-18 23:10   ` [PATCH v4 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
                     ` (18 subsequent siblings)
  19 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2022-05-18 23:10 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

Create a technical document to explain cruft packs. It contains a brief
overview of the problem, some background, details on the implementation,
and a couple of alternative approaches not considered here.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/Makefile                  |   1 +
 Documentation/technical/cruft-packs.txt | 123 ++++++++++++++++++++++++
 2 files changed, 124 insertions(+)
 create mode 100644 Documentation/technical/cruft-packs.txt

diff --git a/Documentation/Makefile b/Documentation/Makefile
index adb2f1b50a..2faffb52ab 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -94,6 +94,7 @@ TECH_DOCS += MyFirstContribution
 TECH_DOCS += MyFirstObjectWalk
 TECH_DOCS += SubmittingPatches
 TECH_DOCS += technical/bundle-format
+TECH_DOCS += technical/cruft-packs
 TECH_DOCS += technical/hash-function-transition
 TECH_DOCS += technical/http-protocol
 TECH_DOCS += technical/index-format
diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
new file mode 100644
index 0000000000..c0f583cd48
--- /dev/null
+++ b/Documentation/technical/cruft-packs.txt
@@ -0,0 +1,123 @@
+= Cruft packs
+
+The cruft packs feature offer an alternative to Git's traditional mechanism of
+removing unreachable objects. This document provides an overview of Git's
+pruning mechanism, and how a cruft pack can be used instead to accomplish the
+same.
+
+== Background
+
+To remove unreachable objects from your repository, Git offers `git repack -Ad`
+(see linkgit:git-repack[1]). Quoting from the documentation:
+
+[quote]
+[...] unreachable objects in a previous pack become loose, unpacked objects,
+instead of being left in the old pack. [...] loose unreachable objects will be
+pruned according to normal expiry rules with the next 'git gc' invocation.
+
+Unreachable objects aren't removed immediately, since doing so could race with
+an incoming push which may reference an object which is about to be deleted.
+Instead, those unreachable objects are stored as loose object and stay that way
+until they are older than the expiration window, at which point they are removed
+by linkgit:git-prune[1].
+
+Git must store these unreachable objects loose in order to keep track of their
+per-object mtimes. If these unreachable objects were written into one big pack,
+then either freshening that pack (because an object contained within it was
+re-written) or creating a new pack of unreachable objects would cause the pack's
+mtime to get updated, and the objects within it would never leave the expiration
+window. Instead, objects are stored loose in order to keep track of the
+individual object mtimes and avoid a situation where all cruft objects are
+freshened at once.
+
+This can lead to undesirable situations when a repository contains many
+unreachable objects which have not yet left the grace period. Having large
+directories in the shards of `.git/objects` can lead to decreased performance in
+the repository. But given enough unreachable objects, this can lead to inode
+starvation and degrade the performance of the whole system. Since we
+can never pack those objects, these repositories often take up a large amount of
+disk space, since we can only zlib compress them, but not store them in delta
+chains.
+
+== Cruft packs
+
+A cruft pack eliminates the need for storing unreachable objects in a loose
+state by including the per-object mtimes in a separate file alongside a single
+pack containing all loose objects.
+
+A cruft pack is written by `git repack --cruft` when generating a new pack.
+linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft`
+is a classic all-into-one repack, meaning that everything in the resulting pack is
+reachable, and everything else is unreachable. Once written, the `--cruft`
+option instructs `git repack` to generate another pack containing only objects
+not packed in the previous step (which equates to packing all unreachable
+objects together). This progresses as follows:
+
+  1. Enumerate every object, marking any object which is (a) not contained in a
+     kept-pack, and (b) whose mtime is within the grace period as a traversal
+     tip.
+
+  2. Perform a reachability traversal based on the tips gathered in the previous
+     step, adding every object along the way to the pack.
+
+  3. Write the pack out, along with a `.mtimes` file that records the per-object
+     timestamps.
+
+This mode is invoked internally by linkgit:git-repack[1] when instructed to
+write a cruft pack. Crucially, the set of in-core kept packs is exactly the set
+of packs which will not be deleted by the repack; in other words, they contain
+all of the repository's reachable objects.
+
+When a repository already has a cruft pack, `git repack --cruft` typically only
+adds objects to it. An exception to this is when `git repack` is given the
+`--cruft-expiration` option, which allows the generated cruft pack to omit
+expired objects instead of waiting for linkgit:git-gc[1] to expire those objects
+later on.
+
+It is linkgit:git-gc[1] that is typically responsible for removing expired
+unreachable objects.
+
+== Caution for mixed-version environments
+
+Repositories that have cruft packs in them will continue to work with any older
+version of Git. Note, however, that previous versions of Git which do not
+understand the `.mtimes` file will use the cruft pack's mtime as the mtime for
+all of the objects in it. In other words, do not expect older (pre-cruft pack)
+versions of Git to interpret or even read the contents of the `.mtimes` file.
+
+Note that having mixed versions of Git GC-ing the same repository can lead to
+unreachable objects never being completely pruned. This can happen under the
+following circumstances:
+
+  - An older version of Git running GC explodes the contents of an existing
+    cruft pack loose, using the cruft pack's mtime.
+  - A newer version running GC collects those loose objects into a cruft pack,
+    where the .mtime file reflects the loose object's actual mtimes, but the
+    cruft pack mtime is "now".
+
+Repeating this process will lead to unreachable objects not getting pruned as a
+result of repeatedly resetting the objects' mtimes to the present time.
+
+If you are GC-ing repositories in a mixed version environment, consider omitting
+the `--cruft` option when using linkgit:git-repack[1] and linkgit:git-gc[1], and
+leaving the `gc.cruftPacks` configuration unset until all writers understand
+cruft packs.
+
+== Alternatives
+
+Notable alternatives to this design include:
+
+  - The location of the per-object mtime data, and
+  - Storing unreachable objects in multiple cruft packs.
+
+On the location of mtime data, a new auxiliary file tied to the pack was chosen
+to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
+support for optional chunks of data, it may make sense to consolidate the
+`.mtimes` format into the `.idx` itself.
+
+Storing unreachable objects among multiple cruft packs (e.g., creating a new
+cruft pack during each repacking operation including only unreachable objects
+which aren't already stored in an earlier cruft pack) is significantly more
+complicated to construct, and so aren't pursued here. The obvious drawback to
+the current implementation is that the entire cruft pack must be re-written from
+scratch.
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v4 02/17] pack-mtimes: support reading .mtimes files
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
  2022-05-18 23:10   ` [PATCH v4 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
@ 2022-05-18 23:10   ` Taylor Blau
  2022-05-19 10:40     ` Ævar Arnfjörð Bjarmason
  2022-05-18 23:10   ` [PATCH v4 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
                     ` (17 subsequent siblings)
  19 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2022-05-18 23:10 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

To store the individual mtimes of objects in a cruft pack, introduce a
new `.mtimes` format that can optionally accompany a single pack in the
repository.

The format is defined in Documentation/technical/pack-format.txt, and
stores a 4-byte network order timestamp for each object in name (index)
order.

This patch prepares for cruft packs by defining the `.mtimes` format,
and introducing a basic API that callers can use to read out individual
mtimes.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/pack-format.txt |  19 ++++
 Makefile                                |   1 +
 builtin/repack.c                        |   1 +
 object-store.h                          |   5 +-
 pack-mtimes.c                           | 126 ++++++++++++++++++++++++
 pack-mtimes.h                           |  15 +++
 packfile.c                              |  19 +++-
 7 files changed, 183 insertions(+), 3 deletions(-)
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 6d3efb7d16..c443dbb526 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -294,6 +294,25 @@ Pack file entry: <+
 
 All 4-byte numbers are in network order.
 
+== pack-*.mtimes files have the format:
+
+  - A 4-byte magic number '0x4d544d45' ('MTME').
+
+  - A 4-byte version identifier (= 1).
+
+  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
+
+  - A table of 4-byte unsigned integers in network order. The ith
+    value is the modification time (mtime) of the ith object in the
+    corresponding pack by lexicographic (index) order. The mtimes
+    count standard epoch seconds.
+
+  - A trailer, containing a checksum of the corresponding packfile,
+    and a checksum of all of the above (each having length according
+    to the specified hash function).
+
+All 4-byte numbers are in network order.
+
 == multi-pack-index (MIDX) files have the following format:
 
 The multi-pack-index files refer to multiple pack-files and loose objects.
diff --git a/Makefile b/Makefile
index 61aadf3ce8..a299580b7c 100644
--- a/Makefile
+++ b/Makefile
@@ -993,6 +993,7 @@ LIB_OBJS += oidtree.o
 LIB_OBJS += pack-bitmap-write.o
 LIB_OBJS += pack-bitmap.o
 LIB_OBJS += pack-check.o
+LIB_OBJS += pack-mtimes.o
 LIB_OBJS += pack-objects.o
 LIB_OBJS += pack-revindex.o
 LIB_OBJS += pack-write.o
diff --git a/builtin/repack.c b/builtin/repack.c
index d1a563d5b6..e7a3920c6d 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -217,6 +217,7 @@ static struct {
 } exts[] = {
 	{".pack"},
 	{".rev", 1},
+	{".mtimes", 1},
 	{".bitmap", 1},
 	{".promisor", 1},
 	{".idx"},
diff --git a/object-store.h b/object-store.h
index 53996018c1..2c4671ed7a 100644
--- a/object-store.h
+++ b/object-store.h
@@ -115,12 +115,15 @@ struct packed_git {
 		 freshened:1,
 		 do_not_close:1,
 		 pack_promisor:1,
-		 multi_pack_index:1;
+		 multi_pack_index:1,
+		 is_cruft:1;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	struct revindex_entry *revindex;
 	const uint32_t *revindex_data;
 	const uint32_t *revindex_map;
 	size_t revindex_size;
+	const uint32_t *mtimes_map;
+	size_t mtimes_size;
 	/* something like ".git/objects/pack/xxxxx.pack" */
 	char pack_name[FLEX_ARRAY]; /* more */
 };
diff --git a/pack-mtimes.c b/pack-mtimes.c
new file mode 100644
index 0000000000..46ad584af1
--- /dev/null
+++ b/pack-mtimes.c
@@ -0,0 +1,126 @@
+#include "pack-mtimes.h"
+#include "object-store.h"
+#include "packfile.h"
+
+static char *pack_mtimes_filename(struct packed_git *p)
+{
+	size_t len;
+	if (!strip_suffix(p->pack_name, ".pack", &len))
+		BUG("pack_name does not end in .pack");
+	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
+	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
+}
+
+#define MTIMES_HEADER_SIZE (12)
+#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))
+
+struct mtimes_header {
+	uint32_t signature;
+	uint32_t version;
+	uint32_t hash_id;
+};
+
+static int load_pack_mtimes_file(char *mtimes_file,
+				 uint32_t num_objects,
+				 const uint32_t **data_p, size_t *len_p)
+{
+	int fd, ret = 0;
+	struct stat st;
+	void *data = NULL;
+	size_t mtimes_size;
+	struct mtimes_header header;
+	uint32_t *hdr;
+
+	fd = git_open(mtimes_file);
+
+	if (fd < 0) {
+		ret = -1;
+		goto cleanup;
+	}
+	if (fstat(fd, &st)) {
+		ret = error_errno(_("failed to read %s"), mtimes_file);
+		goto cleanup;
+	}
+
+	mtimes_size = xsize_t(st.st_size);
+
+	if (mtimes_size < MTIMES_MIN_SIZE) {
+		ret = error(_("mtimes file %s is too small"), mtimes_file);
+		goto cleanup;
+	}
+
+	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
+		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
+		goto cleanup;
+	}
+
+	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
+
+	header.signature = ntohl(hdr[0]);
+	header.version = ntohl(hdr[1]);
+	header.hash_id = ntohl(hdr[2]);
+
+	if (header.signature != MTIMES_SIGNATURE) {
+		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
+		goto cleanup;
+	}
+
+	if (header.version != 1) {
+		ret = error(_("mtimes file %s has unsupported version %"PRIu32),
+			    mtimes_file, header.version);
+		goto cleanup;
+	}
+
+	if (!(header.hash_id == 1 || header.hash_id == 2)) {
+		ret = error(_("mtimes file %s has unsupported hash id %"PRIu32),
+			    mtimes_file, header.hash_id);
+		goto cleanup;
+	}
+
+cleanup:
+	if (ret) {
+		if (data)
+			munmap(data, mtimes_size);
+	} else {
+		*len_p = mtimes_size;
+		*data_p = (const uint32_t *)data;
+	}
+
+	close(fd);
+	return ret;
+}
+
+int load_pack_mtimes(struct packed_git *p)
+{
+	char *mtimes_name = NULL;
+	int ret = 0;
+
+	if (!p->is_cruft)
+		return ret; /* not a cruft pack */
+	if (p->mtimes_map)
+		return ret; /* already loaded */
+
+	ret = open_pack_index(p);
+	if (ret < 0)
+		goto cleanup;
+
+	mtimes_name = pack_mtimes_filename(p);
+	ret = load_pack_mtimes_file(mtimes_name,
+				    p->num_objects,
+				    &p->mtimes_map,
+				    &p->mtimes_size);
+cleanup:
+	free(mtimes_name);
+	return ret;
+}
+
+uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)
+{
+	if (!p->mtimes_map)
+		BUG("pack .mtimes file not loaded for %s", p->pack_name);
+	if (p->num_objects <= pos)
+		BUG("pack .mtimes out-of-bounds (%"PRIu32" vs %"PRIu32")",
+		    pos, p->num_objects);
+
+	return get_be32(p->mtimes_map + pos + 3);
+}
diff --git a/pack-mtimes.h b/pack-mtimes.h
new file mode 100644
index 0000000000..38ddb9f893
--- /dev/null
+++ b/pack-mtimes.h
@@ -0,0 +1,15 @@
+#ifndef PACK_MTIMES_H
+#define PACK_MTIMES_H
+
+#include "git-compat-util.h"
+
+#define MTIMES_SIGNATURE 0x4d544d45 /* "MTME" */
+#define MTIMES_VERSION 1
+
+struct packed_git;
+
+int load_pack_mtimes(struct packed_git *p);
+
+uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos);
+
+#endif
diff --git a/packfile.c b/packfile.c
index 835b2d2716..fc0245fbab 100644
--- a/packfile.c
+++ b/packfile.c
@@ -334,12 +334,22 @@ static void close_pack_revindex(struct packed_git *p)
 	p->revindex_data = NULL;
 }
 
+static void close_pack_mtimes(struct packed_git *p)
+{
+	if (!p->mtimes_map)
+		return;
+
+	munmap((void *)p->mtimes_map, p->mtimes_size);
+	p->mtimes_map = NULL;
+}
+
 void close_pack(struct packed_git *p)
 {
 	close_pack_windows(p);
 	close_pack_fd(p);
 	close_pack_index(p);
 	close_pack_revindex(p);
+	close_pack_mtimes(p);
 	oidset_clear(&p->bad_objects);
 }
 
@@ -363,7 +373,7 @@ void close_object_store(struct raw_object_store *o)
 
 void unlink_pack_path(const char *pack_name, int force_delete)
 {
-	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
+	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".mtimes"};
 	int i;
 	struct strbuf buf = STRBUF_INIT;
 	size_t plen;
@@ -718,6 +728,10 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
 	if (!access(p->pack_name, F_OK))
 		p->pack_promisor = 1;
 
+	xsnprintf(p->pack_name + path_len, alloc - path_len, ".mtimes");
+	if (!access(p->pack_name, F_OK))
+		p->is_cruft = 1;
+
 	xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
 	if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) {
 		free(p);
@@ -869,7 +883,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 	    ends_with(file_name, ".pack") ||
 	    ends_with(file_name, ".bitmap") ||
 	    ends_with(file_name, ".keep") ||
-	    ends_with(file_name, ".promisor"))
+	    ends_with(file_name, ".promisor") ||
+	    ends_with(file_name, ".mtimes"))
 		string_list_append(data->garbage, full_name);
 	else
 		report_garbage(PACKDIR_FILE_GARBAGE, full_name);
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v4 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
  2022-05-18 23:10   ` [PATCH v4 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
  2022-05-18 23:10   ` [PATCH v4 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
@ 2022-05-18 23:10   ` Taylor Blau
  2022-05-18 23:11   ` [PATCH v4 04/17] chunk-format.h: extract oid_version() Taylor Blau
                     ` (16 subsequent siblings)
  19 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-05-18 23:10 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

This structure will be used to communicate the per-object mtimes when
writing a cruft pack. Here, we need the full packing_data structure
because the mtime information is stored in an array there, not on the
individual object_entry's themselves (to avoid paying the overhead in
structure width for operations which do not generate a cruft pack).

We haven't passed this information down before because one of the two
callers (in bulk-checkin.c) does not have a packing_data structure at
all. In that case (where no cruft pack will be generated), NULL is
passed instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 3 ++-
 bulk-checkin.c         | 2 +-
 pack-write.c           | 1 +
 pack.h                 | 3 +++
 4 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 014dcd4bc9..6ac927047c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1262,7 +1262,8 @@ static void write_pack_file(void)
 
 			stage_tmp_packfiles(&tmpname, pack_tmp_name,
 					    written_list, nr_written,
-					    &pack_idx_opts, hash, &idx_tmp_name);
+					    &to_pack, &pack_idx_opts, hash,
+					    &idx_tmp_name);
 
 			if (write_bitmap_index) {
 				size_t tmpname_len = tmpname.len;
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 6d6c37171c..e988a388b6 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -33,7 +33,7 @@ static void finish_tmp_packfile(struct strbuf *basename,
 	char *idx_tmp_name = NULL;
 
 	stage_tmp_packfiles(basename, pack_tmp_name, written_list, nr_written,
-			    pack_idx_opts, hash, &idx_tmp_name);
+			    NULL, pack_idx_opts, hash, &idx_tmp_name);
 	rename_tmp_packfile_idx(basename, &idx_tmp_name);
 
 	free(idx_tmp_name);
diff --git a/pack-write.c b/pack-write.c
index 51812cb129..a2adc565f4 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -484,6 +484,7 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 const char *pack_tmp_name,
 			 struct pack_idx_entry **written_list,
 			 uint32_t nr_written,
+			 struct packing_data *to_pack,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[],
 			 char **idx_tmp_name)
diff --git a/pack.h b/pack.h
index b22bfc4a18..fd27cfdfd7 100644
--- a/pack.h
+++ b/pack.h
@@ -109,11 +109,14 @@ int encode_in_pack_object_header(unsigned char *hdr, int hdr_len,
 #define PH_ERROR_PROTOCOL	(-3)
 int read_pack_header(int fd, struct pack_header *);
 
+struct packing_data;
+
 struct hashfile *create_tmp_packfile(char **pack_tmp_name);
 void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 const char *pack_tmp_name,
 			 struct pack_idx_entry **written_list,
 			 uint32_t nr_written,
+			 struct packing_data *to_pack,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[],
 			 char **idx_tmp_name);
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v4 04/17] chunk-format.h: extract oid_version()
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (2 preceding siblings ...)
  2022-05-18 23:10   ` [PATCH v4 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-19 11:44     ` Ævar Arnfjörð Bjarmason
  2022-05-18 23:11   ` [PATCH v4 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
                     ` (15 subsequent siblings)
  19 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

There are three definitions of an identical function which converts
`the_hash_algo` into either 1 (for SHA-1) or 2 (for SHA-256). There is a
copy of this function for writing both the commit-graph and
multi-pack-index file, and another inline definition used to write the
.rev header.

Consolidate these into a single definition in chunk-format.h. It's not
clear that this is the best header to define this function in, but it
should do for now.

(Worth noting, the .rev caller expects a 4-byte unsigned, but the other
two callers work with a single unsigned byte. The consolidated version
uses the latter type, and lets the compiler widen it when required).

Another caller will be added in a subsequent patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 chunk-format.c | 12 ++++++++++++
 chunk-format.h |  3 +++
 commit-graph.c | 18 +++---------------
 midx.c         | 18 +++---------------
 pack-write.c   | 15 ++-------------
 5 files changed, 23 insertions(+), 43 deletions(-)

diff --git a/chunk-format.c b/chunk-format.c
index 1c3dca62e2..0275b74a89 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -181,3 +181,15 @@ int read_chunk(struct chunkfile *cf,
 
 	return CHUNK_NOT_FOUND;
 }
+
+uint8_t oid_version(const struct git_hash_algo *algop)
+{
+	switch (hash_algo_by_ptr(algop)) {
+	case GIT_HASH_SHA1:
+		return 1;
+	case GIT_HASH_SHA256:
+		return 2;
+	default:
+		die(_("invalid hash version"));
+	}
+}
diff --git a/chunk-format.h b/chunk-format.h
index 9ccbe00377..7885aa0848 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -2,6 +2,7 @@
 #define CHUNK_FORMAT_H
 
 #include "git-compat-util.h"
+#include "hash.h"
 
 struct hashfile;
 struct chunkfile;
@@ -65,4 +66,6 @@ int read_chunk(struct chunkfile *cf,
 	       chunk_read_fn fn,
 	       void *data);
 
+uint8_t oid_version(const struct git_hash_algo *algop);
+
 #endif
diff --git a/commit-graph.c b/commit-graph.c
index 06107beedc..066d82ed6a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -193,18 +193,6 @@ char *get_commit_graph_chain_filename(struct object_directory *odb)
 	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
 }
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 static struct commit_graph *alloc_commit_graph(void)
 {
 	struct commit_graph *g = xcalloc(1, sizeof(*g));
@@ -365,9 +353,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 	}
 
 	hash_version = *(unsigned char*)(data + 5);
-	if (hash_version != oid_version()) {
+	if (hash_version != oid_version(the_hash_algo)) {
 		error(_("commit-graph hash version %X does not match version %X"),
-		      hash_version, oid_version());
+		      hash_version, oid_version(the_hash_algo));
 		return NULL;
 	}
 
@@ -1924,7 +1912,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
 	hashwrite_u8(f, GRAPH_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, oid_version(the_hash_algo));
 	hashwrite_u8(f, get_num_chunks(cf));
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
diff --git a/midx.c b/midx.c
index 3db0e47735..c617c51cd0 100644
--- a/midx.c
+++ b/midx.c
@@ -41,18 +41,6 @@
 
 #define PACK_EXPIRED UINT_MAX
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 const unsigned char *get_midx_checksum(struct multi_pack_index *m)
 {
 	return m->data + m->data_len - the_hash_algo->rawsz;
@@ -134,9 +122,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 		      m->version);
 
 	hash_version = m->data[MIDX_BYTE_HASH_VERSION];
-	if (hash_version != oid_version()) {
+	if (hash_version != oid_version(the_hash_algo)) {
 		error(_("multi-pack-index hash version %u does not match version %u"),
-		      hash_version, oid_version());
+		      hash_version, oid_version(the_hash_algo));
 		goto cleanup_fail;
 	}
 	m->hash_len = the_hash_algo->rawsz;
@@ -420,7 +408,7 @@ static size_t write_midx_header(struct hashfile *f,
 {
 	hashwrite_be32(f, MIDX_SIGNATURE);
 	hashwrite_u8(f, MIDX_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, oid_version(the_hash_algo));
 	hashwrite_u8(f, num_chunks);
 	hashwrite_u8(f, 0); /* unused */
 	hashwrite_be32(f, num_packs);
diff --git a/pack-write.c b/pack-write.c
index a2adc565f4..27b171e440 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -2,6 +2,7 @@
 #include "pack.h"
 #include "csum-file.h"
 #include "remote.h"
+#include "chunk-format.h"
 
 void reset_pack_idx_option(struct pack_idx_option *opts)
 {
@@ -181,21 +182,9 @@ static int pack_order_cmp(const void *va, const void *vb, void *ctx)
 
 static void write_rev_header(struct hashfile *f)
 {
-	uint32_t oid_version;
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		oid_version = 1;
-		break;
-	case GIT_HASH_SHA256:
-		oid_version = 2;
-		break;
-	default:
-		die("write_rev_header: unknown hash version");
-	}
-
 	hashwrite_be32(f, RIDX_SIGNATURE);
 	hashwrite_be32(f, RIDX_VERSION);
-	hashwrite_be32(f, oid_version);
+	hashwrite_be32(f, oid_version(the_hash_algo));
 }
 
 static void write_rev_index_positions(struct hashfile *f,
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v4 05/17] pack-mtimes: support writing pack .mtimes files
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (3 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 04/17] chunk-format.h: extract oid_version() Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-18 23:11   ` [PATCH v4 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
                     ` (14 subsequent siblings)
  19 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

Now that the `.mtimes` format is defined, supplement the pack-write API
to be able to conditionally write an `.mtimes` file along with a pack by
setting an additional flag and passing an oidmap that contains the
timestamps corresponding to each object in the pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-objects.c |  6 ++++
 pack-objects.h | 25 ++++++++++++++++
 pack-write.c   | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++
 pack.h         |  1 +
 4 files changed, 109 insertions(+)

diff --git a/pack-objects.c b/pack-objects.c
index fe2a4eace9..272e8d4517 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -170,6 +170,9 @@ struct object_entry *packlist_alloc(struct packing_data *pdata,
 
 		if (pdata->layer)
 			REALLOC_ARRAY(pdata->layer, pdata->nr_alloc);
+
+		if (pdata->cruft_mtime)
+			REALLOC_ARRAY(pdata->cruft_mtime, pdata->nr_alloc);
 	}
 
 	new_entry = pdata->objects + pdata->nr_objects++;
@@ -198,6 +201,9 @@ struct object_entry *packlist_alloc(struct packing_data *pdata,
 	if (pdata->layer)
 		pdata->layer[pdata->nr_objects - 1] = 0;
 
+	if (pdata->cruft_mtime)
+		pdata->cruft_mtime[pdata->nr_objects - 1] = 0;
+
 	return new_entry;
 }
 
diff --git a/pack-objects.h b/pack-objects.h
index dca2351ef9..393b9db546 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -168,6 +168,14 @@ struct packing_data {
 	/* delta islands */
 	unsigned int *tree_depth;
 	unsigned char *layer;
+
+	/*
+	 * Used when writing cruft packs.
+	 *
+	 * Object mtimes are stored in pack order when writing, but
+	 * written out in lexicographic (index) order.
+	 */
+	uint32_t *cruft_mtime;
 };
 
 void prepare_packing_data(struct repository *r, struct packing_data *pdata);
@@ -289,4 +297,21 @@ static inline void oe_set_layer(struct packing_data *pack,
 	pack->layer[e - pack->objects] = layer;
 }
 
+static inline uint32_t oe_cruft_mtime(struct packing_data *pack,
+				      struct object_entry *e)
+{
+	if (!pack->cruft_mtime)
+		return 0;
+	return pack->cruft_mtime[e - pack->objects];
+}
+
+static inline void oe_set_cruft_mtime(struct packing_data *pack,
+				      struct object_entry *e,
+				      uint32_t mtime)
+{
+	if (!pack->cruft_mtime)
+		CALLOC_ARRAY(pack->cruft_mtime, pack->nr_alloc);
+	pack->cruft_mtime[e - pack->objects] = mtime;
+}
+
 #endif
diff --git a/pack-write.c b/pack-write.c
index 27b171e440..23c0342018 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -3,6 +3,10 @@
 #include "csum-file.h"
 #include "remote.h"
 #include "chunk-format.h"
+#include "pack-mtimes.h"
+#include "oidmap.h"
+#include "chunk-format.h"
+#include "pack-objects.h"
 
 void reset_pack_idx_option(struct pack_idx_option *opts)
 {
@@ -277,6 +281,70 @@ const char *write_rev_file_order(const char *rev_name,
 	return rev_name;
 }
 
+static void write_mtimes_header(struct hashfile *f)
+{
+	hashwrite_be32(f, MTIMES_SIGNATURE);
+	hashwrite_be32(f, MTIMES_VERSION);
+	hashwrite_be32(f, oid_version(the_hash_algo));
+}
+
+/*
+ * Writes the object mtimes of "objects" for use in a .mtimes file.
+ * Note that objects must be in lexicographic (index) order, which is
+ * the expected ordering of these values in the .mtimes file.
+ */
+static void write_mtimes_objects(struct hashfile *f,
+				 struct packing_data *to_pack,
+				 struct pack_idx_entry **objects,
+				 uint32_t nr_objects)
+{
+	uint32_t i;
+	for (i = 0; i < nr_objects; i++) {
+		struct object_entry *e = (struct object_entry*)objects[i];
+		hashwrite_be32(f, oe_cruft_mtime(to_pack, e));
+	}
+}
+
+static void write_mtimes_trailer(struct hashfile *f, const unsigned char *hash)
+{
+	hashwrite(f, hash, the_hash_algo->rawsz);
+}
+
+static const char *write_mtimes_file(const char *mtimes_name,
+				     struct packing_data *to_pack,
+				     struct pack_idx_entry **objects,
+				     uint32_t nr_objects,
+				     const unsigned char *hash)
+{
+	struct hashfile *f;
+	int fd;
+
+	if (!to_pack)
+		BUG("cannot call write_mtimes_file with NULL packing_data");
+
+	if (!mtimes_name) {
+		struct strbuf tmp_file = STRBUF_INIT;
+		fd = odb_mkstemp(&tmp_file, "pack/tmp_mtimes_XXXXXX");
+		mtimes_name = strbuf_detach(&tmp_file, NULL);
+	} else {
+		unlink(mtimes_name);
+		fd = xopen(mtimes_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
+	}
+	f = hashfd(fd, mtimes_name);
+
+	write_mtimes_header(f);
+	write_mtimes_objects(f, to_pack, objects, nr_objects);
+	write_mtimes_trailer(f, hash);
+
+	if (adjust_shared_perm(mtimes_name) < 0)
+		die(_("failed to make %s readable"), mtimes_name);
+
+	finalize_hashfile(f, NULL, FSYNC_COMPONENT_PACK_METADATA,
+			  CSUM_HASH_IN_STREAM | CSUM_CLOSE | CSUM_FSYNC);
+
+	return mtimes_name;
+}
+
 off_t write_pack_header(struct hashfile *f, uint32_t nr_entries)
 {
 	struct pack_header hdr;
@@ -479,6 +547,7 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 char **idx_tmp_name)
 {
 	const char *rev_tmp_name = NULL;
+	const char *mtimes_tmp_name = NULL;
 
 	if (adjust_shared_perm(pack_tmp_name))
 		die_errno("unable to make temporary pack file readable");
@@ -491,9 +560,17 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 	rev_tmp_name = write_rev_file(NULL, written_list, nr_written, hash,
 				      pack_idx_opts->flags);
 
+	if (pack_idx_opts->flags & WRITE_MTIMES) {
+		mtimes_tmp_name = write_mtimes_file(NULL, to_pack, written_list,
+						    nr_written,
+						    hash);
+	}
+
 	rename_tmp_packfile(name_buffer, pack_tmp_name, "pack");
 	if (rev_tmp_name)
 		rename_tmp_packfile(name_buffer, rev_tmp_name, "rev");
+	if (mtimes_tmp_name)
+		rename_tmp_packfile(name_buffer, mtimes_tmp_name, "mtimes");
 }
 
 void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought)
diff --git a/pack.h b/pack.h
index fd27cfdfd7..01d385903a 100644
--- a/pack.h
+++ b/pack.h
@@ -44,6 +44,7 @@ struct pack_idx_option {
 #define WRITE_IDX_STRICT 02
 #define WRITE_REV 04
 #define WRITE_REV_VERIFY 010
+#define WRITE_MTIMES 020
 
 	uint32_t version;
 	uint32_t off32_limit;
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v4 06/17] t/helper: add 'pack-mtimes' test-tool
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (4 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-18 23:11   ` [PATCH v4 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
                     ` (13 subsequent siblings)
  19 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

In the next patch, we will implement and test support for writing a
cruft pack via a special mode of `git pack-objects`. To make sure that
objects are written with the correct timestamps, and a new test-tool
that can dump the object names and corresponding timestamps from a given
`.mtimes` file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Makefile                    |  1 +
 t/helper/test-pack-mtimes.c | 56 +++++++++++++++++++++++++++++++++++++
 t/helper/test-tool.c        |  1 +
 t/helper/test-tool.h        |  1 +
 4 files changed, 59 insertions(+)
 create mode 100644 t/helper/test-pack-mtimes.c

diff --git a/Makefile b/Makefile
index a299580b7c..0b6eab0453 100644
--- a/Makefile
+++ b/Makefile
@@ -738,6 +738,7 @@ TEST_BUILTINS_OBJS += test-oid-array.o
 TEST_BUILTINS_OBJS += test-oidmap.o
 TEST_BUILTINS_OBJS += test-oidtree.o
 TEST_BUILTINS_OBJS += test-online-cpus.o
+TEST_BUILTINS_OBJS += test-pack-mtimes.o
 TEST_BUILTINS_OBJS += test-parse-options.o
 TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
 TEST_BUILTINS_OBJS += test-partial-clone.o
diff --git a/t/helper/test-pack-mtimes.c b/t/helper/test-pack-mtimes.c
new file mode 100644
index 0000000000..f7b79daf4c
--- /dev/null
+++ b/t/helper/test-pack-mtimes.c
@@ -0,0 +1,56 @@
+#include "git-compat-util.h"
+#include "test-tool.h"
+#include "strbuf.h"
+#include "object-store.h"
+#include "packfile.h"
+#include "pack-mtimes.h"
+
+static void dump_mtimes(struct packed_git *p)
+{
+	uint32_t i;
+	if (load_pack_mtimes(p) < 0)
+		die("could not load pack .mtimes");
+
+	for (i = 0; i < p->num_objects; i++) {
+		struct object_id oid;
+		if (nth_packed_object_id(&oid, p, i) < 0)
+			die("could not load object id at position %"PRIu32, i);
+
+		printf("%s %"PRIu32"\n",
+		       oid_to_hex(&oid), nth_packed_mtime(p, i));
+	}
+}
+
+static const char *pack_mtimes_usage = "\n"
+"  test-tool pack-mtimes <pack-name.mtimes>";
+
+int cmd__pack_mtimes(int argc, const char **argv)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct packed_git *p;
+
+	setup_git_directory();
+
+	if (argc != 2)
+		usage(pack_mtimes_usage);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		strbuf_addstr(&buf, basename(p->pack_name));
+		strbuf_strip_suffix(&buf, ".pack");
+		strbuf_addstr(&buf, ".mtimes");
+
+		if (!strcmp(buf.buf, argv[1]))
+			break;
+
+		strbuf_reset(&buf);
+	}
+
+	strbuf_release(&buf);
+
+	if (!p)
+		die("could not find pack '%s'", argv[1]);
+
+	dump_mtimes(p);
+
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 0424f7adf5..d2eacd302d 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -48,6 +48,7 @@ static struct test_cmd cmds[] = {
 	{ "oidmap", cmd__oidmap },
 	{ "oidtree", cmd__oidtree },
 	{ "online-cpus", cmd__online_cpus },
+	{ "pack-mtimes", cmd__pack_mtimes },
 	{ "parse-options", cmd__parse_options },
 	{ "parse-pathspec-file", cmd__parse_pathspec_file },
 	{ "partial-clone", cmd__partial_clone },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index c876e8246f..960cc27ef7 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -38,6 +38,7 @@ int cmd__mktemp(int argc, const char **argv);
 int cmd__oidmap(int argc, const char **argv);
 int cmd__oidtree(int argc, const char **argv);
 int cmd__online_cpus(int argc, const char **argv);
+int cmd__pack_mtimes(int argc, const char **argv);
 int cmd__parse_options(int argc, const char **argv);
 int cmd__parse_pathspec_file(int argc, const char** argv);
 int cmd__partial_clone(int argc, const char **argv);
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v4 07/17] builtin/pack-objects.c: return from create_object_entry()
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (5 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-18 23:11   ` [PATCH v4 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
                     ` (12 subsequent siblings)
  19 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

A new caller in the next commit will want to immediately modify the
object_entry structure created by create_object_entry(). Instead of
forcing that caller to wastefully look-up the entry we just created,
return it from create_object_entry() instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6ac927047c..c6d16872ee 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1516,13 +1516,13 @@ static int want_object_in_pack(const struct object_id *oid,
 	return 1;
 }
 
-static void create_object_entry(const struct object_id *oid,
-				enum object_type type,
-				uint32_t hash,
-				int exclude,
-				int no_try_delta,
-				struct packed_git *found_pack,
-				off_t found_offset)
+static struct object_entry *create_object_entry(const struct object_id *oid,
+						enum object_type type,
+						uint32_t hash,
+						int exclude,
+						int no_try_delta,
+						struct packed_git *found_pack,
+						off_t found_offset)
 {
 	struct object_entry *entry;
 
@@ -1539,6 +1539,8 @@ static void create_object_entry(const struct object_id *oid,
 	}
 
 	entry->no_try_delta = no_try_delta;
+
+	return entry;
 }
 
 static const char no_closure_warning[] = N_(
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v4 08/17] builtin/pack-objects.c: --cruft without expiration
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (6 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-19 10:04     ` Junio C Hamano
  2022-05-18 23:11   ` [PATCH v4 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
                     ` (11 subsequent siblings)
  19 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

Teach `pack-objects` how to generate a cruft pack when no objects are
dropped (i.e., `--cruft-expiration=never`). Later patches will teach
`pack-objects` how to generate a cruft pack that prunes objects.

When generating a cruft pack which does not prune objects, we want to
collect all unreachable objects into a single pack (noting and updating
their mtimes as we accumulate them). Ordinary use will pass the result
of a `git repack -A` as a kept pack, so when this patch says "kept
pack", readers should think "reachable objects".

Generating a non-expiring cruft packs works as follows:

  - Callers provide a list of every pack they know about, and indicate
    which packs are about to be removed.

  - All packs which are going to be removed (we'll call these the
    redundant ones) are marked as kept in-core.

    Any packs the caller did not mention (but are known to the
    `pack-objects` process) are also marked as kept in-core. Packs not
    mentioned by the caller are assumed to be unknown to them, i.e.,
    they entered the repository after the caller decided which packs
    should be kept and which should be discarded.

    Since we do not want to include objects in these "unknown" packs
    (because we don't know which of their objects are or aren't
    reachable), these are also marked as kept in-core.

  - Then, we enumerate all objects in the repository, and add them to
    our packing list if they do not appear in an in-core kept pack.

This results in a new cruft pack which contains all known objects that
aren't included in the kept packs. When the kept pack is the result of
`git repack -A`, the resulting pack contains all unreachable objects.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-pack-objects.txt |  30 ++++
 builtin/pack-objects.c             | 201 +++++++++++++++++++++++++-
 object-file.c                      |   2 +-
 object-store.h                     |   2 +
 t/t5329-pack-objects-cruft.sh      | 218 +++++++++++++++++++++++++++++
 5 files changed, 448 insertions(+), 5 deletions(-)
 create mode 100755 t/t5329-pack-objects-cruft.sh

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index f8344e1e5b..a9995a932c 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -13,6 +13,7 @@ SYNOPSIS
 	[--no-reuse-delta] [--delta-base-offset] [--non-empty]
 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
 	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
+	[--cruft] [--cruft-expiration=<time>]
 	[--stdout [--filter=<filter-spec>] | <base-name>]
 	[--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>
 
@@ -95,6 +96,35 @@ base-name::
 Incompatible with `--revs`, or options that imply `--revs` (such as
 `--all`), with the exception of `--unpacked`, which is compatible.
 
+--cruft::
+	Packs unreachable objects into a separate "cruft" pack, denoted
+	by the existence of a `.mtimes` file. Typically used by `git
+	repack --cruft`. Callers provide a list of pack names and
+	indicate which packs will remain in the repository, along with
+	which packs will be deleted (indicated by the `-` prefix). The
+	contents of the cruft pack are all objects not contained in the
+	surviving packs which have not exceeded the grace period (see
+	`--cruft-expiration` below), or which have exceeded the grace
+	period, but are reachable from an other object which hasn't.
++
+When the input lists a pack containing all reachable objects (and lists
+all other packs as pending deletion), the corresponding cruft pack will
+contain all unreachable objects (with mtime newer than the
+`--cruft-expiration`) along with any unreachable objects whose mtime is
+older than the `--cruft-expiration`, but are reachable from an
+unreachable object whose mtime is newer than the `--cruft-expiration`).
++
+Incompatible with `--unpack-unreachable`, `--keep-unreachable`,
+`--pack-loose-unreachable`, `--stdin-packs`, as well as any other
+options which imply `--revs`. Also incompatible with `--max-pack-size`;
+when this option is set, the maximum pack size is not inferred from
+`pack.packSizeLimit`.
+
+--cruft-expiration=<approxidate>::
+	If specified, objects are eliminated from the cruft pack if they
+	have an mtime older than `<approxidate>`. If unspecified (and
+	given `--cruft`), then no objects are eliminated.
+
 --window=<n>::
 --depth=<n>::
 	These two options affect how the objects contained in
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c6d16872ee..9cf89be673 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -36,6 +36,7 @@
 #include "trace2.h"
 #include "shallow.h"
 #include "promisor-remote.h"
+#include "pack-mtimes.h"
 
 /*
  * Objects we are going to pack are collected in the `to_pack` structure.
@@ -194,6 +195,8 @@ static int reuse_delta = 1, reuse_object = 1;
 static int keep_unreachable, unpack_unreachable, include_tag;
 static timestamp_t unpack_unreachable_expiration;
 static int pack_loose_unreachable;
+static int cruft;
+static timestamp_t cruft_expiration;
 static int local;
 static int have_non_local_packs;
 static int incremental;
@@ -1260,6 +1263,9 @@ static void write_pack_file(void)
 					&to_pack, written_list, nr_written);
 			}
 
+			if (cruft)
+				pack_idx_opts.flags |= WRITE_MTIMES;
+
 			stage_tmp_packfiles(&tmpname, pack_tmp_name,
 					    written_list, nr_written,
 					    &to_pack, &pack_idx_opts, hash,
@@ -3397,6 +3403,135 @@ static void read_packs_list_from_stdin(void)
 	string_list_clear(&exclude_packs, 0);
 }
 
+static void add_cruft_object_entry(const struct object_id *oid, enum object_type type,
+				   struct packed_git *pack, off_t offset,
+				   const char *name, uint32_t mtime)
+{
+	struct object_entry *entry;
+
+	display_progress(progress_state, ++nr_seen);
+
+	entry = packlist_find(&to_pack, oid);
+	if (entry) {
+		if (name) {
+			entry->hash = pack_name_hash(name);
+			entry->no_try_delta = no_try_delta(name);
+		}
+	} else {
+		if (!want_object_in_pack(oid, 0, &pack, &offset))
+			return;
+		if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
+			/*
+			 * If a traversed tree has a missing blob then we want
+			 * to avoid adding that missing object to our pack.
+			 *
+			 * This only applies to missing blobs, not trees,
+			 * because the traversal needs to parse sub-trees but
+			 * not blobs.
+			 *
+			 * Note we only perform this check when we couldn't
+			 * already find the object in a pack, so we're really
+			 * limited to "ensure non-tip blobs which don't exist in
+			 * packs do exist via loose objects". Confused?
+			 */
+			return;
+		}
+
+		entry = create_object_entry(oid, type, pack_name_hash(name),
+					    0, name && no_try_delta(name),
+					    pack, offset);
+	}
+
+	if (mtime > oe_cruft_mtime(&to_pack, entry))
+		oe_set_cruft_mtime(&to_pack, entry, mtime);
+	return;
+}
+
+static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
+{
+	struct string_list_item *item = NULL;
+	for_each_string_list_item(item, packs) {
+		struct packed_git *p = item->util;
+		if (!p)
+			die(_("could not find pack '%s'"), item->string);
+		p->pack_keep_in_core = keep;
+	}
+}
+
+static void add_unreachable_loose_objects(void);
+static void add_objects_in_unpacked_packs(void);
+
+static void enumerate_cruft_objects(void)
+{
+	if (progress)
+		progress_state = start_progress(_("Enumerating cruft objects"), 0);
+
+	add_objects_in_unpacked_packs();
+	add_unreachable_loose_objects();
+
+	stop_progress(&progress_state);
+}
+
+static void read_cruft_objects(void)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct string_list discard_packs = STRING_LIST_INIT_DUP;
+	struct string_list fresh_packs = STRING_LIST_INIT_DUP;
+	struct packed_git *p;
+
+	ignore_packed_keep_in_core = 1;
+
+	while (strbuf_getline(&buf, stdin) != EOF) {
+		if (!buf.len)
+			continue;
+
+		if (*buf.buf == '-')
+			string_list_append(&discard_packs, buf.buf + 1);
+		else
+			string_list_append(&fresh_packs, buf.buf);
+		strbuf_reset(&buf);
+	}
+
+	string_list_sort(&discard_packs);
+	string_list_sort(&fresh_packs);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		const char *pack_name = pack_basename(p);
+		struct string_list_item *item;
+
+		item = string_list_lookup(&fresh_packs, pack_name);
+		if (!item)
+			item = string_list_lookup(&discard_packs, pack_name);
+
+		if (item) {
+			item->util = p;
+		} else {
+			/*
+			 * This pack wasn't mentioned in either the "fresh" or
+			 * "discard" list, so the caller didn't know about it.
+			 *
+			 * Mark it as kept so that its objects are ignored by
+			 * add_unseen_recent_objects_to_traversal(). We'll
+			 * unmark it before starting the traversal so it doesn't
+			 * halt the traversal early.
+			 */
+			p->pack_keep_in_core = 1;
+		}
+	}
+
+	mark_pack_kept_in_core(&fresh_packs, 1);
+	mark_pack_kept_in_core(&discard_packs, 0);
+
+	if (cruft_expiration)
+		die("--cruft-expiration not yet implemented");
+	else
+		enumerate_cruft_objects();
+
+	strbuf_release(&buf);
+	string_list_clear(&discard_packs, 0);
+	string_list_clear(&fresh_packs, 0);
+}
+
 static void read_object_list_from_stdin(void)
 {
 	char line[GIT_MAX_HEXSZ + 1 + PATH_MAX + 2];
@@ -3529,7 +3664,24 @@ static int add_object_in_unpacked_pack(const struct object_id *oid,
 				       uint32_t pos,
 				       void *_data)
 {
-	add_object_entry(oid, OBJ_NONE, "", 0);
+	if (cruft) {
+		off_t offset;
+		time_t mtime;
+
+		if (pack->is_cruft) {
+			if (load_pack_mtimes(pack) < 0)
+				die(_("could not load cruft pack .mtimes"));
+			mtime = nth_packed_mtime(pack, pos);
+		} else {
+			mtime = pack->mtime;
+		}
+		offset = nth_packed_object_offset(pack, pos);
+
+		add_cruft_object_entry(oid, OBJ_NONE, pack, offset,
+				       NULL, mtime);
+	} else {
+		add_object_entry(oid, OBJ_NONE, "", 0);
+	}
 	return 0;
 }
 
@@ -3553,7 +3705,19 @@ static int add_loose_object(const struct object_id *oid, const char *path,
 		return 0;
 	}
 
-	add_object_entry(oid, type, "", 0);
+	if (cruft) {
+		struct stat st;
+		if (stat(path, &st) < 0) {
+			if (errno == ENOENT)
+				return 0;
+			return error_errno("unable to stat %s", oid_to_hex(oid));
+		}
+
+		add_cruft_object_entry(oid, type, NULL, 0, NULL,
+				       st.st_mtime);
+	} else {
+		add_object_entry(oid, type, "", 0);
+	}
 	return 0;
 }
 
@@ -3870,6 +4034,20 @@ static int option_parse_unpack_unreachable(const struct option *opt,
 	return 0;
 }
 
+static int option_parse_cruft_expiration(const struct option *opt,
+					 const char *arg, int unset)
+{
+	if (unset) {
+		cruft = 0;
+		cruft_expiration = 0;
+	} else {
+		cruft = 1;
+		if (arg)
+			cruft_expiration = approxidate(arg);
+	}
+	return 0;
+}
+
 struct po_filter_data {
 	unsigned have_revs:1;
 	struct rev_info revs;
@@ -3959,6 +4137,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		OPT_CALLBACK_F(0, "unpack-unreachable", NULL, N_("time"),
 		  N_("unpack unreachable objects newer than <time>"),
 		  PARSE_OPT_OPTARG, option_parse_unpack_unreachable),
+		OPT_BOOL(0, "cruft", &cruft, N_("create a cruft pack")),
+		OPT_CALLBACK_F(0, "cruft-expiration", NULL, N_("time"),
+		  N_("expire cruft objects older than <time>"),
+		  PARSE_OPT_OPTARG, option_parse_cruft_expiration),
 		OPT_BOOL(0, "sparse", &sparse,
 			 N_("use the sparse reachability algorithm")),
 		OPT_BOOL(0, "thin", &thin,
@@ -4085,7 +4267,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	if (!HAVE_THREADS && delta_search_threads != 1)
 		warning(_("no threads support, ignoring --threads"));
-	if (!pack_to_stdout && !pack_size_limit)
+	if (!pack_to_stdout && !pack_size_limit && !cruft)
 		pack_size_limit = pack_size_limit_cfg;
 	if (pack_to_stdout && pack_size_limit)
 		die(_("--max-pack-size cannot be used to build a pack for transfer"));
@@ -4112,6 +4294,15 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
 
+	if (cruft) {
+		if (use_internal_rev_list)
+			die(_("cannot use internal rev list with --cruft"));
+		if (stdin_packs)
+			die(_("cannot use --stdin-packs with --cruft"));
+		if (pack_size_limit)
+			die(_("cannot use --max-pack-size with --cruft"));
+	}
+
 	/*
 	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
 	 *
@@ -4168,7 +4359,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			    the_repository);
 	prepare_packing_data(the_repository, &to_pack);
 
-	if (progress)
+	if (progress && !cruft)
 		progress_state = start_progress(_("Enumerating objects"), 0);
 	if (stdin_packs) {
 		/* avoids adding objects in excluded packs */
@@ -4176,6 +4367,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		read_packs_list_from_stdin();
 		if (rev_list_unpacked)
 			add_unreachable_loose_objects();
+	} else if (cruft) {
+		read_cruft_objects();
 	} else if (!use_internal_rev_list) {
 		read_object_list_from_stdin();
 	} else if (pfd.have_revs) {
diff --git a/object-file.c b/object-file.c
index 5ffbf3d4fd..ff0cffe68e 100644
--- a/object-file.c
+++ b/object-file.c
@@ -997,7 +997,7 @@ int has_loose_object_nonlocal(const struct object_id *oid)
 	return check_and_freshen_nonlocal(oid, 0);
 }
 
-static int has_loose_object(const struct object_id *oid)
+int has_loose_object(const struct object_id *oid)
 {
 	return check_and_freshen(oid, 0);
 }
diff --git a/object-store.h b/object-store.h
index 2c4671ed7a..c41609e8db 100644
--- a/object-store.h
+++ b/object-store.h
@@ -330,6 +330,8 @@ int repo_has_object_file_with_flags(struct repository *r,
  */
 int has_loose_object_nonlocal(const struct object_id *);
 
+int has_loose_object(const struct object_id *);
+
 /**
  * format_object_header() is a thin wrapper around s xsnprintf() that
  * writes the initial "<type> <obj-len>" part of the loose object
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
new file mode 100755
index 0000000000..003ca7344e
--- /dev/null
+++ b/t/t5329-pack-objects-cruft.sh
@@ -0,0 +1,218 @@
+#!/bin/sh
+
+test_description='cruft pack related pack-objects tests'
+. ./test-lib.sh
+
+objdir=.git/objects
+packdir=$objdir/pack
+
+basic_cruft_pack_tests () {
+	expire="$1"
+
+	test_expect_success "unreachable loose objects are packed (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit base &&
+			git repack -Ad &&
+			test_commit loose &&
+
+			test-tool chmtime +2000 "$objdir/$(test_oid_to_path \
+				$(git rev-parse loose:loose.t))" &&
+			test-tool chmtime +1000 "$objdir/$(test_oid_to_path \
+				$(git rev-parse loose^{tree}))" &&
+
+			(
+				git rev-list --objects --no-object-names base..loose |
+				while read oid
+				do
+					path="$objdir/$(test_oid_to_path "$oid")" &&
+					printf "%s %d\n" "$oid" "$(test-tool chmtime --get "$path")"
+				done |
+				sort -k1
+			) >expect &&
+
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+			cruft="$(echo $keep | git pack-objects --cruft \
+				--cruft-expiration="$expire" $packdir/pack)" &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "unreachable packed objects are packed (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit packed &&
+			git repack -Ad &&
+			test_commit other &&
+
+			git rev-list --objects --no-object-names packed.. >objects &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+			other="$(git pack-objects --delta-base-offset \
+				$packdir/pack <objects)" &&
+			git prune-packed &&
+
+			test-tool chmtime --get -100 "$packdir/pack-$other.pack" >expect &&
+
+			cruft="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF
+			$keep
+			-pack-$other.pack
+			EOF
+			)" &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+			cut -d" " -f2 <actual.raw | sort -u >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "unreachable cruft objects are repacked (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit packed &&
+			git repack -Ad &&
+			test_commit other &&
+
+			git rev-list --objects --no-object-names packed.. >objects &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+
+			cruft_a="$(echo $keep | git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack)" &&
+			git prune-packed &&
+			cruft_b="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF
+			$keep
+			-pack-$cruft_a.pack
+			EOF
+			)" &&
+
+			test-tool pack-mtimes "pack-$cruft_a.mtimes" >expect.raw &&
+			test-tool pack-mtimes "pack-$cruft_b.mtimes" >actual.raw &&
+
+			sort <expect.raw >expect &&
+			sort <actual.raw >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "multiple cruft packs (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			git repack -Ad &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+
+			test_commit cruft &&
+			loose="$objdir/$(test_oid_to_path $(git rev-parse cruft))" &&
+
+			# generate three copies of the cruft object in different
+			# cruft packs, each with a unique mtime:
+			#   - one expired (1000 seconds ago)
+			#   - two non-expired (one 1000 seconds in the future,
+			#     one 1500 seconds in the future)
+			test-tool chmtime =-1000 "$loose" &&
+			git pack-objects --cruft $packdir/pack-A <<-EOF &&
+			$keep
+			EOF
+			test-tool chmtime =+1000 "$loose" &&
+			git pack-objects --cruft $packdir/pack-B <<-EOF &&
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			EOF
+			test-tool chmtime =+1500 "$loose" &&
+			git pack-objects --cruft $packdir/pack-C <<-EOF &&
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			-$(basename $(ls $packdir/pack-B-*.pack))
+			EOF
+
+			# ensure the resulting cruft pack takes the most recent
+			# mtime among all copies
+			cruft="$(git pack-objects --cruft \
+				--cruft-expiration="$expire" \
+				$packdir/pack <<-EOF
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			-$(basename $(ls $packdir/pack-B-*.pack))
+			-$(basename $(ls $packdir/pack-C-*.pack))
+			EOF
+			)" &&
+
+			test-tool pack-mtimes "$(basename $(ls $packdir/pack-C-*.mtimes))" >expect.raw &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+			sort expect.raw >expect &&
+			sort actual.raw >actual &&
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "cruft packs tolerate missing trees (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			test_commit cruft &&
+
+			tree="$(git rev-parse cruft^{tree})" &&
+
+			git reset --hard reachable &&
+			git tag -d cruft &&
+			git reflog expire --all --expire=all &&
+
+			# remove the unreachable tree, but leave the commit
+			# which has it as its root tree intact
+			rm -fr "$objdir/$(test_oid_to_path "$tree")" &&
+
+			git repack -Ad &&
+			basename $(ls $packdir/pack-*.pack) >in &&
+			git pack-objects --cruft --cruft-expiration="$expire" \
+				$packdir/pack <in
+		)
+	'
+
+	test_expect_success "cruft packs tolerate missing blobs (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			test_commit cruft &&
+
+			blob="$(git rev-parse cruft:cruft.t)" &&
+
+			git reset --hard reachable &&
+			git tag -d cruft &&
+			git reflog expire --all --expire=all &&
+
+			# remove the unreachable blob, but leave the commit (and
+			# the root tree of that commit) intact
+			rm -fr "$objdir/$(test_oid_to_path "$blob")" &&
+
+			git repack -Ad &&
+			basename $(ls $packdir/pack-*.pack) >in &&
+			git pack-objects --cruft --cruft-expiration="$expire" \
+				$packdir/pack <in
+		)
+	'
+}
+
+basic_cruft_pack_tests never
+
+test_done
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v4 09/17] reachable: add options to add_unseen_recent_objects_to_traversal
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (7 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-18 23:11   ` [PATCH v4 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
                     ` (10 subsequent siblings)
  19 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

This function behaves very similarly to what we will need in
pack-objects in order to implement cruft packs with expiration. But it
is lacking a couple of things. Namely, it needs:

  - a mechanism to communicate the timestamps of individual recent
    objects to some external caller

  - and, in the case of packed objects, our future caller will also want
    to know the originating pack, as well as the offset within that pack
    at which the object can be found

  - finally, it needs a way to skip over packs which are marked as kept
    in-core.

To address the first two, add a callback interface in this patch which
reports the time of each recent object, as well as a (packed_git,
off_t) pair for packed objects.

Likewise, add a new option to the packed object iterators to skip over
packs which are marked as kept in core. This option will become
implicitly tested in a future patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c |  2 +-
 reachable.c            | 51 +++++++++++++++++++++++++++++++++++-------
 reachable.h            |  9 +++++++-
 3 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 9cf89be673..3b8bf6a3dd 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3957,7 +3957,7 @@ static void get_object_list(struct rev_info *revs, int ac, const char **av)
 	if (unpack_unreachable_expiration) {
 		revs->ignore_missing_links = 1;
 		if (add_unseen_recent_objects_to_traversal(revs,
-				unpack_unreachable_expiration))
+				unpack_unreachable_expiration, NULL, 0))
 			die(_("unable to add recent objects"));
 		if (prepare_revision_walk(revs))
 			die(_("revision walk setup failed"));
diff --git a/reachable.c b/reachable.c
index b9f4ad886e..d4507c4270 100644
--- a/reachable.c
+++ b/reachable.c
@@ -60,9 +60,13 @@ static void mark_commit(struct commit *c, void *data)
 struct recent_data {
 	struct rev_info *revs;
 	timestamp_t timestamp;
+	report_recent_object_fn *cb;
+	int ignore_in_core_kept_packs;
 };
 
 static void add_recent_object(const struct object_id *oid,
+			      struct packed_git *pack,
+			      off_t offset,
 			      timestamp_t mtime,
 			      struct recent_data *data)
 {
@@ -103,13 +107,29 @@ static void add_recent_object(const struct object_id *oid,
 		die("unable to lookup %s", oid_to_hex(oid));
 
 	add_pending_object(data->revs, obj, "");
+	if (data->cb)
+		data->cb(obj, pack, offset, mtime);
+}
+
+static int want_recent_object(struct recent_data *data,
+			      const struct object_id *oid)
+{
+	if (data->ignore_in_core_kept_packs &&
+	    has_object_kept_pack(oid, IN_CORE_KEEP_PACKS))
+		return 0;
+	return 1;
 }
 
 static int add_recent_loose(const struct object_id *oid,
 			    const char *path, void *data)
 {
 	struct stat st;
-	struct object *obj = lookup_object(the_repository, oid);
+	struct object *obj;
+
+	if (!want_recent_object(data, oid))
+		return 0;
+
+	obj = lookup_object(the_repository, oid);
 
 	if (obj && obj->flags & SEEN)
 		return 0;
@@ -126,7 +146,7 @@ static int add_recent_loose(const struct object_id *oid,
 		return error_errno("unable to stat %s", oid_to_hex(oid));
 	}
 
-	add_recent_object(oid, st.st_mtime, data);
+	add_recent_object(oid, NULL, 0, st.st_mtime, data);
 	return 0;
 }
 
@@ -134,29 +154,43 @@ static int add_recent_packed(const struct object_id *oid,
 			     struct packed_git *p, uint32_t pos,
 			     void *data)
 {
-	struct object *obj = lookup_object(the_repository, oid);
+	struct object *obj;
+
+	if (!want_recent_object(data, oid))
+		return 0;
+
+	obj = lookup_object(the_repository, oid);
 
 	if (obj && obj->flags & SEEN)
 		return 0;
-	add_recent_object(oid, p->mtime, data);
+	add_recent_object(oid, p, nth_packed_object_offset(p, pos), p->mtime, data);
 	return 0;
 }
 
 int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
-					   timestamp_t timestamp)
+					   timestamp_t timestamp,
+					   report_recent_object_fn *cb,
+					   int ignore_in_core_kept_packs)
 {
 	struct recent_data data;
+	enum for_each_object_flags flags;
 	int r;
 
 	data.revs = revs;
 	data.timestamp = timestamp;
+	data.cb = cb;
+	data.ignore_in_core_kept_packs = ignore_in_core_kept_packs;
 
 	r = for_each_loose_object(add_recent_loose, &data,
 				  FOR_EACH_OBJECT_LOCAL_ONLY);
 	if (r)
 		return r;
-	return for_each_packed_object(add_recent_packed, &data,
-				      FOR_EACH_OBJECT_LOCAL_ONLY);
+
+	flags = FOR_EACH_OBJECT_LOCAL_ONLY | FOR_EACH_OBJECT_PACK_ORDER;
+	if (ignore_in_core_kept_packs)
+		flags |= FOR_EACH_OBJECT_SKIP_IN_CORE_KEPT_PACKS;
+
+	return for_each_packed_object(add_recent_packed, &data, flags);
 }
 
 static int mark_object_seen(const struct object_id *oid,
@@ -217,7 +251,8 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 
 	if (mark_recent) {
 		revs->ignore_missing_links = 1;
-		if (add_unseen_recent_objects_to_traversal(revs, mark_recent))
+		if (add_unseen_recent_objects_to_traversal(revs, mark_recent,
+							   NULL, 0))
 			die("unable to mark recent objects");
 		if (prepare_revision_walk(revs))
 			die("revision walk setup failed");
diff --git a/reachable.h b/reachable.h
index 5df932ad8f..b776761baa 100644
--- a/reachable.h
+++ b/reachable.h
@@ -1,11 +1,18 @@
 #ifndef REACHEABLE_H
 #define REACHEABLE_H
 
+#include "object.h"
+
 struct progress;
 struct rev_info;
 
+typedef void report_recent_object_fn(const struct object *, struct packed_git *,
+				     off_t, time_t);
+
 int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
-					   timestamp_t timestamp);
+					   timestamp_t timestamp,
+					   report_recent_object_fn cb,
+					   int ignore_in_core_kept_packs);
 void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 			    timestamp_t mark_recent, struct progress *);
 
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v4 10/17] reachable: report precise timestamps from objects in cruft packs
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (8 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-18 23:11   ` [PATCH v4 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
                     ` (9 subsequent siblings)
  19 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

When generating a cruft pack, the caller within pack-objects will want
to know the precise timestamps of cruft objects (i.e., their
corresponding values in the .mtimes table) rather than the mtime of the
cruft pack itself.

Teach add_recent_packed() to lookup each object's precise mtime from the
.mtimes file if one exists (indicated by the is_cruft bit on the
packed_git structure).

A couple of small things worth noting here:

  - load_pack_mtimes() needs to be called before asking for
    nth_packed_mtime(), and that call is done lazily here. That function
    exits early if the .mtimes file has already been opened and parsed,
    so only the first call is slow.

  - Checking the is_cruft bit can be done without any extra work on the
    caller's behalf, since it is set up for us automatically as a
    side-effect of calling add_packed_git() (just like the 'pack_keep'
    and 'pack_promisor' bits).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 reachable.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/reachable.c b/reachable.c
index d4507c4270..aba63ebeb3 100644
--- a/reachable.c
+++ b/reachable.c
@@ -13,6 +13,7 @@
 #include "worktree.h"
 #include "object-store.h"
 #include "pack-bitmap.h"
+#include "pack-mtimes.h"
 
 struct connectivity_progress {
 	struct progress *progress;
@@ -155,6 +156,7 @@ static int add_recent_packed(const struct object_id *oid,
 			     void *data)
 {
 	struct object *obj;
+	timestamp_t mtime = p->mtime;
 
 	if (!want_recent_object(data, oid))
 		return 0;
@@ -163,7 +165,12 @@ static int add_recent_packed(const struct object_id *oid,
 
 	if (obj && obj->flags & SEEN)
 		return 0;
-	add_recent_object(oid, p, nth_packed_object_offset(p, pos), p->mtime, data);
+	if (p->is_cruft) {
+		if (load_pack_mtimes(p) < 0)
+			die(_("could not load cruft pack .mtimes"));
+		mtime = nth_packed_mtime(p, pos);
+	}
+	add_recent_object(oid, p, nth_packed_object_offset(p, pos), mtime, data);
 	return 0;
 }
 
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v4 11/17] builtin/pack-objects.c: --cruft with expiration
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (9 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-18 23:11   ` [PATCH v4 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
                     ` (8 subsequent siblings)
  19 siblings, 0 replies; 200+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

In a previous patch, pack-objects learned how to generate a cruft pack
so long as no objects are dropped.

This patch teaches pack-objects to handle the case where a non-never
`--cruft-expiration` value is passed. This case is slightly more
complicated than before, because we want pack-objects to save
unreachable objects which would have been pruned when there is another
recent (i.e., non-prunable) unreachable object which reaches the other.
We'll call these objects "unreachable but reachable-from-recent".

Here is how pack-objects handles `--cruft-expiration`:

  - Instead of adding all objects outside of the kept pack(s) into the
    packing list, only handle the ones whose mtime is within the grace
    period.

  - Construct a reachability traversal whose tips are the
    unreachable-but-recent objects.

  - Then, walk along that traversal, stopping if we reach an object in
    the kept pack. At each step along the traversal, we add the object
    we are visiting to the packing list.

In the majority of these cases, any object we visit in this traversal
will already be in our packing list. But we will sometimes encounter
reachable-from-recent cruft objects, which we want to retain even if
they aged out of the grace period.

The most subtle point of this process is that we actually don't need to
bother to update the rescued object's mtime. Even though we will write
an .mtimes file with a value that is older than the expiration window,
it will continue to survive cruft repacks so long as any objects which
reach it haven't aged out.

That is, a future repack will also exclude that object from the initial
packing list, only to discover it later on when doing the reachability
traversal.

Finally, stopping early once an object is found in a kept pack is safe
to do because the kept packs ordinarily represent which packs will
survive after repacking. Assuming that it _isn't_ safe to halt a
traversal early would mean that there is some ancestor object which is
missing, which implies repository corruption (i.e., the complete set of
reachable objects isn't present).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c        |  84 +++++++++++++++++++-
 reachable.h                   |   4 +-
 t/t5329-pack-objects-cruft.sh | 143 ++++++++++++++++++++++++++++++++++
 3 files changed, 228 insertions(+), 3 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 3b8bf6a3dd..8decc9dc0c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3447,6 +3447,44 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
 	return;
 }
 
+static void show_cruft_object(struct object *obj, const char *name, void *data)
+{
+	/*
+	 * if we did not record it earlier, it's at least as old as our
+	 * expiration value. Rather than find it exactly, just use that
+	 * value.  This may bump it forward from its real mtime, but it
+	 * will still be "too old" next time we run with the same
+	 * expiration.
+	 *
+	 * if obj does appear in the packing list, this call is a noop (or may
+	 * set the namehash).
+	 */
+	add_cruft_object_entry(&obj->oid, obj->type, NULL, 0, name, cruft_expiration);
+}
+
+static void show_cruft_commit(struct commit *commit, void *data)
+{
+	show_cruft_object((struct object*)commit, NULL, data);
+}
+
+static int cruft_include_check_obj(struct object *obj, void *data)
+{
+	return !has_object_kept_pack(&obj->oid, IN_CORE_KEEP_PACKS);
+}
+
+static int cruft_include_check(struct commit *commit, void *data)
+{
+	return cruft_include_check_obj((struct object*)commit, data);
+}
+
+static void set_cruft_mtime(const struct object *object,
+			    struct packed_git *pack,
+			    off_t offset, time_t mtime)
+{
+	add_cruft_object_entry(&object->oid, object->type, pack, offset, NULL,
+			       mtime);
+}
+
 static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
 {
 	struct string_list_item *item = NULL;
@@ -3472,6 +3510,50 @@ static void enumerate_cruft_objects(void)
 	stop_progress(&progress_state);
 }
 
+static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs)
+{
+	struct packed_git *p;
+	struct rev_info revs;
+	int ret;
+
+	repo_init_revisions(the_repository, &revs, NULL);
+
+	revs.tag_objects = 1;
+	revs.tree_objects = 1;
+	revs.blob_objects = 1;
+
+	revs.include_check = cruft_include_check;
+	revs.include_check_obj = cruft_include_check_obj;
+
+	revs.ignore_missing_links = 1;
+
+	if (progress)
+		progress_state = start_progress(_("Enumerating cruft objects"), 0);
+	ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
+						     set_cruft_mtime, 1);
+	stop_progress(&progress_state);
+
+	if (ret)
+		die(_("unable to add cruft objects"));
+
+	/*
+	 * Re-mark only the fresh packs as kept so that objects in
+	 * unknown packs do not halt the reachability traversal early.
+	 */
+	for (p = get_all_packs(the_repository); p; p = p->next)
+		p->pack_keep_in_core = 0;
+	mark_pack_kept_in_core(fresh_packs, 1);
+
+	if (prepare_revision_walk(&revs))
+		die(_("revision walk setup failed"));
+	if (progress)
+		progress_state = start_progress(_("Traversing cruft objects"), 0);
+	nr_seen = 0;
+	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
+
+	stop_progress(&progress_state);
+}
+
 static void read_cruft_objects(void)
 {
 	struct strbuf buf = STRBUF_INIT;
@@ -3523,7 +3605,7 @@ static void read_cruft_objects(void)
 	mark_pack_kept_in_core(&discard_packs, 0);
 
 	if (cruft_expiration)
-		die("--cruft-expiration not yet implemented");
+		enumerate_and_traverse_cruft_objects(&fresh_packs);
 	else
 		enumerate_cruft_objects();
 
diff --git a/reachable.h b/reachable.h
index b776761baa..020a887b99 100644
--- a/reachable.h
+++ b/reachable.h
@@ -1,10 +1,10 @@
 #ifndef REACHEABLE_H
 #define REACHEABLE_H
 
-#include "object.h"
-
 struct progress;
 struct rev_info;
+struct object;
+struct packed_git;
 
 typedef void report_recent_object_fn(const struct object *, struct packed_git *,
 				     off_t, time_t);
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 003ca7344e..939cdc297a 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -214,5 +214,148 @@ basic_cruft_pack_tests () {
 }
 
 basic_cruft_pack_tests never
+basic_cruft_pack_tests 2.weeks.ago
+
+test_expect_success 'cruft tags rescue tagged objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit tagged &&
+		git tag -a annotated -m tag &&
+
+		git rev-list --objects --no-object-names packed.. >objects &&
+		while read oid
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $oid)"
+		done <objects &&
+
+		test-tool chmtime -500 \
+			"$objdir/$(test_oid_to_path $(git rev-parse annotated))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+		cut -f1 -d" " <actual.raw | sort >actual &&
+
+		(
+			cat objects &&
+			git rev-parse annotated
+		) >expect.raw &&
+		sort <expect.raw >expect &&
+
+		test_cmp expect actual &&
+		cat actual
+	)
+'
+
+test_expect_success 'cruft commits rescue parents, trees' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit old &&
+		test_commit new &&
+
+		git rev-list --objects --no-object-names packed..new >objects &&
+		while read object
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $object)"
+		done <objects &&
+		test-tool chmtime +500 "$objdir/$(test_oid_to_path \
+			$(git rev-parse HEAD))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+		cut -d" " -f1 <actual.raw | sort >actual &&
+		sort <objects >expect &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'cruft trees rescue sub-trees, blobs' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		mkdir -p dir/sub &&
+		echo foo >foo &&
+		echo bar >dir/bar &&
+		echo baz >dir/sub/baz &&
+
+		test_tick &&
+		git add . &&
+		git commit -m "pruned" &&
+
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD^{tree}))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:foo))" &&
+		test-tool chmtime  -500 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/bar))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/sub))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/sub/baz))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+		cut -f1 -d" " <actual.raw | sort >actual &&
+
+		git rev-parse HEAD:dir HEAD:dir/bar HEAD:dir/sub HEAD:dir/sub/baz >expect.raw &&
+		sort <expect.raw >expect &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'expired objects are pruned' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit pruned &&
+
+		git rev-list --objects --no-object-names packed..pruned >objects &&
+		while read object
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $object)"
+		done <objects &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual &&
+		test_must_be_empty actual
+	)
+'
 
 test_done
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply	[flat|nested] 200+ messages in thread

* [PATCH v4 12/17] builtin/repack.c: support generating a cruft pack
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (10 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-19 11:29     ` Ævar Arnfjörð Bjarmason
  2022-05-18 23:11   ` [PATCH v4 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
                     ` (7 subsequent siblings)
  19 siblings, 1 reply; 200+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

Expose a way to split the contents of a repository into a main and cruft
pack when doing an all-into-one repack with `git repack --cruft -d`, and
a complementary configuration variable.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt            |  11 ++
 Documentation/technical/cruft-packs.txt |   2 +-
 builtin/repack.c                        | 105 +++++++++++-
 t/t5329-pack-objects-cruft.sh           | 207 ++++++++++++++++++++++++
 4 files changed, 319 insertions(+), 6 deletions(-)