git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 00/17] cruft packs
@ 2021-11-29 22:25 Taylor Blau
  2021-11-29 22:25 ` [PATCH 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
                   ` (21 more replies)
  0 siblings, 22 replies; 201+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

This series implements "cruft packs", a pack which stores accumulated
unreachable objects, along with a new ".mtimes" file which tracks each
object's last known modification time.

This idea was discussed recently-ish in [1], but the most thorough
discussion I could find is in [2]. The approach settled on in this
series is laid out in detail by the first patch.

For the uninitiated, cruft packs enable repositories to safely run
`git repack -Ad` by storing unreachable objects which have not yet
"aged out" in a separate pack. This prevents repositories from storing
a potentially large number of these such objects as loose.

This series is structured as follows:

  - The first patch describes the technical details of cruft packs.
  - The next five patches implement reading and writing the new
    `.mtimes` format.
  - The next six patches implement `git pack-objects --cruft`. The
    first five implement this mode when no grace period is specified,
    and the six patch adds support for the grace period.
  - The next five patches integrate cruft packs with `git repack`,
    including the new-ish `--geometric` mode.
  - The final patch handles object freshening for objects stored in a
    cruft pack.

Thanks in advance for your review.

[1]: https://lore.kernel.org/git/20170610080626.sjujpmgkli4muh7h@sigill.intra.peff.net/
[2]: https://lore.kernel.org/git/E1SdhJ9-0006B1-6p@tytso-glaptop.cam.corp.google.com/

Taylor Blau (17):
  Documentation/technical: add cruft-packs.txt
  pack-mtimes: support reading .mtimes files
  pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  chunk-format.h: extract oid_version()
  pack-mtimes: support writing pack .mtimes files
  t/helper: add 'pack-mtimes' test-tool
  builtin/pack-objects.c: return from create_object_entry()
  builtin/pack-objects.c: --cruft without expiration
  reachable: add options to add_unseen_recent_objects_to_traversal
  reachable: report precise timestamps from objects in cruft packs
  builtin/pack-objects.c: --cruft with expiration
  builtin/repack.c: support generating a cruft pack
  builtin/repack.c: allow configuring cruft pack generation
  builtin/repack.c: use named flags for existing_packs
  builtin/repack.c: add cruft packs to MIDX during geometric repack
  builtin/gc.c: conditionally avoid pruning objects via loose
  sha1-file.c: don't freshen cruft packs

 Documentation/Makefile                  |   1 +
 Documentation/config/gc.txt             |  21 +-
 Documentation/config/repack.txt         |   9 +
 Documentation/git-gc.txt                |   5 +
 Documentation/git-pack-objects.txt      |  23 +
 Documentation/git-repack.txt            |  11 +
 Documentation/technical/cruft-packs.txt |  95 ++++
 Documentation/technical/pack-format.txt |  22 +
 Makefile                                |   2 +
 builtin/gc.c                            |  10 +-
 builtin/pack-objects.c                  | 306 ++++++++++-
 builtin/repack.c                        | 189 ++++++-
 bulk-checkin.c                          |   2 +-
 chunk-format.c                          |  12 +
 chunk-format.h                          |   3 +
 commit-graph.c                          |  18 +-
 midx.c                                  |  18 +-
 object-file.c                           |   4 +-
 object-store.h                          |   7 +-
 pack-mtimes.c                           | 139 +++++
 pack-mtimes.h                           |  16 +
 pack-objects.c                          |   6 +
 pack-objects.h                          |  20 +
 pack-write.c                            |  90 +++-
 pack.h                                  |   4 +
 packfile.c                              |  18 +-
 packfile.h                              |   1 +
 reachable.c                             |  58 +-
 reachable.h                             |   9 +-
 t/helper/test-pack-mtimes.c             |  53 ++
 t/helper/test-tool.c                    |   1 +
 t/helper/test-tool.h                    |   1 +
 t/t5327-pack-objects-cruft.sh           | 685 ++++++++++++++++++++++++
 33 files changed, 1757 insertions(+), 102 deletions(-)
 create mode 100644 Documentation/technical/cruft-packs.txt
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h
 create mode 100644 t/helper/test-pack-mtimes.c
 create mode 100755 t/t5327-pack-objects-cruft.sh

-- 
2.34.1.25.gb3157a20e6

^ permalink raw reply	[flat|nested] 201+ messages in thread

* [PATCH 01/17] Documentation/technical: add cruft-packs.txt
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-12-02 14:33   ` Derrick Stolee
  2021-12-04 22:20   ` Elijah Newren
  2021-11-29 22:25 ` [PATCH 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
                   ` (20 subsequent siblings)
  21 siblings, 2 replies; 201+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

Create a technical document to explain cruft packs. It contains a brief
overview of the problem, some background, details on the implementation,
and a couple of alternative approaches not considered here.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/Makefile                  |  1 +
 Documentation/technical/cruft-packs.txt | 95 +++++++++++++++++++++++++
 2 files changed, 96 insertions(+)
 create mode 100644 Documentation/technical/cruft-packs.txt

diff --git a/Documentation/Makefile b/Documentation/Makefile
index ed656db2ae..0b01c9408e 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -91,6 +91,7 @@ TECH_DOCS += MyFirstContribution
 TECH_DOCS += MyFirstObjectWalk
 TECH_DOCS += SubmittingPatches
 TECH_DOCS += technical/bundle-format
+TECH_DOCS += technical/cruft-packs
 TECH_DOCS += technical/hash-function-transition
 TECH_DOCS += technical/http-protocol
 TECH_DOCS += technical/index-format
diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
new file mode 100644
index 0000000000..bb54cce1b1
--- /dev/null
+++ b/Documentation/technical/cruft-packs.txt
@@ -0,0 +1,95 @@
+= Cruft packs
+
+Cruft packs offer an alternative to Git's traditional mechanism of removing
+unreachable objects. This document provides an overview of Git's pruning
+mechanism, and how cruft packs can be used instead to accomplish the same.
+
+== Background
+
+To remove unreachable objects from your repository, Git offers `git repack -Ad`
+(see linkgit:git-repack[1]). Quoting from the documentation:
+
+[quote]
+[...] unreachable objects in a previous pack become loose, unpacked objects,
+instead of being left in the old pack. [...] loose unreachable objects will be
+pruned according to normal expiry rules with the next 'git gc' invocation.
+
+Unreachable objects aren't removed immediately, since doing so could race with
+an incoming push which may reference an object which is about to be deleted.
+Instead, those unreachable objects are stored as loose object and stay that way
+until they are older than the expiration window, at which point they are removed
+by linkgit:git-prune[1].
+
+Git must store these unreachable objects loose in order to keep track of their
+per-object mtimes. If these unreachable objects were written into one big pack,
+then either freshening that pack (because an object contained within it was
+re-written) or creating a new pack of unreachable objects would cause the pack's
+mtime to get updated, and the objects within it would never leave the expiration
+window. Instead, objects are stored loose in order to keep track of the
+individual object mtimes and avoid a situation where all cruft objects are
+freshened at once.
+
+This can lead to undesirable situations when a repository contains many
+unreachable objects which have not yet left the grace period. Having large
+directories in the shards of `.git/objects` can lead to decreased performance in
+the repository. But given enough unreachable objects, this can lead to inode
+starvation and degrade the performance of the whole system. Since we
+can never pack those objects, these repositories often take up a large amount of
+disk space, since we can only zlib compress them, but not store them in delta
+chains.
+
+== Cruft packs
+
+Cruft packs are designed to eliminate the need for storing unreachable objects
+in a loose state by including the per-object mtimes in a separate file alongside
+a single pack containing all loose objects.
+
+A cruft pack is written by `git repack --cruft` when generating a new pack.
+linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft`
+is a classic all-into-one repack, meaning that everything in the resulting pack is
+reachable, and everything else is unreachable. Once written, the `--cruft`
+option instructs `git repack` to generate another pack containing only objects
+not packed in the previous step (which equates to packing all unreachable
+objects together). This progresses as follows:
+
+  1. Enumerate every object, marking any object which is (a) not contained in a
+     kept-pack, and (b) whose mtime is within the grace period as a traversal
+     tip.
+
+  2. Perform a reachability traversal based on the tips gathered in the previous
+     step, adding every object along the way to the pack.
+
+  3. Write the pack out, along with a `.mtimes` file that records the per-object
+     timestamps.
+
+This mode is invoked internally by linkgit:git-repack[1] when instructed to
+write a cruft pack. Crucially, the set of in-core kept packs is exactly the set
+of packs which will not be deleted by the repack; in other words, they contain
+all of the repository's reachable objects.
+
+When a repository already has a cruft pack, `git repack --cruft` typically only
+adds objects to it. An exception to this is when `git repack` is given the
+`--cruft-expiration` option, which allows the generated cruft pack to omit
+expired objects instead of waiting for linkgit:git-gc[1] to expire those objects
+later on.
+
+It is linkgit:git-gc[1] that is typically responsible for removing expired
+unreachable objects.
+
+== Alternatives
+
+Notable alternatives to this design include:
+
+  - The location of the per-object mtime data, and
+  - Whether cruft packs should be incremental or not.
+
+On the location of mtime data, a new auxiliary file tied to the pack was chosen
+to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
+support for optional chunks of data, it may make sense to consolidate the
+`.mtimes` format into the `.idx` itself.
+
+Incremental cruft packs (i.e., where each time a repository is repacked a new
+cruft pack is generated containing only the unreachable objects introduced since
+the last time a cruft pack was written) are significantly more complicated to
+construct, and so aren't pursued here. The obvious drawback to the current
+implementation is that the entire cruft pack must be re-written from scratch.
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH 02/17] pack-mtimes: support reading .mtimes files
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
  2021-11-29 22:25 ` [PATCH 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-12-02 15:06   ` Derrick Stolee
  2021-11-29 22:25 ` [PATCH 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

To store the individual mtimes of objects in a cruft pack, introduce a
new `.mtimes` format that can optionally accompany a single pack in the
repository.

The format is defined in Documentation/technical/pack-format.txt, and
stores a 4-byte network order timestamp for each object in name (index)
order.

This patch prepares for cruft packs by defining the `.mtimes` format,
and introducing a basic API that callers can use to read out individual
mtimes.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/pack-format.txt |  22 ++++
 Makefile                                |   1 +
 builtin/repack.c                        |   1 +
 object-store.h                          |   5 +-
 pack-mtimes.c                           | 139 ++++++++++++++++++++++++
 pack-mtimes.h                           |  16 +++
 packfile.c                              |  18 ++-
 packfile.h                              |   1 +
 8 files changed, 200 insertions(+), 3 deletions(-)
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 8d2f42f29e..61d8d960e7 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -294,6 +294,28 @@ Pack file entry: <+
 
 All 4-byte numbers are in network order.
 
+== pack-*.mtimes files have the format:
+
+  - A 4-byte magic number '0x4d544d45' ('MTME').
+
+  - A 4-byte version identifier (= 1).
+
+  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
+
+  - A table of mtimes (one per packed object, num_objects in total, each
+    a 4-byte unsigned integer in network order), in the same order as
+    objects appear in the index file (e.g., the first entry in the mtime
+    table corresponds to the object with the lowest lexically-sorted
+    oid). The mtimes count standard epoch seconds.
+
+  - A trailer, containing a:
+
+    checksum of the corresponding packfile, and
+
+    a checksum of all of the above.
+
+All 4-byte numbers are in network order.
+
 == multi-pack-index (MIDX) files have the following format:
 
 The multi-pack-index files refer to multiple pack-files and loose objects.
diff --git a/Makefile b/Makefile
index 12be39ac49..efd5e00717 100644
--- a/Makefile
+++ b/Makefile
@@ -949,6 +949,7 @@ LIB_OBJS += oidtree.o
 LIB_OBJS += pack-bitmap-write.o
 LIB_OBJS += pack-bitmap.o
 LIB_OBJS += pack-check.o
+LIB_OBJS += pack-mtimes.o
 LIB_OBJS += pack-objects.o
 LIB_OBJS += pack-revindex.o
 LIB_OBJS += pack-write.o
diff --git a/builtin/repack.c b/builtin/repack.c
index 0b2d1e5d82..acbb7b8c3b 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -212,6 +212,7 @@ static struct {
 } exts[] = {
 	{".pack"},
 	{".rev", 1},
+	{".mtimes", 1},
 	{".bitmap", 1},
 	{".promisor", 1},
 	{".idx"},
diff --git a/object-store.h b/object-store.h
index 952efb6a4b..d87481f101 100644
--- a/object-store.h
+++ b/object-store.h
@@ -89,12 +89,15 @@ struct packed_git {
 		 freshened:1,
 		 do_not_close:1,
 		 pack_promisor:1,
-		 multi_pack_index:1;
+		 multi_pack_index:1,
+		 is_cruft:1;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	struct revindex_entry *revindex;
 	const uint32_t *revindex_data;
 	const uint32_t *revindex_map;
 	size_t revindex_size;
+	const uint32_t *mtimes_map;
+	size_t mtimes_size;
 	/* something like ".git/objects/pack/xxxxx.pack" */
 	char pack_name[FLEX_ARRAY]; /* more */
 };
diff --git a/pack-mtimes.c b/pack-mtimes.c
new file mode 100644
index 0000000000..4c7c00fa67
--- /dev/null
+++ b/pack-mtimes.c
@@ -0,0 +1,139 @@
+#include "pack-mtimes.h"
+#include "object-store.h"
+#include "packfile.h"
+
+static char *pack_mtimes_filename(struct packed_git *p)
+{
+	size_t len;
+	if (!strip_suffix(p->pack_name, ".pack", &len))
+		BUG("pack_name does not end in .pack");
+	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
+	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
+}
+
+int pack_has_mtimes(struct packed_git *p)
+{
+	struct stat st;
+	char *fname = pack_mtimes_filename(p);
+
+	if (stat(fname, &st) < 0) {
+		if (errno == ENOENT)
+			return 0;
+		die_errno(_("could not stat %s"), fname);
+	}
+
+	free(fname);
+	return 1;
+}
+
+#define MTIMES_HEADER_SIZE (12)
+#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))
+
+struct mtimes_header {
+	uint32_t signature;
+	uint32_t version;
+	uint32_t hash_id;
+};
+
+static int load_pack_mtimes_file(char *mtimes_file,
+				 uint32_t num_objects,
+				 const uint32_t **data_p, size_t *len_p)
+{
+	int fd, ret = 0;
+	struct stat st;
+	void *data = NULL;
+	size_t mtimes_size;
+	uint32_t *hdr;
+
+	fd = git_open(mtimes_file);
+
+	if (fd < 0) {
+		ret = -1;
+		goto cleanup;
+	}
+	if (fstat(fd, &st)) {
+		ret = error_errno(_("failed to read %s"), mtimes_file);
+		goto cleanup;
+	}
+
+	mtimes_size = xsize_t(st.st_size);
+
+	if (mtimes_size < MTIMES_MIN_SIZE) {
+		ret = error(_("mtimes file %s is too small"), mtimes_file);
+		goto cleanup;
+	}
+
+	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
+		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
+		goto cleanup;
+	}
+
+	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
+
+	if (ntohl(*hdr) != MTIMES_SIGNATURE) {
+		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
+		goto cleanup;
+	}
+
+	if (ntohl(*++hdr) != 1) {
+		ret = error(_("mtimes file %s has unsupported version %"PRIu32),
+			    mtimes_file, ntohl(*hdr));
+		goto cleanup;
+	}
+	hdr++;
+	if (!(ntohl(*hdr) == 1 || ntohl(*hdr) == 2)) {
+		ret = error(_("mtimes file %s has unsupported hash id %"PRIu32),
+			    mtimes_file, ntohl(*hdr));
+		goto cleanup;
+	}
+
+cleanup:
+	if (ret) {
+		if (data)
+			munmap(data, mtimes_size);
+	} else {
+		*len_p = mtimes_size;
+		*data_p = (const uint32_t *)data;
+	}
+
+	close(fd);
+	return ret;
+}
+
+int load_pack_mtimes(struct packed_git *p)
+{
+	char *mtimes_name = NULL;
+	int ret = 0;
+
+	if (!p->is_cruft)
+		return ret; /* not a cruft pack */
+	if (p->mtimes_map)
+		return ret; /* already loaded */
+
+	ret = open_pack_index(p);
+	if (ret < 0)
+		goto cleanup;
+
+	mtimes_name = pack_mtimes_filename(p);
+	ret = load_pack_mtimes_file(mtimes_name,
+				    p->num_objects,
+				    &p->mtimes_map,
+				    &p->mtimes_size);
+	if (ret)
+		goto cleanup;
+
+cleanup:
+	free(mtimes_name);
+	return ret;
+}
+
+uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)
+{
+	if (!p->mtimes_map)
+		BUG("pack .mtimes file not loaded for %s", p->pack_name);
+	if (p->num_objects <= pos)
+		BUG("pack .mtimes out-of-bounds (%"PRIu32" vs %"PRIu32")",
+		    pos, p->num_objects);
+
+	return get_be32(p->mtimes_map + pos + 3);
+}
diff --git a/pack-mtimes.h b/pack-mtimes.h
new file mode 100644
index 0000000000..ac4247bb5e
--- /dev/null
+++ b/pack-mtimes.h
@@ -0,0 +1,16 @@
+#ifndef PACK_MTIMES_H
+#define PACK_MTIMES_H
+
+#include "git-compat-util.h"
+
+#define MTIMES_SIGNATURE 0x4d544d45 /* "MTME" */
+#define MTIMES_VERSION 1
+
+struct packed_git;
+
+int pack_has_mtimes(struct packed_git *p);
+int load_pack_mtimes(struct packed_git *p);
+
+uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos);
+
+#endif
diff --git a/packfile.c b/packfile.c
index 89402cfc69..ae79ac644e 100644
--- a/packfile.c
+++ b/packfile.c
@@ -333,12 +333,21 @@ void close_pack_revindex(struct packed_git *p) {
 	p->revindex_data = NULL;
 }
 
+void close_pack_mtimes(struct packed_git *p) {
+	if (!p->mtimes_map)
+		return;
+
+	munmap((void *)p->mtimes_map, p->mtimes_size);
+	p->mtimes_map = NULL;
+}
+
 void close_pack(struct packed_git *p)
 {
 	close_pack_windows(p);
 	close_pack_fd(p);
 	close_pack_index(p);
 	close_pack_revindex(p);
+	close_pack_mtimes(p);
 	oidset_clear(&p->bad_objects);
 }
 
@@ -362,7 +371,7 @@ void close_object_store(struct raw_object_store *o)
 
 void unlink_pack_path(const char *pack_name, int force_delete)
 {
-	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
+	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".mtimes"};
 	int i;
 	struct strbuf buf = STRBUF_INIT;
 	size_t plen;
@@ -717,6 +726,10 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
 	if (!access(p->pack_name, F_OK))
 		p->pack_promisor = 1;
 
+	xsnprintf(p->pack_name + path_len, alloc - path_len, ".mtimes");
+	if (!access(p->pack_name, F_OK))
+		p->is_cruft = 1;
+
 	xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
 	if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) {
 		free(p);
@@ -868,7 +881,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 	    ends_with(file_name, ".pack") ||
 	    ends_with(file_name, ".bitmap") ||
 	    ends_with(file_name, ".keep") ||
-	    ends_with(file_name, ".promisor"))
+	    ends_with(file_name, ".promisor") ||
+	    ends_with(file_name, ".mtimes"))
 		string_list_append(data->garbage, full_name);
 	else
 		report_garbage(PACKDIR_FILE_GARBAGE, full_name);
diff --git a/packfile.h b/packfile.h
index 186146779d..32201d8af7 100644
--- a/packfile.h
+++ b/packfile.h
@@ -91,6 +91,7 @@ uint32_t get_pack_fanout(struct packed_git *p, uint32_t value);
 unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *);
 void close_pack_windows(struct packed_git *);
 void close_pack_revindex(struct packed_git *);
+void close_pack_mtimes(struct packed_git *p);
 void close_pack(struct packed_git *);
 void close_object_store(struct raw_object_store *o);
 void unuse_pack(struct pack_window **);
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
  2021-11-29 22:25 ` [PATCH 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
  2021-11-29 22:25 ` [PATCH 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-11-29 22:25 ` [PATCH 04/17] chunk-format.h: extract oid_version() Taylor Blau
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

This structure will be used to communicate the per-object mtimes when
writing a cruft pack. Here, we need the full packing_data structure
because the mtime information is stored in an array there, not on the
individual object_entry's themselves (to avoid paying the overhead in
structure width for operations which do not generate a cruft pack).

We haven't passed this information down before because one of the two
callers (in bulk-checkin.c) does not have a packing_data structure at
all. In that case (where no cruft pack will be generated), NULL is
passed instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 3 ++-
 bulk-checkin.c         | 2 +-
 pack-write.c           | 1 +
 pack.h                 | 3 +++
 4 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 1a3dd445f8..bf45ffbc57 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1254,7 +1254,8 @@ static void write_pack_file(void)
 
 			stage_tmp_packfiles(&tmpname, pack_tmp_name,
 					    written_list, nr_written,
-					    &pack_idx_opts, hash, &idx_tmp_name);
+					    &to_pack, &pack_idx_opts, hash,
+					    &idx_tmp_name);
 
 			if (write_bitmap_index) {
 				size_t tmpname_len = tmpname.len;
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 8785b2ac80..99f7596c4e 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -33,7 +33,7 @@ static void finish_tmp_packfile(struct strbuf *basename,
 	char *idx_tmp_name = NULL;
 
 	stage_tmp_packfiles(basename, pack_tmp_name, written_list, nr_written,
-			    pack_idx_opts, hash, &idx_tmp_name);
+			    NULL, pack_idx_opts, hash, &idx_tmp_name);
 	rename_tmp_packfile_idx(basename, &idx_tmp_name);
 
 	free(idx_tmp_name);
diff --git a/pack-write.c b/pack-write.c
index a5846f3a34..d594e3008e 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -483,6 +483,7 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 const char *pack_tmp_name,
 			 struct pack_idx_entry **written_list,
 			 uint32_t nr_written,
+			 struct packing_data *to_pack,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[],
 			 char **idx_tmp_name)
diff --git a/pack.h b/pack.h
index b22bfc4a18..fd27cfdfd7 100644
--- a/pack.h
+++ b/pack.h
@@ -109,11 +109,14 @@ int encode_in_pack_object_header(unsigned char *hdr, int hdr_len,
 #define PH_ERROR_PROTOCOL	(-3)
 int read_pack_header(int fd, struct pack_header *);
 
+struct packing_data;
+
 struct hashfile *create_tmp_packfile(char **pack_tmp_name);
 void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 const char *pack_tmp_name,
 			 struct pack_idx_entry **written_list,
 			 uint32_t nr_written,
+			 struct packing_data *to_pack,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[],
 			 char **idx_tmp_name);
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH 04/17] chunk-format.h: extract oid_version()
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (2 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-12-02 15:22   ` Derrick Stolee
  2021-11-29 22:25 ` [PATCH 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

There are three definitions of an identical function which converts
`the_hash_algo` into either 1 (for SHA-1) or 2 (for SHA-256). There is a
copy of this function for writing both the commit-graph and
multi-pack-index file, and another inline definition used to write the
.rev header.

Consolidate these into a single definition in chunk-format.h. It's not
clear that this is the best header to define this function in, but it
should do for now.

(Worth noting, the .rev caller expects a 4-byte unsigned, but the other
two callers work with a single unsigned byte. The consolidated version
uses the latter type, and lets the compiler widen it when required).

Another caller will be added in a subsequent patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 chunk-format.c | 12 ++++++++++++
 chunk-format.h |  3 +++
 commit-graph.c | 18 +++---------------
 midx.c         | 18 +++---------------
 pack-write.c   | 15 ++-------------
 5 files changed, 23 insertions(+), 43 deletions(-)

diff --git a/chunk-format.c b/chunk-format.c
index 1c3dca62e2..0275b74a89 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -181,3 +181,15 @@ int read_chunk(struct chunkfile *cf,
 
 	return CHUNK_NOT_FOUND;
 }
+
+uint8_t oid_version(const struct git_hash_algo *algop)
+{
+	switch (hash_algo_by_ptr(algop)) {
+	case GIT_HASH_SHA1:
+		return 1;
+	case GIT_HASH_SHA256:
+		return 2;
+	default:
+		die(_("invalid hash version"));
+	}
+}
diff --git a/chunk-format.h b/chunk-format.h
index 9ccbe00377..7885aa0848 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -2,6 +2,7 @@
 #define CHUNK_FORMAT_H
 
 #include "git-compat-util.h"
+#include "hash.h"
 
 struct hashfile;
 struct chunkfile;
@@ -65,4 +66,6 @@ int read_chunk(struct chunkfile *cf,
 	       chunk_read_fn fn,
 	       void *data);
 
+uint8_t oid_version(const struct git_hash_algo *algop);
+
 #endif
diff --git a/commit-graph.c b/commit-graph.c
index 2706683acf..1f08152a35 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -193,18 +193,6 @@ char *get_commit_graph_chain_filename(struct object_directory *odb)
 	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
 }
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 static struct commit_graph *alloc_commit_graph(void)
 {
 	struct commit_graph *g = xcalloc(1, sizeof(*g));
@@ -365,9 +353,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 	}
 
 	hash_version = *(unsigned char*)(data + 5);
-	if (hash_version != oid_version()) {
+	if (hash_version != oid_version(the_hash_algo)) {
 		error(_("commit-graph hash version %X does not match version %X"),
-		      hash_version, oid_version());
+		      hash_version, oid_version(the_hash_algo));
 		return NULL;
 	}
 
@@ -1908,7 +1896,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
 	hashwrite_u8(f, GRAPH_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, oid_version(the_hash_algo));
 	hashwrite_u8(f, get_num_chunks(cf));
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
diff --git a/midx.c b/midx.c
index 8433086ac1..756ae6a206 100644
--- a/midx.c
+++ b/midx.c
@@ -40,18 +40,6 @@
 
 #define PACK_EXPIRED UINT_MAX
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 const unsigned char *get_midx_checksum(struct multi_pack_index *m)
 {
 	return m->data + m->data_len - the_hash_algo->rawsz;
@@ -131,9 +119,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 		      m->version);
 
 	hash_version = m->data[MIDX_BYTE_HASH_VERSION];
-	if (hash_version != oid_version()) {
+	if (hash_version != oid_version(the_hash_algo)) {
 		error(_("multi-pack-index hash version %u does not match version %u"),
-		      hash_version, oid_version());
+		      hash_version, oid_version(the_hash_algo));
 		goto cleanup_fail;
 	}
 	m->hash_len = the_hash_algo->rawsz;
@@ -413,7 +401,7 @@ static size_t write_midx_header(struct hashfile *f,
 {
 	hashwrite_be32(f, MIDX_SIGNATURE);
 	hashwrite_u8(f, MIDX_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, oid_version(the_hash_algo));
 	hashwrite_u8(f, num_chunks);
 	hashwrite_u8(f, 0); /* unused */
 	hashwrite_be32(f, num_packs);
diff --git a/pack-write.c b/pack-write.c
index d594e3008e..ff305b404c 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -2,6 +2,7 @@
 #include "pack.h"
 #include "csum-file.h"
 #include "remote.h"
+#include "chunk-format.h"
 
 void reset_pack_idx_option(struct pack_idx_option *opts)
 {
@@ -181,21 +182,9 @@ static int pack_order_cmp(const void *va, const void *vb, void *ctx)
 
 static void write_rev_header(struct hashfile *f)
 {
-	uint32_t oid_version;
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		oid_version = 1;
-		break;
-	case GIT_HASH_SHA256:
-		oid_version = 2;
-		break;
-	default:
-		die("write_rev_header: unknown hash version");
-	}
-
 	hashwrite_be32(f, RIDX_SIGNATURE);
 	hashwrite_be32(f, RIDX_VERSION);
-	hashwrite_be32(f, oid_version);
+	hashwrite_be32(f, oid_version(the_hash_algo));
 }
 
 static void write_rev_index_positions(struct hashfile *f,
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH 05/17] pack-mtimes: support writing pack .mtimes files
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (3 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 04/17] chunk-format.h: extract oid_version() Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-12-02 15:36   ` Derrick Stolee
  2021-11-29 22:25 ` [PATCH 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

Now that the `.mtimes` format is defined, supplement the pack-write API
to be able to conditionally write an `.mtimes` file along with a pack by
setting an additional flag and passing an oidmap that contains the
timestamps corresponding to each object in the pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-objects.c |  6 ++++
 pack-objects.h | 20 ++++++++++++++
 pack-write.c   | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++
 pack.h         |  1 +
 4 files changed, 101 insertions(+)

diff --git a/pack-objects.c b/pack-objects.c
index fe2a4eace9..272e8d4517 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -170,6 +170,9 @@ struct object_entry *packlist_alloc(struct packing_data *pdata,
 
 		if (pdata->layer)
 			REALLOC_ARRAY(pdata->layer, pdata->nr_alloc);
+
+		if (pdata->cruft_mtime)
+			REALLOC_ARRAY(pdata->cruft_mtime, pdata->nr_alloc);
 	}
 
 	new_entry = pdata->objects + pdata->nr_objects++;
@@ -198,6 +201,9 @@ struct object_entry *packlist_alloc(struct packing_data *pdata,
 	if (pdata->layer)
 		pdata->layer[pdata->nr_objects - 1] = 0;
 
+	if (pdata->cruft_mtime)
+		pdata->cruft_mtime[pdata->nr_objects - 1] = 0;
+
 	return new_entry;
 }
 
diff --git a/pack-objects.h b/pack-objects.h
index dca2351ef9..f17119de26 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -168,6 +168,9 @@ struct packing_data {
 	/* delta islands */
 	unsigned int *tree_depth;
 	unsigned char *layer;
+
+	/* cruft packs */
+	uint32_t *cruft_mtime;
 };
 
 void prepare_packing_data(struct repository *r, struct packing_data *pdata);
@@ -289,4 +292,21 @@ static inline void oe_set_layer(struct packing_data *pack,
 	pack->layer[e - pack->objects] = layer;
 }
 
+static inline uint32_t oe_cruft_mtime(struct packing_data *pack,
+				      struct object_entry *e)
+{
+	if (!pack->cruft_mtime)
+		return 0;
+	return pack->cruft_mtime[e - pack->objects];
+}
+
+static inline void oe_set_cruft_mtime(struct packing_data *pack,
+				      struct object_entry *e,
+				      uint32_t mtime)
+{
+	if (!pack->cruft_mtime)
+		CALLOC_ARRAY(pack->cruft_mtime, pack->nr_alloc);
+	pack->cruft_mtime[e - pack->objects] = mtime;
+}
+
 #endif
diff --git a/pack-write.c b/pack-write.c
index ff305b404c..8c3efda2c3 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -3,6 +3,10 @@
 #include "csum-file.h"
 #include "remote.h"
 #include "chunk-format.h"
+#include "pack-mtimes.h"
+#include "oidmap.h"
+#include "chunk-format.h"
+#include "pack-objects.h"
 
 void reset_pack_idx_option(struct pack_idx_option *opts)
 {
@@ -276,6 +280,65 @@ const char *write_rev_file_order(const char *rev_name,
 	return rev_name;
 }
 
+static void write_mtimes_header(struct hashfile *f)
+{
+	hashwrite_be32(f, MTIMES_SIGNATURE);
+	hashwrite_be32(f, MTIMES_VERSION);
+	hashwrite_be32(f, oid_version(the_hash_algo));
+}
+
+static void write_mtimes_objects(struct hashfile *f,
+				 struct packing_data *to_pack,
+				 struct pack_idx_entry **objects,
+				 uint32_t nr_objects)
+{
+	uint32_t i;
+	for (i = 0; i < nr_objects; i++) {
+		struct object_entry *e = (struct object_entry*)objects[i];
+		hashwrite_be32(f, oe_cruft_mtime(to_pack, e));
+	}
+}
+
+static void write_mtimes_trailer(struct hashfile *f, const unsigned char *hash)
+{
+	hashwrite(f, hash, the_hash_algo->rawsz);
+}
+
+static const char *write_mtimes_file(const char *mtimes_name,
+				     struct packing_data *to_pack,
+				     struct pack_idx_entry **objects,
+				     uint32_t nr_objects,
+				     const unsigned char *hash)
+{
+	struct hashfile *f;
+	int fd;
+
+	if (!to_pack)
+		BUG("cannot call write_mtimes_file with NULL packing_data");
+
+	if (!mtimes_name) {
+		struct strbuf tmp_file = STRBUF_INIT;
+		fd = odb_mkstemp(&tmp_file, "pack/tmp_mtimes_XXXXXX");
+		mtimes_name = strbuf_detach(&tmp_file, NULL);
+	} else {
+		unlink(mtimes_name);
+		fd = xopen(mtimes_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
+	}
+	f = hashfd(fd, mtimes_name);
+
+	write_mtimes_header(f);
+	write_mtimes_objects(f, to_pack, objects, nr_objects);
+	write_mtimes_trailer(f, hash);
+
+	if (mtimes_name && adjust_shared_perm(mtimes_name) < 0)
+		die(_("failed to make %s readable"), mtimes_name);
+
+	finalize_hashfile(f, NULL,
+			  CSUM_HASH_IN_STREAM | CSUM_CLOSE | CSUM_FSYNC);
+
+	return mtimes_name;
+}
+
 off_t write_pack_header(struct hashfile *f, uint32_t nr_entries)
 {
 	struct pack_header hdr;
@@ -478,6 +541,7 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 char **idx_tmp_name)
 {
 	const char *rev_tmp_name = NULL;
+	const char *mtimes_tmp_name = NULL;
 
 	if (adjust_shared_perm(pack_tmp_name))
 		die_errno("unable to make temporary pack file readable");
@@ -490,9 +554,19 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 	rev_tmp_name = write_rev_file(NULL, written_list, nr_written, hash,
 				      pack_idx_opts->flags);
 
+	if (pack_idx_opts->flags & WRITE_MTIMES) {
+		mtimes_tmp_name = write_mtimes_file(NULL, to_pack, written_list,
+						    nr_written,
+						    hash);
+		if (adjust_shared_perm(mtimes_tmp_name))
+			die_errno("unable to make temporary mtimes file readable");
+	}
+
 	rename_tmp_packfile(name_buffer, pack_tmp_name, "pack");
 	if (rev_tmp_name)
 		rename_tmp_packfile(name_buffer, rev_tmp_name, "rev");
+	if (mtimes_tmp_name)
+		rename_tmp_packfile(name_buffer, mtimes_tmp_name, "mtimes");
 }
 
 void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought)
diff --git a/pack.h b/pack.h
index fd27cfdfd7..01d385903a 100644
--- a/pack.h
+++ b/pack.h
@@ -44,6 +44,7 @@ struct pack_idx_option {
 #define WRITE_IDX_STRICT 02
 #define WRITE_REV 04
 #define WRITE_REV_VERIFY 010
+#define WRITE_MTIMES 020
 
 	uint32_t version;
 	uint32_t off32_limit;
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH 06/17] t/helper: add 'pack-mtimes' test-tool
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (4 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-12-06 21:16   ` Derrick Stolee
  2021-11-29 22:25 ` [PATCH 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

In the next patch, we will implement and test support for writing a
cruft pack via a special mode of `git pack-objects`. To make sure that
objects are written with the correct timestamps, and a new test-tool
that can dump the object names and corresponding timestamps from a given
`.mtimes` file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Makefile                    |  1 +
 t/helper/test-pack-mtimes.c | 53 +++++++++++++++++++++++++++++++++++++
 t/helper/test-tool.c        |  1 +
 t/helper/test-tool.h        |  1 +
 4 files changed, 56 insertions(+)
 create mode 100644 t/helper/test-pack-mtimes.c

diff --git a/Makefile b/Makefile
index efd5e00717..a7382cbfc1 100644
--- a/Makefile
+++ b/Makefile
@@ -721,6 +721,7 @@ TEST_BUILTINS_OBJS += test-oid-array.o
 TEST_BUILTINS_OBJS += test-oidmap.o
 TEST_BUILTINS_OBJS += test-oidtree.o
 TEST_BUILTINS_OBJS += test-online-cpus.o
+TEST_BUILTINS_OBJS += test-pack-mtimes.o
 TEST_BUILTINS_OBJS += test-parse-options.o
 TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
 TEST_BUILTINS_OBJS += test-partial-clone.o
diff --git a/t/helper/test-pack-mtimes.c b/t/helper/test-pack-mtimes.c
new file mode 100644
index 0000000000..b143f62520
--- /dev/null
+++ b/t/helper/test-pack-mtimes.c
@@ -0,0 +1,53 @@
+#include "git-compat-util.h"
+#include "test-tool.h"
+#include "strbuf.h"
+#include "object-store.h"
+#include "packfile.h"
+#include "pack-mtimes.h"
+
+static int dump_mtimes(struct packed_git *p)
+{
+	uint32_t i;
+	if (load_pack_mtimes(p) < 0)
+		die("could not load pack .mtimes");
+
+	for (i = 0; i < p->num_objects; i++) {
+		struct object_id oid;
+		if (nth_packed_object_id(&oid, p, i) < 0)
+			die("could not load object id at position %"PRIu32, i);
+
+		printf("%s %"PRIu32"\n",
+		       oid_to_hex(&oid), nth_packed_mtime(p, i));
+	}
+
+	return 0;
+}
+
+static const char *pack_mtimes_usage = "\n"
+"  test-tool pack-mtimes <pack-name.mtimes>";
+
+int cmd__pack_mtimes(int argc, const char **argv)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct packed_git *p;
+
+	setup_git_directory();
+
+	if (argc != 2)
+		usage(pack_mtimes_usage);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		strbuf_addstr(&buf, basename(p->pack_name));
+		strbuf_strip_suffix(&buf, ".pack");
+		strbuf_addstr(&buf, ".mtimes");
+
+		if (!strcmp(buf.buf, argv[1]))
+			break;
+
+		strbuf_reset(&buf);
+	}
+
+	strbuf_release(&buf);
+
+	return p ? dump_mtimes(p) : 1;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 3ce5585e53..1bb1c4b562 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -46,6 +46,7 @@ static struct test_cmd cmds[] = {
 	{ "oidmap", cmd__oidmap },
 	{ "oidtree", cmd__oidtree },
 	{ "online-cpus", cmd__online_cpus },
+	{ "pack-mtimes", cmd__pack_mtimes },
 	{ "parse-options", cmd__parse_options },
 	{ "parse-pathspec-file", cmd__parse_pathspec_file },
 	{ "partial-clone", cmd__partial_clone },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 9f0f522850..07a2d3f94e 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -35,6 +35,7 @@ int cmd__mktemp(int argc, const char **argv);
 int cmd__oidmap(int argc, const char **argv);
 int cmd__oidtree(int argc, const char **argv);
 int cmd__online_cpus(int argc, const char **argv);
+int cmd__pack_mtimes(int argc, const char **argv);
 int cmd__parse_options(int argc, const char **argv);
 int cmd__parse_pathspec_file(int argc, const char** argv);
 int cmd__partial_clone(int argc, const char **argv);
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH 07/17] builtin/pack-objects.c: return from create_object_entry()
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (5 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-11-29 22:25 ` [PATCH 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

A new caller in the next commit will want to immediately modify the
object_entry structure created by create_object_entry(). Instead of
forcing that caller to wastefully look-up the entry we just created,
return it from create_object_entry() instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index bf45ffbc57..3fb10529ba 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1508,13 +1508,13 @@ static int want_object_in_pack(const struct object_id *oid,
 	return 1;
 }
 
-static void create_object_entry(const struct object_id *oid,
-				enum object_type type,
-				uint32_t hash,
-				int exclude,
-				int no_try_delta,
-				struct packed_git *found_pack,
-				off_t found_offset)
+static struct object_entry *create_object_entry(const struct object_id *oid,
+						enum object_type type,
+						uint32_t hash,
+						int exclude,
+						int no_try_delta,
+						struct packed_git *found_pack,
+						off_t found_offset)
 {
 	struct object_entry *entry;
 
@@ -1531,6 +1531,8 @@ static void create_object_entry(const struct object_id *oid,
 	}
 
 	entry->no_try_delta = no_try_delta;
+
+	return entry;
 }
 
 static const char no_closure_warning[] = N_(
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH 08/17] builtin/pack-objects.c: --cruft without expiration
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (6 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-12-06 21:44   ` Derrick Stolee
  2021-12-07 15:17   ` Derrick Stolee
  2021-11-29 22:25 ` [PATCH 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
                   ` (13 subsequent siblings)
  21 siblings, 2 replies; 201+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

Teach `pack-objects` how to generate a cruft pack when no objects are
dropped (i.e., `--cruft-expiration=never`). Later patches will teach
`pack-objects` how to generate a cruft pack that prunes objects.

When generating a cruft pack which does not prune objects, we want to
collect all unreachable objects into a single pack (noting and updating
their mtimes as we accumulate them). Ordinary use will pass the result
of a `git repack -A` as a kept pack, so when this patch says "kept
pack", readers should think "reachable objects".

Generating a non-expiring cruft packs works as follows:

  - Callers provide a list of every pack they know about, and indicate
    which packs are about to be removed.

  - All packs which are going to be removed (we'll call these the
    redundant ones) are marked as kept in-core, as well as any packs
    that `pack-objects` found but the caller did not specify.

    These packs are presumed to have entered the repository between
    the caller collecting packs and invoking `pack-objects`. Since we
    do not want to include objects in these packs (because we don't know
    which of their objects are or aren't reachable), these are also
    marked as kept in-core.

  - Then, we enumerate all objects in the repository, and add them to
    our packing list if they do not appear in an in-core kept pack.

This results in a new cruft pack which contains all known objects that
aren't included in the kept packs. When the kept pack is the result of
`git repack -A`, the resulting pack contains all unreachable objects.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-pack-objects.txt |  23 +++
 builtin/pack-objects.c             | 203 ++++++++++++++++++++++++++-
 object-file.c                      |   2 +-
 object-store.h                     |   2 +
 t/t5327-pack-objects-cruft.sh      | 218 +++++++++++++++++++++++++++++
 5 files changed, 442 insertions(+), 6 deletions(-)
 create mode 100755 t/t5327-pack-objects-cruft.sh

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index dbfd1f9017..573c18afcd 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -13,6 +13,7 @@ SYNOPSIS
 	[--no-reuse-delta] [--delta-base-offset] [--non-empty]
 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
 	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
+	[--cruft] [--cruft-expiration=<time>]
 	[--stdout [--filter=<filter-spec>] | base-name]
 	[--shallow] [--keep-true-parents] [--[no-]sparse] < object-list
 
@@ -95,6 +96,28 @@ base-name::
 Incompatible with `--revs`, or options that imply `--revs` (such as
 `--all`), with the exception of `--unpacked`, which is compatible.
 
+--cruft::
+	Packs unreachable objects into a separate "cruft" pack, denoted
+	by the existence of a `.mtimes` file. Pack names provided over
+	stdin indicate which packs will remain after a `git repack`.
+	Pack names prefixed with a `-` indicate those which will be
+	removed. The contents of the cruft pack are all objects not
+	contained in the surviving packs specified by `--keep-pack`)
+	which have not exceeded the grace period (see
+	`--cruft-expiration` below), or which have exceeded the grace
+	period, but are reachable from an other object which hasn't.
++
+Incompatible with `--unpack-unreachable`, `--keep-unreachable`,
+`--pack-loose-unreachable`, `--stdin-packs`, as well as any other
+options which imply `--revs`. Also incompatible with `--max-pack-size`;
+when this option is set, the maximum pack size is not inferred from
+`pack.packSizeLimit`.
+
+--cruft-expiration=<approxidate>::
+	If specified, objects are eliminated from the cruft pack if they
+	have an mtime older than `<approxidate>`. If unspecified (and
+	given `--cruft`), then no objects are eliminated.
+
 --window=<n>::
 --depth=<n>::
 	These two options affect how the objects contained in
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 3fb10529ba..b12e79e4b1 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -36,6 +36,7 @@
 #include "trace2.h"
 #include "shallow.h"
 #include "promisor-remote.h"
+#include "pack-mtimes.h"
 
 /*
  * Objects we are going to pack are collected in the `to_pack` structure.
@@ -194,6 +195,8 @@ static int reuse_delta = 1, reuse_object = 1;
 static int keep_unreachable, unpack_unreachable, include_tag;
 static timestamp_t unpack_unreachable_expiration;
 static int pack_loose_unreachable;
+static int cruft;
+static timestamp_t cruft_expiration;
 static int local;
 static int have_non_local_packs;
 static int incremental;
@@ -1252,6 +1255,9 @@ static void write_pack_file(void)
 					&to_pack, written_list, nr_written);
 			}
 
+			if (cruft)
+				pack_idx_opts.flags |= WRITE_MTIMES;
+
 			stage_tmp_packfiles(&tmpname, pack_tmp_name,
 					    written_list, nr_written,
 					    &to_pack, &pack_idx_opts, hash,
@@ -3389,6 +3395,135 @@ static void read_packs_list_from_stdin(void)
 	string_list_clear(&exclude_packs, 0);
 }
 
+static int add_cruft_object_entry(const struct object_id *oid, enum object_type type,
+				  struct packed_git *pack, off_t offset,
+				  const char *name, uint32_t mtime)
+{
+	struct object_entry *entry;
+
+	display_progress(progress_state, ++nr_seen);
+
+	entry = packlist_find(&to_pack, oid);
+	if (entry) {
+		if (name) {
+			entry->hash = pack_name_hash(name);
+			entry->no_try_delta = name && no_try_delta(name);
+		}
+	} else {
+		if (!want_object_in_pack(oid, 0, &pack, &offset))
+			return 0;
+		if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
+			/*
+			 * If a traversed tree has a missing blob then we want
+			 * to avoid adding that missing object to our pack.
+			 *
+			 * This only applies to missing blobs, not trees,
+			 * because the traversal needs to parse sub-trees but
+			 * not blobs.
+			 *
+			 * Note we only perform this check when we couldn't
+			 * already find the object in a pack, so we're really
+			 * limited to "ensure non-tip blobs which don't exist in
+			 * packs do exist via loose objects". Confused?
+			 */
+			return 0;
+		}
+
+		entry = create_object_entry(oid, type, pack_name_hash(name),
+					    0, name && no_try_delta(name),
+					    pack, offset);
+	}
+
+	if (mtime > oe_cruft_mtime(&to_pack, entry))
+		oe_set_cruft_mtime(&to_pack, entry, mtime);
+	return 1;
+}
+
+static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
+{
+	struct string_list_item *item = NULL;
+	for_each_string_list_item(item, packs) {
+		struct packed_git *p = item->util;
+		if (!p)
+			die(_("could not find pack '%s'"), item->string);
+		p->pack_keep_in_core = keep;
+	}
+}
+
+static void add_unreachable_loose_objects(void);
+static void add_objects_in_unpacked_packs(void);
+
+static void enumerate_cruft_objects(void)
+{
+	if (progress)
+		progress_state = start_progress(_("Enumerating cruft objects"), 0);
+
+	add_objects_in_unpacked_packs();
+	add_unreachable_loose_objects();
+
+	stop_progress(&progress_state);
+}
+
+static void read_cruft_objects(void)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct string_list discard_packs = STRING_LIST_INIT_DUP;
+	struct string_list fresh_packs = STRING_LIST_INIT_DUP;
+	struct packed_git *p;
+
+	ignore_packed_keep_in_core = 1;
+
+	while (strbuf_getline(&buf, stdin) != EOF) {
+		if (!buf.len)
+			continue;
+
+		if (*buf.buf == '-')
+			string_list_append(&discard_packs, buf.buf + 1);
+		else
+			string_list_append(&fresh_packs, buf.buf);
+		strbuf_reset(&buf);
+	}
+
+	string_list_sort(&discard_packs);
+	string_list_sort(&fresh_packs);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		const char *pack_name = pack_basename(p);
+		struct string_list_item *item;
+
+		item = string_list_lookup(&fresh_packs, pack_name);
+		if (!item)
+			item = string_list_lookup(&discard_packs, pack_name);
+
+		if (item) {
+			item->util = p;
+		} else {
+			/*
+			 * This pack wasn't mentioned in either the "fresh" or
+			 * "discard" list, so the caller didn't know about it.
+			 *
+			 * Mark it as kept so that its objects are ignored by
+			 * add_unseen_recent_objects_to_traversal(). We'll
+			 * unmark it before starting the traversal so it doesn't
+			 * halt the traversal early.
+			 */
+			p->pack_keep_in_core = 1;
+		}
+	}
+
+	mark_pack_kept_in_core(&fresh_packs, 1);
+	mark_pack_kept_in_core(&discard_packs, 0);
+
+	if (cruft_expiration)
+		die("--cruft-expiration not yet implemented");
+	else
+		enumerate_cruft_objects();
+
+	strbuf_release(&buf);
+	string_list_clear(&discard_packs, 0);
+	string_list_clear(&fresh_packs, 0);
+}
+
 static void read_object_list_from_stdin(void)
 {
 	char line[GIT_MAX_HEXSZ + 1 + PATH_MAX + 2];
@@ -3521,7 +3656,24 @@ static int add_object_in_unpacked_pack(const struct object_id *oid,
 				       uint32_t pos,
 				       void *_data)
 {
-	add_object_entry(oid, OBJ_NONE, "", 0);
+	if (cruft) {
+		off_t offset;
+		time_t mtime;
+
+		if (pack->is_cruft) {
+			if (load_pack_mtimes(pack) < 0)
+				die(_("could not load cruft pack .mtimes"));
+			mtime = nth_packed_mtime(pack, pos);
+		} else {
+			mtime = pack->mtime;
+		}
+		offset = nth_packed_object_offset(pack, pos);
+
+		add_cruft_object_entry(oid, OBJ_NONE, pack, offset,
+				       NULL, mtime);
+	} else {
+		add_object_entry(oid, OBJ_NONE, "", 0);
+	}
 	return 0;
 }
 
@@ -3545,7 +3697,19 @@ static int add_loose_object(const struct object_id *oid, const char *path,
 		return 0;
 	}
 
-	add_object_entry(oid, type, "", 0);
+	if (cruft) {
+		struct stat st;
+		if (stat(path, &st) < 0) {
+			if (errno == ENOENT)
+				return 0;
+			return error_errno("unable to stat %s", oid_to_hex(oid));
+		}
+
+		add_cruft_object_entry(oid, type, NULL, 0, NULL,
+				       st.st_mtime);
+	} else {
+		add_object_entry(oid, type, "", 0);
+	}
 	return 0;
 }
 
@@ -3864,6 +4028,20 @@ static int option_parse_unpack_unreachable(const struct option *opt,
 	return 0;
 }
 
+static int option_parse_cruft_expiration(const struct option *opt,
+					 const char *arg, int unset)
+{
+	if (unset) {
+		cruft = 0;
+		cruft_expiration = 0;
+	} else {
+		cruft = 1;
+		if (arg)
+			cruft_expiration = approxidate(arg);
+	}
+	return 0;
+}
+
 int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 {
 	int use_internal_rev_list = 0;
@@ -3936,6 +4114,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		OPT_CALLBACK_F(0, "unpack-unreachable", NULL, N_("time"),
 		  N_("unpack unreachable objects newer than <time>"),
 		  PARSE_OPT_OPTARG, option_parse_unpack_unreachable),
+		OPT_BOOL(0, "cruft", &cruft, N_("create a cruft pack")),
+		OPT_CALLBACK_F(0, "cruft-expiration", NULL, N_("time"),
+		  N_("expire cruft objects older than <time>"),
+		  PARSE_OPT_OPTARG, option_parse_cruft_expiration),
 		OPT_BOOL(0, "sparse", &sparse,
 			 N_("use the sparse reachability algorithm")),
 		OPT_BOOL(0, "thin", &thin,
@@ -4060,7 +4242,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	if (!HAVE_THREADS && delta_search_threads != 1)
 		warning(_("no threads support, ignoring --threads"));
-	if (!pack_to_stdout && !pack_size_limit)
+	if (!pack_to_stdout && !pack_size_limit && !cruft)
 		pack_size_limit = pack_size_limit_cfg;
 	if (pack_to_stdout && pack_size_limit)
 		die(_("--max-pack-size cannot be used to build a pack for transfer"));
@@ -4087,6 +4269,15 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
 
+	if (cruft) {
+		if (use_internal_rev_list)
+			die(_("cannot use internal rev list with --cruft"));
+		if (stdin_packs)
+			die(_("cannot use --stdin-packs with --cruft"));
+		if (pack_size_limit)
+			die(_("cannot use --max-pack-size with --cruft"));
+	}
+
 	/*
 	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
 	 *
@@ -4143,7 +4334,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			    the_repository);
 	prepare_packing_data(the_repository, &to_pack);
 
-	if (progress)
+	if (progress && !cruft)
 		progress_state = start_progress(_("Enumerating objects"), 0);
 	if (stdin_packs) {
 		/* avoids adding objects in excluded packs */
@@ -4151,7 +4342,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		read_packs_list_from_stdin();
 		if (rev_list_unpacked)
 			add_unreachable_loose_objects();
-	} else if (!use_internal_rev_list)
+	} else if (cruft)
+		read_cruft_objects();
+	else if (!use_internal_rev_list)
 		read_object_list_from_stdin();
 	else {
 		get_object_list(rp.nr, rp.v);
diff --git a/object-file.c b/object-file.c
index c3d866a287..7ddb38b64a 100644
--- a/object-file.c
+++ b/object-file.c
@@ -956,7 +956,7 @@ int has_loose_object_nonlocal(const struct object_id *oid)
 	return check_and_freshen_nonlocal(oid, 0);
 }
 
-static int has_loose_object(const struct object_id *oid)
+int has_loose_object(const struct object_id *oid)
 {
 	return check_and_freshen(oid, 0);
 }
diff --git a/object-store.h b/object-store.h
index d87481f101..a79c1c91ab 100644
--- a/object-store.h
+++ b/object-store.h
@@ -308,6 +308,8 @@ int repo_has_object_file_with_flags(struct repository *r,
  */
 int has_loose_object_nonlocal(const struct object_id *);
 
+int has_loose_object(const struct object_id *);
+
 void assert_oid_type(const struct object_id *oid, enum object_type expect);
 
 /*
diff --git a/t/t5327-pack-objects-cruft.sh b/t/t5327-pack-objects-cruft.sh
new file mode 100755
index 0000000000..543a80e9bf
--- /dev/null
+++ b/t/t5327-pack-objects-cruft.sh
@@ -0,0 +1,218 @@
+#!/bin/sh
+
+test_description='cruft pack related pack-objects tests'
+. ./test-lib.sh
+
+objdir=.git/objects
+packdir=$objdir/pack
+
+basic_cruft_pack_tests () {
+	expire="$1"
+
+	test_expect_success "unreachable loose objects are packed (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit base &&
+			git repack -Ad &&
+			test_commit loose &&
+
+			test-tool chmtime +2000 "$objdir/$(test_oid_to_path \
+				$(git rev-parse loose:loose.t))" &&
+			test-tool chmtime +1000 "$objdir/$(test_oid_to_path \
+				$(git rev-parse loose^{tree}))" &&
+
+			(
+				git rev-list --objects --no-object-names base..loose |
+				while read oid
+				do
+					path="$objdir/$(test_oid_to_path "$oid")" &&
+					printf "%s %d\n" "$oid" "$(test-tool chmtime --get "$path")"
+				done |
+				sort -k1
+			) >expect &&
+
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+			cruft="$(echo $keep | git pack-objects --cruft \
+				--cruft-expiration="$expire" $packdir/pack)" &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "unreachable packed objects are packed (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit packed &&
+			git repack -Ad &&
+			test_commit other &&
+
+			git rev-list --objects --no-object-names packed.. >objects &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+			other="$(git pack-objects --delta-base-offset \
+				$packdir/pack <objects)" &&
+			git prune-packed &&
+
+			test-tool chmtime --get -100 "$packdir/pack-$other.pack" >expect &&
+
+			cruft="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF
+			$keep
+			-pack-$other.pack
+			EOF
+			)" &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+			cut -d" " -f2 <actual.raw | sort -u >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "unreachable cruft objects are repacked (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit packed &&
+			git repack -Ad &&
+			test_commit other &&
+
+			git rev-list --objects --no-object-names packed.. >objects &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+
+			cruft_a="$(echo $keep | git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack)" &&
+			git prune-packed &&
+			cruft_b="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF
+			$keep
+			-pack-$cruft_a.pack
+			EOF
+			)" &&
+
+			test-tool pack-mtimes "pack-$cruft_a.mtimes" >expect.raw &&
+			test-tool pack-mtimes "pack-$cruft_b.mtimes" >actual.raw &&
+
+			sort <expect.raw >expect &&
+			sort <actual.raw >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "multiple cruft packs (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			git repack -Ad &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+
+			test_commit cruft &&
+			loose="$objdir/$(test_oid_to_path $(git rev-parse cruft))" &&
+
+			# generate three copies of the cruft object in different
+			# cruft packs, each with a unique mtime:
+			#   - one expired (1000 seconds ago)
+			#   - two non-expired (one 1000 seconds in the future,
+			#     one 1500 seconds in the future)
+			test-tool chmtime =-1000 "$loose" &&
+			git pack-objects --cruft $packdir/pack-A <<-EOF &&
+			$keep
+			EOF
+			test-tool chmtime =+1000 "$loose" &&
+			git pack-objects --cruft $packdir/pack-B <<-EOF &&
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			EOF
+			test-tool chmtime =+1500 "$loose" &&
+			git pack-objects --cruft $packdir/pack-C <<-EOF &&
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			-$(basename $(ls $packdir/pack-B-*.pack))
+			EOF
+
+			# ensure the resulting cruft pack takes the most recent
+			# mtime among all copies
+			cruft="$(git pack-objects --cruft \
+				--cruft-expiration="$expire" \
+				$packdir/pack <<-EOF
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			-$(basename $(ls $packdir/pack-B-*.pack))
+			-$(basename $(ls $packdir/pack-C-*.pack))
+			EOF
+			)" &&
+
+			test-tool pack-mtimes "$(basename $(ls $packdir/pack-C-*.mtimes))" >expect.raw &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+			sort expect.raw >expect &&
+			sort actual.raw >actual &&
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "cruft packs tolerate missing trees (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			test_commit cruft &&
+
+			tree="$(git rev-parse cruft^{tree})" &&
+
+			git reset --hard reachable &&
+			git tag -d cruft &&
+			rm -fr .git/logs &&
+
+			# remove the unreachable tree, but leave the commit
+			# which has it as its root tree in-tact
+			rm -fr "$objdir/$(test_oid_to_path "$tree")" &&
+
+			git repack -Ad &&
+			basename $(ls $packdir/pack-*.pack) >in &&
+			git pack-objects --cruft --cruft-expiration="$expire" \
+				$packdir/pack <in
+		)
+	'
+
+	test_expect_success "cruft packs tolerate missing blobs (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			test_commit cruft &&
+
+			blob="$(git rev-parse cruft:cruft.t)" &&
+
+			git reset --hard reachable &&
+			git tag -d cruft &&
+			rm -fr .git/logs &&
+
+			# remove the unreachable blob, but leave the commit (and
+			# the root tree of that commit) in-tact
+			rm -fr "$objdir/$(test_oid_to_path "$blob")" &&
+
+			git repack -Ad &&
+			basename $(ls $packdir/pack-*.pack) >in &&
+			git pack-objects --cruft --cruft-expiration="$expire" \
+				$packdir/pack <in
+		)
+	'
+}
+
+basic_cruft_pack_tests never
+
+test_done
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH 09/17] reachable: add options to add_unseen_recent_objects_to_traversal
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (7 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-11-29 22:25 ` [PATCH 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

This function behaves very similarly to what we will need in
pack-objects in order to implement cruft packs with expiration. But it
is lacking a couple of things. Namely, it needs:

  - a mechanism to communicate the timestamps of individual recent
    objects to some external caller

  - and, in the case of packed objects, our future caller will also want
    to know the originating pack, as well as the offset within that pack
    at which the object can be found

  - finally, it needs a way to skip over packs which are marked as kept
    in-core.

To address the first two, add a callback interface in this patch which
reports the time of each recent object, as well as a (packed_git,
off_t) pair for packed objects.

Likewise, add a new option to the packed object iterators to skip over
packs which are marked as kept in core. This option will become
implicitly tested in a future patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c |  2 +-
 reachable.c            | 51 +++++++++++++++++++++++++++++++++++-------
 reachable.h            |  9 +++++++-
 3 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index b12e79e4b1..2c592d369a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3951,7 +3951,7 @@ static void get_object_list(int ac, const char **av)
 	if (unpack_unreachable_expiration) {
 		revs.ignore_missing_links = 1;
 		if (add_unseen_recent_objects_to_traversal(&revs,
-				unpack_unreachable_expiration))
+				unpack_unreachable_expiration, NULL, 0))
 			die(_("unable to add recent objects"));
 		if (prepare_revision_walk(&revs))
 			die(_("revision walk setup failed"));
diff --git a/reachable.c b/reachable.c
index 84e3d0d75e..0eb9909f47 100644
--- a/reachable.c
+++ b/reachable.c
@@ -60,9 +60,13 @@ static void mark_commit(struct commit *c, void *data)
 struct recent_data {
 	struct rev_info *revs;
 	timestamp_t timestamp;
+	report_recent_object_fn *cb;
+	int ignore_in_core_kept_packs;
 };
 
 static void add_recent_object(const struct object_id *oid,
+			      struct packed_git *pack,
+			      off_t offset,
 			      timestamp_t mtime,
 			      struct recent_data *data)
 {
@@ -103,13 +107,29 @@ static void add_recent_object(const struct object_id *oid,
 		die("unable to lookup %s", oid_to_hex(oid));
 
 	add_pending_object(data->revs, obj, "");
+	if (data->cb)
+		data->cb(obj, pack, offset, mtime);
+}
+
+static int want_recent_object(struct recent_data *data,
+			      const struct object_id *oid)
+{
+	if (data->ignore_in_core_kept_packs &&
+	    has_object_kept_pack(oid, IN_CORE_KEEP_PACKS))
+		return 0;
+	return 1;
 }
 
 static int add_recent_loose(const struct object_id *oid,
 			    const char *path, void *data)
 {
 	struct stat st;
-	struct object *obj = lookup_object(the_repository, oid);
+	struct object *obj;
+
+	if (!want_recent_object(data, oid))
+		return 0;
+
+	obj = lookup_object(the_repository, oid);
 
 	if (obj && obj->flags & SEEN)
 		return 0;
@@ -126,7 +146,7 @@ static int add_recent_loose(const struct object_id *oid,
 		return error_errno("unable to stat %s", oid_to_hex(oid));
 	}
 
-	add_recent_object(oid, st.st_mtime, data);
+	add_recent_object(oid, NULL, 0, st.st_mtime, data);
 	return 0;
 }
 
@@ -134,29 +154,43 @@ static int add_recent_packed(const struct object_id *oid,
 			     struct packed_git *p, uint32_t pos,
 			     void *data)
 {
-	struct object *obj = lookup_object(the_repository, oid);
+	struct object *obj;
+
+	if (!want_recent_object(data, oid))
+		return 0;
+
+	obj = lookup_object(the_repository, oid);
 
 	if (obj && obj->flags & SEEN)
 		return 0;
-	add_recent_object(oid, p->mtime, data);
+	add_recent_object(oid, p, nth_packed_object_offset(p, pos), p->mtime, data);
 	return 0;
 }
 
 int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
-					   timestamp_t timestamp)
+					   timestamp_t timestamp,
+					   report_recent_object_fn *cb,
+					   int ignore_in_core_kept_packs)
 {
 	struct recent_data data;
+	enum for_each_object_flags flags;
 	int r;
 
 	data.revs = revs;
 	data.timestamp = timestamp;
+	data.cb = cb;
+	data.ignore_in_core_kept_packs = ignore_in_core_kept_packs;
 
 	r = for_each_loose_object(add_recent_loose, &data,
 				  FOR_EACH_OBJECT_LOCAL_ONLY);
 	if (r)
 		return r;
-	return for_each_packed_object(add_recent_packed, &data,
-				      FOR_EACH_OBJECT_LOCAL_ONLY);
+
+	flags = FOR_EACH_OBJECT_LOCAL_ONLY | FOR_EACH_OBJECT_PACK_ORDER;
+	if (ignore_in_core_kept_packs)
+		flags |= FOR_EACH_OBJECT_SKIP_IN_CORE_KEPT_PACKS;
+
+	return for_each_packed_object(add_recent_packed, &data, flags);
 }
 
 static int mark_object_seen(const struct object_id *oid,
@@ -217,7 +251,8 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 
 	if (mark_recent) {
 		revs->ignore_missing_links = 1;
-		if (add_unseen_recent_objects_to_traversal(revs, mark_recent))
+		if (add_unseen_recent_objects_to_traversal(revs, mark_recent,
+							   NULL, 0))
 			die("unable to mark recent objects");
 		if (prepare_revision_walk(revs))
 			die("revision walk setup failed");
diff --git a/reachable.h b/reachable.h
index 5df932ad8f..b776761baa 100644
--- a/reachable.h
+++ b/reachable.h
@@ -1,11 +1,18 @@
 #ifndef REACHEABLE_H
 #define REACHEABLE_H
 
+#include "object.h"
+
 struct progress;
 struct rev_info;
 
+typedef void report_recent_object_fn(const struct object *, struct packed_git *,
+				     off_t, time_t);
+
 int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
-					   timestamp_t timestamp);
+					   timestamp_t timestamp,
+					   report_recent_object_fn cb,
+					   int ignore_in_core_kept_packs);
 void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 			    timestamp_t mark_recent, struct progress *);
 
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH 10/17] reachable: report precise timestamps from objects in cruft packs
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (8 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-11-29 22:25 ` [PATCH 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

When generating a cruft pack, the caller within pack-objects will want
to know the precise timestamps of cruft objects (i.e., their
corresponding values in the .mtimes table) rather than the mtime of the
cruft pack itself.

Teach add_recent_packed() to lookup each object's precise mtime from the
.mtimes file if one exists (indicated by the is_cruft bit on the
packed_git structure).

A couple of small things worth noting here:

  - load_pack_mtimes() needs to be called before asking for
    nth_packed_mtime(), and that call is done lazily here. That function
    exits early if the .mtimes file has already been opened and parsed,
    so only the first call is slow.

  - Checking the is_cruft bit can be done without any extra work on the
    caller's behalf, since it is set up for us automatically as a
    side-effect of calling add_packed_git() (just like the 'pack_keep'
    and 'pack_promisor' bits).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 reachable.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/reachable.c b/reachable.c
index 0eb9909f47..9ec8e6bd5b 100644
--- a/reachable.c
+++ b/reachable.c
@@ -13,6 +13,7 @@
 #include "worktree.h"
 #include "object-store.h"
 #include "pack-bitmap.h"
+#include "pack-mtimes.h"
 
 struct connectivity_progress {
 	struct progress *progress;
@@ -155,6 +156,7 @@ static int add_recent_packed(const struct object_id *oid,
 			     void *data)
 {
 	struct object *obj;
+	timestamp_t mtime = p->mtime;
 
 	if (!want_recent_object(data, oid))
 		return 0;
@@ -163,7 +165,12 @@ static int add_recent_packed(const struct object_id *oid,
 
 	if (obj && obj->flags & SEEN)
 		return 0;
-	add_recent_object(oid, p, nth_packed_object_offset(p, pos), p->mtime, data);
+	if (p->is_cruft) {
+		if (load_pack_mtimes(p) < 0)
+			die(_("could not load cruft pack .mtimes"));
+		mtime = nth_packed_mtime(p, pos);
+	}
+	add_recent_object(oid, p, nth_packed_object_offset(p, pos), mtime, data);
 	return 0;
 }
 
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH 11/17] builtin/pack-objects.c: --cruft with expiration
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (9 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-12-07 15:30   ` Derrick Stolee
  2021-11-29 22:25 ` [PATCH 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

In a previous patch, pack-objects learned how to generate a cruft pack
so long as no objects are dropped.

This patch teaches pack-objects to handle the case where a non-never
`--cruft-expiration` value is passed. This case is slightly more
complicated than before, because we want pack-objects to save
unreachable objects which would have been pruned when there is another
recent (i.e., non-prunable) unreachable object which reaches the other.
We'll call these objects "unreachable but reachable-from-recent".

Here is how pack-objects handles `--cruft-expiration`:

  - Instead of adding all objects outside of the kept pack(s) into the
    packing list, only handle the ones whose mtime is within the grace
    period.

  - Construct a reachability traversal whose tips are the
    unreachable-but-recent objects.

  - Then, walk along that traversal, stopping if we reach an object in
    the kept pack. At each step along the traversal, we add the object
    we are visiting to the packing list.

In the majority of these cases, any object we visit in this traversal
will already be in our packing list. But we will sometimes encounter
reachable-from-recent cruft objects, which we want to retain even if
they aged out of the grace period.

The most subtle point of this process is that we actually don't need to
bother to update the rescued object's mtime. Even though we will write
an .mtimes file with a value that is older than the expiration window,
it will continue to survive cruft repacks so long as any objects which
reach it haven't aged out.

That is, a future repack will also exclude that object from the initial
packing list, only to discover it later on when doing the reachability
traversal.

Finally, stopping early once an object is found in a kept pack is safe
to do because the kept packs ordinarily represent which packs will
survive after repacking. Assuming that it _isn't_ safe to halt a
traversal early would mean that there is some ancestor object which is
missing, which implies repository corruption (i.e., the complete set of
reachable objects isn't present).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c        |  84 +++++++++++++++++++-
 t/t5327-pack-objects-cruft.sh | 143 ++++++++++++++++++++++++++++++++++
 2 files changed, 226 insertions(+), 1 deletion(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 2c592d369a..a38fa34479 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3439,6 +3439,44 @@ static int add_cruft_object_entry(const struct object_id *oid, enum object_type
 	return 1;
 }
 
+static void show_cruft_object(struct object *obj, const char *name, void *data)
+{
+	/*
+	 * if we did not record it earlier, it's at least as old as our
+	 * expiration value. Rather than find it exactly, just use that
+	 * value.  This may bump it forward from its real mtime, but it
+	 * will still be "too old" next time we run with the same
+	 * expiration.
+	 *
+	 * if obj does appear in the packing list, this call is a noop (or may
+	 * set the namehash).
+	 */
+	add_cruft_object_entry(&obj->oid, obj->type, NULL, 0, name, cruft_expiration);
+}
+
+static void show_cruft_commit(struct commit *commit, void *data)
+{
+	show_cruft_object((struct object*)commit, NULL, data);
+}
+
+static int cruft_include_check_obj(struct object *obj, void *data)
+{
+	return !has_object_kept_pack(&obj->oid, IN_CORE_KEEP_PACKS);
+}
+
+static int cruft_include_check(struct commit *commit, void *data)
+{
+	return cruft_include_check_obj((struct object*)commit, data);
+}
+
+static void set_cruft_mtime(const struct object *object,
+			    struct packed_git *pack,
+			    off_t offset, time_t mtime)
+{
+	add_cruft_object_entry(&object->oid, object->type, pack, offset, NULL,
+			       mtime);
+}
+
 static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
 {
 	struct string_list_item *item = NULL;
@@ -3464,6 +3502,50 @@ static void enumerate_cruft_objects(void)
 	stop_progress(&progress_state);
 }
 
+static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs)
+{
+	struct packed_git *p;
+	struct rev_info revs;
+	int ret;
+
+	repo_init_revisions(the_repository, &revs, NULL);
+
+	revs.tag_objects = 1;
+	revs.tree_objects = 1;
+	revs.blob_objects = 1;
+
+	revs.include_check = cruft_include_check;
+	revs.include_check_obj = cruft_include_check_obj;
+
+	revs.ignore_missing_links = 1;
+
+	if (progress)
+		progress_state = start_progress(_("Enumerating cruft objects"), 0);
+	ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
+						     set_cruft_mtime, 1);
+	stop_progress(&progress_state);
+
+	if (ret)
+		die(_("unable to add cruft objects"));
+
+	/*
+	 * Re-mark only the fresh packs as kept so that objects in
+	 * unknown packs do not halt the reachability traversal early.
+	 */
+	for (p = get_all_packs(the_repository); p; p = p->next)
+		p->pack_keep_in_core = 0;
+	mark_pack_kept_in_core(fresh_packs, 1);
+
+	if (prepare_revision_walk(&revs))
+		die(_("revision walk setup failed"));
+	if (progress)
+		progress_state = start_progress(_("Traversing cruft objects"), 0);
+	nr_seen = 0;
+	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
+
+	stop_progress(&progress_state);
+}
+
 static void read_cruft_objects(void)
 {
 	struct strbuf buf = STRBUF_INIT;
@@ -3515,7 +3597,7 @@ static void read_cruft_objects(void)
 	mark_pack_kept_in_core(&discard_packs, 0);
 
 	if (cruft_expiration)
-		die("--cruft-expiration not yet implemented");
+		enumerate_and_traverse_cruft_objects(&fresh_packs);
 	else
 		enumerate_cruft_objects();
 
diff --git a/t/t5327-pack-objects-cruft.sh b/t/t5327-pack-objects-cruft.sh
index 543a80e9bf..31d4a561fe 100755
--- a/t/t5327-pack-objects-cruft.sh
+++ b/t/t5327-pack-objects-cruft.sh
@@ -214,5 +214,148 @@ basic_cruft_pack_tests () {
 }
 
 basic_cruft_pack_tests never
+basic_cruft_pack_tests 2.weeks.ago
+
+test_expect_success 'cruft tags rescue tagged objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit tagged &&
+		git tag -a annotated -m tag &&
+
+		git rev-list --objects --no-object-names packed.. >objects &&
+		while read oid
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $oid)"
+		done <objects &&
+
+		test-tool chmtime -500 \
+			"$objdir/$(test_oid_to_path $(git rev-parse annotated))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+		cut -f1 -d" " <actual.raw | sort >actual &&
+
+		(
+			cat objects &&
+			git rev-parse annotated
+		) >expect.raw &&
+		sort <expect.raw >expect &&
+
+		test_cmp expect actual &&
+		cat actual
+	)
+'
+
+test_expect_success 'cruft commits rescue parents, trees' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit old &&
+		test_commit new &&
+
+		git rev-list --objects --no-object-names packed..new >objects &&
+		while read object
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $object)"
+		done <objects &&
+		test-tool chmtime +500 "$objdir/$(test_oid_to_path \
+			$(git rev-parse HEAD))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+		cut -d" " -f1 <actual.raw | sort >actual &&
+		sort <objects >expect &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'cruft trees rescue sub-trees, blobs' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		mkdir -p dir/sub &&
+		echo foo >foo &&
+		echo bar >dir/bar &&
+		echo baz >dir/sub/baz &&
+
+		test_tick &&
+		git add . &&
+		git commit -m "pruned" &&
+
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD^{tree}))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:foo))" &&
+		test-tool chmtime  -500 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/bar))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/sub))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/sub/baz))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+		cut -f1 -d" " <actual.raw | sort >actual &&
+
+		git rev-parse HEAD:dir HEAD:dir/bar HEAD:dir/sub HEAD:dir/sub/baz >expect.raw &&
+		sort <expect.raw >expect &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'expired objects are pruned' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit pruned &&
+
+		git rev-list --objects --no-object-names packed..pruned >objects &&
+		while read object
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $object)"
+		done <objects &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual &&
+		test_must_be_empty actual
+	)
+'
 
 test_done
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH 12/17] builtin/repack.c: support generating a cruft pack
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (10 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-12-05 20:46   ` Junio C Hamano
  2021-12-07 15:38   ` Derrick Stolee
  2021-11-29 22:25 ` [PATCH 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
                   ` (9 subsequent siblings)
  21 siblings, 2 replies; 201+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

Expose a way to split the contents of a repository into a main and cruft
pack when doing an all-into-one repack with `git repack --cruft -d`, and
a complementary configuration variable.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt            |  11 ++
 Documentation/technical/cruft-packs.txt |   2 +-
 builtin/repack.c                        | 112 ++++++++++++++++-
 t/t5327-pack-objects-cruft.sh           | 153 ++++++++++++++++++++++++
 4 files changed, 272 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 7183fb498f..4f8f4b5a1f 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -63,6 +63,17 @@ to the new separate pack will be written.
 	Also run  'git prune-packed' to remove redundant
 	loose object files.
 
+--cruft::
+	Same as `-a`, unless `-d` is used. Then any unreachable objects
+	are packed into a separate cruft pack. Unreachable objects can
+	be pruned using the normal expiry rules with the next `git gc`
+	invocation (see linkgit:git-gc[1]). Incompatible with `-k`.
+
+--cruft-expiration=<approxidate>::
+	Expire unreachable objects older than `<approxidate>`
+	immediately instead of waiting for the next `git gc` invocation.
+	Only useful with `--cruft -d`.
+
 -l::
 	Pass the `--local` option to 'git pack-objects'. See
 	linkgit:git-pack-objects[1].
diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
index bb54cce1b1..b7daad2e3e 100644
--- a/Documentation/technical/cruft-packs.txt
+++ b/Documentation/technical/cruft-packs.txt
@@ -16,7 +16,7 @@ pruned according to normal expiry rules with the next 'git gc' invocation.
 
 Unreachable objects aren't removed immediately, since doing so could race with
 an incoming push which may reference an object which is about to be deleted.
-Instead, those unreachable objects are stored as loose object and stay that way
+Instead, those unreachable objects are stored as loose objects and stay that way
 until they are older than the expiration window, at which point they are removed
 by linkgit:git-prune[1].
 
diff --git a/builtin/repack.c b/builtin/repack.c
index acbb7b8c3b..68b4bdf06f 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -18,11 +18,17 @@
 #include "pack-bitmap.h"
 #include "refs.h"
 
+#define ALL_INTO_ONE 1
+#define LOOSEN_UNREACHABLE 2
+#define PACK_CRUFT 4
+
+static int pack_everything;
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
 static int write_bitmaps = -1;
 static int use_delta_islands;
 static char *packdir, *packtmp_name, *packtmp;
+static char *cruft_expiration;
 
 static const char *const git_repack_usage[] = {
 	N_("git repack [<options>]"),
@@ -54,6 +60,7 @@ static int repack_config(const char *var, const char *value, void *cb)
 		use_delta_islands = git_config_bool(var, value);
 		return 0;
 	}
+
 	return git_default_config(var, value, cb);
 }
 
@@ -298,9 +305,6 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 		die(_("could not finish pack-objects to repack promisor objects"));
 }
 
-#define ALL_INTO_ONE 1
-#define LOOSEN_UNREACHABLE 2
-
 struct pack_geometry {
 	struct packed_git **pack;
 	uint32_t pack_nr, pack_alloc;
@@ -337,6 +341,8 @@ static void init_pack_geometry(struct pack_geometry **geometry_p)
 	for (p = get_all_packs(the_repository); p; p = p->next) {
 		if (!pack_kept_objects && p->pack_keep)
 			continue;
+		if (p->is_cruft)
+			continue;
 
 		ALLOC_GROW(geometry->pack,
 			   geometry->pack_nr + 1,
@@ -598,6 +604,67 @@ static int write_midx_included_packs(struct string_list *include,
 	return finish_command(&cmd);
 }
 
+static int write_cruft_pack(const struct pack_objects_args *args,
+			    const char *pack_prefix,
+			    struct string_list *names,
+			    struct string_list *existing_packs,
+			    struct string_list *existing_kept_packs)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf line = STRBUF_INIT;
+	struct string_list_item *item;
+	FILE *in, *out;
+	int ret;
+
+	prepare_pack_objects(&cmd, args);
+
+	strvec_push(&cmd.args, "--cruft");
+	if (cruft_expiration)
+		strvec_pushf(&cmd.args, "--cruft-expiration=%s",
+			     cruft_expiration);
+
+	strvec_push(&cmd.args, "--honor-pack-keep");
+	strvec_push(&cmd.args, "--non-empty");
+	strvec_push(&cmd.args, "--max-pack-size=0");
+
+	cmd.in = -1;
+
+	ret = start_command(&cmd);
+	if (ret)
+		return ret;
+
+	/*
+	 * names has a confusing double use: it both provides the list
+	 * of just-written new packs, and accepts the name of the cruft
+	 * pack we are writing.
+	 *
+	 * By the time it is read here, it contains only the pack(s)
+	 * that were just written, which is exactly the set of packs we
+	 * want to consider kept.
+	 */
+	in = xfdopen(cmd.in, "w");
+	for_each_string_list_item(item, names)
+		fprintf(in, "%s-%s.pack\n", pack_prefix, item->string);
+	for_each_string_list_item(item, existing_packs)
+		fprintf(in, "-%s.pack\n", item->string);
+	for_each_string_list_item(item, existing_kept_packs)
+		fprintf(in, "%s.pack\n", item->string);
+	fclose(in);
+
+	out = xfdopen(cmd.out, "r");
+	while (strbuf_getline_lf(&line, out) != EOF) {
+		if (line.len != the_hash_algo->hexsz)
+			die(_("repack: Expecting full hex object ID lines only "
+			      "from pack-objects."));
+		string_list_append(names, line.buf);
+	}
+	fclose(out);
+
+	strbuf_release(&line);
+
+	return finish_command(&cmd);
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -614,7 +681,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int show_progress = isatty(2);
 
 	/* variables to be filled by option parsing */
-	int pack_everything = 0;
 	int delete_redundant = 0;
 	const char *unpack_unreachable = NULL;
 	int keep_unreachable = 0;
@@ -630,6 +696,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_BIT('A', NULL, &pack_everything,
 				N_("same as -a, and turn unreachable objects loose"),
 				   LOOSEN_UNREACHABLE | ALL_INTO_ONE),
+		OPT_BIT(0, "cruft", &pack_everything,
+				N_("same as -a, pack unreachable cruft objects separately"),
+				   PACK_CRUFT | ALL_INTO_ONE),
+		OPT_STRING(0, "cruft-expiration", &cruft_expiration, N_("approxidate"),
+				N_("with -C, expire objects older than this")),
 		OPT_BOOL('d', NULL, &delete_redundant,
 				N_("remove redundant packs, and run git-prune-packed")),
 		OPT_BOOL('f', NULL, &po_args.no_reuse_delta,
@@ -681,6 +752,14 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (keep_unreachable &&
 	    (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
 		die(_("--keep-unreachable and -A are incompatible"));
+	if (pack_everything & PACK_CRUFT && delete_redundant) {
+		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
+			die(_("--cruft and -A are incompatible"));
+		if (keep_unreachable)
+			die(_("--cruft and -k are incompatible"));
+		if (!(pack_everything & ALL_INTO_ONE))
+			die(_("--cruft must be combined with all-into-one"));
+	}
 
 	if (write_bitmaps < 0) {
 		if (!write_midx &&
@@ -763,7 +842,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (pack_everything & ALL_INTO_ONE) {
 		repack_promisor_objects(&po_args, &names);
 
-		if (existing_nonkept_packs.nr && delete_redundant) {
+		if (existing_nonkept_packs.nr && delete_redundant &&
+		    !(pack_everything & PACK_CRUFT)) {
 			for_each_string_list_item(item, &names) {
 				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
 					     packtmp_name, item->string);
@@ -798,6 +878,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		return ret;
 
 	if (geometry) {
+		struct packed_git *p;
 		FILE *in = xfdopen(cmd.in, "w");
 		/*
 		 * The resulting pack should contain all objects in packs that
@@ -808,6 +889,12 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			fprintf(in, "%s\n", pack_basename(geometry->pack[i]));
 		for (i = geometry->split; i < geometry->pack_nr; i++)
 			fprintf(in, "^%s\n", pack_basename(geometry->pack[i]));
+
+		for (p = get_all_packs(the_repository); p; p = p->next) {
+			if (!p->is_cruft)
+				continue;
+			fprintf(in, "^%s\n", pack_basename(p));
+		}
 		fclose(in);
 	}
 
@@ -825,6 +912,21 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (!names.nr && !po_args.quiet)
 		printf_ln(_("Nothing new to pack."));
 
+	if (pack_everything & PACK_CRUFT) {
+		const char *pack_prefix;
+		if (!skip_prefix(packtmp, packdir, &pack_prefix))
+			die(_("pack prefix %s does not begin with objdir %s"),
+			    packtmp, packdir);
+		if (*pack_prefix == '/')
+			pack_prefix++;
+
+		ret = write_cruft_pack(&po_args, pack_prefix, &names,
+				       &existing_nonkept_packs,
+				       &existing_kept_packs);
+		if (ret)
+			return ret;
+	}
+
 	for_each_string_list_item(item, &names) {
 		item->util = (void *)(uintptr_t)populate_pack_exts(item->string);
 	}
diff --git a/t/t5327-pack-objects-cruft.sh b/t/t5327-pack-objects-cruft.sh
index 31d4a561fe..ed1a113ab6 100755
--- a/t/t5327-pack-objects-cruft.sh
+++ b/t/t5327-pack-objects-cruft.sh
@@ -358,4 +358,157 @@ test_expect_success 'expired objects are pruned' '
 	)
 '
 
+test_expect_success 'repack --cruft generates a cruft pack' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git branch -M main &&
+		git checkout --orphan other &&
+		test_commit unreachable &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d unreachable &&
+		# objects are not cruft if they are contained in the reflogs
+		rm -fr .git/logs &&
+
+		git rev-list --objects --all --no-object-names >reachable.raw &&
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+		sort <reachable.raw >reachable &&
+		comm -13 reachable objects >unreachable &&
+
+		git repack --cruft -d &&
+
+		cruft=$(basename $(ls $packdir/pack-*.mtimes) .mtimes) &&
+		pack=$(basename $(ls $packdir/pack-*.pack | grep -v $cruft) .pack) &&
+
+		git show-index <$packdir/$pack.idx >actual.raw &&
+		cut -f2 -d" " actual.raw | sort >actual &&
+		test_cmp reachable actual &&
+
+		git show-index <$packdir/$cruft.idx >actual.raw &&
+		cut -f2 -d" " actual.raw | sort >actual &&
+		test_cmp unreachable actual
+	)
+'
+
+test_expect_success 'loose objects mtimes upsert others' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		test_commit cruft &&
+		# incremental repack, leaving existing objects loose (so
+		# they can be "freshened")
+		git repack &&
+
+		tip="$(git rev-parse cruft)" &&
+		path="$objdir/$(test_oid_to_path "$(git rev-parse cruft)")" &&
+		test-tool chmtime --get +1000 "$path" >expect &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d cruft &&
+		rm -fr .git/logs &&
+
+		git repack --cruft -d &&
+
+		mtimes="$(basename $(ls $packdir/pack-*.mtimes))" &&
+		test-tool pack-mtimes "$mtimes" >actual.raw &&
+		grep "$tip" actual.raw | cut -d" " -f2 >actual &&
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'cruft packs are not included in geometric repack' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		test_commit cruft &&
+		git repack -d &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d cruft &&
+		rm -fr .git/logs &&
+
+		git repack --cruft &&
+
+		find $packdir -type f | sort >before &&
+		git repack --geometric=2 -d &&
+		find $packdir -type f | sort >after &&
+
+		test_cmp before after
+	)
+'
+test_expect_success 'cruft repack with no reachable objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		git repack -ad &&
+
+		base="$(git rev-parse base)" &&
+
+		git for-each-ref --format="delete %(refname)" >in &&
+		git update-ref --stdin <in &&
+		rm -fr .git/logs &&
+		rm -fr .git/index &&
+
+		git repack --cruft -d &&
+
+		git cat-file -t $base
+	)
+'
+
+test_expect_success 'cruft repack ignores --max-pack-size' '
+	git init max-pack-size &&
+	(
+		cd max-pack-size &&
+		test_commit base &&
+		# two cruft objects which exceed the maximum pack size
+		test-tool genrandom foo 1048576 | git hash-object --stdin -w &&
+		test-tool genrandom bar 1048576 | git hash-object --stdin -w &&
+		git repack --cruft --max-pack-size=1M &&
+		find $packdir -name "*.mtimes" >cruft &&
+		test_line_count = 1 cruft &&
+		test-tool pack-mtimes "$(basename "$(cat cruft)")" >objects &&
+		test_line_count = 2 objects
+	)
+'
+
+test_expect_success 'cruft repack ignores pack.packSizeLimit' '
+	(
+		cd max-pack-size &&
+		# repack everything back together to remove the existing cruft
+		# pack (but to keep its objects)
+		git repack -adk &&
+		git -c pack.packSizeLimit=1M repack --cruft &&
+		# ensure the same post condition is met when --max-pack-size
+		# would otherwise be inferred from the configuration
+		find $packdir -name "*.mtimes" >cruft &&
+		test_line_count = 1 cruft &&
+		test-tool pack-mtimes "$(basename "$(cat cruft)")" >objects &&
+		test_line_count = 2 objects
+	)
+'
+
 test_done
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH 13/17] builtin/repack.c: allow configuring cruft pack generation
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (11 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-11-29 22:25 ` [PATCH 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

In servers which set the pack.window configuration to a large value, we
can wind up spending quite a lot of time finding new bases when breaking
delta chains between reachable and unreachable objects while generating
a cruft pack.

Introduce a handful of `repack.cruft*` configuration variables to
control the parameters used by pack-objects when generating a cruft
pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/repack.txt |  9 ++++
 builtin/repack.c                | 50 ++++++++++++++------
 t/t5327-pack-objects-cruft.sh   | 83 +++++++++++++++++++++++++++++++++
 3 files changed, 128 insertions(+), 14 deletions(-)

diff --git a/Documentation/config/repack.txt b/Documentation/config/repack.txt
index 9c413e177e..fd18d1fb89 100644
--- a/Documentation/config/repack.txt
+++ b/Documentation/config/repack.txt
@@ -25,3 +25,12 @@ repack.writeBitmaps::
 	space and extra time spent on the initial repack.  This has
 	no effect if multiple packfiles are created.
 	Defaults to true on bare repos, false otherwise.
+
+repack.cruftWindow::
+repack.cruftWindowMemory::
+repack.cruftDepth::
+repack.cruftThreads::
+	Parameters used by linkgit:git-pack-objects[1] when generating
+	a cruft pack and the respective parameters are not given over
+	the command line. See similarly named `pack.*` configuration
+	variables for defaults and meaning.
diff --git a/builtin/repack.c b/builtin/repack.c
index 68b4bdf06f..cefa906344 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -40,9 +40,21 @@ static const char incremental_bitmap_conflict_error[] = N_(
 "--no-write-bitmap-index or disable the pack.writebitmaps configuration."
 );
 
+struct pack_objects_args {
+	const char *window;
+	const char *window_memory;
+	const char *depth;
+	const char *threads;
+	const char *max_pack_size;
+	int no_reuse_delta;
+	int no_reuse_object;
+	int quiet;
+	int local;
+};
 
 static int repack_config(const char *var, const char *value, void *cb)
 {
+	struct pack_objects_args *cruft_po_args = cb;
 	if (!strcmp(var, "repack.usedeltabaseoffset")) {
 		delta_base_offset = git_config_bool(var, value);
 		return 0;
@@ -61,6 +73,15 @@ static int repack_config(const char *var, const char *value, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(var, "repack.cruftwindow"))
+		return git_config_string(&cruft_po_args->window, var, value);
+	if (!strcmp(var, "repack.cruftwindowmemory"))
+		return git_config_string(&cruft_po_args->window_memory, var, value);
+	if (!strcmp(var, "repack.cruftdepth"))
+		return git_config_string(&cruft_po_args->depth, var, value);
+	if (!strcmp(var, "repack.cruftthreads"))
+		return git_config_string(&cruft_po_args->threads, var, value);
+
 	return git_default_config(var, value, cb);
 }
 
@@ -153,18 +174,6 @@ static void remove_redundant_pack(const char *dir_name, const char *base_name)
 	strbuf_release(&buf);
 }
 
-struct pack_objects_args {
-	const char *window;
-	const char *window_memory;
-	const char *depth;
-	const char *threads;
-	const char *max_pack_size;
-	int no_reuse_delta;
-	int no_reuse_object;
-	int quiet;
-	int local;
-};
-
 static void prepare_pack_objects(struct child_process *cmd,
 				 const struct pack_objects_args *args)
 {
@@ -687,6 +696,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	int no_update_server_info = 0;
 	struct pack_objects_args po_args = {NULL};
+	struct pack_objects_args cruft_po_args = {NULL};
 	int geometric_factor = 0;
 	int write_midx = 0;
 
@@ -741,7 +751,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_END()
 	};
 
-	git_config(repack_config, NULL);
+	git_config(repack_config, &cruft_po_args);
 
 	argc = parse_options(argc, argv, prefix, builtin_repack_options,
 				git_repack_usage, 0);
@@ -920,7 +930,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		if (*pack_prefix == '/')
 			pack_prefix++;
 
-		ret = write_cruft_pack(&po_args, pack_prefix, &names,
+		if (!cruft_po_args.window)
+			cruft_po_args.window = po_args.window;
+		if (!cruft_po_args.window_memory)
+			cruft_po_args.window_memory = po_args.window_memory;
+		if (!cruft_po_args.depth)
+			cruft_po_args.depth = po_args.depth;
+		if (!cruft_po_args.threads)
+			cruft_po_args.threads = po_args.threads;
+
+		cruft_po_args.local = po_args.local;
+		cruft_po_args.quiet = po_args.quiet;
+
+		ret = write_cruft_pack(&cruft_po_args, pack_prefix, &names,
 				       &existing_nonkept_packs,
 				       &existing_kept_packs);
 		if (ret)
diff --git a/t/t5327-pack-objects-cruft.sh b/t/t5327-pack-objects-cruft.sh
index ed1a113ab6..750e9d6d6f 100755
--- a/t/t5327-pack-objects-cruft.sh
+++ b/t/t5327-pack-objects-cruft.sh
@@ -511,4 +511,87 @@ test_expect_success 'cruft repack ignores pack.packSizeLimit' '
 	)
 '
 
+test_expect_success 'cruft repack respects repack.cruftWindow' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		GIT_TRACE2_EVENT=$(pwd)/event.trace \
+		git -c pack.window=1 -c repack.cruftWindow=2 repack \
+		       --cruft --window=3 &&
+
+		grep "pack-objects.*--window=2.*--cruft" event.trace
+	)
+'
+
+test_expect_success 'cruft repack respects --window by default' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		GIT_TRACE2_EVENT=$(pwd)/event.trace \
+		git -c pack.window=2 repack --cruft --window=3 &&
+
+		grep "pack-objects.*--window=3.*--cruft" event.trace
+	)
+'
+
+test_expect_success 'cruft repack respects --quiet' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		GIT_PROGRESS_DELAY=0 git repack --cruft --quiet 2>err &&
+		test_must_be_empty err
+	)
+'
+
+test_expect_success 'cruft --local drops unreachable objects' '
+	git init alternate &&
+	git init repo &&
+	test_when_finished "rm -fr alternate repo" &&
+
+	test_commit -C alternate base &&
+	# Pack all objects in alterate so that the cruft repack in "repo" sees
+	# the object it dropped due to `--local` as packed. Otherwise this
+	# object would not appear packed anywhere (since it is not packed in
+	# alternate and likewise not part of the cruft pack in the other repo
+	# because of `--local`).
+	git -C alternate repack -ad &&
+
+	(
+		cd repo &&
+
+		object="$(git -C ../alternate rev-parse HEAD:base.t)" &&
+		git -C ../alternate cat-file -p $object >contents &&
+
+		# Write some reachable objects and two unreachable ones: one
+		# that the alternate has and another that is unique.
+		test_commit other &&
+		git hash-object -w -t blob contents &&
+		cruft="$(echo cruft | git hash-object -w -t blob --stdin)" &&
+
+		( cd ../alternate/.git/objects && pwd ) \
+		       >.git/objects/info/alternates &&
+
+		test_path_is_file $objdir/$(test_oid_to_path $cruft) &&
+		test_path_is_file $objdir/$(test_oid_to_path $object) &&
+
+		git repack -d --cruft --local &&
+
+		test-tool pack-mtimes "$(basename $(ls $packdir/pack-*.mtimes))" \
+		       >objects &&
+		! grep $object objects &&
+		grep $cruft objects
+	)
+'
+
 test_done
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH 14/17] builtin/repack.c: use named flags for existing_packs
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (12 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-11-29 22:25 ` [PATCH 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

We use the `util` pointer for items in the `existing_packs` string list
to indicate which packs are going to be deleted. Since that has so far
been the only use of that `util` pointer, we just set it to 0 or 1.

But we're going to add an additional state to this field in the next
patch, so prepare for that by adding a #define for the first bit so we
can more expressively inspect the flags state.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index cefa906344..cd4d789d27 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -22,6 +22,8 @@
 #define LOOSEN_UNREACHABLE 2
 #define PACK_CRUFT 4
 
+#define DELETE_PACK 1
+
 static int pack_everything;
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
@@ -559,7 +561,7 @@ static void midx_included_packs(struct string_list *include,
 		}
 	} else {
 		for_each_string_list_item(item, existing_nonkept_packs) {
-			if (item->util)
+			if ((uintptr_t)item->util & DELETE_PACK)
 				continue;
 			string_list_insert(include, xstrfmt("%s.idx", item->string));
 		}
@@ -1002,7 +1004,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			 * was given) and that we will actually delete this pack
 			 * (if `-d` was given).
 			 */
-			item->util = (void*)(intptr_t)!string_list_has_string(&names, sha1);
+			if (!string_list_has_string(&names, sha1))
+				item->util = (void*)(uintptr_t)((size_t)item->util | DELETE_PACK);
 		}
 	}
 
@@ -1026,7 +1029,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (delete_redundant) {
 		int opts = 0;
 		for_each_string_list_item(item, &existing_nonkept_packs) {
-			if (!item->util)
+			if (!((uintptr_t)item->util & DELETE_PACK))
 				continue;
 			remove_redundant_pack(packdir, item->string);
 		}
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (13 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-11-29 22:25 ` [PATCH 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

When using cruft packs, the following race can occur when a geometric
repack that writes a MIDX bitmap takes place afterwords:

  - First, create an unreachable object and do an all-into-one cruft
    repack which stores that object in the repository's cruft pack.
  - Then make that object reachable.
  - Finally, do a geometric repack and write a MIDX bitmap.

Assuming that we are sufficiently unlucky as to select a commit from the
MIDX which reaches that object for bitmapping, then the `git
multi-pack-index` process will complain that that object is missing.

The reason is because we don't include cruft packs in the MIDX when
doing a geometric repack. Since the "make that object reachable" doesn't
necessarily mean that we'll create a new copy of that object in one of
the packs that will get rolled up as part of a geometric repack, it's
possible that the MIDX won't see any copies of that now-reachable
object.

Of course, it's desirable to avoid including cruft packs in the MIDX
because it causes the MIDX to store a bunch of objects which are likely
to get thrown away. But excluding that pack does open us up to the above
race.

This patch demonstrates the bug, and resolves it by including cruft
packs in the MIDX even when doing a geometric repack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c              | 19 +++++++++++++++++--
 t/t5327-pack-objects-cruft.sh | 26 ++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index cd4d789d27..5a201063e7 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -23,6 +23,7 @@
 #define PACK_CRUFT 4
 
 #define DELETE_PACK 1
+#define CRUFT_PACK 2
 
 static int pack_everything;
 static int delta_base_offset = 1;
@@ -158,8 +159,11 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 		if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
 		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
 			string_list_append_nodup(fname_kept_list, fname);
-		else
-			string_list_append_nodup(fname_nonkept_list, fname);
+		else {
+			struct string_list_item *item = string_list_append_nodup(fname_nonkept_list, fname);
+			if (file_exists(mkpath("%s/%s.mtimes", packdir, fname)))
+				item->util = (void*)(uintptr_t)CRUFT_PACK;
+		}
 	}
 	closedir(dir);
 }
@@ -559,6 +563,17 @@ static void midx_included_packs(struct string_list *include,
 
 			string_list_insert(include, strbuf_detach(&buf, NULL));
 		}
+
+		for_each_string_list_item(item, existing_nonkept_packs) {
+			if (!((uintptr_t)item->util & CRUFT_PACK)) {
+				/*
+				 * no need to check DELETE_PACK, since we're not
+				 * doing an ALL_INTO_ONE repack
+				 */
+				continue;
+			}
+			string_list_insert(include, xstrfmt("%s.idx", item->string));
+		}
 	} else {
 		for_each_string_list_item(item, existing_nonkept_packs) {
 			if ((uintptr_t)item->util & DELETE_PACK)
diff --git a/t/t5327-pack-objects-cruft.sh b/t/t5327-pack-objects-cruft.sh
index 750e9d6d6f..857f9e8855 100755
--- a/t/t5327-pack-objects-cruft.sh
+++ b/t/t5327-pack-objects-cruft.sh
@@ -594,4 +594,30 @@ test_expect_success 'cruft --local drops unreachable objects' '
 	)
 '
 
+test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		test_commit cruft &&
+		unreachable="$(git rev-parse cruft)" &&
+
+		git reset --hard $unreachable^ &&
+		git tag -d cruft &&
+		rm -fr .git/logs &&
+
+		git repack --cruft -d &&
+
+		# resurrect the unreachable object via a new commit. the
+		# new commit will get selected for a bitmap, but be
+		# missing one of its parents from the selected packs.
+		git reset --hard $unreachable &&
+		test_commit resurrect &&
+
+		git repack --write-midx --write-bitmap-index --geometric=2 -d
+	)
+'
+
 test_done
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH 16/17] builtin/gc.c: conditionally avoid pruning objects via loose
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (14 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-11-29 22:25 ` [PATCH 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

Expose the new `git repack --cruft` mode from `git gc` via a new opt-in
flag. When invoked like `git gc --cruft`, `git gc` will avoid exploding
unreachable objects as loose ones, and instead create a cruft pack and
`.mtimes` file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/gc.txt   | 21 +++++++++++++-------
 Documentation/git-gc.txt      |  5 +++++
 builtin/gc.c                  | 10 +++++++++-
 t/t5327-pack-objects-cruft.sh | 37 +++++++++++++++++++++++++++++++++++
 4 files changed, 65 insertions(+), 8 deletions(-)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index c834e07991..38fea076a2 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -81,14 +81,21 @@ gc.packRefs::
 	to enable it within all non-bare repos or it can be set to a
 	boolean value.  The default is `true`.
 
+gc.cruftPacks::
+	Store unreachable objects in a cruft pack (see
+	linkgit:git-repack[1]) instead of as loose objects. The default
+	is `false`.
+
 gc.pruneExpire::
-	When 'git gc' is run, it will call 'prune --expire 2.weeks.ago'.
-	Override the grace period with this config variable.  The value
-	"now" may be used to disable this grace period and always prune
-	unreachable objects immediately, or "never" may be used to
-	suppress pruning.  This feature helps prevent corruption when
-	'git gc' runs concurrently with another process writing to the
-	repository; see the "NOTES" section of linkgit:git-gc[1].
+	When 'git gc' is run, it will call 'prune --expire 2.weeks.ago'
+	(and 'repack --cruft --cruft-expiration 2.weeks.ago' if using
+	cruft packs via `gc.cruftPacks` or `--cruft`).  Override the
+	grace period with this config variable.  The value "now" may be
+	used to disable this grace period and always prune unreachable
+	objects immediately, or "never" may be used to suppress pruning.
+	This feature helps prevent corruption when 'git gc' runs
+	concurrently with another process writing to the repository; see
+	the "NOTES" section of linkgit:git-gc[1].
 
 gc.worktreePruneExpire::
 	When 'git gc' is run, it calls
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 853967dea0..ba4e67700e 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -54,6 +54,11 @@ other housekeeping tasks (e.g. rerere, working trees, reflog...) will
 be performed as well.
 
 
+--cruft::
+	When expiring unreachable objects, pack them separately into a
+	cruft pack instead of storing the loose objects as loose
+	objects.
+
 --prune=<date>::
 	Prune loose objects older than date (default is 2 weeks ago,
 	overridable by the config variable `gc.pruneExpire`).
diff --git a/builtin/gc.c b/builtin/gc.c
index bcef6a4c8d..c16cef0285 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -42,6 +42,7 @@ static const char * const builtin_gc_usage[] = {
 
 static int pack_refs = 1;
 static int prune_reflogs = 1;
+static int cruft_packs = 0;
 static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
@@ -152,6 +153,7 @@ static void gc_config(void)
 	git_config_get_int("gc.auto", &gc_auto_threshold);
 	git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
 	git_config_get_bool("gc.autodetach", &detach_auto);
+	git_config_get_bool("gc.cruftpacks", &cruft_packs);
 	git_config_get_expiry("gc.pruneexpire", &prune_expire);
 	git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire);
 	git_config_get_expiry("gc.logexpiry", &gc_log_expire);
@@ -331,7 +333,11 @@ static void add_repack_all_option(struct string_list *keep_pack)
 {
 	if (prune_expire && !strcmp(prune_expire, "now"))
 		strvec_push(&repack, "-a");
-	else {
+	else if (cruft_packs) {
+		strvec_push(&repack, "--cruft");
+		if (prune_expire)
+			strvec_pushf(&repack, "--cruft-expiration=%s", prune_expire);
+	} else {
 		strvec_push(&repack, "-A");
 		if (prune_expire)
 			strvec_pushf(&repack, "--unpack-unreachable=%s", prune_expire);
@@ -550,6 +556,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 		{ OPTION_STRING, 0, "prune", &prune_expire, N_("date"),
 			N_("prune unreferenced objects"),
 			PARSE_OPT_OPTARG, NULL, (intptr_t)prune_expire },
+		OPT_BOOL(0, "cruft", &cruft_packs, N_("pack unreferenced objects separately")),
 		OPT_BOOL(0, "aggressive", &aggressive, N_("be more thorough (increased runtime)")),
 		OPT_BOOL_F(0, "auto", &auto_gc, N_("enable auto-gc mode"),
 			   PARSE_OPT_NOCOMPLETE),
@@ -668,6 +675,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 			die(FAILED_RUN, repack.v[0]);
 
 		if (prune_expire) {
+			/* run `git prune` even if using cruft packs */
 			strvec_push(&prune, prune_expire);
 			if (quiet)
 				strvec_push(&prune, "--no-progress");
diff --git a/t/t5327-pack-objects-cruft.sh b/t/t5327-pack-objects-cruft.sh
index 857f9e8855..4cd0f0cf57 100755
--- a/t/t5327-pack-objects-cruft.sh
+++ b/t/t5327-pack-objects-cruft.sh
@@ -429,6 +429,43 @@ test_expect_success 'loose objects mtimes upsert others' '
 	)
 '
 
+test_expect_success 'expiring cruft objects with git gc' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git branch -M main &&
+		git checkout --orphan other &&
+		test_commit unreachable &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d unreachable &&
+		# objects are not cruft if they are contained in the reflogs
+		rm -fr .git/logs &&
+
+		git rev-list --objects --all --no-object-names >reachable.raw &&
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+		sort <reachable.raw >reachable &&
+		comm -13 reachable objects >unreachable &&
+
+		git repack --cruft -d &&
+
+		mtimes=$(ls .git/objects/pack/pack-*.mtimes) &&
+		test_path_is_file $mtimes &&
+
+		git gc --cruft --prune=now &&
+
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+
+		comm -23 unreachable objects >removed &&
+		test_cmp unreachable removed &&
+		test_path_is_missing $mtimes
+	)
+'
+
 test_expect_success 'cruft packs are not included in geometric repack' '
 	git init repo &&
 	test_when_finished "rm -fr repo" &&
-- 
2.34.1.25.gb3157a20e6


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH 17/17] sha1-file.c: don't freshen cruft packs
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (15 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
@ 2021-11-29 22:25 ` Taylor Blau
  2021-12-03 19:51 ` [PATCH 00/17] " Junio C Hamano
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2021-11-29 22:25 UTC (permalink / raw)
  To: git; +Cc: gitster, larsxschneider, peff, tytso

We don't bother to freshen objects stored in a cruft pack individually
by updating the `.mtimes` file. This is because we can't portably `mmap`
and write into the middle of a file (i.e., to update the mtime of just
one object). Instead, we would have to rewrite the entire `.mtimes` file
which may incur some wasted effort especially if there a lot of cruft
objects and they are freshened infrequently.

Instead, force the freshening code to avoid an optimizing write by
writing out the object loose and letting it pick up a current mtime.

This works because we prefer the mtime of the loose copy of an object
when both a loose and packed one exist (whether or not the packed copy
comes from a cruft pack or not).

This could certainly do with a test and/or be included earlier in this
series/PR, but I want to wait until after I have a chance to clean up
the overly-repetitive nature of the cruft pack tests in general.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 object-file.c                 |  2 ++
 t/t5327-pack-objects-cruft.sh | 25 +++++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/object-file.c b/object-file.c
index 7ddb38b64a..dddc1bdd2c 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1946,6 +1946,8 @@ static int freshen_packed_object(const struct object_id *oid)
 	struct pack_entry e;
 	if (!find_pack_entry(the_repository, oid, &e))
 		return 0;
+	if (e.p->is_cruft)
+		return 0;
 	if (e.p->freshened)
 		return 1;
 	if (!freshen_file(e.p->pack_name))
diff --git a/t/t5327-pack-objects-cruft.sh b/t/t5327-pack-objects-cruft.sh
index 4cd0f0cf57..ff87701bbf 100755
--- a/t/t5327-pack-objects-cruft.sh
+++ b/t/t5327-pack-objects-cruft.sh
@@ -657,4 +657,29 @@ test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
 	)
 '
 
+test_expect_success 'cruft objects are freshend via loose' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		echo "cruft" >contents &&
+		blob="$(git hash-object -w -t blob contents)" &&
+		loose="$objdir/$(test_oid_to_path $blob)" &&
+
+		test_commit base &&
+
+		git repack --cruft -d &&
+
+		test_path_is_missing "$loose" &&
+		test-tool pack-mtimes "$(basename "$(ls $packdir/pack-*.mtimes)")" >cruft &&
+		grep "$blob" cruft &&
+
+		# write the same object again
+		git hash-object -w -t blob contents &&
+
+		test_path_is_file "$loose"
+	)
+'
+
 test_done
-- 
2.34.1.25.gb3157a20e6

^ permalink raw reply related	[flat|nested] 201+ messages in thread

* Re: [PATCH 01/17] Documentation/technical: add cruft-packs.txt
  2021-11-29 22:25 ` [PATCH 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
@ 2021-12-02 14:33   ` Derrick Stolee
  2021-12-03 21:53     ` Taylor Blau
  2021-12-04 22:20   ` Elijah Newren
  1 sibling, 1 reply; 201+ messages in thread
From: Derrick Stolee @ 2021-12-02 14:33 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: gitster, larsxschneider, peff, tytso

On 11/29/2021 5:25 PM, Taylor Blau wrote:
> +Notable alternatives to this design include:
> +
> +  - The location of the per-object mtime data, and
> +  - Whether cruft packs should be incremental or not.

It was not obvious from this sentence that "incremental" meant that
we could store a number of cruft packs and use the mtime of each pack
as the time for all contained objects.

> +On the location of mtime data, a new auxiliary file tied to the pack was chosen
> +to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
> +support for optional chunks of data, it may make sense to consolidate the
> +`.mtimes` format into the `.idx` itself.
> +
> +Incremental cruft packs (i.e., where each time a repository is repacked a new
> +cruft pack is generated containing only the unreachable objects introduced since
> +the last time a cruft pack was written) are significantly more complicated to
> +construct, and so aren't pursued here. The obvious drawback to the current
> +implementation is that the entire cruft pack must be re-written from scratch.

But you seem to be pointing that direction here. The difference being
that you don't discuss how a list of cruft packs could avoid the .mtimes
file.

I think what is hidden underneath "significantly more complicated to
construct" are situations such as "this object was in an old cruft
pack, but then became reachable, but now is unreachable again". I'll
try to remember to come back to this after seeing the situations you
cover in your tests.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 02/17] pack-mtimes: support reading .mtimes files
  2021-11-29 22:25 ` [PATCH 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
@ 2021-12-02 15:06   ` Derrick Stolee
  2021-12-02 22:32     ` brian m. carlson
  2021-12-03 22:24     ` Taylor Blau
  0 siblings, 2 replies; 201+ messages in thread
From: Derrick Stolee @ 2021-12-02 15:06 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: gitster, larsxschneider, peff, tytso, brian m. carlson

On 11/29/2021 5:25 PM, Taylor Blau wrote:

> +== pack-*.mtimes files have the format:
> +
> +  - A 4-byte magic number '0x4d544d45' ('MTME').
> +
> +  - A 4-byte version identifier (= 1).
> +
> +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).

I vaguely remember complaints about using a 1-byte identifier in
the commit-graph and multi-pack-index formats because the "standard"
way to refer to these hash functions was a magic number that had a
meaning in ASCII that helped human readers a bit. I cannot find an
example of such 4-byte identifiers, but perhaps brian (CC'd) could
remind us.

You are using a 4-byte identifier, but using the same values as
those 1-byte identifiers.

> +  - A table of mtimes (one per packed object, num_objects in total, each
> +    a 4-byte unsigned integer in network order), in the same order as
> +    objects appear in the index file (e.g., the first entry in the mtime
> +    table corresponds to the object with the lowest lexically-sorted
> +    oid). The mtimes count standard epoch seconds.

This paragraph seemed awkward. Here is a rephrasing that might be
less awkward:

 - A table of 4-byte unsigned integers in network order. The ith value
   is the modified time (mtime) of the ith object of the corresponding
   pack in lexicographic order. The mtime represents standard epoch
   seconds.

Storing these mtimes in 32-bits means we will hit the 2038 problem.
The commit-graph stores commit times with an extra two bits to extend
the lifetime by another hundred years or so.

Could we extend the lifetime of cruft packs by decreasing the granularity
here? Should 'mtime' store a number of _minutes_ instead of seconds? That
should be enough granularity for these purposes.

> +  - A trailer, containing a:
> +
> +    checksum of the corresponding packfile, and
> +
> +    a checksum of all of the above.

Could you specify the checksum as having length according to the
specified hash function?

> +All 4-byte numbers are in network order.
> +

Maybe this could be at the start of the format, since the file
version and hash function are both 4-byte numbers here and we
could remove the mention of network order from the mtime values.

> +static char *pack_mtimes_filename(struct packed_git *p)
> +{
> +	size_t len;
> +	if (!strip_suffix(p->pack_name, ".pack", &len))
> +		BUG("pack_name does not end in .pack");
> +	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
> +	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
> +}

I see your NEEDSWORK here and you are probably referring to this:

static char *pack_revindex_filename(struct packed_git *p)
{
	size_t len;
	if (!strip_suffix(p->pack_name, ".pack", &len))
		BUG("pack_name does not end in .pack");
	return xstrfmt("%.*s.rev", (int)len, p->pack_name);
}

and the implementation is identical except for the new trailer
(which exist in the exts[] array in builtin/repack.c, but could
also be pulled out into a header somewhere.

I'm happy to delay any cleanup of these code clones until later,
if at all, because doing it right might mean moving more code
than we like. Such refactorings aren't worth it most of the time.

> +static int load_pack_mtimes_file(char *mtimes_file,
> +				 uint32_t num_objects,
> +				 const uint32_t **data_p, size_t *len_p)
> +{

> +	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
> +		ret = error(_("mtimes file %s is corrupt"), mtimes_file);

This message could be more informative: "mtimes file %s has the wrong size"?

> +	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
> +
> +	if (ntohl(*hdr) != MTIMES_SIGNATURE) {
> +		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
> +		goto cleanup;
> +	}

Interesting that you defined 'struct mtimes_header' before this
method, but don't use it here (in favor of moving a uint32_t
pointer). Perhaps you are avoiding pointing the struct at the
memory map, but you could also do this:

	struct mtimes_header header;

	header.signature = ntohl(hdr[0]);
	header.version = ntohl(hdr[1]);
	header.hash_id = ntohl(hdr[2]);

And then operate on the struct for your validation.

At the very least, 'struct mtimes_header' is defined but not
used in this patch. If you decide to not use it this way, then
maybe delay its definition.

> +
> +	if (ntohl(*++hdr) != 1) {
> +		ret = error(_("mtimes file %s has unsupported version %"PRIu32),
> +			    mtimes_file, ntohl(*hdr));

Unlike the commit-graph, if we don't understand the version we
cannot simply ignore the data. error() is appropriate here.

> +int load_pack_mtimes(struct packed_git *p)
> +{
> +	char *mtimes_name = NULL;
> +	int ret = 0;
> +
> +	if (!p->is_cruft)
> +		return ret; /* not a cruft pack */

Interesting that this indicator is essentially "we have an mtimes
file for this pack", but it makes sense to include that check next
to the .keep and .promisor checks.

> +uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)
> +{
> +	if (!p->mtimes_map)
> +		BUG("pack .mtimes file not loaded for %s", p->pack_name);
> +	if (p->num_objects <= pos)
> +		BUG("pack .mtimes out-of-bounds (%"PRIu32" vs %"PRIu32")",
> +		    pos, p->num_objects);
> +
> +	return get_be32(p->mtimes_map + pos + 3);
> +}

A nice safe access method. Good.

> -	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
> +	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".mtimes"};

(Speaking of that refactoring earlier, here is a second definition of
exts[] that would be valuable to unify.)

The hunks I did not comment on look good. Nice standard file format
stuff.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 04/17] chunk-format.h: extract oid_version()
  2021-11-29 22:25 ` [PATCH 04/17] chunk-format.h: extract oid_version() Taylor Blau
@ 2021-12-02 15:22   ` Derrick Stolee
  2021-12-03 22:40     ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Derrick Stolee @ 2021-12-02 15:22 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: gitster, larsxschneider, peff, tytso

On 11/29/2021 5:25 PM, Taylor Blau wrote:
> There are three definitions of an identical function which converts
> `the_hash_algo` into either 1 (for SHA-1) or 2 (for SHA-256). There is a
> copy of this function for writing both the commit-graph and
> multi-pack-index file, and another inline definition used to write the
> .rev header.
> 
> Consolidate these into a single definition in chunk-format.h. It's not
> clear that this is the best header to define this function in, but it
> should do for now.

Thanks for consolidating these!
 
> (Worth noting, the .rev caller expects a 4-byte unsigned, but the other
> two callers work with a single unsigned byte. The consolidated version
> uses the latter type, and lets the compiler widen it when required).
> 
> Another caller will be added in a subsequent patch.

>  chunk-format.c | 12 ++++++++++++
>  chunk-format.h |  3 +++
>  commit-graph.c | 18 +++---------------
>  midx.c         | 18 +++---------------
>  pack-write.c   | 15 ++-------------

I notice that you don't use this in load_pack_mtimes_file(),
in pack-mtimes.c but you could at this point.

The code you do touch looks good.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 05/17] pack-mtimes: support writing pack .mtimes files
  2021-11-29 22:25 ` [PATCH 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
@ 2021-12-02 15:36   ` Derrick Stolee
  2021-12-03 23:04     ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Derrick Stolee @ 2021-12-02 15:36 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: gitster, larsxschneider, peff, tytso

On 11/29/2021 5:25 PM, Taylor Blau wrote:> @@ -168,6 +168,9 @@ struct packing_data {
>  	/* delta islands */
>  	unsigned int *tree_depth;
>  	unsigned char *layer;
> +
> +	/* cruft packs */
> +	uint32_t *cruft_mtime;

This comment is a bit terse. Perhaps...

	/* Used when writing cruft packs. */

> +static inline uint32_t oe_cruft_mtime(struct packing_data *pack,
> +				      struct object_entry *e)
> +{
> +	if (!pack->cruft_mtime)
> +		return 0;
> +	return pack->cruft_mtime[e - pack->objects];
> +}

When writing a pack, it appears that the cruft_mtime array
maps to objects in pack-order, not idx-order, correct? That
might be worth mentioning in the struct definition because
it differs from the .mtimes file.

> +static void write_mtimes_objects(struct hashfile *f,
> +				 struct packing_data *to_pack,
> +				 struct pack_idx_entry **objects,
> +				 uint32_t nr_objects)
> +{
> +	uint32_t i;
> +	for (i = 0; i < nr_objects; i++) {
> +		struct object_entry *e = (struct object_entry*)objects[i];
> +		hashwrite_be32(f, oe_cruft_mtime(to_pack, e));
> +	}

The name "objects" here confused me at first, thinking it
corresponded to the objects member of 'struct packing_data', but
that is being handled by the fact that 'objects' is actually a
lex-sorted list of pack_idx_entry pointers (and they happen to
also point to 'struct object_entry' values because the 'struct
pack_idx_entry' is the first member.

So this is (very densely) handling the translation from pack-order
to lex-order through the double pointer 'objects'. I'm not sure if
there is a way to make it more clear or if every reader will need
to do the same mental gymnastics I had to do.

> +}
> +
> +static void write_mtimes_trailer(struct hashfile *f, const unsigned char *hash)
> +{
> +	hashwrite(f, hash, the_hash_algo->rawsz);
> +}
> +
> +static const char *write_mtimes_file(const char *mtimes_name,
> +				     struct packing_data *to_pack,
> +				     struct pack_idx_entry **objects,
> +				     uint32_t nr_objects,
> +				     const unsigned char *hash)
> +{
> +	struct hashfile *f;
> +	int fd;
> +
> +	if (!to_pack)
> +		BUG("cannot call write_mtimes_file with NULL packing_data");
> +
> +	if (!mtimes_name) {
> +		struct strbuf tmp_file = STRBUF_INIT;
> +		fd = odb_mkstemp(&tmp_file, "pack/tmp_mtimes_XXXXXX");
> +		mtimes_name = strbuf_detach(&tmp_file, NULL);
> +	} else {
> +		unlink(mtimes_name);
> +		fd = xopen(mtimes_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
> +	}
> +	f = hashfd(fd, mtimes_name);
> +
> +	write_mtimes_header(f);
> +	write_mtimes_objects(f, to_pack, objects, nr_objects);
> +	write_mtimes_trailer(f, hash);
> +
> +	if (mtimes_name && adjust_shared_perm(mtimes_name) < 0)
> +		die(_("failed to make %s readable"), mtimes_name);

What could cause 'mtimes_name' to be NULL here? It seems that it would
be initialized in the "if (!mtimes_name)" block above.

> +
> +	finalize_hashfile(f, NULL,
> +			  CSUM_HASH_IN_STREAM | CSUM_CLOSE | CSUM_FSYNC);
> +
> +	return mtimes_name;

Note that you return the name here...

> +	if (pack_idx_opts->flags & WRITE_MTIMES) {
> +		mtimes_tmp_name = write_mtimes_file(NULL, to_pack, written_list,
> +						    nr_written,
> +						    hash);
> +		if (adjust_shared_perm(mtimes_tmp_name))
> +			die_errno("unable to make temporary mtimes file readable");

...and then adjust the perms again. I think that this adjustment is
redundant, because it already happened within the write_mtimes_file()
method.

> +	}
> +
>  	rename_tmp_packfile(name_buffer, pack_tmp_name, "pack");
>  	if (rev_tmp_name)
>  		rename_tmp_packfile(name_buffer, rev_tmp_name, "rev");
> +	if (mtimes_tmp_name)
> +		rename_tmp_packfile(name_buffer, mtimes_tmp_name, "mtimes");

And then it is finally renamed here, if it had a temporary name to
start.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 02/17] pack-mtimes: support reading .mtimes files
  2021-12-02 15:06   ` Derrick Stolee
@ 2021-12-02 22:32     ` brian m. carlson
  2021-12-03 22:24     ` Taylor Blau
  1 sibling, 0 replies; 201+ messages in thread
From: brian m. carlson @ 2021-12-02 22:32 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, gitster, larsxschneider, peff, tytso

[-- Attachment #1: Type: text/plain, Size: 1265 bytes --]

On 2021-12-02 at 15:06:07, Derrick Stolee wrote:
> On 11/29/2021 5:25 PM, Taylor Blau wrote:
> 
> > +== pack-*.mtimes files have the format:
> > +
> > +  - A 4-byte magic number '0x4d544d45' ('MTME').
> > +
> > +  - A 4-byte version identifier (= 1).
> > +
> > +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
> 
> I vaguely remember complaints about using a 1-byte identifier in
> the commit-graph and multi-pack-index formats because the "standard"
> way to refer to these hash functions was a magic number that had a
> meaning in ASCII that helped human readers a bit. I cannot find an
> example of such 4-byte identifiers, but perhaps brian (CC'd) could
> remind us.
> 
> You are using a 4-byte identifier, but using the same values as
> those 1-byte identifiers.

The preferred value is the_hash_algo->format_id.  For SHA-1, that's
"sha1", big-endian (0x73686131) and for SHA-256 it's "s256", big-endian
(0x73323536).

There's also hash_algo_by_id to turn the format ID into an index into
the hash_algos array, but you need to check for GIT_HASH_UNKNOWN (0)
first.

These will be used in index v3, which I haven't sent out patches for
yet.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 00/17] cruft packs
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (16 preceding siblings ...)
  2021-11-29 22:25 ` [PATCH 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
@ 2021-12-03 19:51 ` Junio C Hamano
  2021-12-03 20:08   ` Taylor Blau
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                   ` (3 subsequent siblings)
  21 siblings, 1 reply; 201+ messages in thread
From: Junio C Hamano @ 2021-12-03 19:51 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, larsxschneider, peff, tytso

Taylor Blau <me@ttaylorr.com> writes:

> This series implements "cruft packs", a pack which stores accumulated
> unreachable objects, along with a new ".mtimes" file which tracks each
> object's last known modification time.

Let me rephrase the above to test my understanding, since I need to
write a summary for the  "What's cooking" report.

 Instead of leaving unreachable objects in loose form when packing,
 or ejecting them into loose form when repacking, gather them in a
 packfile with an auxiliary file that records the last-use time of
 these objects.

That way, we do not have to waste so many inodes for loose objects
that is not likely to be used, which feels like a win.

>   - The final patch handles object freshening for objects stored in a
>     cruft pack.

I am not going to read it today, but I think this is the most
interesting part of the series.  Instead of using mtime of an
individual loose object file, we'd need to record the time of
last use for each object in a pack.

Stepping back a bit, I do not see how we can get away without doing
the same .mtimes file for non-cruft packs.  An object that is in a
non-cruft pack may be referenced immediately after the repack that
created the pack, but the ref that was referencing the object may
have gone away and now the pack is a month old.  If we were to
repack the object, we do not know when was the last time the object
was reachable from any of the refs and index entries (collectively
known as anchor points).  Of course, recording all mtimes for all
packed objects all the time would involve quite a lot of overhead.
I am guessing (I will not spend time today to figure it out myself)
that .mtimes update at runtime will happen in-place (i.e. via
seek(2)+write(2), or pwrite()), and I wonder what the safety concern
would be (which is the primary reason why we tend not to do in-place
updates but recreate-and-rename updates).

Thanks for working on such an interesting topic.

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 00/17] cruft packs
  2021-12-03 19:51 ` [PATCH 00/17] " Junio C Hamano
@ 2021-12-03 20:08   ` Taylor Blau
  2021-12-03 20:47     ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2021-12-03 20:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, larsxschneider, peff, tytso

On Fri, Dec 03, 2021 at 11:51:51AM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > This series implements "cruft packs", a pack which stores accumulated
> > unreachable objects, along with a new ".mtimes" file which tracks each
> > object's last known modification time.
>
> Let me rephrase the above to test my understanding, since I need to
> write a summary for the  "What's cooking" report.
>
>  Instead of leaving unreachable objects in loose form when packing,
>  or ejecting them into loose form when repacking, gather them in a
>  packfile with an auxiliary file that records the last-use time of
>  these objects.

Exactly. Thanks for such a concise and accurate description of the
topic.

> That way, we do not have to waste so many inodes for loose objects
> that is not likely to be used, which feels like a win.

Yes. This had historically been a problem for GitHub. We don't
automatically prune unreachable objects during repacking, but sometimes
customers will ask us to do it on their behalf (if, for example, they
accidentally pushed sensitive information to us, and then force-pushed
over it).

But occasionally we'd get bitten by exploding many years of loose
objects (because we used to freshen packfiles too aggressively when
moving them around).

We've been running this series in production for the past few months,
and it's been a huge relief on the folks who typically run these pruning
GCs.

> >   - The final patch handles object freshening for objects stored in a
> >     cruft pack.
>
> I am not going to read it today, but I think this is the most
> interesting part of the series.  Instead of using mtime of an
> individual loose object file, we'd need to record the time of
> last use for each object in a pack.
>
> Stepping back a bit, I do not see how we can get away without doing
> the same .mtimes file for non-cruft packs.  An object that is in a
> non-cruft pack may be referenced immediately after the repack that
> created the pack, but the ref that was referencing the object may
> have gone away and now the pack is a month old.  If we were to
> repack the object, we do not know when was the last time the object
> was reachable from any of the refs and index entries (collectively
> known as anchor points).

In that situation, we would use the mtime of the pack which contains
that object itself as a proxy (or the mtime of a loose copy of the
object, if it is more recent).

That isn't perfect, as you note, since if the pack isn't otherwise
freshened, we'd consider that object to be a month old, even if the
reference pointing at it was deleted a mere second ago.

I can't recall if Peff and I talked about this off-list, but I have a
vague sense we probably did (and I forgot the details).

> Of course, recording all mtimes for all
> packed objects all the time would involve quite a lot of overhead.
> I am guessing (I will not spend time today to figure it out myself)
> that .mtimes update at runtime will happen in-place (i.e. via
> seek(2)+write(2), or pwrite()), and I wonder what the safety concern
> would be (which is the primary reason why we tend not to do in-place
> updates but recreate-and-rename updates).

Yeah, this series avoids doing an in-place update, and similarly avoids
recreating the entire .mtimes file before moving into place. Instead,
freshening an object stored in a cruft pack takes place by rewriting a
copy of the object loose, since we consider an object's mtime to be the
most recent of (a) what's in the .mtimes file, (b) the mtime of the
containing pack, and (c) the mtime of a loose copy (if one exists).

It can be wasteful, but in practice "resurrecting" an object in a cruft
pack is pretty rare, so on balance it ends up costing less work to do.

> Thanks for working on such an interesting topic.

I'm glad to have piqued your interest.

Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 00/17] cruft packs
  2021-12-03 20:08   ` Taylor Blau
@ 2021-12-03 20:47     ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2021-12-03 20:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, larsxschneider, peff, tytso

On Fri, Dec 03, 2021 at 03:08:00PM -0500, Taylor Blau wrote:
> On Fri, Dec 03, 2021 at 11:51:51AM -0800, Junio C Hamano wrote:
> > Stepping back a bit, I do not see how we can get away without doing
> > the same .mtimes file for non-cruft packs.  An object that is in a
> > non-cruft pack may be referenced immediately after the repack that
> > created the pack, but the ref that was referencing the object may
> > have gone away and now the pack is a month old.  If we were to
> > repack the object, we do not know when was the last time the object
> > was reachable from any of the refs and index entries (collectively
> > known as anchor points).
>
> In that situation, we would use the mtime of the pack which contains
> that object itself as a proxy (or the mtime of a loose copy of the
> object, if it is more recent).
>
> That isn't perfect, as you note, since if the pack isn't otherwise
> freshened, we'd consider that object to be a month old, even if the
> reference pointing at it was deleted a mere second ago.
>
> I can't recall if Peff and I talked about this off-list, but I have a
> vague sense we probably did (and I forgot the details).

Maybe I can rephrase the problem as being orthogonal to what we're
addressing here. Modification time can be a useful-ish proxy for "last
referenced time", but they are ultimately different.

Forgetting cruft packs for a moment, our behavior today in that
situation would be to prune the object if our grace period did not cover
the time in which the pack was last modified. So if the pack was a month
old, the grace period was two weeks, but the reference pointing at some
object in that pack was deleted only a second before starting a pruning
GC, we'd prune that object before this series (just as we would do the
same thing with this series).

Aside from pruning, what happens to the value recorded in the .mtimes
file is more interesting. For the case you're talking about, we'll err
on the side of newer mtimes (either the original timestamp is recorded,
or some future time when the containing pack was rewritten). But the
more interesting case is when an object becomes re-referenced. Since the
ref-update doesn't cause the object to be rewritten, we wouldn't change
the timestamp.

Anyway, both of these are still independent from cruft packs, so we're
not changing the status quo there, I don't think.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 01/17] Documentation/technical: add cruft-packs.txt
  2021-12-02 14:33   ` Derrick Stolee
@ 2021-12-03 21:53     ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2021-12-03 21:53 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, larsxschneider, peff, tytso

On Thu, Dec 02, 2021 at 09:33:51AM -0500, Derrick Stolee wrote:
> On 11/29/2021 5:25 PM, Taylor Blau wrote:
> > +Notable alternatives to this design include:
> > +
> > +  - The location of the per-object mtime data, and
> > +  - Whether cruft packs should be incremental or not.
>
> It was not obvious from this sentence that "incremental" meant that
> we could store a number of cruft packs and use the mtime of each pack
> as the time for all contained objects.

Yes, I think I meant "incremental" in the sense of "incremental commit-
graphs". But it's clearer to say "storing unreachable objects in
multiple cruft packs" (and then giving an example later on). Thanks!

> I think what is hidden underneath "significantly more complicated to
> construct" are situations such as "this object was in an old cruft
> pack, but then became reachable, but now is unreachable again". I'll
> try to remember to come back to this after seeing the situations you
> cover in your tests.

Yeah, I'm being deliberately vague here, since the aim of this paragraph
is to illustrate "this is much more complicated than what we implement
here, and the trade-offs are..."

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 02/17] pack-mtimes: support reading .mtimes files
  2021-12-02 15:06   ` Derrick Stolee
  2021-12-02 22:32     ` brian m. carlson
@ 2021-12-03 22:24     ` Taylor Blau
  2022-01-07 19:41       ` Taylor Blau
  1 sibling, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2021-12-03 22:24 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, gitster, larsxschneider, peff, tytso, brian m. carlson

On Thu, Dec 02, 2021 at 10:06:07AM -0500, Derrick Stolee wrote:
> On 11/29/2021 5:25 PM, Taylor Blau wrote:
>
> > +== pack-*.mtimes files have the format:
> > +
> > +  - A 4-byte magic number '0x4d544d45' ('MTME').
> > +
> > +  - A 4-byte version identifier (= 1).
> > +
> > +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
>
> I vaguely remember complaints about using a 1-byte identifier in
> the commit-graph and multi-pack-index formats because the "standard"
> way to refer to these hash functions was a magic number that had a
> meaning in ASCII that helped human readers a bit. I cannot find an
> example of such 4-byte identifiers, but perhaps brian (CC'd) could
> remind us.
>
> You are using a 4-byte identifier, but using the same values as
> those 1-byte identifiers.

Yeah, I'm definitely borrowing from the commit-graph and multi-pack
index formats here. Though I believe we did the same thing for .rev
files, too (and checking with Documentation/technical/pack-format.txt
confirms as much).

I don't have a strong feeling about using the 4-byte identifier or not.
But making this field four bytes wide is very much intentional, since it
makes sure that all of our reads are aligned, which should yield much
better cache performance (assuming the page size is also a multiple of
four).

I don't, but if others feel strongly we could write the magic
identifiers brian points out downthread here instead. (It would be
mildly inconvenient for GitHub, which has many hundreds of thousands of
these files laying around everywhere with '1' as the identifier. But
since the magic identifiers don't collide with the values proposed here,
GitHub's fork could easily be taught to accept both on the reading side,
but only write out the special identifier).

> > +  - A table of mtimes (one per packed object, num_objects in total, each
> > +    a 4-byte unsigned integer in network order), in the same order as
> > +    objects appear in the index file (e.g., the first entry in the mtime
> > +    table corresponds to the object with the lowest lexically-sorted
> > +    oid). The mtimes count standard epoch seconds.
>
> This paragraph seemed awkward. Here is a rephrasing that might be
> less awkward:
>
>  - A table of 4-byte unsigned integers in network order. The ith value
>    is the modified time (mtime) of the ith object of the corresponding
>    pack in lexicographic order. The mtime represents standard epoch
>    seconds.

Thanks, this is clearer. I went with a blend of the two:

    - A table of 4-byte unsigned integers in network order. The ith
      value is the modification time (mtime) of the ith object in the
      corresponding pack by lexicographic (index) order. The mtimes
      count standard epoch seconds.

> Storing these mtimes in 32-bits means we will hit the 2038 problem.
> The commit-graph stores commit times with an extra two bits to extend
> the lifetime by another hundred years or so.
>
> Could we extend the lifetime of cruft packs by decreasing the granularity
> here? Should 'mtime' store a number of _minutes_ instead of seconds? That
> should be enough granularity for these purposes.

Perhaps, though it does add some complexity to the code that deals with
this format at the expense of some future-proofing. I'm open to it,
though.

>
> > +  - A trailer, containing a:
> > +
> > +    checksum of the corresponding packfile, and
> > +
> > +    a checksum of all of the above.
>
> Could you specify the checksum as having length according to the
> specified hash function?

Great suggestion, thanks.

> > +All 4-byte numbers are in network order.
> > +
>
> Maybe this could be at the start of the format, since the file
> version and hash function are both 4-byte numbers here and we
> could remove the mention of network order from the mtime values.

This is copy-and-pasted from the .rev section above, where I think I
added the "All 4-byte numbers are in network order" bit at the end in
response to a suggestion opposite yours ;).

Here I would probably rather stay consistent with the surrounding
sections.

> > +static char *pack_mtimes_filename(struct packed_git *p)
> > +{
> > +	size_t len;
> > +	if (!strip_suffix(p->pack_name, ".pack", &len))
> > +		BUG("pack_name does not end in .pack");
> > +	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
> > +	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
> > +}
>
> I see your NEEDSWORK here and you are probably referring to this:
>
> static char *pack_revindex_filename(struct packed_git *p)
> {
> 	size_t len;
> 	if (!strip_suffix(p->pack_name, ".pack", &len))
> 		BUG("pack_name does not end in .pack");
> 	return xstrfmt("%.*s.rev", (int)len, p->pack_name);
> }
>
> and the implementation is identical except for the new trailer
> (which exist in the exts[] array in builtin/repack.c, but could
> also be pulled out into a header somewhere.
>
> I'm happy to delay any cleanup of these code clones until later,
> if at all, because doing it right might mean moving more code
> than we like. Such refactorings aren't worth it most of the time.

Yeah, I think your thoughts matched my own when writing this. Which is
to say, I felt it prudent to call out that there is an opportunity to
DRY these two up, but I'm not convinced that such a clean up would be
worthwhile.

> > +static int load_pack_mtimes_file(char *mtimes_file,
> > +				 uint32_t num_objects,
> > +				 const uint32_t **data_p, size_t *len_p)
> > +{
>
> > +	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
> > +		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
>
> This message could be more informative: "mtimes file %s has the wrong size"?

Copy-and-pasting here again from the corresponding code for the .rev
file, which is why I didn't opt to change the message here. Probably
many of these checks could be extracted out and shared between the two
paths, but I don't think we should attempt it here.

> > +	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
> > +
> > +	if (ntohl(*hdr) != MTIMES_SIGNATURE) {
> > +		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
> > +		goto cleanup;
> > +	}
>
> Interesting that you defined 'struct mtimes_header' before this
> method, but don't use it here (in favor of moving a uint32_t
> pointer). Perhaps you are avoiding pointing the struct at the
> memory map, but you could also do this:
>
> 	struct mtimes_header header;
>
> 	header.signature = ntohl(hdr[0]);
> 	header.version = ntohl(hdr[1]);
> 	header.hash_id = ntohl(hdr[2]);
>
> And then operate on the struct for your validation.
>
> At the very least, 'struct mtimes_header' is defined but not
> used in this patch. If you decide to not use it this way, then
> maybe delay its definition.

Yeah, not reading directly out of the struct is intentional, since the
compiler is free to insert padding between these members, which would
break any subsequent reads out of the struct.

But I like your idea to assign the fields manually, thanks!

> > +int load_pack_mtimes(struct packed_git *p)
> > +{
> > +	char *mtimes_name = NULL;
> > +	int ret = 0;
> > +
> > +	if (!p->is_cruft)
> > +		return ret; /* not a cruft pack */
>
> Interesting that this indicator is essentially "we have an mtimes
> file for this pack", but it makes sense to include that check next
> to the .keep and .promisor checks.

I think I had originally called it "mtimes" but changed it to "cruft",
since it makes sense as a prefix similar to the others (that is, "keep
pack", "promisor pack", and "cruft pack", not "mtimes pack").

> The hunks I did not comment on look good. Nice standard file format
> stuff.

Thanks for your review!

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 04/17] chunk-format.h: extract oid_version()
  2021-12-02 15:22   ` Derrick Stolee
@ 2021-12-03 22:40     ` Taylor Blau
  2021-12-06 17:33       ` Derrick Stolee
  0 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2021-12-03 22:40 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, larsxschneider, peff, tytso

On Thu, Dec 02, 2021 at 10:22:05AM -0500, Derrick Stolee wrote:
> I notice that you don't use this in load_pack_mtimes_file(),
> in pack-mtimes.c but you could at this point.

Hmm, I'm confused. Te extracted function converts a pointer to a struct
git_hash_algo into a uint32, but here we just care about reading the
four byte value we wrote.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 05/17] pack-mtimes: support writing pack .mtimes files
  2021-12-02 15:36   ` Derrick Stolee
@ 2021-12-03 23:04     ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2021-12-03 23:04 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, larsxschneider, peff, tytso

On Thu, Dec 02, 2021 at 10:36:16AM -0500, Derrick Stolee wrote:
> On 11/29/2021 5:25 PM, Taylor Blau wrote:> @@ -168,6 +168,9 @@ struct packing_data {
> >  	/* delta islands */
> >  	unsigned int *tree_depth;
> >  	unsigned char *layer;
> > +
> > +	/* cruft packs */
> > +	uint32_t *cruft_mtime;
>
> This comment is a bit terse. Perhaps...
>
> 	/* Used when writing cruft packs. */

Sure; here I was imitating the terseness of the "delta islands" comment
a few lines above. But I don't mind changing it here.

> > +static inline uint32_t oe_cruft_mtime(struct packing_data *pack,
> > +				      struct object_entry *e)
> > +{
> > +	if (!pack->cruft_mtime)
> > +		return 0;
> > +	return pack->cruft_mtime[e - pack->objects];
> > +}
>
> When writing a pack, it appears that the cruft_mtime array
> maps to objects in pack-order, not idx-order, correct? That
> might be worth mentioning in the struct definition because
> it differs from the .mtimes file.

Great observation and suggestion, thank you! The comment that I
ultimately settled on is:

  /*
   * Used when writing cruft packs.
   *
   * Object mtimes  are stored in pack order when writing, but
   * written out in lexicographic (index) order.
   */
   uint32_t *cruft_mtime;

> > +static void write_mtimes_objects(struct hashfile *f,
> > +				 struct packing_data *to_pack,
> > +				 struct pack_idx_entry **objects,
> > +				 uint32_t nr_objects)
> > +{
> > +	uint32_t i;
> > +	for (i = 0; i < nr_objects; i++) {
> > +		struct object_entry *e = (struct object_entry*)objects[i];
> > +		hashwrite_be32(f, oe_cruft_mtime(to_pack, e));
> > +	}
>
> The name "objects" here confused me at first, thinking it
> corresponded to the objects member of 'struct packing_data', but
> that is being handled by the fact that 'objects' is actually a
> lex-sorted list of pack_idx_entry pointers (and they happen to
> also point to 'struct object_entry' values because the 'struct
> pack_idx_entry' is the first member.
>
> So this is (very densely) handling the translation from pack-order
> to lex-order through the double pointer 'objects'. I'm not sure if
> there is a way to make it more clear or if every reader will need
> to do the same mental gymnastics I had to do.

Exactly, and sorry that I didn't point this out more clearly. It's been
long enough since I wrote this code that I can sympathize with the
mental gymnastics required ;).

> > +}
> > +
> > +static void write_mtimes_trailer(struct hashfile *f, const unsigned char *hash)
> > +{
> > +	hashwrite(f, hash, the_hash_algo->rawsz);
> > +}
> > +
> > +static const char *write_mtimes_file(const char *mtimes_name,
> > +				     struct packing_data *to_pack,
> > +				     struct pack_idx_entry **objects,
> > +				     uint32_t nr_objects,
> > +				     const unsigned char *hash)
> > +{
> > +	struct hashfile *f;
> > +	int fd;
> > +
> > +	if (!to_pack)
> > +		BUG("cannot call write_mtimes_file with NULL packing_data");
> > +
> > +	if (!mtimes_name) {
> > +		struct strbuf tmp_file = STRBUF_INIT;
> > +		fd = odb_mkstemp(&tmp_file, "pack/tmp_mtimes_XXXXXX");
> > +		mtimes_name = strbuf_detach(&tmp_file, NULL);
> > +	} else {
> > +		unlink(mtimes_name);
> > +		fd = xopen(mtimes_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
> > +	}
> > +	f = hashfd(fd, mtimes_name);
> > +
> > +	write_mtimes_header(f);
> > +	write_mtimes_objects(f, to_pack, objects, nr_objects);
> > +	write_mtimes_trailer(f, hash);
> > +
> > +	if (mtimes_name && adjust_shared_perm(mtimes_name) < 0)
> > +		die(_("failed to make %s readable"), mtimes_name);
>
> What could cause 'mtimes_name' to be NULL here? It seems that it would
> be initialized in the "if (!mtimes_name)" block above.

You're right, it's impossible for it to be NULL here. I'll remove the
redundant side of the &&-expression here.

> > +
> > +	finalize_hashfile(f, NULL,
> > +			  CSUM_HASH_IN_STREAM | CSUM_CLOSE | CSUM_FSYNC);
> > +
> > +	return mtimes_name;
>
> Note that you return the name here...
>
> > +	if (pack_idx_opts->flags & WRITE_MTIMES) {
> > +		mtimes_tmp_name = write_mtimes_file(NULL, to_pack, written_list,
> > +						    nr_written,
> > +						    hash);
> > +		if (adjust_shared_perm(mtimes_tmp_name))
> > +			die_errno("unable to make temporary mtimes file readable");
>
> ...and then adjust the perms again. I think that this adjustment is
> redundant, because it already happened within the write_mtimes_file()
> method.

Yep, thanks. I'll clean it up here to just call adjust_shared_perm()
witin write_mtimes_file().

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 01/17] Documentation/technical: add cruft-packs.txt
  2021-11-29 22:25 ` [PATCH 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
  2021-12-02 14:33   ` Derrick Stolee
@ 2021-12-04 22:20   ` Elijah Newren
  2021-12-04 23:32     ` Taylor Blau
  1 sibling, 1 reply; 201+ messages in thread
From: Elijah Newren @ 2021-12-04 22:20 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Git Mailing List, Junio C Hamano, Lars Schneider, Jeff King,
	Theodore Tso

On Mon, Nov 29, 2021 at 7:29 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> Create a technical document to explain cruft packs. It contains a brief
> overview of the problem, some background, details on the implementation,
> and a couple of alternative approaches not considered here.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  Documentation/Makefile                  |  1 +
>  Documentation/technical/cruft-packs.txt | 95 +++++++++++++++++++++++++
>  2 files changed, 96 insertions(+)
>  create mode 100644 Documentation/technical/cruft-packs.txt
>
> diff --git a/Documentation/Makefile b/Documentation/Makefile
> index ed656db2ae..0b01c9408e 100644
> --- a/Documentation/Makefile
> +++ b/Documentation/Makefile
> @@ -91,6 +91,7 @@ TECH_DOCS += MyFirstContribution
>  TECH_DOCS += MyFirstObjectWalk
>  TECH_DOCS += SubmittingPatches
>  TECH_DOCS += technical/bundle-format
> +TECH_DOCS += technical/cruft-packs
>  TECH_DOCS += technical/hash-function-transition
>  TECH_DOCS += technical/http-protocol
>  TECH_DOCS += technical/index-format
> diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
> new file mode 100644
> index 0000000000..bb54cce1b1
> --- /dev/null
> +++ b/Documentation/technical/cruft-packs.txt
> @@ -0,0 +1,95 @@
> += Cruft packs
> +
> +Cruft packs offer an alternative to Git's traditional mechanism of removing
> +unreachable objects. This document provides an overview of Git's pruning
> +mechanism, and how cruft packs can be used instead to accomplish the same.
> +
> +== Background
> +
> +To remove unreachable objects from your repository, Git offers `git repack -Ad`
> +(see linkgit:git-repack[1]). Quoting from the documentation:
> +
> +[quote]
> +[...] unreachable objects in a previous pack become loose, unpacked objects,
> +instead of being left in the old pack. [...] loose unreachable objects will be
> +pruned according to normal expiry rules with the next 'git gc' invocation.
> +
> +Unreachable objects aren't removed immediately, since doing so could race with
> +an incoming push which may reference an object which is about to be deleted.
> +Instead, those unreachable objects are stored as loose object and stay that way
> +until they are older than the expiration window, at which point they are removed
> +by linkgit:git-prune[1].
> +
> +Git must store these unreachable objects loose in order to keep track of their
> +per-object mtimes. If these unreachable objects were written into one big pack,
> +then either freshening that pack (because an object contained within it was
> +re-written) or creating a new pack of unreachable objects would cause the pack's
> +mtime to get updated, and the objects within it would never leave the expiration
> +window. Instead, objects are stored loose in order to keep track of the
> +individual object mtimes and avoid a situation where all cruft objects are
> +freshened at once.
> +
> +This can lead to undesirable situations when a repository contains many
> +unreachable objects which have not yet left the grace period. Having large
> +directories in the shards of `.git/objects` can lead to decreased performance in
> +the repository. But given enough unreachable objects, this can lead to inode
> +starvation and degrade the performance of the whole system. Since we
> +can never pack those objects, these repositories often take up a large amount of
> +disk space, since we can only zlib compress them, but not store them in delta
> +chains.
> +
> +== Cruft packs
> +
> +Cruft packs are designed to eliminate the need for storing unreachable objects
> +in a loose state by including the per-object mtimes in a separate file alongside
> +a single pack containing all loose objects.

I had the same question as Stolee here: why not use the cruft-pack's
mtime for all the objects in it?  Much later below, you make it clear
that a repository will generally only have one cruft pack which kind
of answers the question, but the repeated mention of "cruft packs"
throughout the document subtly made me make the opposite assumption.
It might be nice to address the almost-always-only-one-cruft-pack
earlier on, which may also help answer the question about why you need
to store individual mtimes in an additional file.

> +A cruft pack is written by `git repack --cruft` when generating a new pack.
> +linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft`
> +is a classic all-into-one repack, meaning that everything in the resulting pack is
> +reachable, and everything else is unreachable. Once written, the `--cruft`
> +option instructs `git repack` to generate another pack containing only objects
> +not packed in the previous step (which equates to packing all unreachable
> +objects together). This progresses as follows:
> +
> +  1. Enumerate every object, marking any object which is (a) not contained in a
> +     kept-pack, and (b) whose mtime is within the grace period as a traversal
> +     tip.
> +
> +  2. Perform a reachability traversal based on the tips gathered in the previous
> +     step, adding every object along the way to the pack.
> +
> +  3. Write the pack out, along with a `.mtimes` file that records the per-object
> +     timestamps.
> +
> +This mode is invoked internally by linkgit:git-repack[1] when instructed to
> +write a cruft pack. Crucially, the set of in-core kept packs is exactly the set
> +of packs which will not be deleted by the repack; in other words, they contain
> +all of the repository's reachable objects.
> +
> +When a repository already has a cruft pack, `git repack --cruft` typically only
> +adds objects to it. An exception to this is when `git repack` is given the
> +`--cruft-expiration` option, which allows the generated cruft pack to omit
> +expired objects instead of waiting for linkgit:git-gc[1] to expire those objects
> +later on.
> +
> +It is linkgit:git-gc[1] that is typically responsible for removing expired
> +unreachable objects.
> +
> +== Alternatives
> +
> +Notable alternatives to this design include:
> +
> +  - The location of the per-object mtime data, and
> +  - Whether cruft packs should be incremental or not.
> +
> +On the location of mtime data, a new auxiliary file tied to the pack was chosen
> +to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
> +support for optional chunks of data, it may make sense to consolidate the
> +`.mtimes` format into the `.idx` itself.
> +
> +Incremental cruft packs (i.e., where each time a repository is repacked a new
> +cruft pack is generated containing only the unreachable objects introduced since
> +the last time a cruft pack was written) are significantly more complicated to
> +construct, and so aren't pursued here. The obvious drawback to the current
> +implementation is that the entire cruft pack must be re-written from scratch.
> --
> 2.34.1.25.gb3157a20e6
>

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 01/17] Documentation/technical: add cruft-packs.txt
  2021-12-04 22:20   ` Elijah Newren
@ 2021-12-04 23:32     ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2021-12-04 23:32 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Junio C Hamano, Lars Schneider, Jeff King,
	Theodore Tso

On Sat, Dec 04, 2021 at 02:20:23PM -0800, Elijah Newren wrote:
> > +== Cruft packs
> > +
> > +Cruft packs are designed to eliminate the need for storing unreachable objects
> > +in a loose state by including the per-object mtimes in a separate file alongside
> > +a single pack containing all loose objects.
>
> I had the same question as Stolee here: why not use the cruft-pack's
> mtime for all the objects in it?  Much later below, you make it clear
> that a repository will generally only have one cruft pack which kind
> of answers the question, but the repeated mention of "cruft packs"
> throughout the document subtly made me make the opposite assumption.
> It might be nice to address the almost-always-only-one-cruft-pack
> earlier on, which may also help answer the question about why you need
> to store individual mtimes in an additional file.

Responding to your suggestions out of order ;-). Throughout the
document, I wrote "cruft packs" in the sense of "the feature this series
implements", not "multiple cruft packs".

But my wording is unintentionally vague, especially because this
document does talk about why this series stores unreachable objects in a
single cruft pack. I updated my copy to make clear the difference
between the two, which should hopefully avoid any confusion here in the
future.

As far as why not use the cruft pack's timestamp as the mtime for all of
the unreachable objects contained within it, there are a few reasons:

It makes freshening objects more complicated. Not because we couldn't
freshen individual objects (we would likely do so in the same way this
series does, by rewriting it loose and using the loose copy's mtime
instead), but because it makes it complicated to repack a repository
with many cruft packs. If I have a handful of cruft packs, and freshen a
handful of objects within them, I now need to update many cruft packs,
or pay the price of storing their objects twice (if I instead don't
rewrite them and keep the loose copies around).

It also makes it impossible to share deltas between cruft objects that
don't have the same timestamp, unless the cruft packs are stored thin
(in which case it becomes much more complicated to figure out which
cruft packs can be safely pruned without storing information about which
other packs a thin pack has deltas against).

I'm sure there were others, but these are the ones that I could recall
off the top of my head. This all felt like a little too much detail for
the "alternative designs" section, but if you think some or all of this
would be interesting to memorialize not just on the mailing list, let me
know.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 12/17] builtin/repack.c: support generating a cruft pack
  2021-11-29 22:25 ` [PATCH 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
@ 2021-12-05 20:46   ` Junio C Hamano
  2022-03-01  2:00     ` Taylor Blau
  2021-12-07 15:38   ` Derrick Stolee
  1 sibling, 1 reply; 201+ messages in thread
From: Junio C Hamano @ 2021-12-05 20:46 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, larsxschneider, peff, tytso

Various thoughts on just this part, as the hunk got my attention
while merging with other topics in 'seen'.

> +	if (pack_everything & PACK_CRUFT && delete_redundant) {
> +		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
> +			die(_("--cruft and -A are incompatible"));
> +		if (keep_unreachable)
> +			die(_("--cruft and -k are incompatible"));
> +		if (!(pack_everything & ALL_INTO_ONE))
> +			die(_("--cruft must be combined with all-into-one"));
> +	}

The "reuse similar messages for i18n" topic will encourage us to
turn this part into:

	if (pack_everything & PACK_CRUFT && delete_redundant) {
		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
			die(_("%s and %s are mutually exclusive"),
			    "--cruft", "-A");
		if (keep_unreachable)
			die(_("%s and %s are mutually exclusive"),
			    "--cruft", "-k");
		if (!(pack_everything & ALL_INTO_ONE))
			die(_("--cruft must be combined with all-into-one"));
	}

The conditionals are a bit unpleasant to read and maintain, but I
guess we cannot help it?

Saying ALL_INTO_ONE is a bit unfriendly to the end user, who would
probably not know that it is the name the code gave to the bit that
is turned on when given an option externally known under a different
name (is that "-a"?).

If "--cruft" must be used with "all into one", I wonder if it makes
sense to make it imply that?  Not in the sense that OPT_BIT()
initially flips the ALL_INTO_ONE bit on upon seeing "--cruft", but
after parse_options() returns, we check PACK_CRUFT and if it is on
turn ALL_INTO_ONE also on (so even if '-a' gains '--all-into-one'
option, the user won't break us by giving "--no-all-into-one" after
they gave us "--cruft")?  I didn't think about this part thoroughly
enough, though.

Thanks.







^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 04/17] chunk-format.h: extract oid_version()
  2021-12-03 22:40     ` Taylor Blau
@ 2021-12-06 17:33       ` Derrick Stolee
  0 siblings, 0 replies; 201+ messages in thread
From: Derrick Stolee @ 2021-12-06 17:33 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, gitster, larsxschneider, peff, tytso

On 12/3/21 5:40 PM, Taylor Blau wrote:
> On Thu, Dec 02, 2021 at 10:22:05AM -0500, Derrick Stolee wrote:
>> I notice that you don't use this in load_pack_mtimes_file(),
>> in pack-mtimes.c but you could at this point.
> 
> Hmm, I'm confused. Te extracted function converts a pointer to a struct
> git_hash_algo into a uint32, but here we just care about reading the
> four byte value we wrote.

Ah. I got mixed up here. Sorry.

-Stolee

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 06/17] t/helper: add 'pack-mtimes' test-tool
  2021-11-29 22:25 ` [PATCH 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
@ 2021-12-06 21:16   ` Derrick Stolee
  2022-02-23 22:24     ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Derrick Stolee @ 2021-12-06 21:16 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: gitster, larsxschneider, peff, tytso

On 11/29/2021 5:25 PM, Taylor Blau wrote:
> +static int dump_mtimes(struct packed_git *p)

nit: you return an int here so you can use it as an error code...

> +{
> +	uint32_t i;
> +	if (load_pack_mtimes(p) < 0)
> +		die("could not load pack .mtimes");
> +
> +	for (i = 0; i < p->num_objects; i++) {
> +		struct object_id oid;
> +		if (nth_packed_object_id(&oid, p, i) < 0)
> +			die("could not load object id at position %"PRIu32, i);
> +
> +		printf("%s %"PRIu32"\n",
> +		       oid_to_hex(&oid), nth_packed_mtime(p, i));
> +	}
> +
> +	return 0;

But always return 0 unless you die().

> +	return p ? dump_mtimes(p) : 1;

It makes this line concise, I suppose.

Perhaps just use "return dump_mtimes(p)" and have dump_mtimes()
return 1 if the given pack is NULL?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 08/17] builtin/pack-objects.c: --cruft without expiration
  2021-11-29 22:25 ` [PATCH 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
@ 2021-12-06 21:44   ` Derrick Stolee
  2022-03-01  2:48     ` Taylor Blau
  2021-12-07 15:17   ` Derrick Stolee
  1 sibling, 1 reply; 201+ messages in thread
From: Derrick Stolee @ 2021-12-06 21:44 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: gitster, larsxschneider, peff, tytso

On 11/29/2021 5:25 PM, Taylor Blau wrote:
> Generating a non-expiring cruft packs works as follows:

I had trouble parsing the documentation changes below, so I came back
to this commit message to see if that helps.
 
>   - Callers provide a list of every pack they know about, and indicate
>     which packs are about to be removed.

This corresponds to the list over stdin.
 
>   - All packs which are going to be removed (we'll call these the
>     redundant ones) are marked as kept in-core, as well as any packs
>     that `pack-objects` found but the caller did not specify.

Ok, so as an implementation detail we mark these as keep packs.

>     These packs are presumed to have entered the repository between
>     the caller collecting packs and invoking `pack-objects`. Since we
>     do not want to include objects in these packs (because we don't know
>     which of their objects are or aren't reachable), these are also
>     marked as kept in-core.

Here, "are presumed" is doing a lot of work. Theoretically, there could
be three categories:

1. This pack was just repacked and will be removed because all of its
   objects were placed into new objects.

2. Either this pack was repacked and contains important reachable objects
   OR we did a repack of reachable objects and this pack contained some
   extra, unreachable objects.

3. This pack was added to the repository while creating those repacked
   packs from category 2, so we don't know if things are reachable or
   not.

So, the packs that we discover on-disk but are not specified over stdin
are in this third category, but these are grouped with category 1 as we
will treat them the same.

>   - Then, we enumerate all objects in the repository, and add them to
>     our packing list if they do not appear in an in-core kept pack.

Here, we are looking at all of the objects in category 2 as well as
loose objects.

> This results in a new cruft pack which contains all known objects that
> aren't included in the kept packs. When the kept pack is the result of
> `git repack -A`, the resulting pack contains all unreachable objects.

This now describes how 'git repack' will interface with this new change
to pack-objects. I'll keep an eye out for that.

> +--cruft::

Now getting to this description.

> +	Packs unreachable objects into a separate "cruft" pack, denoted
> +	by the existence of a `.mtimes` file. Pack names provided over
> +	stdin indicate which packs will remain after a `git repack`.
> +	Pack names prefixed with a `-` indicate those which will be
> +	removed. (...)

This description is too tied to 'git repack'. Can we describe the
input using terms independent of the 'git repack' operation? I need
to keep reading.

> (...) The contents of the cruft pack are all objects not
> +	contained in the surviving packs specified by `--keep-pack`)

Now you use --keep-pack, which is a way of specifying a pack as
"in-core keep" which was not in your commit message. Here, we also
don't link the packs over stdin to the concept of keep packs.

> +	which have not exceeded the grace period (see
> +	`--cruft-expiration` below), or which have exceeded the grace
> +	period, but are reachable from an other object which hasn't.

And now we think about the grace period! There is so much going on
that I need to break it down to understand.

  An object is _excluded_ from the new cruft pack if

  1. It is reachable from at least one reference.
  2. It is in a pack from stdin prefixed with "-"
  3. It is in a pack specified by `--keep-pack`
  4. It is in an existing cruft pack and the .mtimes file states
     that its mtime is at least as recent as the time specified by
     the --cruft-expiration option.

Breaking it down into a list like this helps me, at least. I'm not
sure what the best way would look like.

(Needing to pause here and look at the implementation later.)

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 08/17] builtin/pack-objects.c: --cruft without expiration
  2021-11-29 22:25 ` [PATCH 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
  2021-12-06 21:44   ` Derrick Stolee
@ 2021-12-07 15:17   ` Derrick Stolee
  2022-02-23 23:34     ` Taylor Blau
  1 sibling, 1 reply; 201+ messages in thread
From: Derrick Stolee @ 2021-12-07 15:17 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: gitster, larsxschneider, peff, tytso

On 11/29/2021 5:25 PM, Taylor Blau wrote:
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> +static int add_cruft_object_entry(const struct object_id *oid, enum object_type type,
> +				  struct packed_git *pack, off_t offset,
> +				  const char *name, uint32_t mtime)
> +{
> +	struct object_entry *entry;
> +
> +	display_progress(progress_state, ++nr_seen);

I don't love the global nr_seen here, but it is pervasive through the
file. OK.

> +	entry = packlist_find(&to_pack, oid);
> +	if (entry) {
> +		if (name) {
> +			entry->hash = pack_name_hash(name);
> +			entry->no_try_delta = name && no_try_delta(name);

This is already in an "if (name)" block, so "name &&" isn't needed.

> +		}
> +	} else {
> +		if (!want_object_in_pack(oid, 0, &pack, &offset))
> +			return 0;
> +		if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
> +			/*
> +			 * If a traversed tree has a missing blob then we want
> +			 * to avoid adding that missing object to our pack.
> +			 *
> +			 * This only applies to missing blobs, not trees,
> +			 * because the traversal needs to parse sub-trees but
> +			 * not blobs.
> +			 *
> +			 * Note we only perform this check when we couldn't
> +			 * already find the object in a pack, so we're really
> +			 * limited to "ensure non-tip blobs which don't exist in
> +			 * packs do exist via loose objects". Confused?
> +			 */
> +			return 0;
> +		}
> +
> +		entry = create_object_entry(oid, type, pack_name_hash(name),
> +					    0, name && no_try_delta(name),
> +					    pack, offset);
> +	}
> +
> +	if (mtime > oe_cruft_mtime(&to_pack, entry))
> +		oe_set_cruft_mtime(&to_pack, entry, mtime);
> +	return 1;

I was confused at this "return 1" here, while other cases return 0.

It turns out that there are multiple methods in this file that have
different semantics: add_loose_object() and add_object_entry_from_pack()
are both called from iterators where "return 1" means "stop iterating"
so they return 0 always. add_object_entry_from_bitmap() is used to
iterate over a bitmap and "return 1" means "include this object".

However, the return code for add_cruft_object_entry() is never used,
so it should probably return void or swap the meanings to have nonzero
mean an error occurred.

> +static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
> +{
> +	struct string_list_item *item = NULL;
> +	for_each_string_list_item(item, packs) {
> +		struct packed_git *p = item->util;
> +		if (!p)
> +			die(_("could not find pack '%s'"), item->string);

Interesting that this is a potential issue. We are expecting the pack
to be loaded before we get here. Is this more because some packs might
not actually load, but it's fine as long as we don't mark them as kept?

> +		p->pack_keep_in_core = keep;
> +	}
> +}
...
> +static void read_cruft_objects(void)
> +{
> +	struct strbuf buf = STRBUF_INIT;
> +	struct string_list discard_packs = STRING_LIST_INIT_DUP;
> +	struct string_list fresh_packs = STRING_LIST_INIT_DUP;
> +	struct packed_git *p;
> +
> +	ignore_packed_keep_in_core = 1;

Here is a global that we are suddenly changing. Should we not be
returning it to its initial state when this method is complete?

> +static int option_parse_cruft_expiration(const struct option *opt,
> +					 const char *arg, int unset)
> +{
> +	if (unset) {
> +		cruft = 0;

This unassignment of 'cruft' when cruft-expiration is unset with
--no-cruft-expiration seems odd. I would expect

	git pack-objects --cruft --no-cruft-expiration

to still make a cruft pack, but not expire anything. It seems that
your code here makes --no-cruft-expiration disable the --cruft option.

> +		cruft_expiration = 0;
> +	} else {
> +		cruft = 1;
> +		if (arg)
> +			cruft_expiration = approxidate(arg);
> +	}
> +	return 0;
> +}
..
> +		OPT_BOOL(0, "cruft", &cruft, N_("create a cruft pack")),
> +		OPT_CALLBACK_F(0, "cruft-expiration", NULL, N_("time"),
> +		  N_("expire cruft objects older than <time>"),
> +		  PARSE_OPT_OPTARG, option_parse_cruft_expiration),

> -static int has_loose_object(const struct object_id *oid)
> +int has_loose_object(const struct object_id *oid)
>  {
>  	return check_and_freshen(oid, 0);
>  }

I'm surprised this hasn't been modified to use a repository pointer.
Adding another caller here isn't too much debt, though.

> diff --git a/object-store.h b/object-store.h
> index d87481f101..a79c1c91ab 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -308,6 +308,8 @@ int repo_has_object_file_with_flags(struct repository *r,
>   */
>  int has_loose_object_nonlocal(const struct object_id *);

Of course, here is another example that is already more widely used.

> +int has_loose_object(const struct object_id *);
> +
>  void assert_oid_type(const struct object_id *oid, enum object_type expect);

...

> +	test_expect_success "unreachable packed objects are packed (expire $expire)" '
> +		git init repo &&
> +		test_when_finished "rm -fr repo" &&
> +		(
> +			cd repo &&
> +
> +			test_commit packed &&
> +			git repack -Ad &&
> +			test_commit other &&
> +
> +			git rev-list --objects --no-object-names packed.. >objects &&
> +			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
> +			other="$(git pack-objects --delta-base-offset \
> +				$packdir/pack <objects)" &&
> +			git prune-packed &&
> +
> +			test-tool chmtime --get -100 "$packdir/pack-$other.pack" >expect &&

I am missing how this test creates _unreachable_ objects. I would expect removal of
some refs or a 'git reset --hard' somewhere. What am I missing?

> +			cruft="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF
> +			$keep
> +			-pack-$other.pack
> +			EOF
> +			)" &&
> +			test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
> +
> +			cut -d" " -f2 <actual.raw | sort -u >actual &&
> +
> +			test_cmp expect actual
> +		)
> +	'
> +
> +	test_expect_success "unreachable cruft objects are repacked (expire $expire)" '

I have the same question for all of the tests, really.

> +			# remove the unreachable tree, but leave the commit
> +			# which has it as its root tree in-tact

nit: "intact" is one word.

> +			rm -fr "$objdir/$(test_oid_to_path "$tree")" &&
> +
> +			git repack -Ad &&
> +			basename $(ls $packdir/pack-*.pack) >in &&
> +			git pack-objects --cruft --cruft-expiration="$expire" \
> +				$packdir/pack <in
> +		)
> +	'

...

> +basic_cruft_pack_tests never

I look forward to seeing how this changes with additional expiration values.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 11/17] builtin/pack-objects.c: --cruft with expiration
  2021-11-29 22:25 ` [PATCH 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
@ 2021-12-07 15:30   ` Derrick Stolee
  2022-02-23 23:35     ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Derrick Stolee @ 2021-12-07 15:30 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: gitster, larsxschneider, peff, tytso

On 11/29/2021 5:25 PM, Taylor Blau wrote:

> +static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs)
> +{
...
> +	/*
> +	 * Re-mark only the fresh packs as kept so that objects in
> +	 * unknown packs do not halt the reachability traversal early.
> +	 */
> +	for (p = get_all_packs(the_repository); p; p = p->next)
> +		p->pack_keep_in_core = 0;
> +	mark_pack_kept_in_core(fresh_packs, 1);

Are we ever going to recover this pack_keep_in_core state? Should we
be saving it somewhere so we can return without mutating this state
permanently?

> +	if (prepare_revision_walk(&revs))
> +		die(_("revision walk setup failed"));
> +	if (progress)
> +		progress_state = start_progress(_("Traversing cruft objects"), 0);
> +	nr_seen = 0;
> +	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
> +
> +	stop_progress(&progress_state);
> +}
> +
>  static void read_cruft_objects(void)
>  {
>  	struct strbuf buf = STRBUF_INIT;
> @@ -3515,7 +3597,7 @@ static void read_cruft_objects(void)
>  	mark_pack_kept_in_core(&discard_packs, 0);
>  
>  	if (cruft_expiration)
> -		die("--cruft-expiration not yet implemented");
> +		enumerate_and_traverse_cruft_objects(&fresh_packs);
>  	else
>  		enumerate_cruft_objects();

>  basic_cruft_pack_tests never
> +basic_cruft_pack_tests 2.weeks.ago

I'm surprised these tests didn't require any changes to adapt to the
new expiration date. But I suppose none of the mtimes were older than
two weeks ago?

I continue to miss something in these tests, because I don't see how
things are becoming unreachable.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 12/17] builtin/repack.c: support generating a cruft pack
  2021-11-29 22:25 ` [PATCH 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
  2021-12-05 20:46   ` Junio C Hamano
@ 2021-12-07 15:38   ` Derrick Stolee
  2022-02-23 23:37     ` Taylor Blau
  1 sibling, 1 reply; 201+ messages in thread
From: Derrick Stolee @ 2021-12-07 15:38 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: gitster, larsxschneider, peff, tytso

On 11/29/2021 5:25 PM, Taylor Blau wrote:

> +static int write_cruft_pack(const struct pack_objects_args *args,
> +			    const char *pack_prefix,
> +			    struct string_list *names,
> +			    struct string_list *existing_packs,
> +			    struct string_list *existing_kept_packs)
> +{
> +	struct child_process cmd = CHILD_PROCESS_INIT;
> +	struct strbuf line = STRBUF_INIT;
> +	struct string_list_item *item;
> +	FILE *in, *out;
> +	int ret;
> +
> +	prepare_pack_objects(&cmd, args);
> +
> +	strvec_push(&cmd.args, "--cruft");
> +	if (cruft_expiration)
> +		strvec_pushf(&cmd.args, "--cruft-expiration=%s",
> +			     cruft_expiration);
> +
> +	strvec_push(&cmd.args, "--honor-pack-keep");
> +	strvec_push(&cmd.args, "--non-empty");
> +	strvec_push(&cmd.args, "--max-pack-size=0");

This --max-pack-size is meaningless, right? The config that would change
this is already ignored by 'git pack-objects'.

> +		OPT_BIT(0, "cruft", &pack_everything,
> +				N_("same as -a, pack unreachable cruft objects separately"),
> +				   PACK_CRUFT | ALL_INTO_ONE),

I can understand the use of OPT_BIT here. Keep in mind that --no-cruft would
remove the '-a' option, if it already existed. Perhaps we should just use
OPT_BOOL and update to add the ALL_INTO_ONE if PACK_CRUFT exists?

> +		OPT_STRING(0, "cruft-expiration", &cruft_expiration, N_("approxidate"),
> +				N_("with -C, expire objects older than this")),

Here, --no-cruft-expiration will set cruft_expiration to NULL and not overwrite
the --cruft option, as expected. Just pointing out that this is different than
the option in 'git pack-objects'.

> --- a/t/t5327-pack-objects-cruft.sh
> +++ b/t/t5327-pack-objects-cruft.sh
> @@ -358,4 +358,157 @@ test_expect_success 'expired objects are pruned' '
>  	)
>  '
>  
> +test_expect_success 'repack --cruft generates a cruft pack' '
> +	git init repo &&
> +	test_when_finished "rm -fr repo" &&
> +	(
> +		cd repo &&
> +
> +		test_commit reachable &&
> +		git branch -M main &&
> +		git checkout --orphan other &&

Here is a way to make objects unreachable!

> +		test_commit unreachable &&
> +
> +		git checkout main &&
> +		git branch -D other &&
> +		git tag -d unreachable &&

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 02/17] pack-mtimes: support reading .mtimes files
  2021-12-03 22:24     ` Taylor Blau
@ 2022-01-07 19:41       ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-01-07 19:41 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, gitster, larsxschneider, peff, tytso, brian m. carlson

On Fri, Dec 03, 2021 at 05:24:03PM -0500, Taylor Blau wrote:
> On Thu, Dec 02, 2021 at 10:06:07AM -0500, Derrick Stolee wrote:
>     - A table of 4-byte unsigned integers in network order. The ith
>       value is the modification time (mtime) of the ith object in the
>       corresponding pack by lexicographic (index) order. The mtimes
>       count standard epoch seconds.
>
> > Storing these mtimes in 32-bits means we will hit the 2038 problem.
> > The commit-graph stores commit times with an extra two bits to extend
> > the lifetime by another hundred years or so.
> >
> > Could we extend the lifetime of cruft packs by decreasing the granularity
> > here? Should 'mtime' store a number of _minutes_ instead of seconds? That
> > should be enough granularity for these purposes.
>
> Perhaps, though it does add some complexity to the code that deals with
> this format at the expense of some future-proofing. I'm open to it,
> though.

I still have quite a bit of review from this topic sitting in my inbox.

But this had been lingering on my mind, and I realized I said something
incorrect. 32-bit mtimes won't cause us to run into the "2038" problem,
since these aren't signed values. So storing epoch seconds in a uint32_t
should get us into the year 2106.

If anybody is still using cruft packs by then, I'll call this project a
wild success ;-). So in the meantime, I don't think it makes sense to
reduce the granularity and/or use extra bits to store the timestamps.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 06/17] t/helper: add 'pack-mtimes' test-tool
  2021-12-06 21:16   ` Derrick Stolee
@ 2022-02-23 22:24     ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-02-23 22:24 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, larsxschneider, tytso

On Mon, Dec 06, 2021 at 04:16:04PM -0500, Derrick Stolee wrote:
> On 11/29/2021 5:25 PM, Taylor Blau wrote:
> > +static int dump_mtimes(struct packed_git *p)
>
> nit: you return an int here so you can use it as an error code...
>
> > +{
> > +	uint32_t i;
> > +	if (load_pack_mtimes(p) < 0)
> > +		die("could not load pack .mtimes");
> > +
> > +	for (i = 0; i < p->num_objects; i++) {
> > +		struct object_id oid;
> > +		if (nth_packed_object_id(&oid, p, i) < 0)
> > +			die("could not load object id at position %"PRIu32, i);
> > +
> > +		printf("%s %"PRIu32"\n",
> > +		       oid_to_hex(&oid), nth_packed_mtime(p, i));
> > +	}
> > +
> > +	return 0;
>
> But always return 0 unless you die().
>
> > +	return p ? dump_mtimes(p) : 1;
>
> It makes this line concise, I suppose.
>
> Perhaps just use "return dump_mtimes(p)" and have dump_mtimes()
> return 1 if the given pack is NULL?

I think just dying in the case we have a NULL pack is fine, and it
should be OK to lump it in the same case as "could not load pack .mtimes".

But we may want to catch the case a little earlier while we still have
the pack name handy. Perhaps something like this on top:

--- 8< ---
diff --git a/t/helper/test-pack-mtimes.c b/t/helper/test-pack-mtimes.c
index b143f62520..f7b79daf4c 100644
--- a/t/helper/test-pack-mtimes.c
+++ b/t/helper/test-pack-mtimes.c
@@ -5,7 +5,7 @@
 #include "packfile.h"
 #include "pack-mtimes.h"

-static int dump_mtimes(struct packed_git *p)
+static void dump_mtimes(struct packed_git *p)
 {
 	uint32_t i;
 	if (load_pack_mtimes(p) < 0)
@@ -19,8 +19,6 @@ static int dump_mtimes(struct packed_git *p)
 		printf("%s %"PRIu32"\n",
 		       oid_to_hex(&oid), nth_packed_mtime(p, i));
 	}
-
-	return 0;
 }

 static const char *pack_mtimes_usage = "\n"
@@ -49,5 +47,10 @@ int cmd__pack_mtimes(int argc, const char **argv)

 	strbuf_release(&buf);

-	return p ? dump_mtimes(p) : 1;
+	if (!p)
+		die("could not find pack '%s'", argv[1]);
+
+	dump_mtimes(p);
+
+	return 0;
 }
--- >8 ---

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 201+ messages in thread

* Re: [PATCH 08/17] builtin/pack-objects.c: --cruft without expiration
  2021-12-07 15:17   ` Derrick Stolee
@ 2022-02-23 23:34     ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-02-23 23:34 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, larsxschneider, peff, tytso

On Tue, Dec 07, 2021 at 10:17:28AM -0500, Derrick Stolee wrote:
> On 11/29/2021 5:25 PM, Taylor Blau wrote:
> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> > +static int add_cruft_object_entry(const struct object_id *oid, enum object_type type,
> > +				  struct packed_git *pack, off_t offset,
> > +				  const char *name, uint32_t mtime)
> > +{
> > +	struct object_entry *entry;
> > +
> > +	display_progress(progress_state, ++nr_seen);
>
> I don't love the global nr_seen here, but it is pervasive through the
> file. OK.

Yeah; this is how all of the existing progress code works in
pack-objects.

> > +	entry = packlist_find(&to_pack, oid);
> > +	if (entry) {
> > +		if (name) {
> > +			entry->hash = pack_name_hash(name);
> > +			entry->no_try_delta = name && no_try_delta(name);
>
> This is already in an "if (name)" block, so "name &&" isn't needed.

Thanks; this is a copy-and-paste from add_object_entry(), where we
aren't in a conditional on "name". We could also fold the conditional on
whether or not name is NULL into no_try_delta itself, since all existing
calls look like "name && no_try_delta(name)".

So adding something like:

    if (!name)
      return 0;

to the beginning of no_try_delta()'s implementation would allow us to
get rid of the handful of "name &&"s. But I'm trying to avoid touching
other parts of pack-objects as much as I can, so I'll hold off for now.

> > +		}
> > +	} else {
> > +		if (!want_object_in_pack(oid, 0, &pack, &offset))
> > +			return 0;
> > +		if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
> > +			/*
> > +			 * If a traversed tree has a missing blob then we want
> > +			 * to avoid adding that missing object to our pack.
> > +			 *
> > +			 * This only applies to missing blobs, not trees,
> > +			 * because the traversal needs to parse sub-trees but
> > +			 * not blobs.
> > +			 *
> > +			 * Note we only perform this check when we couldn't
> > +			 * already find the object in a pack, so we're really
> > +			 * limited to "ensure non-tip blobs which don't exist in
> > +			 * packs do exist via loose objects". Confused?
> > +			 */
> > +			return 0;
> > +		}
> > +
> > +		entry = create_object_entry(oid, type, pack_name_hash(name),
> > +					    0, name && no_try_delta(name),
> > +					    pack, offset);
> > +	}
> > +
> > +	if (mtime > oe_cruft_mtime(&to_pack, entry))
> > +		oe_set_cruft_mtime(&to_pack, entry, mtime);
> > +	return 1;
>
> I was confused at this "return 1" here, while other cases return 0.
>
> It turns out that there are multiple methods in this file that have
> different semantics: add_loose_object() and add_object_entry_from_pack()
> are both called from iterators where "return 1" means "stop iterating"
> so they return 0 always. add_object_entry_from_bitmap() is used to
> iterate over a bitmap and "return 1" means "include this object".
>
> However, the return code for add_cruft_object_entry() is never used,
> so it should probably return void or swap the meanings to have nonzero
> mean an error occurred.

Yes, exactly. And thanks for tracing out both of the different
meanings/interpretations of these add_xyz_entry() functions. As you can
imagine, this implementation is copy-and-pasted from add_object_entry(),
which was specialized for this use here. At the time, I gave some effort
towards trying to share more code with add_object_entry() for this
special case, but it ended up being pretty awkward, hence the separate
implementation.

Ironically, add_object_entry()'s return code is also unused, so we could
probably clean that up, too. But like the above, I'll avoid it for now
in an effort to touch as little of pack-objects in this patch as I can.

> > +static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
> > +{
> > +	struct string_list_item *item = NULL;
> > +	for_each_string_list_item(item, packs) {
> > +		struct packed_git *p = item->util;
> > +		if (!p)
> > +			die(_("could not find pack '%s'"), item->string);
>
> Interesting that this is a potential issue. We are expecting the pack
> to be loaded before we get here. Is this more because some packs might
> not actually load, but it's fine as long as we don't mark them as kept?

Not quite "loaded" (though any pack structures that we look at by this
point will be fully "loaded"). Instead, we're making sure that all of
the packs names we read from stdin could be matched to packs that we
found in the repository (i.e., that we produce an appropriate error
message if we found "pack-does-not-exist.pack" on stdin).

This is all because we process input from stdin in two phases:

  - First, read all of the input into two string_lists, one for the
    packs we're about to discard (anything that start with '-'), and
    another for all of the "fresh" packs (i.e., anything that we're not
    going to discard).

  - Then, loop through all of the packed_git structs we have, querying
    both of the aforementioned string lists for input that matches each
    pack's `pack_name` field, and setting the `->util` pointer of the
    matching string_list_entry appropriately.

Following those two steps, any list entries that have a NULL util
pointer correspond with bogus input, so we want to call die() there.

> > +		p->pack_keep_in_core = keep;
> > +	}
> > +}
> ...
> > +static void read_cruft_objects(void)
> > +{
> > +	struct strbuf buf = STRBUF_INIT;
> > +	struct string_list discard_packs = STRING_LIST_INIT_DUP;
> > +	struct string_list fresh_packs = STRING_LIST_INIT_DUP;
> > +	struct packed_git *p;
> > +
> > +	ignore_packed_keep_in_core = 1;
>
> Here is a global that we are suddenly changing. Should we not be
> returning it to its initial state when this method is complete?

We could, although it won't matter in practice, because we'll want to
keep that setting around for our traversal, after which point
pack-objects will exit.

> > +static int option_parse_cruft_expiration(const struct option *opt,
> > +					 const char *arg, int unset)
> > +{
> > +	if (unset) {
> > +		cruft = 0;
>
> This unassignment of 'cruft' when cruft-expiration is unset with
> --no-cruft-expiration seems odd. I would expect
>
> 	git pack-objects --cruft --no-cruft-expiration
>
> to still make a cruft pack, but not expire anything. It seems that
> your code here makes --no-cruft-expiration disable the --cruft option.

Hmm. I could see compelling reasoning that goes both ways. On the one
hand, `--no-cruft-expiration` (to me, at least) seems to imply "set
`--cruft-expiration` to "never"). On the other hand, it also matches our
convention of `--no`-prefixed options to unset some value. This
implementation takes the latter approach, though we could easily change
it to set the cruft expiration to "never".

I don't have a strong opinion about which is better, so I'm happy to do
either if you have a better sense about which has more expected
behavior.

> > +		cruft_expiration = 0;
> > +	} else {
> > +		cruft = 1;
> > +		if (arg)
> > +			cruft_expiration = approxidate(arg);
> > +	}
> > +	return 0;
> > +}
> ..
> > +		OPT_BOOL(0, "cruft", &cruft, N_("create a cruft pack")),
> > +		OPT_CALLBACK_F(0, "cruft-expiration", NULL, N_("time"),
> > +		  N_("expire cruft objects older than <time>"),
> > +		  PARSE_OPT_OPTARG, option_parse_cruft_expiration),
>
> > -static int has_loose_object(const struct object_id *oid)
> > +int has_loose_object(const struct object_id *oid)
> >  {
> >  	return check_and_freshen(oid, 0);
> >  }
>
> I'm surprised this hasn't been modified to use a repository pointer.
> Adding another caller here isn't too much debt, though.

Yeah, check_and_freshen() doesn't have a variant that takes a
repository pointer. Good #leftoverbits, I guess!

> > +int has_loose_object(const struct object_id *);
> > +
> >  void assert_oid_type(const struct object_id *oid, enum object_type expect);
>
> ...
>
> > +	test_expect_success "unreachable packed objects are packed (expire $expire)" '
> > +		git init repo &&
> > +		test_when_finished "rm -fr repo" &&
> > +		(
> > +			cd repo &&
> > +
> > +			test_commit packed &&
> > +			git repack -Ad &&
> > +			test_commit other &&
> > +
> > +			git rev-list --objects --no-object-names packed.. >objects &&
> > +			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
> > +			other="$(git pack-objects --delta-base-offset \
> > +				$packdir/pack <objects)" &&
> > +			git prune-packed &&
> > +
> > +			test-tool chmtime --get -100 "$packdir/pack-$other.pack" >expect &&
>
> I am missing how this test creates _unreachable_ objects. I would expect removal of
> some refs or a 'git reset --hard' somewhere. What am I missing?

For this and the other tests the so-called "unreachable" objects are
technically reachable, but we can treat them as unreachable by putting
them in the "discard" packs list (or by not mentioning them at all to
`git pack-objects --cruft`).

> > +			# remove the unreachable tree, but leave the commit
> > +			# which has it as its root tree in-tact
>
> nit: "intact" is one word.

Thanks; fixed here and in the other test which was added by this commit.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 11/17] builtin/pack-objects.c: --cruft with expiration
  2021-12-07 15:30   ` Derrick Stolee
@ 2022-02-23 23:35     ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-02-23 23:35 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, larsxschneider, peff, tytso

On Tue, Dec 07, 2021 at 10:30:52AM -0500, Derrick Stolee wrote:
> On 11/29/2021 5:25 PM, Taylor Blau wrote:
>
> > +static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs)
> > +{
> ...
> > +	/*
> > +	 * Re-mark only the fresh packs as kept so that objects in
> > +	 * unknown packs do not halt the reachability traversal early.
> > +	 */
> > +	for (p = get_all_packs(the_repository); p; p = p->next)
> > +		p->pack_keep_in_core = 0;
> > +	mark_pack_kept_in_core(fresh_packs, 1);
>
> Are we ever going to recover this pack_keep_in_core state? Should we
> be saving it somewhere so we can return without mutating this state
> permanently?

In the same sense that we are free to modify the global
ignore_packed_keep_in_core variable (because we only stop caring about
the modified state right before the program is about to exist) we can
freely mutate these variables, too.

> > +	if (prepare_revision_walk(&revs))
> > +		die(_("revision walk setup failed"));
> > +	if (progress)
> > +		progress_state = start_progress(_("Traversing cruft objects"), 0);
> > +	nr_seen = 0;
> > +	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
> > +
> > +	stop_progress(&progress_state);
> > +}
> > +
> >  static void read_cruft_objects(void)
> >  {
> >  	struct strbuf buf = STRBUF_INIT;
> > @@ -3515,7 +3597,7 @@ static void read_cruft_objects(void)
> >  	mark_pack_kept_in_core(&discard_packs, 0);
> >
> >  	if (cruft_expiration)
> > -		die("--cruft-expiration not yet implemented");
> > +		enumerate_and_traverse_cruft_objects(&fresh_packs);
> >  	else
> >  		enumerate_cruft_objects();
>
> >  basic_cruft_pack_tests never
> > +basic_cruft_pack_tests 2.weeks.ago
>
> I'm surprised these tests didn't require any changes to adapt to the
> new expiration date. But I suppose none of the mtimes were older than
> two weeks ago?

Exactly.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 12/17] builtin/repack.c: support generating a cruft pack
  2021-12-07 15:38   ` Derrick Stolee
@ 2022-02-23 23:37     ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-02-23 23:37 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, larsxschneider, peff, tytso

(Jumping forward a little bit while responding to your review to finish
my train of though before I log off for today...)

On Tue, Dec 07, 2021 at 10:38:05AM -0500, Derrick Stolee wrote:
> > --- a/t/t5327-pack-objects-cruft.sh
> > +++ b/t/t5327-pack-objects-cruft.sh
> > @@ -358,4 +358,157 @@ test_expect_success 'expired objects are pruned' '
> >  	)
> >  '
> >
> > +test_expect_success 'repack --cruft generates a cruft pack' '
> > +	git init repo &&
> > +	test_when_finished "rm -fr repo" &&
> > +	(
> > +		cd repo &&
> > +
> > +		test_commit reachable &&
> > +		git branch -M main &&
> > +		git checkout --orphan other &&
>
> Here is a way to make objects unreachable!

Yes, indeed. And this is the first spot where we *need* to care about
object reachability, because the set of packs that `git repack` passes
over stdin to `git pack-objects --cruft` depends on which objects are
and aren't reachable.

In the tests that exercise `pack-objects --cruft` directly, we can
pretend that certain packs contain only unreachable objects by marking
them as "discarded".

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 12/17] builtin/repack.c: support generating a cruft pack
  2021-12-05 20:46   ` Junio C Hamano
@ 2022-03-01  2:00     ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-01  2:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, larsxschneider, peff, tytso

On Sun, Dec 05, 2021 at 12:46:19PM -0800, Junio C Hamano wrote:
> Various thoughts on just this part, as the hunk got my attention
> while merging with other topics in 'seen'.
>
> > +	if (pack_everything & PACK_CRUFT && delete_redundant) {
> > +		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
> > +			die(_("--cruft and -A are incompatible"));
> > +		if (keep_unreachable)
> > +			die(_("--cruft and -k are incompatible"));
> > +		if (!(pack_everything & ALL_INTO_ONE))
> > +			die(_("--cruft must be combined with all-into-one"));
> > +	}
>
> The "reuse similar messages for i18n" topic will encourage us to
> turn this part into:
>
> 	if (pack_everything & PACK_CRUFT && delete_redundant) {
> 		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
> 			die(_("%s and %s are mutually exclusive"),
> 			    "--cruft", "-A");
> 		if (keep_unreachable)
> 			die(_("%s and %s are mutually exclusive"),
> 			    "--cruft", "-k");
> 		if (!(pack_everything & ALL_INTO_ONE))
> 			die(_("--cruft must be combined with all-into-one"));
> 	}

Thanks, done.

> The conditionals are a bit unpleasant to read and maintain, but I
> guess we cannot help it?

I don't know that I find them unpleasant to read, but perhaps they are a
hassle to maintain (as we add new, mutually-exclusive options). But I
can't seem to think of a better alternative...

> Saying ALL_INTO_ONE is a bit unfriendly to the end user, who would
> probably not know that it is the name the code gave to the bit that
> is turned on when given an option externally known under a different
> name (is that "-a"?).
>
> If "--cruft" must be used with "all into one", I wonder if it makes
> sense to make it imply that?  Not in the sense that OPT_BIT()
> initially flips the ALL_INTO_ONE bit on upon seeing "--cruft", but
> after parse_options() returns, we check PACK_CRUFT and if it is on
> turn ALL_INTO_ONE also on (so even if '-a' gains '--all-into-one'
> option, the user won't break us by giving "--no-all-into-one" after
> they gave us "--cruft")?  I didn't think about this part thoroughly
> enough, though.

Yes, `--cruft` must be used with an option that sets ALL_INTO_ONE. Since
we don't have any automatic '--no-' versions of single character
options, I think that this conditional is currently redundant, but I
agree that this code would break if we (a) removed the conditional
you're talking about and (b) allowed passing something like
`--no-all-into-one` which unsets the ALL_INTO_ONE bit.

So setting ALL_INTO_ONE ourselves _after_ option parsing is done makes
sense to me, thanks.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH 08/17] builtin/pack-objects.c: --cruft without expiration
  2021-12-06 21:44   ` Derrick Stolee
@ 2022-03-01  2:48     ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-01  2:48 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, larsxschneider, peff, tytso

On Mon, Dec 06, 2021 at 04:44:31PM -0500, Derrick Stolee wrote:
> On 11/29/2021 5:25 PM, Taylor Blau wrote:
> > Generating a non-expiring cruft packs works as follows:
>
> I had trouble parsing the documentation changes below, so I came back
> to this commit message to see if that helps.
>
> >   - Callers provide a list of every pack they know about, and indicate
> >     which packs are about to be removed.
>
> This corresponds to the list over stdin.
>
> >   - All packs which are going to be removed (we'll call these the
> >     redundant ones) are marked as kept in-core, as well as any packs
> >     that `pack-objects` found but the caller did not specify.
>
> Ok, so as an implementation detail we mark these as keep packs.


> >     These packs are presumed to have entered the repository between
> >     the caller collecting packs and invoking `pack-objects`. Since we
> >     do not want to include objects in these packs (because we don't know
> >     which of their objects are or aren't reachable), these are also
> >     marked as kept in-core.
>
> Here, "are presumed" is doing a lot of work. Theoretically, there could
> be three categories:
>
> 1. This pack was just repacked and will be removed because all of its
>    objects were placed into new objects.
>
> 2. Either this pack was repacked and contains important reachable objects
>    OR we did a repack of reachable objects and this pack contained some
>    extra, unreachable objects.
>
> 3. This pack was added to the repository while creating those repacked
>    packs from category 2, so we don't know if things are reachable or
>    not.
>
> So, the packs that we discover on-disk but are not specified over stdin
> are in this third category, but these are grouped with category 1 as we
> will treat them the same.

Ah, I think I caused some unintentional confusion by attaching "are
presumed" to "these packs", when it wasn't clear that "these packs"
meant "ones that aren't listed over stdin".

Since the caller is supposed to provide a complete picture of the
repository as they see it, any packs known to the pack-objects process
that aren't mentioned over stdin are assumed to have entered the
repository after the caller was spun up.

I'll clarify this section of the commit message, since I agree it is
unnecessarily confusing.

> >   - Then, we enumerate all objects in the repository, and add them to
> >     our packing list if they do not appear in an in-core kept pack.
>
> Here, we are looking at all of the objects in category 2 as well as
> loose objects.

We're enumerating any objects that aren't in packs which are marked as
kept in-core (along with loose objects which don't appear in packs that
are marked as kept in-core).

The in-core kept packs are ones that the caller (and I find it's helpful
to read "the caller" as "git repack") has marked as "will delete". So
the non in-core pack(s) that we're looking at here contain all reachable
objects (e.g., like you would get with `git repack -A`).

> > +	Packs unreachable objects into a separate "cruft" pack, denoted
> > +	by the existence of a `.mtimes` file. Pack names provided over
> > +	stdin indicate which packs will remain after a `git repack`.
> > +	Pack names prefixed with a `-` indicate those which will be
> > +	removed. (...)
>
> This description is too tied to 'git repack'. Can we describe the
> input using terms independent of the 'git repack' operation? I need
> to keep reading.
>
> > (...) The contents of the cruft pack are all objects not
> > +	contained in the surviving packs specified by `--keep-pack`)
>
> Now you use --keep-pack, which is a way of specifying a pack as
> "in-core keep" which was not in your commit message. Here, we also
> don't link the packs over stdin to the concept of keep packs.

The mention of `--keep-pack` is a mistake left over from a previous
version; thanks for spotting. Here's a version of the first paragraph
from this piece of documentation which is less tied to `git repack` and
hopefully a little clearer:

    --cruft::
            Packs unreachable objects into a separate "cruft" pack, denoted
            by the existence of a `.mtimes` file. Typically used by `git
            repack --cruft`. Callers provide a list of pack names and
            indicate which packs will remain in the repository, along with
            which packs will be deleted (indicated by the `-` prefix). The
            contents of the cruft pack are all objects not contained in the
            surviving packs which have not exceeded the grace period (see
            `--cruft-expiration` below), or which have exceeded the grace
            period, but are reachable from an other object which hasn't.

> > +	which have not exceeded the grace period (see
> > +	`--cruft-expiration` below), or which have exceeded the grace
> > +	period, but are reachable from an other object which hasn't.
>
> And now we think about the grace period! There is so much going on
> that I need to break it down to understand.
>
>   An object is _excluded_ from the new cruft pack if
>
>   1. It is reachable from at least one reference.
>   2. It is in a pack from stdin prefixed with "-"
>   3. It is in a pack specified by `--keep-pack`
>   4. It is in an existing cruft pack and the .mtimes file states
>      that its mtime is at least as recent as the time specified by
>      the --cruft-expiration option.
>
> Breaking it down into a list like this helps me, at least. I'm not
> sure what the best way would look like.

Given some expiration T, cruft packs contain all unreachable objects
which are newer than T, along with any cruft objects (i.e., those not
directly reachable from any ref) which are older than T, but reachable
from another cruft object newer than T.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* [PATCH v2 00/17] cruft packs
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (17 preceding siblings ...)
  2021-12-03 19:51 ` [PATCH 00/17] " Junio C Hamano
@ 2022-03-02  0:57 ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
                     ` (17 more replies)
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                   ` (2 subsequent siblings)
  21 siblings, 18 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-02  0:57 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Here is a reroll of my series to implement "cruft packs", a pack which
stores accumulated unreachable objects, along with a new ".mtimes" file
which tracks each object's last known modification time.

This was on the list towards the end of 2021[1], and I have been
accumulating small changes to it locally for a couple of months now.
Major changes since last time include:

  - Clearer documentation and commit message(s) to better illustrate how
    the feature works and is supposed to be used.

  - Some minor documentation updates to pack-format.txt, which make some
    ambiguous details more explicit.

  - Minor code movement / tweaks to make things easier to read, ensure
    that functions aren't introduced in patches before they are used /
    etc.

  - Moved the new test script to t5328 (instead of t5327, which happens
    to be taken up by a new MIDX bitmap-related test), and purged it of
    all "rm -fr .git/logs" (replacing them with "git reflog --expire
    --all --expire=all" instead).

  - A new test which fixes a bug where loose objects which have copies
    that appear in a cruft pack would not get accumulated when doing a
    `--geometric` repack.

For convenience, a range-diff is below. Thanks in advance for taking
another look!

[1]: https://lore.kernel.org/git/cover.1638224692.git.me@ttaylorr.com/

Taylor Blau (17):
  Documentation/technical: add cruft-packs.txt
  pack-mtimes: support reading .mtimes files
  pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  chunk-format.h: extract oid_version()
  pack-mtimes: support writing pack .mtimes files
  t/helper: add 'pack-mtimes' test-tool
  builtin/pack-objects.c: return from create_object_entry()
  builtin/pack-objects.c: --cruft without expiration
  reachable: add options to add_unseen_recent_objects_to_traversal
  reachable: report precise timestamps from objects in cruft packs
  builtin/pack-objects.c: --cruft with expiration
  builtin/repack.c: support generating a cruft pack
  builtin/repack.c: allow configuring cruft pack generation
  builtin/repack.c: use named flags for existing_packs
  builtin/repack.c: add cruft packs to MIDX during geometric repack
  builtin/gc.c: conditionally avoid pruning objects via loose
  sha1-file.c: don't freshen cruft packs

 Documentation/Makefile                  |   1 +
 Documentation/config/gc.txt             |  21 +-
 Documentation/config/repack.txt         |   9 +
 Documentation/git-gc.txt                |   5 +
 Documentation/git-pack-objects.txt      |  30 +
 Documentation/git-repack.txt            |  11 +
 Documentation/technical/cruft-packs.txt |  97 ++++
 Documentation/technical/pack-format.txt |  19 +
 Makefile                                |   2 +
 builtin/gc.c                            |  10 +-
 builtin/pack-objects.c                  | 304 +++++++++-
 builtin/repack.c                        | 183 +++++-
 bulk-checkin.c                          |   2 +-
 chunk-format.c                          |  12 +
 chunk-format.h                          |   3 +
 commit-graph.c                          |  18 +-
 midx.c                                  |  18 +-
 object-file.c                           |   4 +-
 object-store.h                          |   7 +-
 pack-mtimes.c                           | 129 +++++
 pack-mtimes.h                           |  15 +
 pack-objects.c                          |   6 +
 pack-objects.h                          |  25 +
 pack-write.c                            |  93 ++-
 pack.h                                  |   4 +
 packfile.c                              |  19 +-
 reachable.c                             |  58 +-
 reachable.h                             |   9 +-
 t/helper/test-pack-mtimes.c             |  56 ++
 t/helper/test-tool.c                    |   1 +
 t/helper/test-tool.h                    |   1 +
 t/t5328-pack-objects-cruft.sh           | 739 ++++++++++++++++++++++++
 32 files changed, 1810 insertions(+), 101 deletions(-)
 create mode 100644 Documentation/technical/cruft-packs.txt
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h
 create mode 100644 t/helper/test-pack-mtimes.c
 create mode 100755 t/t5328-pack-objects-cruft.sh

Range-diff against v1:
 1:  a9f7c738e0 !  1:  784ee7e0ee Documentation/technical: add cruft-packs.txt
    @@ Documentation/technical/cruft-packs.txt (new)
     @@
     += Cruft packs
     +
    -+Cruft packs offer an alternative to Git's traditional mechanism of removing
    -+unreachable objects. This document provides an overview of Git's pruning
    -+mechanism, and how cruft packs can be used instead to accomplish the same.
    ++The cruft packs feature offer an alternative to Git's traditional mechanism of
    ++removing unreachable objects. This document provides an overview of Git's
    ++pruning mechanism, and how a cruft pack can be used instead to accomplish the
    ++same.
     +
     +== Background
     +
    @@ Documentation/technical/cruft-packs.txt (new)
     +
     +== Cruft packs
     +
    -+Cruft packs are designed to eliminate the need for storing unreachable objects
    -+in a loose state by including the per-object mtimes in a separate file alongside
    -+a single pack containing all loose objects.
    ++A cruft pack eliminates the need for storing unreachable objects in a loose
    ++state by including the per-object mtimes in a separate file alongside a single
    ++pack containing all loose objects.
     +
     +A cruft pack is written by `git repack --cruft` when generating a new pack.
     +linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft`
    @@ Documentation/technical/cruft-packs.txt (new)
     +Notable alternatives to this design include:
     +
     +  - The location of the per-object mtime data, and
    -+  - Whether cruft packs should be incremental or not.
    ++  - Storing unreachable objects in multiple cruft packs.
     +
     +On the location of mtime data, a new auxiliary file tied to the pack was chosen
     +to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
     +support for optional chunks of data, it may make sense to consolidate the
     +`.mtimes` format into the `.idx` itself.
     +
    -+Incremental cruft packs (i.e., where each time a repository is repacked a new
    -+cruft pack is generated containing only the unreachable objects introduced since
    -+the last time a cruft pack was written) are significantly more complicated to
    -+construct, and so aren't pursued here. The obvious drawback to the current
    -+implementation is that the entire cruft pack must be re-written from scratch.
    ++Storing unreachable objects among multiple cruft packs (e.g., creating a new
    ++cruft pack during each repacking operation including only unreachable objects
    ++which aren't already stored in an earlier cruft pack) is significantly more
    ++complicated to construct, and so aren't pursued here. The obvious drawback to
    ++the current implementation is that the entire cruft pack must be re-written from
    ++scratch.
 2:  7d4ae7bd3e !  2:  101b34660c pack-mtimes: support reading .mtimes files
    @@ Documentation/technical/pack-format.txt: Pack file entry: <+
     +
     +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
     +
    -+  - A table of mtimes (one per packed object, num_objects in total, each
    -+    a 4-byte unsigned integer in network order), in the same order as
    -+    objects appear in the index file (e.g., the first entry in the mtime
    -+    table corresponds to the object with the lowest lexically-sorted
    -+    oid). The mtimes count standard epoch seconds.
    ++  - A table of 4-byte unsigned integers in network order. The ith
    ++    value is the modification time (mtime) of the ith object in the
    ++    corresponding pack by lexicographic (index) order. The mtimes
    ++    count standard epoch seconds.
     +
    -+  - A trailer, containing a:
    -+
    -+    checksum of the corresponding packfile, and
    -+
    -+    a checksum of all of the above.
    ++  - A trailer, containing a checksum of the corresponding packfile,
    ++    and a checksum of all of the above (each having length according
    ++    to the specified hash function).
     +
     +All 4-byte numbers are in network order.
     +
    @@ pack-mtimes.c (new)
     +	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
     +}
     +
    -+int pack_has_mtimes(struct packed_git *p)
    -+{
    -+	struct stat st;
    -+	char *fname = pack_mtimes_filename(p);
    -+
    -+	if (stat(fname, &st) < 0) {
    -+		if (errno == ENOENT)
    -+			return 0;
    -+		die_errno(_("could not stat %s"), fname);
    -+	}
    -+
    -+	free(fname);
    -+	return 1;
    -+}
    -+
     +#define MTIMES_HEADER_SIZE (12)
     +#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))
     +
    @@ pack-mtimes.c (new)
     +	struct stat st;
     +	void *data = NULL;
     +	size_t mtimes_size;
    ++	struct mtimes_header header;
     +	uint32_t *hdr;
     +
     +	fd = git_open(mtimes_file);
    @@ pack-mtimes.c (new)
     +
     +	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
     +
    -+	if (ntohl(*hdr) != MTIMES_SIGNATURE) {
    ++	header.signature = ntohl(hdr[0]);
    ++	header.version = ntohl(hdr[1]);
    ++	header.hash_id = ntohl(hdr[2]);
    ++
    ++	if (header.signature != MTIMES_SIGNATURE) {
     +		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
     +		goto cleanup;
     +	}
     +
    -+	if (ntohl(*++hdr) != 1) {
    ++	if (header.version != 1) {
     +		ret = error(_("mtimes file %s has unsupported version %"PRIu32),
    -+			    mtimes_file, ntohl(*hdr));
    ++			    mtimes_file, header.version);
     +		goto cleanup;
     +	}
    -+	hdr++;
    -+	if (!(ntohl(*hdr) == 1 || ntohl(*hdr) == 2)) {
    ++
    ++	if (!(header.hash_id == 1 || header.hash_id == 2)) {
     +		ret = error(_("mtimes file %s has unsupported hash id %"PRIu32),
    -+			    mtimes_file, ntohl(*hdr));
    ++			    mtimes_file, header.hash_id);
     +		goto cleanup;
     +	}
     +
    @@ pack-mtimes.h (new)
     +
     +struct packed_git;
     +
    -+int pack_has_mtimes(struct packed_git *p);
     +int load_pack_mtimes(struct packed_git *p);
     +
     +uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos);
    @@ pack-mtimes.h (new)
     +#endif
     
      ## packfile.c ##
    -@@ packfile.c: void close_pack_revindex(struct packed_git *p) {
    +@@ packfile.c: static void close_pack_revindex(struct packed_git *p)
      	p->revindex_data = NULL;
      }
      
    -+void close_pack_mtimes(struct packed_git *p) {
    ++static void close_pack_mtimes(struct packed_git *p)
    ++{
     +	if (!p->mtimes_map)
     +		return;
     +
    @@ packfile.c: static void prepare_pack(const char *full_name, size_t full_name_len
      		string_list_append(data->garbage, full_name);
      	else
      		report_garbage(PACKDIR_FILE_GARBAGE, full_name);
    -
    - ## packfile.h ##
    -@@ packfile.h: uint32_t get_pack_fanout(struct packed_git *p, uint32_t value);
    - unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *);
    - void close_pack_windows(struct packed_git *);
    - void close_pack_revindex(struct packed_git *);
    -+void close_pack_mtimes(struct packed_git *p);
    - void close_pack(struct packed_git *);
    - void close_object_store(struct raw_object_store *o);
    - void unuse_pack(struct pack_window **);
 3:  7f4612e859 =  3:  a94d7dfeb3 pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
 4:  ea245b7216 =  4:  1e0ed363ae chunk-format.h: extract oid_version()
 5:  deece9eb70 !  5:  5236490688 pack-mtimes: support writing pack .mtimes files
    @@ pack-objects.h: struct packing_data {
      	unsigned int *tree_depth;
      	unsigned char *layer;
     +
    -+	/* cruft packs */
    ++	/*
    ++	 * Used when writing cruft packs.
    ++	 *
    ++	 * Object mtimes are stored in pack order when writing, but
    ++	 * written out in lexicographic (index) order.
    ++	 */
     +	uint32_t *cruft_mtime;
      };
      
    @@ pack-write.c: const char *write_rev_file_order(const char *rev_name,
     +	hashwrite_be32(f, oid_version(the_hash_algo));
     +}
     +
    ++/*
    ++ * Writes the object mtimes of "objects" for use in a .mtimes file.
    ++ * Note that objects must be in lexicographic (index) order, which is
    ++ * the expected ordering of these values in the .mtimes file.
    ++ */
     +static void write_mtimes_objects(struct hashfile *f,
     +				 struct packing_data *to_pack,
     +				 struct pack_idx_entry **objects,
    @@ pack-write.c: const char *write_rev_file_order(const char *rev_name,
     +	write_mtimes_objects(f, to_pack, objects, nr_objects);
     +	write_mtimes_trailer(f, hash);
     +
    -+	if (mtimes_name && adjust_shared_perm(mtimes_name) < 0)
    ++	if (adjust_shared_perm(mtimes_name) < 0)
     +		die(_("failed to make %s readable"), mtimes_name);
     +
     +	finalize_hashfile(f, NULL,
    @@ pack-write.c: void stage_tmp_packfiles(struct strbuf *name_buffer,
     +		mtimes_tmp_name = write_mtimes_file(NULL, to_pack, written_list,
     +						    nr_written,
     +						    hash);
    -+		if (adjust_shared_perm(mtimes_tmp_name))
    -+			die_errno("unable to make temporary mtimes file readable");
     +	}
     +
      	rename_tmp_packfile(name_buffer, pack_tmp_name, "pack");
 6:  e0a7b3b310 !  6:  78313bc441 t/helper: add 'pack-mtimes' test-tool
    @@ t/helper/test-pack-mtimes.c (new)
     +#include "packfile.h"
     +#include "pack-mtimes.h"
     +
    -+static int dump_mtimes(struct packed_git *p)
    ++static void dump_mtimes(struct packed_git *p)
     +{
     +	uint32_t i;
     +	if (load_pack_mtimes(p) < 0)
    @@ t/helper/test-pack-mtimes.c (new)
     +		printf("%s %"PRIu32"\n",
     +		       oid_to_hex(&oid), nth_packed_mtime(p, i));
     +	}
    -+
    -+	return 0;
     +}
     +
     +static const char *pack_mtimes_usage = "\n"
    @@ t/helper/test-pack-mtimes.c (new)
     +
     +	strbuf_release(&buf);
     +
    -+	return p ? dump_mtimes(p) : 1;
    ++	if (!p)
    ++		die("could not find pack '%s'", argv[1]);
    ++
    ++	dump_mtimes(p);
    ++
    ++	return 0;
     +}
     
      ## t/helper/test-tool.c ##
 7:  5710933127 =  7:  142098668d builtin/pack-objects.c: return from create_object_entry()
 8:  66165917a4 !  8:  2517a6be3d builtin/pack-objects.c: --cruft without expiration
    @@ Commit message
             which packs are about to be removed.
     
           - All packs which are going to be removed (we'll call these the
    -        redundant ones) are marked as kept in-core, as well as any packs
    -        that `pack-objects` found but the caller did not specify.
    +        redundant ones) are marked as kept in-core.
     
    -        These packs are presumed to have entered the repository between
    -        the caller collecting packs and invoking `pack-objects`. Since we
    -        do not want to include objects in these packs (because we don't know
    -        which of their objects are or aren't reachable), these are also
    -        marked as kept in-core.
    +        Any packs the caller did not mention (but are known to the
    +        `pack-objects` process) are also marked as kept in-core. Packs not
    +        mentioned by the caller are assumed to be unknown to them, i.e.,
    +        they entered the repository after the caller decided which packs
    +        should be kept and which should be discarded.
    +
    +        Since we do not want to include objects in these "unknown" packs
    +        (because we don't know which of their objects are or aren't
    +        reachable), these are also marked as kept in-core.
     
           - Then, we enumerate all objects in the repository, and add them to
             our packing list if they do not appear in an in-core kept pack.
    @@ Documentation/git-pack-objects.txt: SYNOPSIS
      	[--local] [--incremental] [--window=<n>] [--depth=<n>]
      	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
     +	[--cruft] [--cruft-expiration=<time>]
    - 	[--stdout [--filter=<filter-spec>] | base-name]
    - 	[--shallow] [--keep-true-parents] [--[no-]sparse] < object-list
    + 	[--stdout [--filter=<filter-spec>] | <base-name>]
    + 	[--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>
      
     @@ Documentation/git-pack-objects.txt: base-name::
      Incompatible with `--revs`, or options that imply `--revs` (such as
    @@ Documentation/git-pack-objects.txt: base-name::
      
     +--cruft::
     +	Packs unreachable objects into a separate "cruft" pack, denoted
    -+	by the existence of a `.mtimes` file. Pack names provided over
    -+	stdin indicate which packs will remain after a `git repack`.
    -+	Pack names prefixed with a `-` indicate those which will be
    -+	removed. The contents of the cruft pack are all objects not
    -+	contained in the surviving packs specified by `--keep-pack`)
    -+	which have not exceeded the grace period (see
    ++	by the existence of a `.mtimes` file. Typically used by `git
    ++	repack --cruft`. Callers provide a list of pack names and
    ++	indicate which packs will remain in the repository, along with
    ++	which packs will be deleted (indicated by the `-` prefix). The
    ++	contents of the cruft pack are all objects not contained in the
    ++	surviving packs which have not exceeded the grace period (see
     +	`--cruft-expiration` below), or which have exceeded the grace
     +	period, but are reachable from an other object which hasn't.
     ++
    ++When the input lists a pack containing all reachable objects (and lists
    ++all other packs as pending deletion), the corresponding cruft pack will
    ++contain all unreachable objects (with mtime newer than the
    ++`--cruft-expiration`) along with any unreachable objects whose mtime is
    ++older than the `--cruft-expiration`, but are reachable from an
    ++unreachable object whose mtime is newer than the `--cruft-expiration`).
    +++
     +Incompatible with `--unpack-unreachable`, `--keep-unreachable`,
     +`--pack-loose-unreachable`, `--stdin-packs`, as well as any other
     +options which imply `--revs`. Also incompatible with `--max-pack-size`;
    @@ builtin/pack-objects.c: static void read_packs_list_from_stdin(void)
      	string_list_clear(&exclude_packs, 0);
      }
      
    -+static int add_cruft_object_entry(const struct object_id *oid, enum object_type type,
    -+				  struct packed_git *pack, off_t offset,
    -+				  const char *name, uint32_t mtime)
    ++static void add_cruft_object_entry(const struct object_id *oid, enum object_type type,
    ++				   struct packed_git *pack, off_t offset,
    ++				   const char *name, uint32_t mtime)
     +{
     +	struct object_entry *entry;
     +
    @@ builtin/pack-objects.c: static void read_packs_list_from_stdin(void)
     +	if (entry) {
     +		if (name) {
     +			entry->hash = pack_name_hash(name);
    -+			entry->no_try_delta = name && no_try_delta(name);
    ++			entry->no_try_delta = no_try_delta(name);
     +		}
     +	} else {
     +		if (!want_object_in_pack(oid, 0, &pack, &offset))
    -+			return 0;
    ++			return;
     +		if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
     +			/*
     +			 * If a traversed tree has a missing blob then we want
    @@ builtin/pack-objects.c: static void read_packs_list_from_stdin(void)
     +			 * limited to "ensure non-tip blobs which don't exist in
     +			 * packs do exist via loose objects". Confused?
     +			 */
    -+			return 0;
    ++			return;
     +		}
     +
     +		entry = create_object_entry(oid, type, pack_name_hash(name),
    @@ builtin/pack-objects.c: static void read_packs_list_from_stdin(void)
     +
     +	if (mtime > oe_cruft_mtime(&to_pack, entry))
     +		oe_set_cruft_mtime(&to_pack, entry, mtime);
    -+	return 1;
    ++	return;
     +}
     +
     +static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
    @@ builtin/pack-objects.c: int cmd_pack_objects(int argc, const char **argv, const
      		read_packs_list_from_stdin();
      		if (rev_list_unpacked)
      			add_unreachable_loose_objects();
    --	} else if (!use_internal_rev_list)
    -+	} else if (cruft)
    ++	} else if (cruft) {
     +		read_cruft_objects();
    -+	else if (!use_internal_rev_list)
    + 	} else if (!use_internal_rev_list) {
      		read_object_list_from_stdin();
    - 	else {
    - 		get_object_list(rp.nr, rp.v);
    + 	} else {
     
      ## object-file.c ##
     @@ object-file.c: int has_loose_object_nonlocal(const struct object_id *oid)
    @@ object-store.h: int repo_has_object_file_with_flags(struct repository *r,
      
      /*
     
    - ## t/t5327-pack-objects-cruft.sh (new) ##
    + ## t/t5328-pack-objects-cruft.sh (new) ##
     @@
     +#!/bin/sh
     +
    @@ t/t5327-pack-objects-cruft.sh (new)
     +
     +			git reset --hard reachable &&
     +			git tag -d cruft &&
    -+			rm -fr .git/logs &&
    ++			git reflog expire --all --expire=all &&
     +
     +			# remove the unreachable tree, but leave the commit
    -+			# which has it as its root tree in-tact
    ++			# which has it as its root tree intact
     +			rm -fr "$objdir/$(test_oid_to_path "$tree")" &&
     +
     +			git repack -Ad &&
    @@ t/t5327-pack-objects-cruft.sh (new)
     +
     +			git reset --hard reachable &&
     +			git tag -d cruft &&
    -+			rm -fr .git/logs &&
    ++			git reflog expire --all --expire=all &&
     +
     +			# remove the unreachable blob, but leave the commit (and
    -+			# the root tree of that commit) in-tact
    ++			# the root tree of that commit) intact
     +			rm -fr "$objdir/$(test_oid_to_path "$blob")" &&
     +
     +			git repack -Ad &&
 9:  02f7fce788 =  9:  6f0e84273f reachable: add options to add_unseen_recent_objects_to_traversal
10:  52e9ac5710 = 10:  a8bde361f9 reachable: report precise timestamps from objects in cruft packs
11:  37fda94785 ! 11:  d68ce28132 builtin/pack-objects.c: --cruft with expiration
    @@ Commit message
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## builtin/pack-objects.c ##
    -@@ builtin/pack-objects.c: static int add_cruft_object_entry(const struct object_id *oid, enum object_type
    - 	return 1;
    +@@ builtin/pack-objects.c: static void add_cruft_object_entry(const struct object_id *oid, enum object_type
    + 	return;
      }
      
     +static void show_cruft_object(struct object *obj, const char *name, void *data)
    @@ builtin/pack-objects.c: static void read_cruft_objects(void)
      		enumerate_cruft_objects();
      
     
    - ## t/t5327-pack-objects-cruft.sh ##
    -@@ t/t5327-pack-objects-cruft.sh: basic_cruft_pack_tests () {
    + ## t/t5328-pack-objects-cruft.sh ##
    +@@ t/t5328-pack-objects-cruft.sh: basic_cruft_pack_tests () {
      }
      
      basic_cruft_pack_tests never
12:  a05675ab83 ! 12:  e5317cd472 builtin/repack.c: support generating a cruft pack
    @@ builtin/repack.c: static int write_midx_included_packs(struct string_list *inclu
      {
      	struct child_process cmd = CHILD_PROCESS_INIT;
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
    - 	int show_progress = isatty(2);
    + 	int show_progress;
      
      	/* variables to be filled by option parsing */
     -	int pack_everything = 0;
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
      				   LOOSEN_UNREACHABLE | ALL_INTO_ONE),
     +		OPT_BIT(0, "cruft", &pack_everything,
     +				N_("same as -a, pack unreachable cruft objects separately"),
    -+				   PACK_CRUFT | ALL_INTO_ONE),
    ++				   PACK_CRUFT),
     +		OPT_STRING(0, "cruft-expiration", &cruft_expiration, N_("approxidate"),
     +				N_("with -C, expire objects older than this")),
      		OPT_BOOL('d', NULL, &delete_redundant,
      				N_("remove redundant packs, and run git-prune-packed")),
      		OPT_BOOL('f', NULL, &po_args.no_reuse_delta,
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
    - 	if (keep_unreachable &&
      	    (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
    - 		die(_("--keep-unreachable and -A are incompatible"));
    -+	if (pack_everything & PACK_CRUFT && delete_redundant) {
    + 		die(_("options '%s' and '%s' cannot be used together"), "--keep-unreachable", "-A");
    + 
    ++	if (pack_everything & PACK_CRUFT) {
    ++		pack_everything |= ALL_INTO_ONE;
    ++
     +		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
    -+			die(_("--cruft and -A are incompatible"));
    ++			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-A");
     +		if (keep_unreachable)
    -+			die(_("--cruft and -k are incompatible"));
    -+		if (!(pack_everything & ALL_INTO_ONE))
    -+			die(_("--cruft must be combined with all-into-one"));
    ++			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-k");
     +	}
    - 
    ++
      	if (write_bitmaps < 0) {
      		if (!write_midx &&
    + 		    (!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()))
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
      	if (pack_everything & ALL_INTO_ONE) {
      		repack_promisor_objects(&po_args, &names);
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
      			for_each_string_list_item(item, &names) {
      				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
      					     packtmp_name, item->string);
    -@@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
    - 		return ret;
    - 
    - 	if (geometry) {
    -+		struct packed_git *p;
    - 		FILE *in = xfdopen(cmd.in, "w");
    - 		/*
    - 		 * The resulting pack should contain all objects in packs that
    -@@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
    - 			fprintf(in, "%s\n", pack_basename(geometry->pack[i]));
    - 		for (i = geometry->split; i < geometry->pack_nr; i++)
    - 			fprintf(in, "^%s\n", pack_basename(geometry->pack[i]));
    -+
    -+		for (p = get_all_packs(the_repository); p; p = p->next) {
    -+			if (!p->is_cruft)
    -+				continue;
    -+			fprintf(in, "^%s\n", pack_basename(p));
    -+		}
    - 		fclose(in);
    - 	}
    - 
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
      	if (!names.nr && !po_args.quiet)
      		printf_ln(_("Nothing new to pack."));
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
      		item->util = (void *)(uintptr_t)populate_pack_exts(item->string);
      	}
     
    - ## t/t5327-pack-objects-cruft.sh ##
    -@@ t/t5327-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned' '
    + ## t/t5328-pack-objects-cruft.sh ##
    +@@ t/t5328-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned' '
      	)
      '
      
    @@ t/t5327-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned'
     +		git branch -D other &&
     +		git tag -d unreachable &&
     +		# objects are not cruft if they are contained in the reflogs
    -+		rm -fr .git/logs &&
    ++		git reflog expire --all --expire=all &&
     +
     +		git rev-list --objects --all --no-object-names >reachable.raw &&
     +		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
    @@ t/t5327-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned'
     +		git checkout main &&
     +		git branch -D other &&
     +		git tag -d cruft &&
    -+		rm -fr .git/logs &&
    ++		git reflog expire --all --expire=all &&
     +
     +		git repack --cruft -d &&
     +
    @@ t/t5327-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned'
     +		git checkout main &&
     +		git branch -D other &&
     +		git tag -d cruft &&
    -+		rm -fr .git/logs &&
    ++		git reflog expire --all --expire=all &&
     +
     +		git repack --cruft &&
     +
    @@ t/t5327-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned'
     +		test_cmp before after
     +	)
     +'
    ++
    ++test_expect_success 'repack --geometric collects once-cruft objects' '
    ++	git init repo &&
    ++	test_when_finished "rm -fr repo" &&
    ++	(
    ++		cd repo &&
    ++
    ++		test_commit reachable &&
    ++		git repack -Ad &&
    ++		git branch -M main &&
    ++
    ++		git checkout --orphan other &&
    ++		git rm -rf . &&
    ++		test_commit --no-tag cruft &&
    ++		cruft="$(git rev-parse HEAD)" &&
    ++
    ++		git checkout main &&
    ++		git branch -D other &&
    ++		git reflog expire --all --expire=all &&
    ++
    ++		# Pack the objects created in the previous step into a cruft
    ++		# pack. Intentionally leave loose copies of those objects
    ++		# around so we can pick them up in a subsequent --geometric
    ++		# reapack.
    ++		git repack --cruft &&
    ++
    ++		# Now make those objects reachable, and ensure that they are
    ++		# packed into the new pack created via a --geometric repack.
    ++		git update-ref refs/heads/other $cruft &&
    ++
    ++		# Without this object, the set of unpacked objects is exactly
    ++		# the set of objects already in the cruft pack. Tweak that set
    ++		# to ensure we do not overwrite the cruft pack entirely.
    ++		test_commit reachable2 &&
    ++
    ++		find $packdir -name "pack-*.idx" | sort >before &&
    ++		git repack --geometric=2 -d &&
    ++		find $packdir -name "pack-*.idx" | sort >after &&
    ++
    ++		{
    ++			git rev-list --objects --no-object-names $cruft &&
    ++			git rev-list --objects --no-object-names reachable..reachable2
    ++		} >want.raw &&
    ++		sort want.raw >want &&
    ++
    ++		pack=$(comm -13 before after) &&
    ++		git show-index <$pack >objects.raw &&
    ++
    ++		cut -d" " -f2 objects.raw | sort >got &&
    ++
    ++		test_cmp want got
    ++	)
    ++'
    ++
     +test_expect_success 'cruft repack with no reachable objects' '
     +	git init repo &&
     +	test_when_finished "rm -fr repo" &&
    @@ t/t5327-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned'
     +
     +		git for-each-ref --format="delete %(refname)" >in &&
     +		git update-ref --stdin <in &&
    -+		rm -fr .git/logs &&
    ++		git reflog expire --all --expire=all &&
     +		rm -fr .git/index &&
     +
     +		git repack --cruft -d &&
13:  0d2dfaa062 ! 13:  b548dbbf80 builtin/repack.c: allow configuring cruft pack generation
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
      				       &existing_kept_packs);
      		if (ret)
     
    - ## t/t5327-pack-objects-cruft.sh ##
    -@@ t/t5327-pack-objects-cruft.sh: test_expect_success 'cruft repack ignores pack.packSizeLimit' '
    + ## t/t5328-pack-objects-cruft.sh ##
    +@@ t/t5328-pack-objects-cruft.sh: test_expect_success 'cruft repack ignores pack.packSizeLimit' '
      	)
      '
      
14:  fd50c39657 = 14:  e6eee7f15c builtin/repack.c: use named flags for existing_packs
15:  b2937ceda7 ! 15:  b09dbc9fe5 builtin/repack.c: add cruft packs to MIDX during geometric repack
    @@ builtin/repack.c: static void midx_included_packs(struct string_list *include,
      		for_each_string_list_item(item, existing_nonkept_packs) {
      			if ((uintptr_t)item->util & DELETE_PACK)
     
    - ## t/t5327-pack-objects-cruft.sh ##
    -@@ t/t5327-pack-objects-cruft.sh: test_expect_success 'cruft --local drops unreachable objects' '
    + ## t/t5328-pack-objects-cruft.sh ##
    +@@ t/t5328-pack-objects-cruft.sh: test_expect_success 'cruft --local drops unreachable objects' '
      	)
      '
      
    @@ t/t5327-pack-objects-cruft.sh: test_expect_success 'cruft --local drops unreacha
     +
     +		git reset --hard $unreachable^ &&
     +		git tag -d cruft &&
    -+		rm -fr .git/logs &&
    ++		git reflog expire --all --expire=all &&
     +
     +		git repack --cruft -d &&
     +
16:  394de0199f ! 16:  7a21ae1494 builtin/gc.c: conditionally avoid pruning objects via loose
    @@ builtin/gc.c: int cmd_gc(int argc, const char **argv, const char *prefix)
      			if (quiet)
      				strvec_push(&prune, "--no-progress");
     
    - ## t/t5327-pack-objects-cruft.sh ##
    -@@ t/t5327-pack-objects-cruft.sh: test_expect_success 'loose objects mtimes upsert others' '
    + ## t/t5328-pack-objects-cruft.sh ##
    +@@ t/t5328-pack-objects-cruft.sh: test_expect_success 'loose objects mtimes upsert others' '
      	)
      '
      
    @@ t/t5327-pack-objects-cruft.sh: test_expect_success 'loose objects mtimes upsert
     +		git branch -D other &&
     +		git tag -d unreachable &&
     +		# objects are not cruft if they are contained in the reflogs
    -+		rm -fr .git/logs &&
    ++		git reflog expire --all --expire=all &&
     +
     +		git rev-list --objects --all --no-object-names >reachable.raw &&
     +		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
17:  99aace8e16 ! 17:  b729b80963 sha1-file.c: don't freshen cruft packs
    @@ object-file.c: static int freshen_packed_object(const struct object_id *oid)
      		return 1;
      	if (!freshen_file(e.p->pack_name))
     
    - ## t/t5327-pack-objects-cruft.sh ##
    -@@ t/t5327-pack-objects-cruft.sh: test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
    + ## t/t5328-pack-objects-cruft.sh ##
    +@@ t/t5328-pack-objects-cruft.sh: test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
      	)
      '
      
-- 
2.35.1.73.gccc5557600

^ permalink raw reply	[flat|nested] 201+ messages in thread

* [PATCH v2 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
                     ` (16 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Create a technical document to explain cruft packs. It contains a brief
overview of the problem, some background, details on the implementation,
and a couple of alternative approaches not considered here.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/Makefile                  |  1 +
 Documentation/technical/cruft-packs.txt | 97 +++++++++++++++++++++++++
 2 files changed, 98 insertions(+)
 create mode 100644 Documentation/technical/cruft-packs.txt

diff --git a/Documentation/Makefile b/Documentation/Makefile
index ed656db2ae..0b01c9408e 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -91,6 +91,7 @@ TECH_DOCS += MyFirstContribution
 TECH_DOCS += MyFirstObjectWalk
 TECH_DOCS += SubmittingPatches
 TECH_DOCS += technical/bundle-format
+TECH_DOCS += technical/cruft-packs
 TECH_DOCS += technical/hash-function-transition
 TECH_DOCS += technical/http-protocol
 TECH_DOCS += technical/index-format
diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
new file mode 100644
index 0000000000..2c3c5d93f8
--- /dev/null
+++ b/Documentation/technical/cruft-packs.txt
@@ -0,0 +1,97 @@
+= Cruft packs
+
+The cruft packs feature offer an alternative to Git's traditional mechanism of
+removing unreachable objects. This document provides an overview of Git's
+pruning mechanism, and how a cruft pack can be used instead to accomplish the
+same.
+
+== Background
+
+To remove unreachable objects from your repository, Git offers `git repack -Ad`
+(see linkgit:git-repack[1]). Quoting from the documentation:
+
+[quote]
+[...] unreachable objects in a previous pack become loose, unpacked objects,
+instead of being left in the old pack. [...] loose unreachable objects will be
+pruned according to normal expiry rules with the next 'git gc' invocation.
+
+Unreachable objects aren't removed immediately, since doing so could race with
+an incoming push which may reference an object which is about to be deleted.
+Instead, those unreachable objects are stored as loose object and stay that way
+until they are older than the expiration window, at which point they are removed
+by linkgit:git-prune[1].
+
+Git must store these unreachable objects loose in order to keep track of their
+per-object mtimes. If these unreachable objects were written into one big pack,
+then either freshening that pack (because an object contained within it was
+re-written) or creating a new pack of unreachable objects would cause the pack's
+mtime to get updated, and the objects within it would never leave the expiration
+window. Instead, objects are stored loose in order to keep track of the
+individual object mtimes and avoid a situation where all cruft objects are
+freshened at once.
+
+This can lead to undesirable situations when a repository contains many
+unreachable objects which have not yet left the grace period. Having large
+directories in the shards of `.git/objects` can lead to decreased performance in
+the repository. But given enough unreachable objects, this can lead to inode
+starvation and degrade the performance of the whole system. Since we
+can never pack those objects, these repositories often take up a large amount of
+disk space, since we can only zlib compress them, but not store them in delta
+chains.
+
+== Cruft packs
+
+A cruft pack eliminates the need for storing unreachable objects in a loose
+state by including the per-object mtimes in a separate file alongside a single
+pack containing all loose objects.
+
+A cruft pack is written by `git repack --cruft` when generating a new pack.
+linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft`
+is a classic all-into-one repack, meaning that everything in the resulting pack is
+reachable, and everything else is unreachable. Once written, the `--cruft`
+option instructs `git repack` to generate another pack containing only objects
+not packed in the previous step (which equates to packing all unreachable
+objects together). This progresses as follows:
+
+  1. Enumerate every object, marking any object which is (a) not contained in a
+     kept-pack, and (b) whose mtime is within the grace period as a traversal
+     tip.
+
+  2. Perform a reachability traversal based on the tips gathered in the previous
+     step, adding every object along the way to the pack.
+
+  3. Write the pack out, along with a `.mtimes` file that records the per-object
+     timestamps.
+
+This mode is invoked internally by linkgit:git-repack[1] when instructed to
+write a cruft pack. Crucially, the set of in-core kept packs is exactly the set
+of packs which will not be deleted by the repack; in other words, they contain
+all of the repository's reachable objects.
+
+When a repository already has a cruft pack, `git repack --cruft` typically only
+adds objects to it. An exception to this is when `git repack` is given the
+`--cruft-expiration` option, which allows the generated cruft pack to omit
+expired objects instead of waiting for linkgit:git-gc[1] to expire those objects
+later on.
+
+It is linkgit:git-gc[1] that is typically responsible for removing expired
+unreachable objects.
+
+== Alternatives
+
+Notable alternatives to this design include:
+
+  - The location of the per-object mtime data, and
+  - Storing unreachable objects in multiple cruft packs.
+
+On the location of mtime data, a new auxiliary file tied to the pack was chosen
+to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
+support for optional chunks of data, it may make sense to consolidate the
+`.mtimes` format into the `.idx` itself.
+
+Storing unreachable objects among multiple cruft packs (e.g., creating a new
+cruft pack during each repacking operation including only unreachable objects
+which aren't already stored in an earlier cruft pack) is significantly more
+complicated to construct, and so aren't pursued here. The obvious drawback to
+the current implementation is that the entire cruft pack must be re-written from
+scratch.
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v2 02/17] pack-mtimes: support reading .mtimes files
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02 20:22     ` Derrick Stolee
  2022-03-02  0:58   ` [PATCH v2 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
                     ` (15 subsequent siblings)
  17 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

To store the individual mtimes of objects in a cruft pack, introduce a
new `.mtimes` format that can optionally accompany a single pack in the
repository.

The format is defined in Documentation/technical/pack-format.txt, and
stores a 4-byte network order timestamp for each object in name (index)
order.

This patch prepares for cruft packs by defining the `.mtimes` format,
and introducing a basic API that callers can use to read out individual
mtimes.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/pack-format.txt |  19 ++++
 Makefile                                |   1 +
 builtin/repack.c                        |   1 +
 object-store.h                          |   5 +-
 pack-mtimes.c                           | 129 ++++++++++++++++++++++++
 pack-mtimes.h                           |  15 +++
 packfile.c                              |  19 +++-
 7 files changed, 186 insertions(+), 3 deletions(-)
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 6d3efb7d16..c443dbb526 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -294,6 +294,25 @@ Pack file entry: <+
 
 All 4-byte numbers are in network order.
 
+== pack-*.mtimes files have the format:
+
+  - A 4-byte magic number '0x4d544d45' ('MTME').
+
+  - A 4-byte version identifier (= 1).
+
+  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
+
+  - A table of 4-byte unsigned integers in network order. The ith
+    value is the modification time (mtime) of the ith object in the
+    corresponding pack by lexicographic (index) order. The mtimes
+    count standard epoch seconds.
+
+  - A trailer, containing a checksum of the corresponding packfile,
+    and a checksum of all of the above (each having length according
+    to the specified hash function).
+
+All 4-byte numbers are in network order.
+
 == multi-pack-index (MIDX) files have the following format:
 
 The multi-pack-index files refer to multiple pack-files and loose objects.
diff --git a/Makefile b/Makefile
index 6f0b4b775f..1b186f4fd7 100644
--- a/Makefile
+++ b/Makefile
@@ -959,6 +959,7 @@ LIB_OBJS += oidtree.o
 LIB_OBJS += pack-bitmap-write.o
 LIB_OBJS += pack-bitmap.o
 LIB_OBJS += pack-check.o
+LIB_OBJS += pack-mtimes.o
 LIB_OBJS += pack-objects.o
 LIB_OBJS += pack-revindex.o
 LIB_OBJS += pack-write.o
diff --git a/builtin/repack.c b/builtin/repack.c
index da1e364a75..f908f7d5dd 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -212,6 +212,7 @@ static struct {
 } exts[] = {
 	{".pack"},
 	{".rev", 1},
+	{".mtimes", 1},
 	{".bitmap", 1},
 	{".promisor", 1},
 	{".idx"},
diff --git a/object-store.h b/object-store.h
index 6f89482df0..9b227661f2 100644
--- a/object-store.h
+++ b/object-store.h
@@ -115,12 +115,15 @@ struct packed_git {
 		 freshened:1,
 		 do_not_close:1,
 		 pack_promisor:1,
-		 multi_pack_index:1;
+		 multi_pack_index:1,
+		 is_cruft:1;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	struct revindex_entry *revindex;
 	const uint32_t *revindex_data;
 	const uint32_t *revindex_map;
 	size_t revindex_size;
+	const uint32_t *mtimes_map;
+	size_t mtimes_size;
 	/* something like ".git/objects/pack/xxxxx.pack" */
 	char pack_name[FLEX_ARRAY]; /* more */
 };
diff --git a/pack-mtimes.c b/pack-mtimes.c
new file mode 100644
index 0000000000..50caa34381
--- /dev/null
+++ b/pack-mtimes.c
@@ -0,0 +1,129 @@
+#include "pack-mtimes.h"
+#include "object-store.h"
+#include "packfile.h"
+
+static char *pack_mtimes_filename(struct packed_git *p)
+{
+	size_t len;
+	if (!strip_suffix(p->pack_name, ".pack", &len))
+		BUG("pack_name does not end in .pack");
+	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
+	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
+}
+
+#define MTIMES_HEADER_SIZE (12)
+#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))
+
+struct mtimes_header {
+	uint32_t signature;
+	uint32_t version;
+	uint32_t hash_id;
+};
+
+static int load_pack_mtimes_file(char *mtimes_file,
+				 uint32_t num_objects,
+				 const uint32_t **data_p, size_t *len_p)
+{
+	int fd, ret = 0;
+	struct stat st;
+	void *data = NULL;
+	size_t mtimes_size;
+	struct mtimes_header header;
+	uint32_t *hdr;
+
+	fd = git_open(mtimes_file);
+
+	if (fd < 0) {
+		ret = -1;
+		goto cleanup;
+	}
+	if (fstat(fd, &st)) {
+		ret = error_errno(_("failed to read %s"), mtimes_file);
+		goto cleanup;
+	}
+
+	mtimes_size = xsize_t(st.st_size);
+
+	if (mtimes_size < MTIMES_MIN_SIZE) {
+		ret = error(_("mtimes file %s is too small"), mtimes_file);
+		goto cleanup;
+	}
+
+	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
+		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
+		goto cleanup;
+	}
+
+	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
+
+	header.signature = ntohl(hdr[0]);
+	header.version = ntohl(hdr[1]);
+	header.hash_id = ntohl(hdr[2]);
+
+	if (header.signature != MTIMES_SIGNATURE) {
+		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
+		goto cleanup;
+	}
+
+	if (header.version != 1) {
+		ret = error(_("mtimes file %s has unsupported version %"PRIu32),
+			    mtimes_file, header.version);
+		goto cleanup;
+	}
+
+	if (!(header.hash_id == 1 || header.hash_id == 2)) {
+		ret = error(_("mtimes file %s has unsupported hash id %"PRIu32),
+			    mtimes_file, header.hash_id);
+		goto cleanup;
+	}
+
+cleanup:
+	if (ret) {
+		if (data)
+			munmap(data, mtimes_size);
+	} else {
+		*len_p = mtimes_size;
+		*data_p = (const uint32_t *)data;
+	}
+
+	close(fd);
+	return ret;
+}
+
+int load_pack_mtimes(struct packed_git *p)
+{
+	char *mtimes_name = NULL;
+	int ret = 0;
+
+	if (!p->is_cruft)
+		return ret; /* not a cruft pack */
+	if (p->mtimes_map)
+		return ret; /* already loaded */
+
+	ret = open_pack_index(p);
+	if (ret < 0)
+		goto cleanup;
+
+	mtimes_name = pack_mtimes_filename(p);
+	ret = load_pack_mtimes_file(mtimes_name,
+				    p->num_objects,
+				    &p->mtimes_map,
+				    &p->mtimes_size);
+	if (ret)
+		goto cleanup;
+
+cleanup:
+	free(mtimes_name);
+	return ret;
+}
+
+uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)
+{
+	if (!p->mtimes_map)
+		BUG("pack .mtimes file not loaded for %s", p->pack_name);
+	if (p->num_objects <= pos)
+		BUG("pack .mtimes out-of-bounds (%"PRIu32" vs %"PRIu32")",
+		    pos, p->num_objects);
+
+	return get_be32(p->mtimes_map + pos + 3);
+}
diff --git a/pack-mtimes.h b/pack-mtimes.h
new file mode 100644
index 0000000000..38ddb9f893
--- /dev/null
+++ b/pack-mtimes.h
@@ -0,0 +1,15 @@
+#ifndef PACK_MTIMES_H
+#define PACK_MTIMES_H
+
+#include "git-compat-util.h"
+
+#define MTIMES_SIGNATURE 0x4d544d45 /* "MTME" */
+#define MTIMES_VERSION 1
+
+struct packed_git;
+
+int load_pack_mtimes(struct packed_git *p);
+
+uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos);
+
+#endif
diff --git a/packfile.c b/packfile.c
index 835b2d2716..fc0245fbab 100644
--- a/packfile.c
+++ b/packfile.c
@@ -334,12 +334,22 @@ static void close_pack_revindex(struct packed_git *p)
 	p->revindex_data = NULL;
 }
 
+static void close_pack_mtimes(struct packed_git *p)
+{
+	if (!p->mtimes_map)
+		return;
+
+	munmap((void *)p->mtimes_map, p->mtimes_size);
+	p->mtimes_map = NULL;
+}
+
 void close_pack(struct packed_git *p)
 {
 	close_pack_windows(p);
 	close_pack_fd(p);
 	close_pack_index(p);
 	close_pack_revindex(p);
+	close_pack_mtimes(p);
 	oidset_clear(&p->bad_objects);
 }
 
@@ -363,7 +373,7 @@ void close_object_store(struct raw_object_store *o)
 
 void unlink_pack_path(const char *pack_name, int force_delete)
 {
-	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
+	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".mtimes"};
 	int i;
 	struct strbuf buf = STRBUF_INIT;
 	size_t plen;
@@ -718,6 +728,10 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
 	if (!access(p->pack_name, F_OK))
 		p->pack_promisor = 1;
 
+	xsnprintf(p->pack_name + path_len, alloc - path_len, ".mtimes");
+	if (!access(p->pack_name, F_OK))
+		p->is_cruft = 1;
+
 	xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
 	if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) {
 		free(p);
@@ -869,7 +883,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 	    ends_with(file_name, ".pack") ||
 	    ends_with(file_name, ".bitmap") ||
 	    ends_with(file_name, ".keep") ||
-	    ends_with(file_name, ".promisor"))
+	    ends_with(file_name, ".promisor") ||
+	    ends_with(file_name, ".mtimes"))
 		string_list_append(data->garbage, full_name);
 	else
 		report_garbage(PACKDIR_FILE_GARBAGE, full_name);
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v2 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 04/17] chunk-format.h: extract oid_version() Taylor Blau
                     ` (14 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

This structure will be used to communicate the per-object mtimes when
writing a cruft pack. Here, we need the full packing_data structure
because the mtime information is stored in an array there, not on the
individual object_entry's themselves (to avoid paying the overhead in
structure width for operations which do not generate a cruft pack).

We haven't passed this information down before because one of the two
callers (in bulk-checkin.c) does not have a packing_data structure at
all. In that case (where no cruft pack will be generated), NULL is
passed instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 3 ++-
 bulk-checkin.c         | 2 +-
 pack-write.c           | 1 +
 pack.h                 | 3 +++
 4 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 178e611f09..385970cb7b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1254,7 +1254,8 @@ static void write_pack_file(void)
 
 			stage_tmp_packfiles(&tmpname, pack_tmp_name,
 					    written_list, nr_written,
-					    &pack_idx_opts, hash, &idx_tmp_name);
+					    &to_pack, &pack_idx_opts, hash,
+					    &idx_tmp_name);
 
 			if (write_bitmap_index) {
 				size_t tmpname_len = tmpname.len;
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 8785b2ac80..99f7596c4e 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -33,7 +33,7 @@ static void finish_tmp_packfile(struct strbuf *basename,
 	char *idx_tmp_name = NULL;
 
 	stage_tmp_packfiles(basename, pack_tmp_name, written_list, nr_written,
-			    pack_idx_opts, hash, &idx_tmp_name);
+			    NULL, pack_idx_opts, hash, &idx_tmp_name);
 	rename_tmp_packfile_idx(basename, &idx_tmp_name);
 
 	free(idx_tmp_name);
diff --git a/pack-write.c b/pack-write.c
index a5846f3a34..d594e3008e 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -483,6 +483,7 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 const char *pack_tmp_name,
 			 struct pack_idx_entry **written_list,
 			 uint32_t nr_written,
+			 struct packing_data *to_pack,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[],
 			 char **idx_tmp_name)
diff --git a/pack.h b/pack.h
index b22bfc4a18..fd27cfdfd7 100644
--- a/pack.h
+++ b/pack.h
@@ -109,11 +109,14 @@ int encode_in_pack_object_header(unsigned char *hdr, int hdr_len,
 #define PH_ERROR_PROTOCOL	(-3)
 int read_pack_header(int fd, struct pack_header *);
 
+struct packing_data;
+
 struct hashfile *create_tmp_packfile(char **pack_tmp_name);
 void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 const char *pack_tmp_name,
 			 struct pack_idx_entry **written_list,
 			 uint32_t nr_written,
+			 struct packing_data *to_pack,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[],
 			 char **idx_tmp_name);
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v2 04/17] chunk-format.h: extract oid_version()
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (2 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
                     ` (13 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

There are three definitions of an identical function which converts
`the_hash_algo` into either 1 (for SHA-1) or 2 (for SHA-256). There is a
copy of this function for writing both the commit-graph and
multi-pack-index file, and another inline definition used to write the
.rev header.

Consolidate these into a single definition in chunk-format.h. It's not
clear that this is the best header to define this function in, but it
should do for now.

(Worth noting, the .rev caller expects a 4-byte unsigned, but the other
two callers work with a single unsigned byte. The consolidated version
uses the latter type, and lets the compiler widen it when required).

Another caller will be added in a subsequent patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 chunk-format.c | 12 ++++++++++++
 chunk-format.h |  3 +++
 commit-graph.c | 18 +++---------------
 midx.c         | 18 +++---------------
 pack-write.c   | 15 ++-------------
 5 files changed, 23 insertions(+), 43 deletions(-)

diff --git a/chunk-format.c b/chunk-format.c
index 1c3dca62e2..0275b74a89 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -181,3 +181,15 @@ int read_chunk(struct chunkfile *cf,
 
 	return CHUNK_NOT_FOUND;
 }
+
+uint8_t oid_version(const struct git_hash_algo *algop)
+{
+	switch (hash_algo_by_ptr(algop)) {
+	case GIT_HASH_SHA1:
+		return 1;
+	case GIT_HASH_SHA256:
+		return 2;
+	default:
+		die(_("invalid hash version"));
+	}
+}
diff --git a/chunk-format.h b/chunk-format.h
index 9ccbe00377..7885aa0848 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -2,6 +2,7 @@
 #define CHUNK_FORMAT_H
 
 #include "git-compat-util.h"
+#include "hash.h"
 
 struct hashfile;
 struct chunkfile;
@@ -65,4 +66,6 @@ int read_chunk(struct chunkfile *cf,
 	       chunk_read_fn fn,
 	       void *data);
 
+uint8_t oid_version(const struct git_hash_algo *algop);
+
 #endif
diff --git a/commit-graph.c b/commit-graph.c
index 265c010122..f678d2c4a1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -193,18 +193,6 @@ char *get_commit_graph_chain_filename(struct object_directory *odb)
 	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
 }
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 static struct commit_graph *alloc_commit_graph(void)
 {
 	struct commit_graph *g = xcalloc(1, sizeof(*g));
@@ -365,9 +353,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 	}
 
 	hash_version = *(unsigned char*)(data + 5);
-	if (hash_version != oid_version()) {
+	if (hash_version != oid_version(the_hash_algo)) {
 		error(_("commit-graph hash version %X does not match version %X"),
-		      hash_version, oid_version());
+		      hash_version, oid_version(the_hash_algo));
 		return NULL;
 	}
 
@@ -1911,7 +1899,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
 	hashwrite_u8(f, GRAPH_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, oid_version(the_hash_algo));
 	hashwrite_u8(f, get_num_chunks(cf));
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
diff --git a/midx.c b/midx.c
index 865170bad0..65e670c5e2 100644
--- a/midx.c
+++ b/midx.c
@@ -41,18 +41,6 @@
 
 #define PACK_EXPIRED UINT_MAX
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 const unsigned char *get_midx_checksum(struct multi_pack_index *m)
 {
 	return m->data + m->data_len - the_hash_algo->rawsz;
@@ -134,9 +122,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 		      m->version);
 
 	hash_version = m->data[MIDX_BYTE_HASH_VERSION];
-	if (hash_version != oid_version()) {
+	if (hash_version != oid_version(the_hash_algo)) {
 		error(_("multi-pack-index hash version %u does not match version %u"),
-		      hash_version, oid_version());
+		      hash_version, oid_version(the_hash_algo));
 		goto cleanup_fail;
 	}
 	m->hash_len = the_hash_algo->rawsz;
@@ -420,7 +408,7 @@ static size_t write_midx_header(struct hashfile *f,
 {
 	hashwrite_be32(f, MIDX_SIGNATURE);
 	hashwrite_u8(f, MIDX_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, oid_version(the_hash_algo));
 	hashwrite_u8(f, num_chunks);
 	hashwrite_u8(f, 0); /* unused */
 	hashwrite_be32(f, num_packs);
diff --git a/pack-write.c b/pack-write.c
index d594e3008e..ff305b404c 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -2,6 +2,7 @@
 #include "pack.h"
 #include "csum-file.h"
 #include "remote.h"
+#include "chunk-format.h"
 
 void reset_pack_idx_option(struct pack_idx_option *opts)
 {
@@ -181,21 +182,9 @@ static int pack_order_cmp(const void *va, const void *vb, void *ctx)
 
 static void write_rev_header(struct hashfile *f)
 {
-	uint32_t oid_version;
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		oid_version = 1;
-		break;
-	case GIT_HASH_SHA256:
-		oid_version = 2;
-		break;
-	default:
-		die("write_rev_header: unknown hash version");
-	}
-
 	hashwrite_be32(f, RIDX_SIGNATURE);
 	hashwrite_be32(f, RIDX_VERSION);
-	hashwrite_be32(f, oid_version);
+	hashwrite_be32(f, oid_version(the_hash_algo));
 }
 
 static void write_rev_index_positions(struct hashfile *f,
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v2 05/17] pack-mtimes: support writing pack .mtimes files
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (3 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 04/17] chunk-format.h: extract oid_version() Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
                     ` (12 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Now that the `.mtimes` format is defined, supplement the pack-write API
to be able to conditionally write an `.mtimes` file along with a pack by
setting an additional flag and passing an oidmap that contains the
timestamps corresponding to each object in the pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-objects.c |  6 ++++
 pack-objects.h | 25 ++++++++++++++++
 pack-write.c   | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++
 pack.h         |  1 +
 4 files changed, 109 insertions(+)

diff --git a/pack-objects.c b/pack-objects.c
index fe2a4eace9..272e8d4517 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -170,6 +170,9 @@ struct object_entry *packlist_alloc(struct packing_data *pdata,
 
 		if (pdata->layer)
 			REALLOC_ARRAY(pdata->layer, pdata->nr_alloc);
+
+		if (pdata->cruft_mtime)
+			REALLOC_ARRAY(pdata->cruft_mtime, pdata->nr_alloc);
 	}
 
 	new_entry = pdata->objects + pdata->nr_objects++;
@@ -198,6 +201,9 @@ struct object_entry *packlist_alloc(struct packing_data *pdata,
 	if (pdata->layer)
 		pdata->layer[pdata->nr_objects - 1] = 0;
 
+	if (pdata->cruft_mtime)
+		pdata->cruft_mtime[pdata->nr_objects - 1] = 0;
+
 	return new_entry;
 }
 
diff --git a/pack-objects.h b/pack-objects.h
index dca2351ef9..393b9db546 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -168,6 +168,14 @@ struct packing_data {
 	/* delta islands */
 	unsigned int *tree_depth;
 	unsigned char *layer;
+
+	/*
+	 * Used when writing cruft packs.
+	 *
+	 * Object mtimes are stored in pack order when writing, but
+	 * written out in lexicographic (index) order.
+	 */
+	uint32_t *cruft_mtime;
 };
 
 void prepare_packing_data(struct repository *r, struct packing_data *pdata);
@@ -289,4 +297,21 @@ static inline void oe_set_layer(struct packing_data *pack,
 	pack->layer[e - pack->objects] = layer;
 }
 
+static inline uint32_t oe_cruft_mtime(struct packing_data *pack,
+				      struct object_entry *e)
+{
+	if (!pack->cruft_mtime)
+		return 0;
+	return pack->cruft_mtime[e - pack->objects];
+}
+
+static inline void oe_set_cruft_mtime(struct packing_data *pack,
+				      struct object_entry *e,
+				      uint32_t mtime)
+{
+	if (!pack->cruft_mtime)
+		CALLOC_ARRAY(pack->cruft_mtime, pack->nr_alloc);
+	pack->cruft_mtime[e - pack->objects] = mtime;
+}
+
 #endif
diff --git a/pack-write.c b/pack-write.c
index ff305b404c..270280c4df 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -3,6 +3,10 @@
 #include "csum-file.h"
 #include "remote.h"
 #include "chunk-format.h"
+#include "pack-mtimes.h"
+#include "oidmap.h"
+#include "chunk-format.h"
+#include "pack-objects.h"
 
 void reset_pack_idx_option(struct pack_idx_option *opts)
 {
@@ -276,6 +280,70 @@ const char *write_rev_file_order(const char *rev_name,
 	return rev_name;
 }
 
+static void write_mtimes_header(struct hashfile *f)
+{
+	hashwrite_be32(f, MTIMES_SIGNATURE);
+	hashwrite_be32(f, MTIMES_VERSION);
+	hashwrite_be32(f, oid_version(the_hash_algo));
+}
+
+/*
+ * Writes the object mtimes of "objects" for use in a .mtimes file.
+ * Note that objects must be in lexicographic (index) order, which is
+ * the expected ordering of these values in the .mtimes file.
+ */
+static void write_mtimes_objects(struct hashfile *f,
+				 struct packing_data *to_pack,
+				 struct pack_idx_entry **objects,
+				 uint32_t nr_objects)
+{
+	uint32_t i;
+	for (i = 0; i < nr_objects; i++) {
+		struct object_entry *e = (struct object_entry*)objects[i];
+		hashwrite_be32(f, oe_cruft_mtime(to_pack, e));
+	}
+}
+
+static void write_mtimes_trailer(struct hashfile *f, const unsigned char *hash)
+{
+	hashwrite(f, hash, the_hash_algo->rawsz);
+}
+
+static const char *write_mtimes_file(const char *mtimes_name,
+				     struct packing_data *to_pack,
+				     struct pack_idx_entry **objects,
+				     uint32_t nr_objects,
+				     const unsigned char *hash)
+{
+	struct hashfile *f;
+	int fd;
+
+	if (!to_pack)
+		BUG("cannot call write_mtimes_file with NULL packing_data");
+
+	if (!mtimes_name) {
+		struct strbuf tmp_file = STRBUF_INIT;
+		fd = odb_mkstemp(&tmp_file, "pack/tmp_mtimes_XXXXXX");
+		mtimes_name = strbuf_detach(&tmp_file, NULL);
+	} else {
+		unlink(mtimes_name);
+		fd = xopen(mtimes_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
+	}
+	f = hashfd(fd, mtimes_name);
+
+	write_mtimes_header(f);
+	write_mtimes_objects(f, to_pack, objects, nr_objects);
+	write_mtimes_trailer(f, hash);
+
+	if (adjust_shared_perm(mtimes_name) < 0)
+		die(_("failed to make %s readable"), mtimes_name);
+
+	finalize_hashfile(f, NULL,
+			  CSUM_HASH_IN_STREAM | CSUM_CLOSE | CSUM_FSYNC);
+
+	return mtimes_name;
+}
+
 off_t write_pack_header(struct hashfile *f, uint32_t nr_entries)
 {
 	struct pack_header hdr;
@@ -478,6 +546,7 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 char **idx_tmp_name)
 {
 	const char *rev_tmp_name = NULL;
+	const char *mtimes_tmp_name = NULL;
 
 	if (adjust_shared_perm(pack_tmp_name))
 		die_errno("unable to make temporary pack file readable");
@@ -490,9 +559,17 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 	rev_tmp_name = write_rev_file(NULL, written_list, nr_written, hash,
 				      pack_idx_opts->flags);
 
+	if (pack_idx_opts->flags & WRITE_MTIMES) {
+		mtimes_tmp_name = write_mtimes_file(NULL, to_pack, written_list,
+						    nr_written,
+						    hash);
+	}
+
 	rename_tmp_packfile(name_buffer, pack_tmp_name, "pack");
 	if (rev_tmp_name)
 		rename_tmp_packfile(name_buffer, rev_tmp_name, "rev");
+	if (mtimes_tmp_name)
+		rename_tmp_packfile(name_buffer, mtimes_tmp_name, "mtimes");
 }
 
 void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought)
diff --git a/pack.h b/pack.h
index fd27cfdfd7..01d385903a 100644
--- a/pack.h
+++ b/pack.h
@@ -44,6 +44,7 @@ struct pack_idx_option {
 #define WRITE_IDX_STRICT 02
 #define WRITE_REV 04
 #define WRITE_REV_VERIFY 010
+#define WRITE_MTIMES 020
 
 	uint32_t version;
 	uint32_t off32_limit;
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v2 06/17] t/helper: add 'pack-mtimes' test-tool
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (4 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
                     ` (11 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

In the next patch, we will implement and test support for writing a
cruft pack via a special mode of `git pack-objects`. To make sure that
objects are written with the correct timestamps, and a new test-tool
that can dump the object names and corresponding timestamps from a given
`.mtimes` file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Makefile                    |  1 +
 t/helper/test-pack-mtimes.c | 56 +++++++++++++++++++++++++++++++++++++
 t/helper/test-tool.c        |  1 +
 t/helper/test-tool.h        |  1 +
 4 files changed, 59 insertions(+)
 create mode 100644 t/helper/test-pack-mtimes.c

diff --git a/Makefile b/Makefile
index 1b186f4fd7..5c0ed1ade7 100644
--- a/Makefile
+++ b/Makefile
@@ -727,6 +727,7 @@ TEST_BUILTINS_OBJS += test-oid-array.o
 TEST_BUILTINS_OBJS += test-oidmap.o
 TEST_BUILTINS_OBJS += test-oidtree.o
 TEST_BUILTINS_OBJS += test-online-cpus.o
+TEST_BUILTINS_OBJS += test-pack-mtimes.o
 TEST_BUILTINS_OBJS += test-parse-options.o
 TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
 TEST_BUILTINS_OBJS += test-partial-clone.o
diff --git a/t/helper/test-pack-mtimes.c b/t/helper/test-pack-mtimes.c
new file mode 100644
index 0000000000..f7b79daf4c
--- /dev/null
+++ b/t/helper/test-pack-mtimes.c
@@ -0,0 +1,56 @@
+#include "git-compat-util.h"
+#include "test-tool.h"
+#include "strbuf.h"
+#include "object-store.h"
+#include "packfile.h"
+#include "pack-mtimes.h"
+
+static void dump_mtimes(struct packed_git *p)
+{
+	uint32_t i;
+	if (load_pack_mtimes(p) < 0)
+		die("could not load pack .mtimes");
+
+	for (i = 0; i < p->num_objects; i++) {
+		struct object_id oid;
+		if (nth_packed_object_id(&oid, p, i) < 0)
+			die("could not load object id at position %"PRIu32, i);
+
+		printf("%s %"PRIu32"\n",
+		       oid_to_hex(&oid), nth_packed_mtime(p, i));
+	}
+}
+
+static const char *pack_mtimes_usage = "\n"
+"  test-tool pack-mtimes <pack-name.mtimes>";
+
+int cmd__pack_mtimes(int argc, const char **argv)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct packed_git *p;
+
+	setup_git_directory();
+
+	if (argc != 2)
+		usage(pack_mtimes_usage);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		strbuf_addstr(&buf, basename(p->pack_name));
+		strbuf_strip_suffix(&buf, ".pack");
+		strbuf_addstr(&buf, ".mtimes");
+
+		if (!strcmp(buf.buf, argv[1]))
+			break;
+
+		strbuf_reset(&buf);
+	}
+
+	strbuf_release(&buf);
+
+	if (!p)
+		die("could not find pack '%s'", argv[1]);
+
+	dump_mtimes(p);
+
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index e6ec69cf32..7d472b31fd 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -47,6 +47,7 @@ static struct test_cmd cmds[] = {
 	{ "oidmap", cmd__oidmap },
 	{ "oidtree", cmd__oidtree },
 	{ "online-cpus", cmd__online_cpus },
+	{ "pack-mtimes", cmd__pack_mtimes },
 	{ "parse-options", cmd__parse_options },
 	{ "parse-pathspec-file", cmd__parse_pathspec_file },
 	{ "partial-clone", cmd__partial_clone },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 20756eefdd..0ac4f32955 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -37,6 +37,7 @@ int cmd__mktemp(int argc, const char **argv);
 int cmd__oidmap(int argc, const char **argv);
 int cmd__oidtree(int argc, const char **argv);
 int cmd__online_cpus(int argc, const char **argv);
+int cmd__pack_mtimes(int argc, const char **argv);
 int cmd__parse_options(int argc, const char **argv);
 int cmd__parse_pathspec_file(int argc, const char** argv);
 int cmd__partial_clone(int argc, const char **argv);
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v2 07/17] builtin/pack-objects.c: return from create_object_entry()
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (5 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
                     ` (10 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

A new caller in the next commit will want to immediately modify the
object_entry structure created by create_object_entry(). Instead of
forcing that caller to wastefully look-up the entry we just created,
return it from create_object_entry() instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 385970cb7b..3f08a3c63a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1508,13 +1508,13 @@ static int want_object_in_pack(const struct object_id *oid,
 	return 1;
 }
 
-static void create_object_entry(const struct object_id *oid,
-				enum object_type type,
-				uint32_t hash,
-				int exclude,
-				int no_try_delta,
-				struct packed_git *found_pack,
-				off_t found_offset)
+static struct object_entry *create_object_entry(const struct object_id *oid,
+						enum object_type type,
+						uint32_t hash,
+						int exclude,
+						int no_try_delta,
+						struct packed_git *found_pack,
+						off_t found_offset)
 {
 	struct object_entry *entry;
 
@@ -1531,6 +1531,8 @@ static void create_object_entry(const struct object_id *oid,
 	}
 
 	entry->no_try_delta = no_try_delta;
+
+	return entry;
 }
 
 static const char no_closure_warning[] = N_(
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v2 08/17] builtin/pack-objects.c: --cruft without expiration
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (6 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
                     ` (9 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Teach `pack-objects` how to generate a cruft pack when no objects are
dropped (i.e., `--cruft-expiration=never`). Later patches will teach
`pack-objects` how to generate a cruft pack that prunes objects.

When generating a cruft pack which does not prune objects, we want to
collect all unreachable objects into a single pack (noting and updating
their mtimes as we accumulate them). Ordinary use will pass the result
of a `git repack -A` as a kept pack, so when this patch says "kept
pack", readers should think "reachable objects".

Generating a non-expiring cruft packs works as follows:

  - Callers provide a list of every pack they know about, and indicate
    which packs are about to be removed.

  - All packs which are going to be removed (we'll call these the
    redundant ones) are marked as kept in-core.

    Any packs the caller did not mention (but are known to the
    `pack-objects` process) are also marked as kept in-core. Packs not
    mentioned by the caller are assumed to be unknown to them, i.e.,
    they entered the repository after the caller decided which packs
    should be kept and which should be discarded.

    Since we do not want to include objects in these "unknown" packs
    (because we don't know which of their objects are or aren't
    reachable), these are also marked as kept in-core.

  - Then, we enumerate all objects in the repository, and add them to
    our packing list if they do not appear in an in-core kept pack.

This results in a new cruft pack which contains all known objects that
aren't included in the kept packs. When the kept pack is the result of
`git repack -A`, the resulting pack contains all unreachable objects.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-pack-objects.txt |  30 ++++
 builtin/pack-objects.c             | 201 +++++++++++++++++++++++++-
 object-file.c                      |   2 +-
 object-store.h                     |   2 +
 t/t5328-pack-objects-cruft.sh      | 218 +++++++++++++++++++++++++++++
 5 files changed, 448 insertions(+), 5 deletions(-)
 create mode 100755 t/t5328-pack-objects-cruft.sh

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index f8344e1e5b..a9995a932c 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -13,6 +13,7 @@ SYNOPSIS
 	[--no-reuse-delta] [--delta-base-offset] [--non-empty]
 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
 	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
+	[--cruft] [--cruft-expiration=<time>]
 	[--stdout [--filter=<filter-spec>] | <base-name>]
 	[--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>
 
@@ -95,6 +96,35 @@ base-name::
 Incompatible with `--revs`, or options that imply `--revs` (such as
 `--all`), with the exception of `--unpacked`, which is compatible.
 
+--cruft::
+	Packs unreachable objects into a separate "cruft" pack, denoted
+	by the existence of a `.mtimes` file. Typically used by `git
+	repack --cruft`. Callers provide a list of pack names and
+	indicate which packs will remain in the repository, along with
+	which packs will be deleted (indicated by the `-` prefix). The
+	contents of the cruft pack are all objects not contained in the
+	surviving packs which have not exceeded the grace period (see
+	`--cruft-expiration` below), or which have exceeded the grace
+	period, but are reachable from an other object which hasn't.
++
+When the input lists a pack containing all reachable objects (and lists
+all other packs as pending deletion), the corresponding cruft pack will
+contain all unreachable objects (with mtime newer than the
+`--cruft-expiration`) along with any unreachable objects whose mtime is
+older than the `--cruft-expiration`, but are reachable from an
+unreachable object whose mtime is newer than the `--cruft-expiration`).
++
+Incompatible with `--unpack-unreachable`, `--keep-unreachable`,
+`--pack-loose-unreachable`, `--stdin-packs`, as well as any other
+options which imply `--revs`. Also incompatible with `--max-pack-size`;
+when this option is set, the maximum pack size is not inferred from
+`pack.packSizeLimit`.
+
+--cruft-expiration=<approxidate>::
+	If specified, objects are eliminated from the cruft pack if they
+	have an mtime older than `<approxidate>`. If unspecified (and
+	given `--cruft`), then no objects are eliminated.
+
 --window=<n>::
 --depth=<n>::
 	These two options affect how the objects contained in
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 3f08a3c63a..5ba4fc9c2c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -36,6 +36,7 @@
 #include "trace2.h"
 #include "shallow.h"
 #include "promisor-remote.h"
+#include "pack-mtimes.h"
 
 /*
  * Objects we are going to pack are collected in the `to_pack` structure.
@@ -194,6 +195,8 @@ static int reuse_delta = 1, reuse_object = 1;
 static int keep_unreachable, unpack_unreachable, include_tag;
 static timestamp_t unpack_unreachable_expiration;
 static int pack_loose_unreachable;
+static int cruft;
+static timestamp_t cruft_expiration;
 static int local;
 static int have_non_local_packs;
 static int incremental;
@@ -1252,6 +1255,9 @@ static void write_pack_file(void)
 					&to_pack, written_list, nr_written);
 			}
 
+			if (cruft)
+				pack_idx_opts.flags |= WRITE_MTIMES;
+
 			stage_tmp_packfiles(&tmpname, pack_tmp_name,
 					    written_list, nr_written,
 					    &to_pack, &pack_idx_opts, hash,
@@ -3389,6 +3395,135 @@ static void read_packs_list_from_stdin(void)
 	string_list_clear(&exclude_packs, 0);
 }
 
+static void add_cruft_object_entry(const struct object_id *oid, enum object_type type,
+				   struct packed_git *pack, off_t offset,
+				   const char *name, uint32_t mtime)
+{
+	struct object_entry *entry;
+
+	display_progress(progress_state, ++nr_seen);
+
+	entry = packlist_find(&to_pack, oid);
+	if (entry) {
+		if (name) {
+			entry->hash = pack_name_hash(name);
+			entry->no_try_delta = no_try_delta(name);
+		}
+	} else {
+		if (!want_object_in_pack(oid, 0, &pack, &offset))
+			return;
+		if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
+			/*
+			 * If a traversed tree has a missing blob then we want
+			 * to avoid adding that missing object to our pack.
+			 *
+			 * This only applies to missing blobs, not trees,
+			 * because the traversal needs to parse sub-trees but
+			 * not blobs.
+			 *
+			 * Note we only perform this check when we couldn't
+			 * already find the object in a pack, so we're really
+			 * limited to "ensure non-tip blobs which don't exist in
+			 * packs do exist via loose objects". Confused?
+			 */
+			return;
+		}
+
+		entry = create_object_entry(oid, type, pack_name_hash(name),
+					    0, name && no_try_delta(name),
+					    pack, offset);
+	}
+
+	if (mtime > oe_cruft_mtime(&to_pack, entry))
+		oe_set_cruft_mtime(&to_pack, entry, mtime);
+	return;
+}
+
+static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
+{
+	struct string_list_item *item = NULL;
+	for_each_string_list_item(item, packs) {
+		struct packed_git *p = item->util;
+		if (!p)
+			die(_("could not find pack '%s'"), item->string);
+		p->pack_keep_in_core = keep;
+	}
+}
+
+static void add_unreachable_loose_objects(void);
+static void add_objects_in_unpacked_packs(void);
+
+static void enumerate_cruft_objects(void)
+{
+	if (progress)
+		progress_state = start_progress(_("Enumerating cruft objects"), 0);
+
+	add_objects_in_unpacked_packs();
+	add_unreachable_loose_objects();
+
+	stop_progress(&progress_state);
+}
+
+static void read_cruft_objects(void)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct string_list discard_packs = STRING_LIST_INIT_DUP;
+	struct string_list fresh_packs = STRING_LIST_INIT_DUP;
+	struct packed_git *p;
+
+	ignore_packed_keep_in_core = 1;
+
+	while (strbuf_getline(&buf, stdin) != EOF) {
+		if (!buf.len)
+			continue;
+
+		if (*buf.buf == '-')
+			string_list_append(&discard_packs, buf.buf + 1);
+		else
+			string_list_append(&fresh_packs, buf.buf);
+		strbuf_reset(&buf);
+	}
+
+	string_list_sort(&discard_packs);
+	string_list_sort(&fresh_packs);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		const char *pack_name = pack_basename(p);
+		struct string_list_item *item;
+
+		item = string_list_lookup(&fresh_packs, pack_name);
+		if (!item)
+			item = string_list_lookup(&discard_packs, pack_name);
+
+		if (item) {
+			item->util = p;
+		} else {
+			/*
+			 * This pack wasn't mentioned in either the "fresh" or
+			 * "discard" list, so the caller didn't know about it.
+			 *
+			 * Mark it as kept so that its objects are ignored by
+			 * add_unseen_recent_objects_to_traversal(). We'll
+			 * unmark it before starting the traversal so it doesn't
+			 * halt the traversal early.
+			 */
+			p->pack_keep_in_core = 1;
+		}
+	}
+
+	mark_pack_kept_in_core(&fresh_packs, 1);
+	mark_pack_kept_in_core(&discard_packs, 0);
+
+	if (cruft_expiration)
+		die("--cruft-expiration not yet implemented");
+	else
+		enumerate_cruft_objects();
+
+	strbuf_release(&buf);
+	string_list_clear(&discard_packs, 0);
+	string_list_clear(&fresh_packs, 0);
+}
+
 static void read_object_list_from_stdin(void)
 {
 	char line[GIT_MAX_HEXSZ + 1 + PATH_MAX + 2];
@@ -3521,7 +3656,24 @@ static int add_object_in_unpacked_pack(const struct object_id *oid,
 				       uint32_t pos,
 				       void *_data)
 {
-	add_object_entry(oid, OBJ_NONE, "", 0);
+	if (cruft) {
+		off_t offset;
+		time_t mtime;
+
+		if (pack->is_cruft) {
+			if (load_pack_mtimes(pack) < 0)
+				die(_("could not load cruft pack .mtimes"));
+			mtime = nth_packed_mtime(pack, pos);
+		} else {
+			mtime = pack->mtime;
+		}
+		offset = nth_packed_object_offset(pack, pos);
+
+		add_cruft_object_entry(oid, OBJ_NONE, pack, offset,
+				       NULL, mtime);
+	} else {
+		add_object_entry(oid, OBJ_NONE, "", 0);
+	}
 	return 0;
 }
 
@@ -3545,7 +3697,19 @@ static int add_loose_object(const struct object_id *oid, const char *path,
 		return 0;
 	}
 
-	add_object_entry(oid, type, "", 0);
+	if (cruft) {
+		struct stat st;
+		if (stat(path, &st) < 0) {
+			if (errno == ENOENT)
+				return 0;
+			return error_errno("unable to stat %s", oid_to_hex(oid));
+		}
+
+		add_cruft_object_entry(oid, type, NULL, 0, NULL,
+				       st.st_mtime);
+	} else {
+		add_object_entry(oid, type, "", 0);
+	}
 	return 0;
 }
 
@@ -3864,6 +4028,20 @@ static int option_parse_unpack_unreachable(const struct option *opt,
 	return 0;
 }
 
+static int option_parse_cruft_expiration(const struct option *opt,
+					 const char *arg, int unset)
+{
+	if (unset) {
+		cruft = 0;
+		cruft_expiration = 0;
+	} else {
+		cruft = 1;
+		if (arg)
+			cruft_expiration = approxidate(arg);
+	}
+	return 0;
+}
+
 int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 {
 	int use_internal_rev_list = 0;
@@ -3936,6 +4114,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		OPT_CALLBACK_F(0, "unpack-unreachable", NULL, N_("time"),
 		  N_("unpack unreachable objects newer than <time>"),
 		  PARSE_OPT_OPTARG, option_parse_unpack_unreachable),
+		OPT_BOOL(0, "cruft", &cruft, N_("create a cruft pack")),
+		OPT_CALLBACK_F(0, "cruft-expiration", NULL, N_("time"),
+		  N_("expire cruft objects older than <time>"),
+		  PARSE_OPT_OPTARG, option_parse_cruft_expiration),
 		OPT_BOOL(0, "sparse", &sparse,
 			 N_("use the sparse reachability algorithm")),
 		OPT_BOOL(0, "thin", &thin,
@@ -4062,7 +4244,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	if (!HAVE_THREADS && delta_search_threads != 1)
 		warning(_("no threads support, ignoring --threads"));
-	if (!pack_to_stdout && !pack_size_limit)
+	if (!pack_to_stdout && !pack_size_limit && !cruft)
 		pack_size_limit = pack_size_limit_cfg;
 	if (pack_to_stdout && pack_size_limit)
 		die(_("--max-pack-size cannot be used to build a pack for transfer"));
@@ -4089,6 +4271,15 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
 
+	if (cruft) {
+		if (use_internal_rev_list)
+			die(_("cannot use internal rev list with --cruft"));
+		if (stdin_packs)
+			die(_("cannot use --stdin-packs with --cruft"));
+		if (pack_size_limit)
+			die(_("cannot use --max-pack-size with --cruft"));
+	}
+
 	/*
 	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
 	 *
@@ -4145,7 +4336,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			    the_repository);
 	prepare_packing_data(the_repository, &to_pack);
 
-	if (progress)
+	if (progress && !cruft)
 		progress_state = start_progress(_("Enumerating objects"), 0);
 	if (stdin_packs) {
 		/* avoids adding objects in excluded packs */
@@ -4153,6 +4344,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		read_packs_list_from_stdin();
 		if (rev_list_unpacked)
 			add_unreachable_loose_objects();
+	} else if (cruft) {
+		read_cruft_objects();
 	} else if (!use_internal_rev_list) {
 		read_object_list_from_stdin();
 	} else {
diff --git a/object-file.c b/object-file.c
index 8be57f48de..e80da1368d 100644
--- a/object-file.c
+++ b/object-file.c
@@ -996,7 +996,7 @@ int has_loose_object_nonlocal(const struct object_id *oid)
 	return check_and_freshen_nonlocal(oid, 0);
 }
 
-static int has_loose_object(const struct object_id *oid)
+int has_loose_object(const struct object_id *oid)
 {
 	return check_and_freshen(oid, 0);
 }
diff --git a/object-store.h b/object-store.h
index 9b227661f2..6b025dc670 100644
--- a/object-store.h
+++ b/object-store.h
@@ -334,6 +334,8 @@ int repo_has_object_file_with_flags(struct repository *r,
  */
 int has_loose_object_nonlocal(const struct object_id *);
 
+int has_loose_object(const struct object_id *);
+
 void assert_oid_type(const struct object_id *oid, enum object_type expect);
 
 /*
diff --git a/t/t5328-pack-objects-cruft.sh b/t/t5328-pack-objects-cruft.sh
new file mode 100755
index 0000000000..003ca7344e
--- /dev/null
+++ b/t/t5328-pack-objects-cruft.sh
@@ -0,0 +1,218 @@
+#!/bin/sh
+
+test_description='cruft pack related pack-objects tests'
+. ./test-lib.sh
+
+objdir=.git/objects
+packdir=$objdir/pack
+
+basic_cruft_pack_tests () {
+	expire="$1"
+
+	test_expect_success "unreachable loose objects are packed (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit base &&
+			git repack -Ad &&
+			test_commit loose &&
+
+			test-tool chmtime +2000 "$objdir/$(test_oid_to_path \
+				$(git rev-parse loose:loose.t))" &&
+			test-tool chmtime +1000 "$objdir/$(test_oid_to_path \
+				$(git rev-parse loose^{tree}))" &&
+
+			(
+				git rev-list --objects --no-object-names base..loose |
+				while read oid
+				do
+					path="$objdir/$(test_oid_to_path "$oid")" &&
+					printf "%s %d\n" "$oid" "$(test-tool chmtime --get "$path")"
+				done |
+				sort -k1
+			) >expect &&
+
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+			cruft="$(echo $keep | git pack-objects --cruft \
+				--cruft-expiration="$expire" $packdir/pack)" &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "unreachable packed objects are packed (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit packed &&
+			git repack -Ad &&
+			test_commit other &&
+
+			git rev-list --objects --no-object-names packed.. >objects &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+			other="$(git pack-objects --delta-base-offset \
+				$packdir/pack <objects)" &&
+			git prune-packed &&
+
+			test-tool chmtime --get -100 "$packdir/pack-$other.pack" >expect &&
+
+			cruft="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF
+			$keep
+			-pack-$other.pack
+			EOF
+			)" &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+			cut -d" " -f2 <actual.raw | sort -u >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "unreachable cruft objects are repacked (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit packed &&
+			git repack -Ad &&
+			test_commit other &&
+
+			git rev-list --objects --no-object-names packed.. >objects &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+
+			cruft_a="$(echo $keep | git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack)" &&
+			git prune-packed &&
+			cruft_b="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF
+			$keep
+			-pack-$cruft_a.pack
+			EOF
+			)" &&
+
+			test-tool pack-mtimes "pack-$cruft_a.mtimes" >expect.raw &&
+			test-tool pack-mtimes "pack-$cruft_b.mtimes" >actual.raw &&
+
+			sort <expect.raw >expect &&
+			sort <actual.raw >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "multiple cruft packs (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			git repack -Ad &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+
+			test_commit cruft &&
+			loose="$objdir/$(test_oid_to_path $(git rev-parse cruft))" &&
+
+			# generate three copies of the cruft object in different
+			# cruft packs, each with a unique mtime:
+			#   - one expired (1000 seconds ago)
+			#   - two non-expired (one 1000 seconds in the future,
+			#     one 1500 seconds in the future)
+			test-tool chmtime =-1000 "$loose" &&
+			git pack-objects --cruft $packdir/pack-A <<-EOF &&
+			$keep
+			EOF
+			test-tool chmtime =+1000 "$loose" &&
+			git pack-objects --cruft $packdir/pack-B <<-EOF &&
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			EOF
+			test-tool chmtime =+1500 "$loose" &&
+			git pack-objects --cruft $packdir/pack-C <<-EOF &&
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			-$(basename $(ls $packdir/pack-B-*.pack))
+			EOF
+
+			# ensure the resulting cruft pack takes the most recent
+			# mtime among all copies
+			cruft="$(git pack-objects --cruft \
+				--cruft-expiration="$expire" \
+				$packdir/pack <<-EOF
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			-$(basename $(ls $packdir/pack-B-*.pack))
+			-$(basename $(ls $packdir/pack-C-*.pack))
+			EOF
+			)" &&
+
+			test-tool pack-mtimes "$(basename $(ls $packdir/pack-C-*.mtimes))" >expect.raw &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+			sort expect.raw >expect &&
+			sort actual.raw >actual &&
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "cruft packs tolerate missing trees (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			test_commit cruft &&
+
+			tree="$(git rev-parse cruft^{tree})" &&
+
+			git reset --hard reachable &&
+			git tag -d cruft &&
+			git reflog expire --all --expire=all &&
+
+			# remove the unreachable tree, but leave the commit
+			# which has it as its root tree intact
+			rm -fr "$objdir/$(test_oid_to_path "$tree")" &&
+
+			git repack -Ad &&
+			basename $(ls $packdir/pack-*.pack) >in &&
+			git pack-objects --cruft --cruft-expiration="$expire" \
+				$packdir/pack <in
+		)
+	'
+
+	test_expect_success "cruft packs tolerate missing blobs (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			test_commit cruft &&
+
+			blob="$(git rev-parse cruft:cruft.t)" &&
+
+			git reset --hard reachable &&
+			git tag -d cruft &&
+			git reflog expire --all --expire=all &&
+
+			# remove the unreachable blob, but leave the commit (and
+			# the root tree of that commit) intact
+			rm -fr "$objdir/$(test_oid_to_path "$blob")" &&
+
+			git repack -Ad &&
+			basename $(ls $packdir/pack-*.pack) >in &&
+			git pack-objects --cruft --cruft-expiration="$expire" \
+				$packdir/pack <in
+		)
+	'
+}
+
+basic_cruft_pack_tests never
+
+test_done
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v2 09/17] reachable: add options to add_unseen_recent_objects_to_traversal
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (7 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02 20:19     ` Derrick Stolee
  2022-03-02  0:58   ` [PATCH v2 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
                     ` (8 subsequent siblings)
  17 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

This function behaves very similarly to what we will need in
pack-objects in order to implement cruft packs with expiration. But it
is lacking a couple of things. Namely, it needs:

  - a mechanism to communicate the timestamps of individual recent
    objects to some external caller

  - and, in the case of packed objects, our future caller will also want
    to know the originating pack, as well as the offset within that pack
    at which the object can be found

  - finally, it needs a way to skip over packs which are marked as kept
    in-core.

To address the first two, add a callback interface in this patch which
reports the time of each recent object, as well as a (packed_git,
off_t) pair for packed objects.

Likewise, add a new option to the packed object iterators to skip over
packs which are marked as kept in core. This option will become
implicitly tested in a future patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c |  2 +-
 reachable.c            | 51 +++++++++++++++++++++++++++++++++++-------
 reachable.h            |  9 +++++++-
 3 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5ba4fc9c2c..1ef333717d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3951,7 +3951,7 @@ static void get_object_list(int ac, const char **av)
 	if (unpack_unreachable_expiration) {
 		revs.ignore_missing_links = 1;
 		if (add_unseen_recent_objects_to_traversal(&revs,
-				unpack_unreachable_expiration))
+				unpack_unreachable_expiration, NULL, 0))
 			die(_("unable to add recent objects"));
 		if (prepare_revision_walk(&revs))
 			die(_("revision walk setup failed"));
diff --git a/reachable.c b/reachable.c
index 84e3d0d75e..0eb9909f47 100644
--- a/reachable.c
+++ b/reachable.c
@@ -60,9 +60,13 @@ static void mark_commit(struct commit *c, void *data)
 struct recent_data {
 	struct rev_info *revs;
 	timestamp_t timestamp;
+	report_recent_object_fn *cb;
+	int ignore_in_core_kept_packs;
 };
 
 static void add_recent_object(const struct object_id *oid,
+			      struct packed_git *pack,
+			      off_t offset,
 			      timestamp_t mtime,
 			      struct recent_data *data)
 {
@@ -103,13 +107,29 @@ static void add_recent_object(const struct object_id *oid,
 		die("unable to lookup %s", oid_to_hex(oid));
 
 	add_pending_object(data->revs, obj, "");
+	if (data->cb)
+		data->cb(obj, pack, offset, mtime);
+}
+
+static int want_recent_object(struct recent_data *data,
+			      const struct object_id *oid)
+{
+	if (data->ignore_in_core_kept_packs &&
+	    has_object_kept_pack(oid, IN_CORE_KEEP_PACKS))
+		return 0;
+	return 1;
 }
 
 static int add_recent_loose(const struct object_id *oid,
 			    const char *path, void *data)
 {
 	struct stat st;
-	struct object *obj = lookup_object(the_repository, oid);
+	struct object *obj;
+
+	if (!want_recent_object(data, oid))
+		return 0;
+
+	obj = lookup_object(the_repository, oid);
 
 	if (obj && obj->flags & SEEN)
 		return 0;
@@ -126,7 +146,7 @@ static int add_recent_loose(const struct object_id *oid,
 		return error_errno("unable to stat %s", oid_to_hex(oid));
 	}
 
-	add_recent_object(oid, st.st_mtime, data);
+	add_recent_object(oid, NULL, 0, st.st_mtime, data);
 	return 0;
 }
 
@@ -134,29 +154,43 @@ static int add_recent_packed(const struct object_id *oid,
 			     struct packed_git *p, uint32_t pos,
 			     void *data)
 {
-	struct object *obj = lookup_object(the_repository, oid);
+	struct object *obj;
+
+	if (!want_recent_object(data, oid))
+		return 0;
+
+	obj = lookup_object(the_repository, oid);
 
 	if (obj && obj->flags & SEEN)
 		return 0;
-	add_recent_object(oid, p->mtime, data);
+	add_recent_object(oid, p, nth_packed_object_offset(p, pos), p->mtime, data);
 	return 0;
 }
 
 int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
-					   timestamp_t timestamp)
+					   timestamp_t timestamp,
+					   report_recent_object_fn *cb,
+					   int ignore_in_core_kept_packs)
 {
 	struct recent_data data;
+	enum for_each_object_flags flags;
 	int r;
 
 	data.revs = revs;
 	data.timestamp = timestamp;
+	data.cb = cb;
+	data.ignore_in_core_kept_packs = ignore_in_core_kept_packs;
 
 	r = for_each_loose_object(add_recent_loose, &data,
 				  FOR_EACH_OBJECT_LOCAL_ONLY);
 	if (r)
 		return r;
-	return for_each_packed_object(add_recent_packed, &data,
-				      FOR_EACH_OBJECT_LOCAL_ONLY);
+
+	flags = FOR_EACH_OBJECT_LOCAL_ONLY | FOR_EACH_OBJECT_PACK_ORDER;
+	if (ignore_in_core_kept_packs)
+		flags |= FOR_EACH_OBJECT_SKIP_IN_CORE_KEPT_PACKS;
+
+	return for_each_packed_object(add_recent_packed, &data, flags);
 }
 
 static int mark_object_seen(const struct object_id *oid,
@@ -217,7 +251,8 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 
 	if (mark_recent) {
 		revs->ignore_missing_links = 1;
-		if (add_unseen_recent_objects_to_traversal(revs, mark_recent))
+		if (add_unseen_recent_objects_to_traversal(revs, mark_recent,
+							   NULL, 0))
 			die("unable to mark recent objects");
 		if (prepare_revision_walk(revs))
 			die("revision walk setup failed");
diff --git a/reachable.h b/reachable.h
index 5df932ad8f..b776761baa 100644
--- a/reachable.h
+++ b/reachable.h
@@ -1,11 +1,18 @@
 #ifndef REACHEABLE_H
 #define REACHEABLE_H
 
+#include "object.h"
+
 struct progress;
 struct rev_info;
 
+typedef void report_recent_object_fn(const struct object *, struct packed_git *,
+				     off_t, time_t);
+
 int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
-					   timestamp_t timestamp);
+					   timestamp_t timestamp,
+					   report_recent_object_fn cb,
+					   int ignore_in_core_kept_packs);
 void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 			    timestamp_t mark_recent, struct progress *);
 
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v2 10/17] reachable: report precise timestamps from objects in cruft packs
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (8 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
                     ` (7 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

When generating a cruft pack, the caller within pack-objects will want
to know the precise timestamps of cruft objects (i.e., their
corresponding values in the .mtimes table) rather than the mtime of the
cruft pack itself.

Teach add_recent_packed() to lookup each object's precise mtime from the
.mtimes file if one exists (indicated by the is_cruft bit on the
packed_git structure).

A couple of small things worth noting here:

  - load_pack_mtimes() needs to be called before asking for
    nth_packed_mtime(), and that call is done lazily here. That function
    exits early if the .mtimes file has already been opened and parsed,
    so only the first call is slow.

  - Checking the is_cruft bit can be done without any extra work on the
    caller's behalf, since it is set up for us automatically as a
    side-effect of calling add_packed_git() (just like the 'pack_keep'
    and 'pack_promisor' bits).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 reachable.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/reachable.c b/reachable.c
index 0eb9909f47..9ec8e6bd5b 100644
--- a/reachable.c
+++ b/reachable.c
@@ -13,6 +13,7 @@
 #include "worktree.h"
 #include "object-store.h"
 #include "pack-bitmap.h"
+#include "pack-mtimes.h"
 
 struct connectivity_progress {
 	struct progress *progress;
@@ -155,6 +156,7 @@ static int add_recent_packed(const struct object_id *oid,
 			     void *data)
 {
 	struct object *obj;
+	timestamp_t mtime = p->mtime;
 
 	if (!want_recent_object(data, oid))
 		return 0;
@@ -163,7 +165,12 @@ static int add_recent_packed(const struct object_id *oid,
 
 	if (obj && obj->flags & SEEN)
 		return 0;
-	add_recent_object(oid, p, nth_packed_object_offset(p, pos), p->mtime, data);
+	if (p->is_cruft) {
+		if (load_pack_mtimes(p) < 0)
+			die(_("could not load cruft pack .mtimes"));
+		mtime = nth_packed_mtime(p, pos);
+	}
+	add_recent_object(oid, p, nth_packed_object_offset(p, pos), mtime, data);
 	return 0;
 }
 
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v2 11/17] builtin/pack-objects.c: --cruft with expiration
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (9 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  7:42     ` Junio C Hamano
  2022-03-02  0:58   ` [PATCH v2 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
                     ` (6 subsequent siblings)
  17 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

In a previous patch, pack-objects learned how to generate a cruft pack
so long as no objects are dropped.

This patch teaches pack-objects to handle the case where a non-never
`--cruft-expiration` value is passed. This case is slightly more
complicated than before, because we want pack-objects to save
unreachable objects which would have been pruned when there is another
recent (i.e., non-prunable) unreachable object which reaches the other.
We'll call these objects "unreachable but reachable-from-recent".

Here is how pack-objects handles `--cruft-expiration`:

  - Instead of adding all objects outside of the kept pack(s) into the
    packing list, only handle the ones whose mtime is within the grace
    period.

  - Construct a reachability traversal whose tips are the
    unreachable-but-recent objects.

  - Then, walk along that traversal, stopping if we reach an object in
    the kept pack. At each step along the traversal, we add the object
    we are visiting to the packing list.

In the majority of these cases, any object we visit in this traversal
will already be in our packing list. But we will sometimes encounter
reachable-from-recent cruft objects, which we want to retain even if
they aged out of the grace period.

The most subtle point of this process is that we actually don't need to
bother to update the rescued object's mtime. Even though we will write
an .mtimes file with a value that is older than the expiration window,
it will continue to survive cruft repacks so long as any objects which
reach it haven't aged out.

That is, a future repack will also exclude that object from the initial
packing list, only to discover it later on when doing the reachability
traversal.

Finally, stopping early once an object is found in a kept pack is safe
to do because the kept packs ordinarily represent which packs will
survive after repacking. Assuming that it _isn't_ safe to halt a
traversal early would mean that there is some ancestor object which is
missing, which implies repository corruption (i.e., the complete set of
reachable objects isn't present).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c        |  84 +++++++++++++++++++-
 t/t5328-pack-objects-cruft.sh | 143 ++++++++++++++++++++++++++++++++++
 2 files changed, 226 insertions(+), 1 deletion(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 1ef333717d..fcac0b5c91 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3439,6 +3439,44 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
 	return;
 }
 
+static void show_cruft_object(struct object *obj, const char *name, void *data)
+{
+	/*
+	 * if we did not record it earlier, it's at least as old as our
+	 * expiration value. Rather than find it exactly, just use that
+	 * value.  This may bump it forward from its real mtime, but it
+	 * will still be "too old" next time we run with the same
+	 * expiration.
+	 *
+	 * if obj does appear in the packing list, this call is a noop (or may
+	 * set the namehash).
+	 */
+	add_cruft_object_entry(&obj->oid, obj->type, NULL, 0, name, cruft_expiration);
+}
+
+static void show_cruft_commit(struct commit *commit, void *data)
+{
+	show_cruft_object((struct object*)commit, NULL, data);
+}
+
+static int cruft_include_check_obj(struct object *obj, void *data)
+{
+	return !has_object_kept_pack(&obj->oid, IN_CORE_KEEP_PACKS);
+}
+
+static int cruft_include_check(struct commit *commit, void *data)
+{
+	return cruft_include_check_obj((struct object*)commit, data);
+}
+
+static void set_cruft_mtime(const struct object *object,
+			    struct packed_git *pack,
+			    off_t offset, time_t mtime)
+{
+	add_cruft_object_entry(&object->oid, object->type, pack, offset, NULL,
+			       mtime);
+}
+
 static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
 {
 	struct string_list_item *item = NULL;
@@ -3464,6 +3502,50 @@ static void enumerate_cruft_objects(void)
 	stop_progress(&progress_state);
 }
 
+static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs)
+{
+	struct packed_git *p;
+	struct rev_info revs;
+	int ret;
+
+	repo_init_revisions(the_repository, &revs, NULL);
+
+	revs.tag_objects = 1;
+	revs.tree_objects = 1;
+	revs.blob_objects = 1;
+
+	revs.include_check = cruft_include_check;
+	revs.include_check_obj = cruft_include_check_obj;
+
+	revs.ignore_missing_links = 1;
+
+	if (progress)
+		progress_state = start_progress(_("Enumerating cruft objects"), 0);
+	ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
+						     set_cruft_mtime, 1);
+	stop_progress(&progress_state);
+
+	if (ret)
+		die(_("unable to add cruft objects"));
+
+	/*
+	 * Re-mark only the fresh packs as kept so that objects in
+	 * unknown packs do not halt the reachability traversal early.
+	 */
+	for (p = get_all_packs(the_repository); p; p = p->next)
+		p->pack_keep_in_core = 0;
+	mark_pack_kept_in_core(fresh_packs, 1);
+
+	if (prepare_revision_walk(&revs))
+		die(_("revision walk setup failed"));
+	if (progress)
+		progress_state = start_progress(_("Traversing cruft objects"), 0);
+	nr_seen = 0;
+	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
+
+	stop_progress(&progress_state);
+}
+
 static void read_cruft_objects(void)
 {
 	struct strbuf buf = STRBUF_INIT;
@@ -3515,7 +3597,7 @@ static void read_cruft_objects(void)
 	mark_pack_kept_in_core(&discard_packs, 0);
 
 	if (cruft_expiration)
-		die("--cruft-expiration not yet implemented");
+		enumerate_and_traverse_cruft_objects(&fresh_packs);
 	else
 		enumerate_cruft_objects();
 
diff --git a/t/t5328-pack-objects-cruft.sh b/t/t5328-pack-objects-cruft.sh
index 003ca7344e..939cdc297a 100755
--- a/t/t5328-pack-objects-cruft.sh
+++ b/t/t5328-pack-objects-cruft.sh
@@ -214,5 +214,148 @@ basic_cruft_pack_tests () {
 }
 
 basic_cruft_pack_tests never
+basic_cruft_pack_tests 2.weeks.ago
+
+test_expect_success 'cruft tags rescue tagged objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit tagged &&
+		git tag -a annotated -m tag &&
+
+		git rev-list --objects --no-object-names packed.. >objects &&
+		while read oid
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $oid)"
+		done <objects &&
+
+		test-tool chmtime -500 \
+			"$objdir/$(test_oid_to_path $(git rev-parse annotated))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+		cut -f1 -d" " <actual.raw | sort >actual &&
+
+		(
+			cat objects &&
+			git rev-parse annotated
+		) >expect.raw &&
+		sort <expect.raw >expect &&
+
+		test_cmp expect actual &&
+		cat actual
+	)
+'
+
+test_expect_success 'cruft commits rescue parents, trees' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit old &&
+		test_commit new &&
+
+		git rev-list --objects --no-object-names packed..new >objects &&
+		while read object
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $object)"
+		done <objects &&
+		test-tool chmtime +500 "$objdir/$(test_oid_to_path \
+			$(git rev-parse HEAD))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+		cut -d" " -f1 <actual.raw | sort >actual &&
+		sort <objects >expect &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'cruft trees rescue sub-trees, blobs' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		mkdir -p dir/sub &&
+		echo foo >foo &&
+		echo bar >dir/bar &&
+		echo baz >dir/sub/baz &&
+
+		test_tick &&
+		git add . &&
+		git commit -m "pruned" &&
+
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD^{tree}))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:foo))" &&
+		test-tool chmtime  -500 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/bar))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/sub))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/sub/baz))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+		cut -f1 -d" " <actual.raw | sort >actual &&
+
+		git rev-parse HEAD:dir HEAD:dir/bar HEAD:dir/sub HEAD:dir/sub/baz >expect.raw &&
+		sort <expect.raw >expect &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'expired objects are pruned' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit pruned &&
+
+		git rev-list --objects --no-object-names packed..pruned >objects &&
+		while read object
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $object)"
+		done <objects &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual &&
+		test_must_be_empty actual
+	)
+'
 
 test_done
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v2 12/17] builtin/repack.c: support generating a cruft pack
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (10 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
                     ` (5 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Expose a way to split the contents of a repository into a main and cruft
pack when doing an all-into-one repack with `git repack --cruft -d`, and
a complementary configuration variable.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt            |  11 ++
 Documentation/technical/cruft-packs.txt |   2 +-
 builtin/repack.c                        | 106 +++++++++++-
 t/t5328-pack-objects-cruft.sh           | 207 ++++++++++++++++++++++++
 4 files changed, 320 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index ee30edc178..0bf13893d8 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -63,6 +63,17 @@ to the new separate pack will be written.
 	Also run  'git prune-packed' to remove redundant
 	loose object files.
 
+--cruft::
+	Same as `-a`, unless `-d` is used. Then any unreachable objects
+	are packed into a separate cruft pack. Unreachable objects can
+	be pruned using the normal expiry rules with the next `git gc`
+	invocation (see linkgit:git-gc[1]). Incompatible with `-k`.
+
+--cruft-expiration=<approxidate>::
+	Expire unreachable objects older than `<approxidate>`
+	immediately instead of waiting for the next `git gc` invocation.
+	Only useful with `--cruft -d`.
+
 -l::
 	Pass the `--local` option to 'git pack-objects'. See
 	linkgit:git-pack-objects[1].
diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
index 2c3c5d93f8..f80e975a47 100644
--- a/Documentation/technical/cruft-packs.txt
+++ b/Documentation/technical/cruft-packs.txt
@@ -17,7 +17,7 @@ pruned according to normal expiry rules with the next 'git gc' invocation.
 
 Unreachable objects aren't removed immediately, since doing so could race with
 an incoming push which may reference an object which is about to be deleted.
-Instead, those unreachable objects are stored as loose object and stay that way
+Instead, those unreachable objects are stored as loose objects and stay that way
 until they are older than the expiration window, at which point they are removed
 by linkgit:git-prune[1].
 
diff --git a/builtin/repack.c b/builtin/repack.c
index f908f7d5dd..f7fb88bcf1 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -18,11 +18,17 @@
 #include "pack-bitmap.h"
 #include "refs.h"
 
+#define ALL_INTO_ONE 1
+#define LOOSEN_UNREACHABLE 2
+#define PACK_CRUFT 4
+
+static int pack_everything;
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
 static int write_bitmaps = -1;
 static int use_delta_islands;
 static char *packdir, *packtmp_name, *packtmp;
+static char *cruft_expiration;
 
 static const char *const git_repack_usage[] = {
 	N_("git repack [<options>]"),
@@ -54,6 +60,7 @@ static int repack_config(const char *var, const char *value, void *cb)
 		use_delta_islands = git_config_bool(var, value);
 		return 0;
 	}
+
 	return git_default_config(var, value, cb);
 }
 
@@ -300,9 +307,6 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 		die(_("could not finish pack-objects to repack promisor objects"));
 }
 
-#define ALL_INTO_ONE 1
-#define LOOSEN_UNREACHABLE 2
-
 struct pack_geometry {
 	struct packed_git **pack;
 	uint32_t pack_nr, pack_alloc;
@@ -339,6 +343,8 @@ static void init_pack_geometry(struct pack_geometry **geometry_p)
 	for (p = get_all_packs(the_repository); p; p = p->next) {
 		if (!pack_kept_objects && p->pack_keep)
 			continue;
+		if (p->is_cruft)
+			continue;
 
 		ALLOC_GROW(geometry->pack,
 			   geometry->pack_nr + 1,
@@ -600,6 +606,67 @@ static int write_midx_included_packs(struct string_list *include,
 	return finish_command(&cmd);
 }
 
+static int write_cruft_pack(const struct pack_objects_args *args,
+			    const char *pack_prefix,
+			    struct string_list *names,
+			    struct string_list *existing_packs,
+			    struct string_list *existing_kept_packs)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf line = STRBUF_INIT;
+	struct string_list_item *item;
+	FILE *in, *out;
+	int ret;
+
+	prepare_pack_objects(&cmd, args);
+
+	strvec_push(&cmd.args, "--cruft");
+	if (cruft_expiration)
+		strvec_pushf(&cmd.args, "--cruft-expiration=%s",
+			     cruft_expiration);
+
+	strvec_push(&cmd.args, "--honor-pack-keep");
+	strvec_push(&cmd.args, "--non-empty");
+	strvec_push(&cmd.args, "--max-pack-size=0");
+
+	cmd.in = -1;
+
+	ret = start_command(&cmd);
+	if (ret)
+		return ret;
+
+	/*
+	 * names has a confusing double use: it both provides the list
+	 * of just-written new packs, and accepts the name of the cruft
+	 * pack we are writing.
+	 *
+	 * By the time it is read here, it contains only the pack(s)
+	 * that were just written, which is exactly the set of packs we
+	 * want to consider kept.
+	 */
+	in = xfdopen(cmd.in, "w");
+	for_each_string_list_item(item, names)
+		fprintf(in, "%s-%s.pack\n", pack_prefix, item->string);
+	for_each_string_list_item(item, existing_packs)
+		fprintf(in, "-%s.pack\n", item->string);
+	for_each_string_list_item(item, existing_kept_packs)
+		fprintf(in, "%s.pack\n", item->string);
+	fclose(in);
+
+	out = xfdopen(cmd.out, "r");
+	while (strbuf_getline_lf(&line, out) != EOF) {
+		if (line.len != the_hash_algo->hexsz)
+			die(_("repack: Expecting full hex object ID lines only "
+			      "from pack-objects."));
+		string_list_append(names, line.buf);
+	}
+	fclose(out);
+
+	strbuf_release(&line);
+
+	return finish_command(&cmd);
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -616,7 +683,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int show_progress;
 
 	/* variables to be filled by option parsing */
-	int pack_everything = 0;
 	int delete_redundant = 0;
 	const char *unpack_unreachable = NULL;
 	int keep_unreachable = 0;
@@ -632,6 +698,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_BIT('A', NULL, &pack_everything,
 				N_("same as -a, and turn unreachable objects loose"),
 				   LOOSEN_UNREACHABLE | ALL_INTO_ONE),
+		OPT_BIT(0, "cruft", &pack_everything,
+				N_("same as -a, pack unreachable cruft objects separately"),
+				   PACK_CRUFT),
+		OPT_STRING(0, "cruft-expiration", &cruft_expiration, N_("approxidate"),
+				N_("with -C, expire objects older than this")),
 		OPT_BOOL('d', NULL, &delete_redundant,
 				N_("remove redundant packs, and run git-prune-packed")),
 		OPT_BOOL('f', NULL, &po_args.no_reuse_delta,
@@ -684,6 +755,15 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	    (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
 		die(_("options '%s' and '%s' cannot be used together"), "--keep-unreachable", "-A");
 
+	if (pack_everything & PACK_CRUFT) {
+		pack_everything |= ALL_INTO_ONE;
+
+		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
+			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-A");
+		if (keep_unreachable)
+			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-k");
+	}
+
 	if (write_bitmaps < 0) {
 		if (!write_midx &&
 		    (!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()))
@@ -767,7 +847,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (pack_everything & ALL_INTO_ONE) {
 		repack_promisor_objects(&po_args, &names);
 
-		if (existing_nonkept_packs.nr && delete_redundant) {
+		if (existing_nonkept_packs.nr && delete_redundant &&
+		    !(pack_everything & PACK_CRUFT)) {
 			for_each_string_list_item(item, &names) {
 				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
 					     packtmp_name, item->string);
@@ -829,6 +910,21 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (!names.nr && !po_args.quiet)
 		printf_ln(_("Nothing new to pack."));
 
+	if (pack_everything & PACK_CRUFT) {
+		const char *pack_prefix;
+		if (!skip_prefix(packtmp, packdir, &pack_prefix))
+			die(_("pack prefix %s does not begin with objdir %s"),
+			    packtmp, packdir);
+		if (*pack_prefix == '/')
+			pack_prefix++;
+
+		ret = write_cruft_pack(&po_args, pack_prefix, &names,
+				       &existing_nonkept_packs,
+				       &existing_kept_packs);
+		if (ret)
+			return ret;
+	}
+
 	for_each_string_list_item(item, &names) {
 		item->util = (void *)(uintptr_t)populate_pack_exts(item->string);
 	}
diff --git a/t/t5328-pack-objects-cruft.sh b/t/t5328-pack-objects-cruft.sh
index 939cdc297a..06c550c958 100755
--- a/t/t5328-pack-objects-cruft.sh
+++ b/t/t5328-pack-objects-cruft.sh
@@ -358,4 +358,211 @@ test_expect_success 'expired objects are pruned' '
 	)
 '
 
+test_expect_success 'repack --cruft generates a cruft pack' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git branch -M main &&
+		git checkout --orphan other &&
+		test_commit unreachable &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d unreachable &&
+		# objects are not cruft if they are contained in the reflogs
+		git reflog expire --all --expire=all &&
+
+		git rev-list --objects --all --no-object-names >reachable.raw &&
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+		sort <reachable.raw >reachable &&
+		comm -13 reachable objects >unreachable &&
+
+		git repack --cruft -d &&
+
+		cruft=$(basename $(ls $packdir/pack-*.mtimes) .mtimes) &&
+		pack=$(basename $(ls $packdir/pack-*.pack | grep -v $cruft) .pack) &&
+
+		git show-index <$packdir/$pack.idx >actual.raw &&
+		cut -f2 -d" " actual.raw | sort >actual &&
+		test_cmp reachable actual &&
+
+		git show-index <$packdir/$cruft.idx >actual.raw &&
+		cut -f2 -d" " actual.raw | sort >actual &&
+		test_cmp unreachable actual
+	)
+'
+
+test_expect_success 'loose objects mtimes upsert others' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		test_commit cruft &&
+		# incremental repack, leaving existing objects loose (so
+		# they can be "freshened")
+		git repack &&
+
+		tip="$(git rev-parse cruft)" &&
+		path="$objdir/$(test_oid_to_path "$(git rev-parse cruft)")" &&
+		test-tool chmtime --get +1000 "$path" >expect &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d cruft &&
+		git reflog expire --all --expire=all &&
+
+		git repack --cruft -d &&
+
+		mtimes="$(basename $(ls $packdir/pack-*.mtimes))" &&
+		test-tool pack-mtimes "$mtimes" >actual.raw &&
+		grep "$tip" actual.raw | cut -d" " -f2 >actual &&
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'cruft packs are not included in geometric repack' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		test_commit cruft &&
+		git repack -d &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d cruft &&
+		git reflog expire --all --expire=all &&
+
+		git repack --cruft &&
+
+		find $packdir -type f | sort >before &&
+		git repack --geometric=2 -d &&
+		find $packdir -type f | sort >after &&
+
+		test_cmp before after
+	)
+'
+
+test_expect_success 'repack --geometric collects once-cruft objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		git rm -rf . &&
+		test_commit --no-tag cruft &&
+		cruft="$(git rev-parse HEAD)" &&
+
+		git checkout main &&
+		git branch -D other &&
+		git reflog expire --all --expire=all &&
+
+		# Pack the objects created in the previous step into a cruft
+		# pack. Intentionally leave loose copies of those objects
+		# around so we can pick them up in a subsequent --geometric
+		# reapack.
+		git repack --cruft &&
+
+		# Now make those objects reachable, and ensure that they are
+		# packed into the new pack created via a --geometric repack.
+		git update-ref refs/heads/other $cruft &&
+
+		# Without this object, the set of unpacked objects is exactly
+		# the set of objects already in the cruft pack. Tweak that set
+		# to ensure we do not overwrite the cruft pack entirely.
+		test_commit reachable2 &&
+
+		find $packdir -name "pack-*.idx" | sort >before &&
+		git repack --geometric=2 -d &&
+		find $packdir -name "pack-*.idx" | sort >after &&
+
+		{
+			git rev-list --objects --no-object-names $cruft &&
+			git rev-list --objects --no-object-names reachable..reachable2
+		} >want.raw &&
+		sort want.raw >want &&
+
+		pack=$(comm -13 before after) &&
+		git show-index <$pack >objects.raw &&
+
+		cut -d" " -f2 objects.raw | sort >got &&
+
+		test_cmp want got
+	)
+'
+
+test_expect_success 'cruft repack with no reachable objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		git repack -ad &&
+
+		base="$(git rev-parse base)" &&
+
+		git for-each-ref --format="delete %(refname)" >in &&
+		git update-ref --stdin <in &&
+		git reflog expire --all --expire=all &&
+		rm -fr .git/index &&
+
+		git repack --cruft -d &&
+
+		git cat-file -t $base
+	)
+'
+
+test_expect_success 'cruft repack ignores --max-pack-size' '
+	git init max-pack-size &&
+	(
+		cd max-pack-size &&
+		test_commit base &&
+		# two cruft objects which exceed the maximum pack size
+		test-tool genrandom foo 1048576 | git hash-object --stdin -w &&
+		test-tool genrandom bar 1048576 | git hash-object --stdin -w &&
+		git repack --cruft --max-pack-size=1M &&
+		find $packdir -name "*.mtimes" >cruft &&
+		test_line_count = 1 cruft &&
+		test-tool pack-mtimes "$(basename "$(cat cruft)")" >objects &&
+		test_line_count = 2 objects
+	)
+'
+
+test_expect_success 'cruft repack ignores pack.packSizeLimit' '
+	(
+		cd max-pack-size &&
+		# repack everything back together to remove the existing cruft
+		# pack (but to keep its objects)
+		git repack -adk &&
+		git -c pack.packSizeLimit=1M repack --cruft &&
+		# ensure the same post condition is met when --max-pack-size
+		# would otherwise be inferred from the configuration
+		find $packdir -name "*.mtimes" >cruft &&
+		test_line_count = 1 cruft &&
+		test-tool pack-mtimes "$(basename "$(cat cruft)")" >objects &&
+		test_line_count = 2 objects
+	)
+'
+
 test_done
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v2 13/17] builtin/repack.c: allow configuring cruft pack generation
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (11 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
                     ` (4 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

In servers which set the pack.window configuration to a large value, we
can wind up spending quite a lot of time finding new bases when breaking
delta chains between reachable and unreachable objects while generating
a cruft pack.

Introduce a handful of `repack.cruft*` configuration variables to
control the parameters used by pack-objects when generating a cruft
pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/repack.txt |  9 ++++
 builtin/repack.c                | 50 ++++++++++++++------
 t/t5328-pack-objects-cruft.sh   | 83 +++++++++++++++++++++++++++++++++
 3 files changed, 128 insertions(+), 14 deletions(-)

diff --git a/Documentation/config/repack.txt b/Documentation/config/repack.txt
index 9c413e177e..fd18d1fb89 100644
--- a/Documentation/config/repack.txt
+++ b/Documentation/config/repack.txt
@@ -25,3 +25,12 @@ repack.writeBitmaps::
 	space and extra time spent on the initial repack.  This has
 	no effect if multiple packfiles are created.
 	Defaults to true on bare repos, false otherwise.
+
+repack.cruftWindow::
+repack.cruftWindowMemory::
+repack.cruftDepth::
+repack.cruftThreads::
+	Parameters used by linkgit:git-pack-objects[1] when generating
+	a cruft pack and the respective parameters are not given over
+	the command line. See similarly named `pack.*` configuration
+	variables for defaults and meaning.
diff --git a/builtin/repack.c b/builtin/repack.c
index f7fb88bcf1..d61c78e94e 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -40,9 +40,21 @@ static const char incremental_bitmap_conflict_error[] = N_(
 "--no-write-bitmap-index or disable the pack.writebitmaps configuration."
 );
 
+struct pack_objects_args {
+	const char *window;
+	const char *window_memory;
+	const char *depth;
+	const char *threads;
+	const char *max_pack_size;
+	int no_reuse_delta;
+	int no_reuse_object;
+	int quiet;
+	int local;
+};
 
 static int repack_config(const char *var, const char *value, void *cb)
 {
+	struct pack_objects_args *cruft_po_args = cb;
 	if (!strcmp(var, "repack.usedeltabaseoffset")) {
 		delta_base_offset = git_config_bool(var, value);
 		return 0;
@@ -61,6 +73,15 @@ static int repack_config(const char *var, const char *value, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(var, "repack.cruftwindow"))
+		return git_config_string(&cruft_po_args->window, var, value);
+	if (!strcmp(var, "repack.cruftwindowmemory"))
+		return git_config_string(&cruft_po_args->window_memory, var, value);
+	if (!strcmp(var, "repack.cruftdepth"))
+		return git_config_string(&cruft_po_args->depth, var, value);
+	if (!strcmp(var, "repack.cruftthreads"))
+		return git_config_string(&cruft_po_args->threads, var, value);
+
 	return git_default_config(var, value, cb);
 }
 
@@ -153,18 +174,6 @@ static void remove_redundant_pack(const char *dir_name, const char *base_name)
 	strbuf_release(&buf);
 }
 
-struct pack_objects_args {
-	const char *window;
-	const char *window_memory;
-	const char *depth;
-	const char *threads;
-	const char *max_pack_size;
-	int no_reuse_delta;
-	int no_reuse_object;
-	int quiet;
-	int local;
-};
-
 static void prepare_pack_objects(struct child_process *cmd,
 				 const struct pack_objects_args *args)
 {
@@ -689,6 +698,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	int no_update_server_info = 0;
 	struct pack_objects_args po_args = {NULL};
+	struct pack_objects_args cruft_po_args = {NULL};
 	int geometric_factor = 0;
 	int write_midx = 0;
 
@@ -743,7 +753,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_END()
 	};
 
-	git_config(repack_config, NULL);
+	git_config(repack_config, &cruft_po_args);
 
 	argc = parse_options(argc, argv, prefix, builtin_repack_options,
 				git_repack_usage, 0);
@@ -918,7 +928,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		if (*pack_prefix == '/')
 			pack_prefix++;
 
-		ret = write_cruft_pack(&po_args, pack_prefix, &names,
+		if (!cruft_po_args.window)
+			cruft_po_args.window = po_args.window;
+		if (!cruft_po_args.window_memory)
+			cruft_po_args.window_memory = po_args.window_memory;
+		if (!cruft_po_args.depth)
+			cruft_po_args.depth = po_args.depth;
+		if (!cruft_po_args.threads)
+			cruft_po_args.threads = po_args.threads;
+
+		cruft_po_args.local = po_args.local;
+		cruft_po_args.quiet = po_args.quiet;
+
+		ret = write_cruft_pack(&cruft_po_args, pack_prefix, &names,
 				       &existing_nonkept_packs,
 				       &existing_kept_packs);
 		if (ret)
diff --git a/t/t5328-pack-objects-cruft.sh b/t/t5328-pack-objects-cruft.sh
index 06c550c958..e4744e4465 100755
--- a/t/t5328-pack-objects-cruft.sh
+++ b/t/t5328-pack-objects-cruft.sh
@@ -565,4 +565,87 @@ test_expect_success 'cruft repack ignores pack.packSizeLimit' '
 	)
 '
 
+test_expect_success 'cruft repack respects repack.cruftWindow' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		GIT_TRACE2_EVENT=$(pwd)/event.trace \
+		git -c pack.window=1 -c repack.cruftWindow=2 repack \
+		       --cruft --window=3 &&
+
+		grep "pack-objects.*--window=2.*--cruft" event.trace
+	)
+'
+
+test_expect_success 'cruft repack respects --window by default' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		GIT_TRACE2_EVENT=$(pwd)/event.trace \
+		git -c pack.window=2 repack --cruft --window=3 &&
+
+		grep "pack-objects.*--window=3.*--cruft" event.trace
+	)
+'
+
+test_expect_success 'cruft repack respects --quiet' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		GIT_PROGRESS_DELAY=0 git repack --cruft --quiet 2>err &&
+		test_must_be_empty err
+	)
+'
+
+test_expect_success 'cruft --local drops unreachable objects' '
+	git init alternate &&
+	git init repo &&
+	test_when_finished "rm -fr alternate repo" &&
+
+	test_commit -C alternate base &&
+	# Pack all objects in alterate so that the cruft repack in "repo" sees
+	# the object it dropped due to `--local` as packed. Otherwise this
+	# object would not appear packed anywhere (since it is not packed in
+	# alternate and likewise not part of the cruft pack in the other repo
+	# because of `--local`).
+	git -C alternate repack -ad &&
+
+	(
+		cd repo &&
+
+		object="$(git -C ../alternate rev-parse HEAD:base.t)" &&
+		git -C ../alternate cat-file -p $object >contents &&
+
+		# Write some reachable objects and two unreachable ones: one
+		# that the alternate has and another that is unique.
+		test_commit other &&
+		git hash-object -w -t blob contents &&
+		cruft="$(echo cruft | git hash-object -w -t blob --stdin)" &&
+
+		( cd ../alternate/.git/objects && pwd ) \
+		       >.git/objects/info/alternates &&
+
+		test_path_is_file $objdir/$(test_oid_to_path $cruft) &&
+		test_path_is_file $objdir/$(test_oid_to_path $object) &&
+
+		git repack -d --cruft --local &&
+
+		test-tool pack-mtimes "$(basename $(ls $packdir/pack-*.mtimes))" \
+		       >objects &&
+		! grep $object objects &&
+		grep $cruft objects
+	)
+'
+
 test_done
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v2 14/17] builtin/repack.c: use named flags for existing_packs
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (12 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
                     ` (3 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

We use the `util` pointer for items in the `existing_packs` string list
to indicate which packs are going to be deleted. Since that has so far
been the only use of that `util` pointer, we just set it to 0 or 1.

But we're going to add an additional state to this field in the next
patch, so prepare for that by adding a #define for the first bit so we
can more expressively inspect the flags state.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index d61c78e94e..afa4d51a22 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -22,6 +22,8 @@
 #define LOOSEN_UNREACHABLE 2
 #define PACK_CRUFT 4
 
+#define DELETE_PACK 1
+
 static int pack_everything;
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
@@ -561,7 +563,7 @@ static void midx_included_packs(struct string_list *include,
 		}
 	} else {
 		for_each_string_list_item(item, existing_nonkept_packs) {
-			if (item->util)
+			if ((uintptr_t)item->util & DELETE_PACK)
 				continue;
 			string_list_insert(include, xstrfmt("%s.idx", item->string));
 		}
@@ -1000,7 +1002,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			 * was given) and that we will actually delete this pack
 			 * (if `-d` was given).
 			 */
-			item->util = (void*)(intptr_t)!string_list_has_string(&names, sha1);
+			if (!string_list_has_string(&names, sha1))
+				item->util = (void*)(uintptr_t)((size_t)item->util | DELETE_PACK);
 		}
 	}
 
@@ -1024,7 +1027,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (delete_redundant) {
 		int opts = 0;
 		for_each_string_list_item(item, &existing_nonkept_packs) {
-			if (!item->util)
+			if (!((uintptr_t)item->util & DELETE_PACK))
 				continue;
 			remove_redundant_pack(packdir, item->string);
 		}
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v2 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (13 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
                     ` (2 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

When using cruft packs, the following race can occur when a geometric
repack that writes a MIDX bitmap takes place afterwords:

  - First, create an unreachable object and do an all-into-one cruft
    repack which stores that object in the repository's cruft pack.
  - Then make that object reachable.
  - Finally, do a geometric repack and write a MIDX bitmap.

Assuming that we are sufficiently unlucky as to select a commit from the
MIDX which reaches that object for bitmapping, then the `git
multi-pack-index` process will complain that that object is missing.

The reason is because we don't include cruft packs in the MIDX when
doing a geometric repack. Since the "make that object reachable" doesn't
necessarily mean that we'll create a new copy of that object in one of
the packs that will get rolled up as part of a geometric repack, it's
possible that the MIDX won't see any copies of that now-reachable
object.

Of course, it's desirable to avoid including cruft packs in the MIDX
because it causes the MIDX to store a bunch of objects which are likely
to get thrown away. But excluding that pack does open us up to the above
race.

This patch demonstrates the bug, and resolves it by including cruft
packs in the MIDX even when doing a geometric repack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c              | 19 +++++++++++++++++--
 t/t5328-pack-objects-cruft.sh | 26 ++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index afa4d51a22..59b60cd309 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -23,6 +23,7 @@
 #define PACK_CRUFT 4
 
 #define DELETE_PACK 1
+#define CRUFT_PACK 2
 
 static int pack_everything;
 static int delta_base_offset = 1;
@@ -158,8 +159,11 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 		if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
 		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
 			string_list_append_nodup(fname_kept_list, fname);
-		else
-			string_list_append_nodup(fname_nonkept_list, fname);
+		else {
+			struct string_list_item *item = string_list_append_nodup(fname_nonkept_list, fname);
+			if (file_exists(mkpath("%s/%s.mtimes", packdir, fname)))
+				item->util = (void*)(uintptr_t)CRUFT_PACK;
+		}
 	}
 	closedir(dir);
 }
@@ -561,6 +565,17 @@ static void midx_included_packs(struct string_list *include,
 
 			string_list_insert(include, strbuf_detach(&buf, NULL));
 		}
+
+		for_each_string_list_item(item, existing_nonkept_packs) {
+			if (!((uintptr_t)item->util & CRUFT_PACK)) {
+				/*
+				 * no need to check DELETE_PACK, since we're not
+				 * doing an ALL_INTO_ONE repack
+				 */
+				continue;
+			}
+			string_list_insert(include, xstrfmt("%s.idx", item->string));
+		}
 	} else {
 		for_each_string_list_item(item, existing_nonkept_packs) {
 			if ((uintptr_t)item->util & DELETE_PACK)
diff --git a/t/t5328-pack-objects-cruft.sh b/t/t5328-pack-objects-cruft.sh
index e4744e4465..13158e4ab7 100755
--- a/t/t5328-pack-objects-cruft.sh
+++ b/t/t5328-pack-objects-cruft.sh
@@ -648,4 +648,30 @@ test_expect_success 'cruft --local drops unreachable objects' '
 	)
 '
 
+test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		test_commit cruft &&
+		unreachable="$(git rev-parse cruft)" &&
+
+		git reset --hard $unreachable^ &&
+		git tag -d cruft &&
+		git reflog expire --all --expire=all &&
+
+		git repack --cruft -d &&
+
+		# resurrect the unreachable object via a new commit. the
+		# new commit will get selected for a bitmap, but be
+		# missing one of its parents from the selected packs.
+		git reset --hard $unreachable &&
+		test_commit resurrect &&
+
+		git repack --write-midx --write-bitmap-index --geometric=2 -d
+	)
+'
+
 test_done
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v2 16/17] builtin/gc.c: conditionally avoid pruning objects via loose
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (14 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02  0:58   ` [PATCH v2 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
  2022-03-02 20:23   ` [PATCH v2 00/17] " Derrick Stolee
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Expose the new `git repack --cruft` mode from `git gc` via a new opt-in
flag. When invoked like `git gc --cruft`, `git gc` will avoid exploding
unreachable objects as loose ones, and instead create a cruft pack and
`.mtimes` file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/gc.txt   | 21 +++++++++++++-------
 Documentation/git-gc.txt      |  5 +++++
 builtin/gc.c                  | 10 +++++++++-
 t/t5328-pack-objects-cruft.sh | 37 +++++++++++++++++++++++++++++++++++
 4 files changed, 65 insertions(+), 8 deletions(-)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index c834e07991..38fea076a2 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -81,14 +81,21 @@ gc.packRefs::
 	to enable it within all non-bare repos or it can be set to a
 	boolean value.  The default is `true`.
 
+gc.cruftPacks::
+	Store unreachable objects in a cruft pack (see
+	linkgit:git-repack[1]) instead of as loose objects. The default
+	is `false`.
+
 gc.pruneExpire::
-	When 'git gc' is run, it will call 'prune --expire 2.weeks.ago'.
-	Override the grace period with this config variable.  The value
-	"now" may be used to disable this grace period and always prune
-	unreachable objects immediately, or "never" may be used to
-	suppress pruning.  This feature helps prevent corruption when
-	'git gc' runs concurrently with another process writing to the
-	repository; see the "NOTES" section of linkgit:git-gc[1].
+	When 'git gc' is run, it will call 'prune --expire 2.weeks.ago'
+	(and 'repack --cruft --cruft-expiration 2.weeks.ago' if using
+	cruft packs via `gc.cruftPacks` or `--cruft`).  Override the
+	grace period with this config variable.  The value "now" may be
+	used to disable this grace period and always prune unreachable
+	objects immediately, or "never" may be used to suppress pruning.
+	This feature helps prevent corruption when 'git gc' runs
+	concurrently with another process writing to the repository; see
+	the "NOTES" section of linkgit:git-gc[1].
 
 gc.worktreePruneExpire::
 	When 'git gc' is run, it calls
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 853967dea0..ba4e67700e 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -54,6 +54,11 @@ other housekeeping tasks (e.g. rerere, working trees, reflog...) will
 be performed as well.
 
 
+--cruft::
+	When expiring unreachable objects, pack them separately into a
+	cruft pack instead of storing the loose objects as loose
+	objects.
+
 --prune=<date>::
 	Prune loose objects older than date (default is 2 weeks ago,
 	overridable by the config variable `gc.pruneExpire`).
diff --git a/builtin/gc.c b/builtin/gc.c
index ffaf0daf5d..11f5150234 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -43,6 +43,7 @@ static const char * const builtin_gc_usage[] = {
 
 static int pack_refs = 1;
 static int prune_reflogs = 1;
+static int cruft_packs = 0;
 static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
@@ -153,6 +154,7 @@ static void gc_config(void)
 	git_config_get_int("gc.auto", &gc_auto_threshold);
 	git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
 	git_config_get_bool("gc.autodetach", &detach_auto);
+	git_config_get_bool("gc.cruftpacks", &cruft_packs);
 	git_config_get_expiry("gc.pruneexpire", &prune_expire);
 	git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire);
 	git_config_get_expiry("gc.logexpiry", &gc_log_expire);
@@ -332,7 +334,11 @@ static void add_repack_all_option(struct string_list *keep_pack)
 {
 	if (prune_expire && !strcmp(prune_expire, "now"))
 		strvec_push(&repack, "-a");
-	else {
+	else if (cruft_packs) {
+		strvec_push(&repack, "--cruft");
+		if (prune_expire)
+			strvec_pushf(&repack, "--cruft-expiration=%s", prune_expire);
+	} else {
 		strvec_push(&repack, "-A");
 		if (prune_expire)
 			strvec_pushf(&repack, "--unpack-unreachable=%s", prune_expire);
@@ -552,6 +558,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 		{ OPTION_STRING, 0, "prune", &prune_expire, N_("date"),
 			N_("prune unreferenced objects"),
 			PARSE_OPT_OPTARG, NULL, (intptr_t)prune_expire },
+		OPT_BOOL(0, "cruft", &cruft_packs, N_("pack unreferenced objects separately")),
 		OPT_BOOL(0, "aggressive", &aggressive, N_("be more thorough (increased runtime)")),
 		OPT_BOOL_F(0, "auto", &auto_gc, N_("enable auto-gc mode"),
 			   PARSE_OPT_NOCOMPLETE),
@@ -671,6 +678,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 			die(FAILED_RUN, repack.v[0]);
 
 		if (prune_expire) {
+			/* run `git prune` even if using cruft packs */
 			strvec_push(&prune, prune_expire);
 			if (quiet)
 				strvec_push(&prune, "--no-progress");
diff --git a/t/t5328-pack-objects-cruft.sh b/t/t5328-pack-objects-cruft.sh
index 13158e4ab7..3910e186ef 100755
--- a/t/t5328-pack-objects-cruft.sh
+++ b/t/t5328-pack-objects-cruft.sh
@@ -429,6 +429,43 @@ test_expect_success 'loose objects mtimes upsert others' '
 	)
 '
 
+test_expect_success 'expiring cruft objects with git gc' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git branch -M main &&
+		git checkout --orphan other &&
+		test_commit unreachable &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d unreachable &&
+		# objects are not cruft if they are contained in the reflogs
+		git reflog expire --all --expire=all &&
+
+		git rev-list --objects --all --no-object-names >reachable.raw &&
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+		sort <reachable.raw >reachable &&
+		comm -13 reachable objects >unreachable &&
+
+		git repack --cruft -d &&
+
+		mtimes=$(ls .git/objects/pack/pack-*.mtimes) &&
+		test_path_is_file $mtimes &&
+
+		git gc --cruft --prune=now &&
+
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+
+		comm -23 unreachable objects >removed &&
+		test_cmp unreachable removed &&
+		test_path_is_missing $mtimes
+	)
+'
+
 test_expect_success 'cruft packs are not included in geometric repack' '
 	git init repo &&
 	test_when_finished "rm -fr repo" &&
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v2 17/17] sha1-file.c: don't freshen cruft packs
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (15 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
@ 2022-03-02  0:58   ` Taylor Blau
  2022-03-02 20:23   ` [PATCH v2 00/17] " Derrick Stolee
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-02  0:58 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

We don't bother to freshen objects stored in a cruft pack individually
by updating the `.mtimes` file. This is because we can't portably `mmap`
and write into the middle of a file (i.e., to update the mtime of just
one object). Instead, we would have to rewrite the entire `.mtimes` file
which may incur some wasted effort especially if there a lot of cruft
objects and they are freshened infrequently.

Instead, force the freshening code to avoid an optimizing write by
writing out the object loose and letting it pick up a current mtime.

This works because we prefer the mtime of the loose copy of an object
when both a loose and packed one exist (whether or not the packed copy
comes from a cruft pack or not).

This could certainly do with a test and/or be included earlier in this
series/PR, but I want to wait until after I have a chance to clean up
the overly-repetitive nature of the cruft pack tests in general.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 object-file.c                 |  2 ++
 t/t5328-pack-objects-cruft.sh | 25 +++++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/object-file.c b/object-file.c
index e80da1368d..65b8df7fb6 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1989,6 +1989,8 @@ static int freshen_packed_object(const struct object_id *oid)
 	struct pack_entry e;
 	if (!find_pack_entry(the_repository, oid, &e))
 		return 0;
+	if (e.p->is_cruft)
+		return 0;
 	if (e.p->freshened)
 		return 1;
 	if (!freshen_file(e.p->pack_name))
diff --git a/t/t5328-pack-objects-cruft.sh b/t/t5328-pack-objects-cruft.sh
index 3910e186ef..4681558612 100755
--- a/t/t5328-pack-objects-cruft.sh
+++ b/t/t5328-pack-objects-cruft.sh
@@ -711,4 +711,29 @@ test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
 	)
 '
 
+test_expect_success 'cruft objects are freshend via loose' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		echo "cruft" >contents &&
+		blob="$(git hash-object -w -t blob contents)" &&
+		loose="$objdir/$(test_oid_to_path $blob)" &&
+
+		test_commit base &&
+
+		git repack --cruft -d &&
+
+		test_path_is_missing "$loose" &&
+		test-tool pack-mtimes "$(basename "$(ls $packdir/pack-*.mtimes)")" >cruft &&
+		grep "$blob" cruft &&
+
+		# write the same object again
+		git hash-object -w -t blob contents &&
+
+		test_path_is_file "$loose"
+	)
+'
+
 test_done
-- 
2.35.1.73.gccc5557600

^ permalink raw reply related	[flat|nested] 201+ messages in thread

* Re: [PATCH v2 11/17] builtin/pack-objects.c: --cruft with expiration
  2022-03-02  0:58   ` [PATCH v2 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
@ 2022-03-02  7:42     ` Junio C Hamano
  2022-03-02 15:54       ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Junio C Hamano @ 2022-03-02  7:42 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, tytso, derrickstolee, larsxschneider

Taylor Blau <me@ttaylorr.com> writes:

>  builtin/pack-objects.c        |  84 +++++++++++++++++++-
>  t/t5328-pack-objects-cruft.sh | 143 ++++++++++++++++++++++++++++++++++
>  2 files changed, 226 insertions(+), 1 deletion(-)

I'd renumber this to 5329, as the latest iteration of generation
number v2 series took 5328, while queuing.



^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v2 11/17] builtin/pack-objects.c: --cruft with expiration
  2022-03-02  7:42     ` Junio C Hamano
@ 2022-03-02 15:54       ` Taylor Blau
  2022-03-02 19:57         ` Derrick Stolee
  0 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-03-02 15:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, tytso, derrickstolee, larsxschneider

On Tue, Mar 01, 2022 at 11:42:57PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> >  builtin/pack-objects.c        |  84 +++++++++++++++++++-
> >  t/t5328-pack-objects-cruft.sh | 143 ++++++++++++++++++++++++++++++++++
> >  2 files changed, 226 insertions(+), 1 deletion(-)
>
> I'd renumber this to 5329, as the latest iteration of generation
> number v2 series took 5328, while queuing.

Oops. I had scanned that series, but glossed over the new test number.

Thanks for renaming (I'll do the same, in case we end up accumulating
more reroll-able bits).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v2 11/17] builtin/pack-objects.c: --cruft with expiration
  2022-03-02 15:54       ` Taylor Blau
@ 2022-03-02 19:57         ` Derrick Stolee
  0 siblings, 0 replies; 201+ messages in thread
From: Derrick Stolee @ 2022-03-02 19:57 UTC (permalink / raw)
  To: Taylor Blau, Junio C Hamano; +Cc: git, tytso, larsxschneider

On 3/2/2022 10:54 AM, Taylor Blau wrote:
> On Tue, Mar 01, 2022 at 11:42:57PM -0800, Junio C Hamano wrote:
>> Taylor Blau <me@ttaylorr.com> writes:
>>
>>>  builtin/pack-objects.c        |  84 +++++++++++++++++++-
>>>  t/t5328-pack-objects-cruft.sh | 143 ++++++++++++++++++++++++++++++++++
>>>  2 files changed, 226 insertions(+), 1 deletion(-)
>>
>> I'd renumber this to 5329, as the latest iteration of generation
>> number v2 series took 5328, while queuing.
> 
> Oops. I had scanned that series, but glossed over the new test number.
> 
> Thanks for renaming (I'll do the same, in case we end up accumulating
> more reroll-able bits).

Sorry for the collision! Had I realized this was already used here,
I would have changed the number myself.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v2 09/17] reachable: add options to add_unseen_recent_objects_to_traversal
  2022-03-02  0:58   ` [PATCH v2 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
@ 2022-03-02 20:19     ` Derrick Stolee
  2022-03-02 21:28       ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Derrick Stolee @ 2022-03-02 20:19 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: tytso, gitster, larsxschneider

On 3/1/2022 7:58 PM, Taylor Blau wrote:

> diff --git a/reachable.h b/reachable.h
> index 5df932ad8f..b776761baa 100644
> --- a/reachable.h
> +++ b/reachable.h
> @@ -1,11 +1,18 @@
>  #ifndef REACHEABLE_H
>  #define REACHEABLE_H
>  
> +#include "object.h"
> +

Nit: just realized this include could be replaced by a struct
declaration:

>  struct progress;
>  struct rev_info;

Like these. 'struct object;' should be enough for the typedef.
>  
> +typedef void report_recent_object_fn(const struct object *, struct packed_git *,
> +				     off_t, time_t);
> +
>  int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
> -					   timestamp_t timestamp);
> +					   timestamp_t timestamp,
> +					   report_recent_object_fn cb,
> +					   int ignore_in_core_kept_packs);

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v2 02/17] pack-mtimes: support reading .mtimes files
  2022-03-02  0:58   ` [PATCH v2 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
@ 2022-03-02 20:22     ` Derrick Stolee
  2022-03-02 21:33       ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Derrick Stolee @ 2022-03-02 20:22 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: tytso, gitster, larsxschneider

On 3/1/2022 7:58 PM, Taylor Blau wrote:
> To store the individual mtimes of objects in a cruft pack, introduce a
> new `.mtimes` format that can optionally accompany a single pack in the
> repository.
> 
> The format is defined in Documentation/technical/pack-format.txt, and
> stores a 4-byte network order timestamp for each object in name (index)
> order.
> 
> This patch prepares for cruft packs by defining the `.mtimes` format,
> and introducing a basic API that callers can use to read out individual
> mtimes.
...
> +int load_pack_mtimes(struct packed_git *p)
> +{
> +	char *mtimes_name = NULL;
> +	int ret = 0;
> +
> +	if (!p->is_cruft)
> +		return ret; /* not a cruft pack */
> +	if (p->mtimes_map)
> +		return ret; /* already loaded */
> +
> +	ret = open_pack_index(p);
> +	if (ret < 0)
> +		goto cleanup;
> +
> +	mtimes_name = pack_mtimes_filename(p);
> +	ret = load_pack_mtimes_file(mtimes_name,
> +				    p->num_objects,
> +				    &p->mtimes_map,
> +				    &p->mtimes_size);
> +	if (ret)
> +		goto cleanup;

This looked odd to me, so I supposed that you had some code
that would be inserted between this 'goto cleanup' and the
'cleanup:' label, but I did not find such an insertion in
the remaining patchs. This 'if' can be deleted.

> +cleanup:
> +	free(mtimes_name);
> +	return ret;
> +}

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v2 00/17] cruft packs
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
                     ` (16 preceding siblings ...)
  2022-03-02  0:58   ` [PATCH v2 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
@ 2022-03-02 20:23   ` Derrick Stolee
  2022-03-02 21:36     ` Taylor Blau
  17 siblings, 1 reply; 201+ messages in thread
From: Derrick Stolee @ 2022-03-02 20:23 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: tytso, gitster, larsxschneider

On 3/1/2022 7:57 PM, Taylor Blau wrote:
> Here is a reroll of my series to implement "cruft packs", a pack which
> stores accumulated unreachable objects, along with a new ".mtimes" file
> which tracks each object's last known modification time.
> 
> This was on the list towards the end of 2021[1], and I have been
> accumulating small changes to it locally for a couple of months now.
> Major changes since last time include:
> 
>   - Clearer documentation and commit message(s) to better illustrate how
>     the feature works and is supposed to be used.
> 
>   - Some minor documentation updates to pack-format.txt, which make some
>     ambiguous details more explicit.
> 
>   - Minor code movement / tweaks to make things easier to read, ensure
>     that functions aren't introduced in patches before they are used /
>     etc.
> 
>   - Moved the new test script to t5328 (instead of t5327, which happens
>     to be taken up by a new MIDX bitmap-related test), and purged it of
>     all "rm -fr .git/logs" (replacing them with "git reflog --expire
>     --all --expire=all" instead).
> 
>   - A new test which fixes a bug where loose objects which have copies
>     that appear in a cruft pack would not get accumulated when doing a
>     `--geometric` repack.
> 
> For convenience, a range-diff is below. Thanks in advance for taking
> another look!

It had been a while since my last read, so I read the patches
in full one more time. I found a couple nitpicks, but otherwise
everything is looking good.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v2 09/17] reachable: add options to add_unseen_recent_objects_to_traversal
  2022-03-02 20:19     ` Derrick Stolee
@ 2022-03-02 21:28       ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-02 21:28 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, tytso, gitster, larsxschneider

On Wed, Mar 02, 2022 at 03:19:57PM -0500, Derrick Stolee wrote:
> Nit: just realized this include could be replaced by a struct
> declaration:
>
> >  struct progress;
> >  struct rev_info;
>
> Like these. 'struct object;' should be enough for the typedef.

Good catch. We would need one for the packed_git struct, too. I don't
have a strong opinion about including object.h or not, though needing
two stubs pushes me slightly in the direction of leaving the include
alone.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v2 02/17] pack-mtimes: support reading .mtimes files
  2022-03-02 20:22     ` Derrick Stolee
@ 2022-03-02 21:33       ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-02 21:33 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, tytso, gitster, larsxschneider

On Wed, Mar 02, 2022 at 03:22:18PM -0500, Derrick Stolee wrote:
> > +	ret = load_pack_mtimes_file(mtimes_name,
> > +				    p->num_objects,
> > +				    &p->mtimes_map,
> > +				    &p->mtimes_size);
> > +	if (ret)
> > +		goto cleanup;
>
> This looked odd to me, so I supposed that you had some code
> that would be inserted between this 'goto cleanup' and the
> 'cleanup:' label, but I did not find such an insertion in
> the remaining patchs. This 'if' can be deleted.

Thanks for spotting. My gut was that there must be something in the
range-diff between this and the previous round, but there isn't. So this
code has always been there.

It likely comes from load_pack_revindex_from_disk(), which assigns the
`revindex_data` member of `struct packed_git` after calling
load_revindex_from_disk(), but only if it returned zero.

We don't have to assign mtimes_data here (since it doesn't exist, and)
because all of our reads into mtimes_map are offset by 3 to adjust for
the width of the header.

Anyway, we don't need this if statement here, so I'll drop it.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v2 00/17] cruft packs
  2022-03-02 20:23   ` [PATCH v2 00/17] " Derrick Stolee
@ 2022-03-02 21:36     ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-02 21:36 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, tytso, gitster, larsxschneider

On Wed, Mar 02, 2022 at 03:23:05PM -0500, Derrick Stolee wrote:
> > For convenience, a range-diff is below. Thanks in advance for taking
> > another look!
>
> It had been a while since my last read, so I read the patches
> in full one more time. I found a couple nitpicks, but otherwise
> everything is looking good.

Thanks for reading! I took both of your suggestions (along with Junio's
to rename the test script to t5329 to avoid a clash with your series)
and will re-submit a tiny reroll shortly.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* [PATCH v3 00/17] cruft packs
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (18 preceding siblings ...)
  2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
@ 2022-03-03  0:20 ` Taylor Blau
  2022-03-03  0:20   ` [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
                     ` (17 more replies)
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
  21 siblings, 18 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-03  0:20 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Here is a small reroll of my series to implement "cruft packs", based on
Stolee's review.

The changes here are minor, and mostly are limited to removing a
redundant "if" statement, avoiding an unnecessary header include, and
moving the tests (again!) to t5329's territory.

As always, a range-diff is below. Thanks in advance for taking another
look!

Taylor Blau (17):
  Documentation/technical: add cruft-packs.txt
  pack-mtimes: support reading .mtimes files
  pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  chunk-format.h: extract oid_version()
  pack-mtimes: support writing pack .mtimes files
  t/helper: add 'pack-mtimes' test-tool
  builtin/pack-objects.c: return from create_object_entry()
  builtin/pack-objects.c: --cruft without expiration
  reachable: add options to add_unseen_recent_objects_to_traversal
  reachable: report precise timestamps from objects in cruft packs
  builtin/pack-objects.c: --cruft with expiration
  builtin/repack.c: support generating a cruft pack
  builtin/repack.c: allow configuring cruft pack generation
  builtin/repack.c: use named flags for existing_packs
  builtin/repack.c: add cruft packs to MIDX during geometric repack
  builtin/gc.c: conditionally avoid pruning objects via loose
  sha1-file.c: don't freshen cruft packs

 Documentation/Makefile                  |   1 +
 Documentation/config/gc.txt             |  21 +-
 Documentation/config/repack.txt         |   9 +
 Documentation/git-gc.txt                |   5 +
 Documentation/git-pack-objects.txt      |  30 +
 Documentation/git-repack.txt            |  11 +
 Documentation/technical/cruft-packs.txt |  97 ++++
 Documentation/technical/pack-format.txt |  19 +
 Makefile                                |   2 +
 builtin/gc.c                            |  10 +-
 builtin/pack-objects.c                  | 304 +++++++++-
 builtin/repack.c                        | 183 +++++-
 bulk-checkin.c                          |   2 +-
 chunk-format.c                          |  12 +
 chunk-format.h                          |   3 +
 commit-graph.c                          |  18 +-
 midx.c                                  |  18 +-
 object-file.c                           |   4 +-
 object-store.h                          |   7 +-
 pack-mtimes.c                           | 126 ++++
 pack-mtimes.h                           |  15 +
 pack-objects.c                          |   6 +
 pack-objects.h                          |  25 +
 pack-write.c                            |  93 ++-
 pack.h                                  |   4 +
 packfile.c                              |  19 +-
 reachable.c                             |  58 +-
 reachable.h                             |   9 +-
 t/helper/test-pack-mtimes.c             |  56 ++
 t/helper/test-tool.c                    |   1 +
 t/helper/test-tool.h                    |   1 +
 t/t5329-pack-objects-cruft.sh           | 739 ++++++++++++++++++++++++
 32 files changed, 1807 insertions(+), 101 deletions(-)
 create mode 100644 Documentation/technical/cruft-packs.txt
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h
 create mode 100644 t/helper/test-pack-mtimes.c
 create mode 100755 t/t5329-pack-objects-cruft.sh

Range-diff against v2:
 -:  ---------- >  1:  784ee7e0ee Documentation/technical: add cruft-packs.txt
 1:  101b34660c !  2:  1ec754ad1b pack-mtimes: support reading .mtimes files
    @@ pack-mtimes.c (new)
     +				    p->num_objects,
     +				    &p->mtimes_map,
     +				    &p->mtimes_size);
    -+	if (ret)
    -+		goto cleanup;
    -+
     +cleanup:
     +	free(mtimes_name);
     +	return ret;
 2:  a94d7dfeb3 =  3:  0f5d6d6492 pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
 3:  1e0ed363ae =  4:  135a07276b chunk-format.h: extract oid_version()
 4:  5236490688 =  5:  0600503856 pack-mtimes: support writing pack .mtimes files
 5:  78313bc441 =  6:  4780c8437b t/helper: add 'pack-mtimes' test-tool
 6:  142098668d =  7:  33862a07c9 builtin/pack-objects.c: return from create_object_entry()
 7:  2517a6be3d !  8:  22705e4887 builtin/pack-objects.c: --cruft without expiration
    @@ object-store.h: int repo_has_object_file_with_flags(struct repository *r,
      
      /*
     
    - ## t/t5328-pack-objects-cruft.sh (new) ##
    + ## t/t5329-pack-objects-cruft.sh (new) ##
     @@
     +#!/bin/sh
     +
 8:  6f0e84273f =  9:  cebb30b667 reachable: add options to add_unseen_recent_objects_to_traversal
 9:  a8bde361f9 = 10:  fa4de8859d reachable: report precise timestamps from objects in cruft packs
10:  d68ce28132 ! 11:  92318f8700 builtin/pack-objects.c: --cruft with expiration
    @@ builtin/pack-objects.c: static void read_cruft_objects(void)
      		enumerate_cruft_objects();
      
     
    - ## t/t5328-pack-objects-cruft.sh ##
    -@@ t/t5328-pack-objects-cruft.sh: basic_cruft_pack_tests () {
    + ## reachable.h ##
    +@@
    + #ifndef REACHEABLE_H
    + #define REACHEABLE_H
    + 
    +-#include "object.h"
    +-
    + struct progress;
    + struct rev_info;
    ++struct object;
    ++struct packed_git;
    + 
    + typedef void report_recent_object_fn(const struct object *, struct packed_git *,
    + 				     off_t, time_t);
    +
    + ## t/t5329-pack-objects-cruft.sh ##
    +@@ t/t5329-pack-objects-cruft.sh: basic_cruft_pack_tests () {
      }
      
      basic_cruft_pack_tests never
11:  e5317cd472 ! 12:  1e94b33cb4 builtin/repack.c: support generating a cruft pack
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
      		item->util = (void *)(uintptr_t)populate_pack_exts(item->string);
      	}
     
    - ## t/t5328-pack-objects-cruft.sh ##
    -@@ t/t5328-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned' '
    + ## t/t5329-pack-objects-cruft.sh ##
    +@@ t/t5329-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned' '
      	)
      '
      
12:  b548dbbf80 ! 13:  9cfcd123bd builtin/repack.c: allow configuring cruft pack generation
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
      				       &existing_kept_packs);
      		if (ret)
     
    - ## t/t5328-pack-objects-cruft.sh ##
    -@@ t/t5328-pack-objects-cruft.sh: test_expect_success 'cruft repack ignores pack.packSizeLimit' '
    + ## t/t5329-pack-objects-cruft.sh ##
    +@@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft repack ignores pack.packSizeLimit' '
      	)
      '
      
13:  e6eee7f15c = 14:  1a58807df0 builtin/repack.c: use named flags for existing_packs
14:  b09dbc9fe5 ! 15:  ed05cf536b builtin/repack.c: add cruft packs to MIDX during geometric repack
    @@ builtin/repack.c: static void midx_included_packs(struct string_list *include,
      		for_each_string_list_item(item, existing_nonkept_packs) {
      			if ((uintptr_t)item->util & DELETE_PACK)
     
    - ## t/t5328-pack-objects-cruft.sh ##
    -@@ t/t5328-pack-objects-cruft.sh: test_expect_success 'cruft --local drops unreachable objects' '
    + ## t/t5329-pack-objects-cruft.sh ##
    +@@ t/t5329-pack-objects-cruft.sh: test_expect_success 'cruft --local drops unreachable objects' '
      	)
      '
      
15:  7a21ae1494 ! 16:  1d5f334138 builtin/gc.c: conditionally avoid pruning objects via loose
    @@ builtin/gc.c: int cmd_gc(int argc, const char **argv, const char *prefix)
      			if (quiet)
      				strvec_push(&prune, "--no-progress");
     
    - ## t/t5328-pack-objects-cruft.sh ##
    -@@ t/t5328-pack-objects-cruft.sh: test_expect_success 'loose objects mtimes upsert others' '
    + ## t/t5329-pack-objects-cruft.sh ##
    +@@ t/t5329-pack-objects-cruft.sh: test_expect_success 'loose objects mtimes upsert others' '
      	)
      '
      
16:  b729b80963 ! 17:  f74b425872 sha1-file.c: don't freshen cruft packs
    @@ object-file.c: static int freshen_packed_object(const struct object_id *oid)
      		return 1;
      	if (!freshen_file(e.p->pack_name))
     
    - ## t/t5328-pack-objects-cruft.sh ##
    -@@ t/t5328-pack-objects-cruft.sh: test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
    + ## t/t5329-pack-objects-cruft.sh ##
    +@@ t/t5329-pack-objects-cruft.sh: test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
      	)
      '
      
-- 
2.35.1.73.gccc5557600

^ permalink raw reply	[flat|nested] 201+ messages in thread

* [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
@ 2022-03-03  0:20   ` Taylor Blau
  2022-03-07 18:03     ` Jonathan Nieder
  2022-03-03  0:20   ` [PATCH v3 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
                     ` (16 subsequent siblings)
  17 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-03-03  0:20 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Create a technical document to explain cruft packs. It contains a brief
overview of the problem, some background, details on the implementation,
and a couple of alternative approaches not considered here.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/Makefile                  |  1 +
 Documentation/technical/cruft-packs.txt | 97 +++++++++++++++++++++++++
 2 files changed, 98 insertions(+)
 create mode 100644 Documentation/technical/cruft-packs.txt

diff --git a/Documentation/Makefile b/Documentation/Makefile
index ed656db2ae..0b01c9408e 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -91,6 +91,7 @@ TECH_DOCS += MyFirstContribution
 TECH_DOCS += MyFirstObjectWalk
 TECH_DOCS += SubmittingPatches
 TECH_DOCS += technical/bundle-format
+TECH_DOCS += technical/cruft-packs
 TECH_DOCS += technical/hash-function-transition
 TECH_DOCS += technical/http-protocol
 TECH_DOCS += technical/index-format
diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
new file mode 100644
index 0000000000..2c3c5d93f8
--- /dev/null
+++ b/Documentation/technical/cruft-packs.txt
@@ -0,0 +1,97 @@
+= Cruft packs
+
+The cruft packs feature offer an alternative to Git's traditional mechanism of
+removing unreachable objects. This document provides an overview of Git's
+pruning mechanism, and how a cruft pack can be used instead to accomplish the
+same.
+
+== Background
+
+To remove unreachable objects from your repository, Git offers `git repack -Ad`
+(see linkgit:git-repack[1]). Quoting from the documentation:
+
+[quote]
+[...] unreachable objects in a previous pack become loose, unpacked objects,
+instead of being left in the old pack. [...] loose unreachable objects will be
+pruned according to normal expiry rules with the next 'git gc' invocation.
+
+Unreachable objects aren't removed immediately, since doing so could race with
+an incoming push which may reference an object which is about to be deleted.
+Instead, those unreachable objects are stored as loose object and stay that way
+until they are older than the expiration window, at which point they are removed
+by linkgit:git-prune[1].
+
+Git must store these unreachable objects loose in order to keep track of their
+per-object mtimes. If these unreachable objects were written into one big pack,
+then either freshening that pack (because an object contained within it was
+re-written) or creating a new pack of unreachable objects would cause the pack's
+mtime to get updated, and the objects within it would never leave the expiration
+window. Instead, objects are stored loose in order to keep track of the
+individual object mtimes and avoid a situation where all cruft objects are
+freshened at once.
+
+This can lead to undesirable situations when a repository contains many
+unreachable objects which have not yet left the grace period. Having large
+directories in the shards of `.git/objects` can lead to decreased performance in
+the repository. But given enough unreachable objects, this can lead to inode
+starvation and degrade the performance of the whole system. Since we
+can never pack those objects, these repositories often take up a large amount of
+disk space, since we can only zlib compress them, but not store them in delta
+chains.
+
+== Cruft packs
+
+A cruft pack eliminates the need for storing unreachable objects in a loose
+state by including the per-object mtimes in a separate file alongside a single
+pack containing all loose objects.
+
+A cruft pack is written by `git repack --cruft` when generating a new pack.
+linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft`
+is a classic all-into-one repack, meaning that everything in the resulting pack is
+reachable, and everything else is unreachable. Once written, the `--cruft`
+option instructs `git repack` to generate another pack containing only objects
+not packed in the previous step (which equates to packing all unreachable
+objects together). This progresses as follows:
+
+  1. Enumerate every object, marking any object which is (a) not contained in a
+     kept-pack, and (b) whose mtime is within the grace period as a traversal
+     tip.
+
+  2. Perform a reachability traversal based on the tips gathered in the previous
+     step, adding every object along the way to the pack.
+
+  3. Write the pack out, along with a `.mtimes` file that records the per-object
+     timestamps.
+
+This mode is invoked internally by linkgit:git-repack[1] when instructed to
+write a cruft pack. Crucially, the set of in-core kept packs is exactly the set
+of packs which will not be deleted by the repack; in other words, they contain
+all of the repository's reachable objects.
+
+When a repository already has a cruft pack, `git repack --cruft` typically only
+adds objects to it. An exception to this is when `git repack` is given the
+`--cruft-expiration` option, which allows the generated cruft pack to omit
+expired objects instead of waiting for linkgit:git-gc[1] to expire those objects
+later on.
+
+It is linkgit:git-gc[1] that is typically responsible for removing expired
+unreachable objects.
+
+== Alternatives
+
+Notable alternatives to this design include:
+
+  - The location of the per-object mtime data, and
+  - Storing unreachable objects in multiple cruft packs.
+
+On the location of mtime data, a new auxiliary file tied to the pack was chosen
+to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
+support for optional chunks of data, it may make sense to consolidate the
+`.mtimes` format into the `.idx` itself.
+
+Storing unreachable objects among multiple cruft packs (e.g., creating a new
+cruft pack during each repacking operation including only unreachable objects
+which aren't already stored in an earlier cruft pack) is significantly more
+complicated to construct, and so aren't pursued here. The obvious drawback to
+the current implementation is that the entire cruft pack must be re-written from
+scratch.
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v3 02/17] pack-mtimes: support reading .mtimes files
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
  2022-03-03  0:20   ` [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
@ 2022-03-03  0:20   ` Taylor Blau
  2022-03-03  0:20   ` [PATCH v3 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
                     ` (15 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-03  0:20 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

To store the individual mtimes of objects in a cruft pack, introduce a
new `.mtimes` format that can optionally accompany a single pack in the
repository.

The format is defined in Documentation/technical/pack-format.txt, and
stores a 4-byte network order timestamp for each object in name (index)
order.

This patch prepares for cruft packs by defining the `.mtimes` format,
and introducing a basic API that callers can use to read out individual
mtimes.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/pack-format.txt |  19 ++++
 Makefile                                |   1 +
 builtin/repack.c                        |   1 +
 object-store.h                          |   5 +-
 pack-mtimes.c                           | 126 ++++++++++++++++++++++++
 pack-mtimes.h                           |  15 +++
 packfile.c                              |  19 +++-
 7 files changed, 183 insertions(+), 3 deletions(-)
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 6d3efb7d16..c443dbb526 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -294,6 +294,25 @@ Pack file entry: <+
 
 All 4-byte numbers are in network order.
 
+== pack-*.mtimes files have the format:
+
+  - A 4-byte magic number '0x4d544d45' ('MTME').
+
+  - A 4-byte version identifier (= 1).
+
+  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
+
+  - A table of 4-byte unsigned integers in network order. The ith
+    value is the modification time (mtime) of the ith object in the
+    corresponding pack by lexicographic (index) order. The mtimes
+    count standard epoch seconds.
+
+  - A trailer, containing a checksum of the corresponding packfile,
+    and a checksum of all of the above (each having length according
+    to the specified hash function).
+
+All 4-byte numbers are in network order.
+
 == multi-pack-index (MIDX) files have the following format:
 
 The multi-pack-index files refer to multiple pack-files and loose objects.
diff --git a/Makefile b/Makefile
index 6f0b4b775f..1b186f4fd7 100644
--- a/Makefile
+++ b/Makefile
@@ -959,6 +959,7 @@ LIB_OBJS += oidtree.o
 LIB_OBJS += pack-bitmap-write.o
 LIB_OBJS += pack-bitmap.o
 LIB_OBJS += pack-check.o
+LIB_OBJS += pack-mtimes.o
 LIB_OBJS += pack-objects.o
 LIB_OBJS += pack-revindex.o
 LIB_OBJS += pack-write.o
diff --git a/builtin/repack.c b/builtin/repack.c
index da1e364a75..f908f7d5dd 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -212,6 +212,7 @@ static struct {
 } exts[] = {
 	{".pack"},
 	{".rev", 1},
+	{".mtimes", 1},
 	{".bitmap", 1},
 	{".promisor", 1},
 	{".idx"},
diff --git a/object-store.h b/object-store.h
index 6f89482df0..9b227661f2 100644
--- a/object-store.h
+++ b/object-store.h
@@ -115,12 +115,15 @@ struct packed_git {
 		 freshened:1,
 		 do_not_close:1,
 		 pack_promisor:1,
-		 multi_pack_index:1;
+		 multi_pack_index:1,
+		 is_cruft:1;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	struct revindex_entry *revindex;
 	const uint32_t *revindex_data;
 	const uint32_t *revindex_map;
 	size_t revindex_size;
+	const uint32_t *mtimes_map;
+	size_t mtimes_size;
 	/* something like ".git/objects/pack/xxxxx.pack" */
 	char pack_name[FLEX_ARRAY]; /* more */
 };
diff --git a/pack-mtimes.c b/pack-mtimes.c
new file mode 100644
index 0000000000..46ad584af1
--- /dev/null
+++ b/pack-mtimes.c
@@ -0,0 +1,126 @@
+#include "pack-mtimes.h"
+#include "object-store.h"
+#include "packfile.h"
+
+static char *pack_mtimes_filename(struct packed_git *p)
+{
+	size_t len;
+	if (!strip_suffix(p->pack_name, ".pack", &len))
+		BUG("pack_name does not end in .pack");
+	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
+	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
+}
+
+#define MTIMES_HEADER_SIZE (12)
+#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))
+
+struct mtimes_header {
+	uint32_t signature;
+	uint32_t version;
+	uint32_t hash_id;
+};
+
+static int load_pack_mtimes_file(char *mtimes_file,
+				 uint32_t num_objects,
+				 const uint32_t **data_p, size_t *len_p)
+{
+	int fd, ret = 0;
+	struct stat st;
+	void *data = NULL;
+	size_t mtimes_size;
+	struct mtimes_header header;
+	uint32_t *hdr;
+
+	fd = git_open(mtimes_file);
+
+	if (fd < 0) {
+		ret = -1;
+		goto cleanup;
+	}
+	if (fstat(fd, &st)) {
+		ret = error_errno(_("failed to read %s"), mtimes_file);
+		goto cleanup;
+	}
+
+	mtimes_size = xsize_t(st.st_size);
+
+	if (mtimes_size < MTIMES_MIN_SIZE) {
+		ret = error(_("mtimes file %s is too small"), mtimes_file);
+		goto cleanup;
+	}
+
+	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
+		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
+		goto cleanup;
+	}
+
+	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
+
+	header.signature = ntohl(hdr[0]);
+	header.version = ntohl(hdr[1]);
+	header.hash_id = ntohl(hdr[2]);
+
+	if (header.signature != MTIMES_SIGNATURE) {
+		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
+		goto cleanup;
+	}
+
+	if (header.version != 1) {
+		ret = error(_("mtimes file %s has unsupported version %"PRIu32),
+			    mtimes_file, header.version);
+		goto cleanup;
+	}
+
+	if (!(header.hash_id == 1 || header.hash_id == 2)) {
+		ret = error(_("mtimes file %s has unsupported hash id %"PRIu32),
+			    mtimes_file, header.hash_id);
+		goto cleanup;
+	}
+
+cleanup:
+	if (ret) {
+		if (data)
+			munmap(data, mtimes_size);
+	} else {
+		*len_p = mtimes_size;
+		*data_p = (const uint32_t *)data;
+	}
+
+	close(fd);
+	return ret;
+}
+
+int load_pack_mtimes(struct packed_git *p)
+{
+	char *mtimes_name = NULL;
+	int ret = 0;
+
+	if (!p->is_cruft)
+		return ret; /* not a cruft pack */
+	if (p->mtimes_map)
+		return ret; /* already loaded */
+
+	ret = open_pack_index(p);
+	if (ret < 0)
+		goto cleanup;
+
+	mtimes_name = pack_mtimes_filename(p);
+	ret = load_pack_mtimes_file(mtimes_name,
+				    p->num_objects,
+				    &p->mtimes_map,
+				    &p->mtimes_size);
+cleanup:
+	free(mtimes_name);
+	return ret;
+}
+
+uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)
+{
+	if (!p->mtimes_map)
+		BUG("pack .mtimes file not loaded for %s", p->pack_name);
+	if (p->num_objects <= pos)
+		BUG("pack .mtimes out-of-bounds (%"PRIu32" vs %"PRIu32")",
+		    pos, p->num_objects);
+
+	return get_be32(p->mtimes_map + pos + 3);
+}
diff --git a/pack-mtimes.h b/pack-mtimes.h
new file mode 100644
index 0000000000..38ddb9f893
--- /dev/null
+++ b/pack-mtimes.h
@@ -0,0 +1,15 @@
+#ifndef PACK_MTIMES_H
+#define PACK_MTIMES_H
+
+#include "git-compat-util.h"
+
+#define MTIMES_SIGNATURE 0x4d544d45 /* "MTME" */
+#define MTIMES_VERSION 1
+
+struct packed_git;
+
+int load_pack_mtimes(struct packed_git *p);
+
+uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos);
+
+#endif
diff --git a/packfile.c b/packfile.c
index 835b2d2716..fc0245fbab 100644
--- a/packfile.c
+++ b/packfile.c
@@ -334,12 +334,22 @@ static void close_pack_revindex(struct packed_git *p)
 	p->revindex_data = NULL;
 }
 
+static void close_pack_mtimes(struct packed_git *p)
+{
+	if (!p->mtimes_map)
+		return;
+
+	munmap((void *)p->mtimes_map, p->mtimes_size);
+	p->mtimes_map = NULL;
+}
+
 void close_pack(struct packed_git *p)
 {
 	close_pack_windows(p);
 	close_pack_fd(p);
 	close_pack_index(p);
 	close_pack_revindex(p);
+	close_pack_mtimes(p);
 	oidset_clear(&p->bad_objects);
 }
 
@@ -363,7 +373,7 @@ void close_object_store(struct raw_object_store *o)
 
 void unlink_pack_path(const char *pack_name, int force_delete)
 {
-	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
+	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".mtimes"};
 	int i;
 	struct strbuf buf = STRBUF_INIT;
 	size_t plen;
@@ -718,6 +728,10 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
 	if (!access(p->pack_name, F_OK))
 		p->pack_promisor = 1;
 
+	xsnprintf(p->pack_name + path_len, alloc - path_len, ".mtimes");
+	if (!access(p->pack_name, F_OK))
+		p->is_cruft = 1;
+
 	xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
 	if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) {
 		free(p);
@@ -869,7 +883,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 	    ends_with(file_name, ".pack") ||
 	    ends_with(file_name, ".bitmap") ||
 	    ends_with(file_name, ".keep") ||
-	    ends_with(file_name, ".promisor"))
+	    ends_with(file_name, ".promisor") ||
+	    ends_with(file_name, ".mtimes"))
 		string_list_append(data->garbage, full_name);
 	else
 		report_garbage(PACKDIR_FILE_GARBAGE, full_name);
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v3 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
  2022-03-03  0:20   ` [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
  2022-03-03  0:20   ` [PATCH v3 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
@ 2022-03-03  0:20   ` Taylor Blau
  2022-03-03  0:20   ` [PATCH v3 04/17] chunk-format.h: extract oid_version() Taylor Blau
                     ` (14 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-03  0:20 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

This structure will be used to communicate the per-object mtimes when
writing a cruft pack. Here, we need the full packing_data structure
because the mtime information is stored in an array there, not on the
individual object_entry's themselves (to avoid paying the overhead in
structure width for operations which do not generate a cruft pack).

We haven't passed this information down before because one of the two
callers (in bulk-checkin.c) does not have a packing_data structure at
all. In that case (where no cruft pack will be generated), NULL is
passed instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 3 ++-
 bulk-checkin.c         | 2 +-
 pack-write.c           | 1 +
 pack.h                 | 3 +++
 4 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 178e611f09..385970cb7b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1254,7 +1254,8 @@ static void write_pack_file(void)
 
 			stage_tmp_packfiles(&tmpname, pack_tmp_name,
 					    written_list, nr_written,
-					    &pack_idx_opts, hash, &idx_tmp_name);
+					    &to_pack, &pack_idx_opts, hash,
+					    &idx_tmp_name);
 
 			if (write_bitmap_index) {
 				size_t tmpname_len = tmpname.len;
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 8785b2ac80..99f7596c4e 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -33,7 +33,7 @@ static void finish_tmp_packfile(struct strbuf *basename,
 	char *idx_tmp_name = NULL;
 
 	stage_tmp_packfiles(basename, pack_tmp_name, written_list, nr_written,
-			    pack_idx_opts, hash, &idx_tmp_name);
+			    NULL, pack_idx_opts, hash, &idx_tmp_name);
 	rename_tmp_packfile_idx(basename, &idx_tmp_name);
 
 	free(idx_tmp_name);
diff --git a/pack-write.c b/pack-write.c
index a5846f3a34..d594e3008e 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -483,6 +483,7 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 const char *pack_tmp_name,
 			 struct pack_idx_entry **written_list,
 			 uint32_t nr_written,
+			 struct packing_data *to_pack,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[],
 			 char **idx_tmp_name)
diff --git a/pack.h b/pack.h
index b22bfc4a18..fd27cfdfd7 100644
--- a/pack.h
+++ b/pack.h
@@ -109,11 +109,14 @@ int encode_in_pack_object_header(unsigned char *hdr, int hdr_len,
 #define PH_ERROR_PROTOCOL	(-3)
 int read_pack_header(int fd, struct pack_header *);
 
+struct packing_data;
+
 struct hashfile *create_tmp_packfile(char **pack_tmp_name);
 void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 const char *pack_tmp_name,
 			 struct pack_idx_entry **written_list,
 			 uint32_t nr_written,
+			 struct packing_data *to_pack,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[],
 			 char **idx_tmp_name);
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v3 04/17] chunk-format.h: extract oid_version()
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (2 preceding siblings ...)
  2022-03-03  0:20   ` [PATCH v3 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
@ 2022-03-03  0:20   ` Taylor Blau
  2022-03-03 16:30     ` Ævar Arnfjörð Bjarmason
  2022-03-03  0:20   ` [PATCH v3 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
                     ` (13 subsequent siblings)
  17 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-03-03  0:20 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

There are three definitions of an identical function which converts
`the_hash_algo` into either 1 (for SHA-1) or 2 (for SHA-256). There is a
copy of this function for writing both the commit-graph and
multi-pack-index file, and another inline definition used to write the
.rev header.

Consolidate these into a single definition in chunk-format.h. It's not
clear that this is the best header to define this function in, but it
should do for now.

(Worth noting, the .rev caller expects a 4-byte unsigned, but the other
two callers work with a single unsigned byte. The consolidated version
uses the latter type, and lets the compiler widen it when required).

Another caller will be added in a subsequent patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 chunk-format.c | 12 ++++++++++++
 chunk-format.h |  3 +++
 commit-graph.c | 18 +++---------------
 midx.c         | 18 +++---------------
 pack-write.c   | 15 ++-------------
 5 files changed, 23 insertions(+), 43 deletions(-)

diff --git a/chunk-format.c b/chunk-format.c
index 1c3dca62e2..0275b74a89 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -181,3 +181,15 @@ int read_chunk(struct chunkfile *cf,
 
 	return CHUNK_NOT_FOUND;
 }
+
+uint8_t oid_version(const struct git_hash_algo *algop)
+{
+	switch (hash_algo_by_ptr(algop)) {
+	case GIT_HASH_SHA1:
+		return 1;
+	case GIT_HASH_SHA256:
+		return 2;
+	default:
+		die(_("invalid hash version"));
+	}
+}
diff --git a/chunk-format.h b/chunk-format.h
index 9ccbe00377..7885aa0848 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -2,6 +2,7 @@
 #define CHUNK_FORMAT_H
 
 #include "git-compat-util.h"
+#include "hash.h"
 
 struct hashfile;
 struct chunkfile;
@@ -65,4 +66,6 @@ int read_chunk(struct chunkfile *cf,
 	       chunk_read_fn fn,
 	       void *data);
 
+uint8_t oid_version(const struct git_hash_algo *algop);
+
 #endif
diff --git a/commit-graph.c b/commit-graph.c
index 265c010122..f678d2c4a1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -193,18 +193,6 @@ char *get_commit_graph_chain_filename(struct object_directory *odb)
 	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
 }
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 static struct commit_graph *alloc_commit_graph(void)
 {
 	struct commit_graph *g = xcalloc(1, sizeof(*g));
@@ -365,9 +353,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 	}
 
 	hash_version = *(unsigned char*)(data + 5);
-	if (hash_version != oid_version()) {
+	if (hash_version != oid_version(the_hash_algo)) {
 		error(_("commit-graph hash version %X does not match version %X"),
-		      hash_version, oid_version());
+		      hash_version, oid_version(the_hash_algo));
 		return NULL;
 	}
 
@@ -1911,7 +1899,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
 	hashwrite_u8(f, GRAPH_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, oid_version(the_hash_algo));
 	hashwrite_u8(f, get_num_chunks(cf));
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
diff --git a/midx.c b/midx.c
index 865170bad0..65e670c5e2 100644
--- a/midx.c
+++ b/midx.c
@@ -41,18 +41,6 @@
 
 #define PACK_EXPIRED UINT_MAX
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 const unsigned char *get_midx_checksum(struct multi_pack_index *m)
 {
 	return m->data + m->data_len - the_hash_algo->rawsz;
@@ -134,9 +122,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 		      m->version);
 
 	hash_version = m->data[MIDX_BYTE_HASH_VERSION];
-	if (hash_version != oid_version()) {
+	if (hash_version != oid_version(the_hash_algo)) {
 		error(_("multi-pack-index hash version %u does not match version %u"),
-		      hash_version, oid_version());
+		      hash_version, oid_version(the_hash_algo));
 		goto cleanup_fail;
 	}
 	m->hash_len = the_hash_algo->rawsz;
@@ -420,7 +408,7 @@ static size_t write_midx_header(struct hashfile *f,
 {
 	hashwrite_be32(f, MIDX_SIGNATURE);
 	hashwrite_u8(f, MIDX_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, oid_version(the_hash_algo));
 	hashwrite_u8(f, num_chunks);
 	hashwrite_u8(f, 0); /* unused */
 	hashwrite_be32(f, num_packs);
diff --git a/pack-write.c b/pack-write.c
index d594e3008e..ff305b404c 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -2,6 +2,7 @@
 #include "pack.h"
 #include "csum-file.h"
 #include "remote.h"
+#include "chunk-format.h"
 
 void reset_pack_idx_option(struct pack_idx_option *opts)
 {
@@ -181,21 +182,9 @@ static int pack_order_cmp(const void *va, const void *vb, void *ctx)
 
 static void write_rev_header(struct hashfile *f)
 {
-	uint32_t oid_version;
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		oid_version = 1;
-		break;
-	case GIT_HASH_SHA256:
-		oid_version = 2;
-		break;
-	default:
-		die("write_rev_header: unknown hash version");
-	}
-
 	hashwrite_be32(f, RIDX_SIGNATURE);
 	hashwrite_be32(f, RIDX_VERSION);
-	hashwrite_be32(f, oid_version);
+	hashwrite_be32(f, oid_version(the_hash_algo));
 }
 
 static void write_rev_index_positions(struct hashfile *f,
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v3 05/17] pack-mtimes: support writing pack .mtimes files
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (3 preceding siblings ...)
  2022-03-03  0:20   ` [PATCH v3 04/17] chunk-format.h: extract oid_version() Taylor Blau
@ 2022-03-03  0:20   ` Taylor Blau
  2022-03-03 16:45     ` Ævar Arnfjörð Bjarmason
  2022-03-03  0:20   ` [PATCH v3 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
                     ` (12 subsequent siblings)
  17 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-03-03  0:20 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Now that the `.mtimes` format is defined, supplement the pack-write API
to be able to conditionally write an `.mtimes` file along with a pack by
setting an additional flag and passing an oidmap that contains the
timestamps corresponding to each object in the pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-objects.c |  6 ++++
 pack-objects.h | 25 ++++++++++++++++
 pack-write.c   | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++
 pack.h         |  1 +
 4 files changed, 109 insertions(+)

diff --git a/pack-objects.c b/pack-objects.c
index fe2a4eace9..272e8d4517 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -170,6 +170,9 @@ struct object_entry *packlist_alloc(struct packing_data *pdata,
 
 		if (pdata->layer)
 			REALLOC_ARRAY(pdata->layer, pdata->nr_alloc);
+
+		if (pdata->cruft_mtime)
+			REALLOC_ARRAY(pdata->cruft_mtime, pdata->nr_alloc);
 	}
 
 	new_entry = pdata->objects + pdata->nr_objects++;
@@ -198,6 +201,9 @@ struct object_entry *packlist_alloc(struct packing_data *pdata,
 	if (pdata->layer)
 		pdata->layer[pdata->nr_objects - 1] = 0;
 
+	if (pdata->cruft_mtime)
+		pdata->cruft_mtime[pdata->nr_objects - 1] = 0;
+
 	return new_entry;
 }
 
diff --git a/pack-objects.h b/pack-objects.h
index dca2351ef9..393b9db546 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -168,6 +168,14 @@ struct packing_data {
 	/* delta islands */
 	unsigned int *tree_depth;
 	unsigned char *layer;
+
+	/*
+	 * Used when writing cruft packs.
+	 *
+	 * Object mtimes are stored in pack order when writing, but
+	 * written out in lexicographic (index) order.
+	 */
+	uint32_t *cruft_mtime;
 };
 
 void prepare_packing_data(struct repository *r, struct packing_data *pdata);
@@ -289,4 +297,21 @@ static inline void oe_set_layer(struct packing_data *pack,
 	pack->layer[e - pack->objects] = layer;
 }
 
+static inline uint32_t oe_cruft_mtime(struct packing_data *pack,
+				      struct object_entry *e)
+{
+	if (!pack->cruft_mtime)
+		return 0;
+	return pack->cruft_mtime[e - pack->objects];
+}
+
+static inline void oe_set_cruft_mtime(struct packing_data *pack,
+				      struct object_entry *e,
+				      uint32_t mtime)
+{
+	if (!pack->cruft_mtime)
+		CALLOC_ARRAY(pack->cruft_mtime, pack->nr_alloc);
+	pack->cruft_mtime[e - pack->objects] = mtime;
+}
+
 #endif
diff --git a/pack-write.c b/pack-write.c
index ff305b404c..270280c4df 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -3,6 +3,10 @@
 #include "csum-file.h"
 #include "remote.h"
 #include "chunk-format.h"
+#include "pack-mtimes.h"
+#include "oidmap.h"
+#include "chunk-format.h"
+#include "pack-objects.h"
 
 void reset_pack_idx_option(struct pack_idx_option *opts)
 {
@@ -276,6 +280,70 @@ const char *write_rev_file_order(const char *rev_name,
 	return rev_name;
 }
 
+static void write_mtimes_header(struct hashfile *f)
+{
+	hashwrite_be32(f, MTIMES_SIGNATURE);
+	hashwrite_be32(f, MTIMES_VERSION);
+	hashwrite_be32(f, oid_version(the_hash_algo));
+}
+
+/*
+ * Writes the object mtimes of "objects" for use in a .mtimes file.
+ * Note that objects must be in lexicographic (index) order, which is
+ * the expected ordering of these values in the .mtimes file.
+ */
+static void write_mtimes_objects(struct hashfile *f,
+				 struct packing_data *to_pack,
+				 struct pack_idx_entry **objects,
+				 uint32_t nr_objects)
+{
+	uint32_t i;
+	for (i = 0; i < nr_objects; i++) {
+		struct object_entry *e = (struct object_entry*)objects[i];
+		hashwrite_be32(f, oe_cruft_mtime(to_pack, e));
+	}
+}
+
+static void write_mtimes_trailer(struct hashfile *f, const unsigned char *hash)
+{
+	hashwrite(f, hash, the_hash_algo->rawsz);
+}
+
+static const char *write_mtimes_file(const char *mtimes_name,
+				     struct packing_data *to_pack,
+				     struct pack_idx_entry **objects,
+				     uint32_t nr_objects,
+				     const unsigned char *hash)
+{
+	struct hashfile *f;
+	int fd;
+
+	if (!to_pack)
+		BUG("cannot call write_mtimes_file with NULL packing_data");
+
+	if (!mtimes_name) {
+		struct strbuf tmp_file = STRBUF_INIT;
+		fd = odb_mkstemp(&tmp_file, "pack/tmp_mtimes_XXXXXX");
+		mtimes_name = strbuf_detach(&tmp_file, NULL);
+	} else {
+		unlink(mtimes_name);
+		fd = xopen(mtimes_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
+	}
+	f = hashfd(fd, mtimes_name);
+
+	write_mtimes_header(f);
+	write_mtimes_objects(f, to_pack, objects, nr_objects);
+	write_mtimes_trailer(f, hash);
+
+	if (adjust_shared_perm(mtimes_name) < 0)
+		die(_("failed to make %s readable"), mtimes_name);
+
+	finalize_hashfile(f, NULL,
+			  CSUM_HASH_IN_STREAM | CSUM_CLOSE | CSUM_FSYNC);
+
+	return mtimes_name;
+}
+
 off_t write_pack_header(struct hashfile *f, uint32_t nr_entries)
 {
 	struct pack_header hdr;
@@ -478,6 +546,7 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 char **idx_tmp_name)
 {
 	const char *rev_tmp_name = NULL;
+	const char *mtimes_tmp_name = NULL;
 
 	if (adjust_shared_perm(pack_tmp_name))
 		die_errno("unable to make temporary pack file readable");
@@ -490,9 +559,17 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 	rev_tmp_name = write_rev_file(NULL, written_list, nr_written, hash,
 				      pack_idx_opts->flags);
 
+	if (pack_idx_opts->flags & WRITE_MTIMES) {
+		mtimes_tmp_name = write_mtimes_file(NULL, to_pack, written_list,
+						    nr_written,
+						    hash);
+	}
+
 	rename_tmp_packfile(name_buffer, pack_tmp_name, "pack");
 	if (rev_tmp_name)
 		rename_tmp_packfile(name_buffer, rev_tmp_name, "rev");
+	if (mtimes_tmp_name)
+		rename_tmp_packfile(name_buffer, mtimes_tmp_name, "mtimes");
 }
 
 void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought)
diff --git a/pack.h b/pack.h
index fd27cfdfd7..01d385903a 100644
--- a/pack.h
+++ b/pack.h
@@ -44,6 +44,7 @@ struct pack_idx_option {
 #define WRITE_IDX_STRICT 02
 #define WRITE_REV 04
 #define WRITE_REV_VERIFY 010
+#define WRITE_MTIMES 020
 
 	uint32_t version;
 	uint32_t off32_limit;
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v3 06/17] t/helper: add 'pack-mtimes' test-tool
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (4 preceding siblings ...)
  2022-03-03  0:20   ` [PATCH v3 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
@ 2022-03-03  0:20   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
                     ` (11 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-03  0:20 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

In the next patch, we will implement and test support for writing a
cruft pack via a special mode of `git pack-objects`. To make sure that
objects are written with the correct timestamps, and a new test-tool
that can dump the object names and corresponding timestamps from a given
`.mtimes` file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Makefile                    |  1 +
 t/helper/test-pack-mtimes.c | 56 +++++++++++++++++++++++++++++++++++++
 t/helper/test-tool.c        |  1 +
 t/helper/test-tool.h        |  1 +
 4 files changed, 59 insertions(+)
 create mode 100644 t/helper/test-pack-mtimes.c

diff --git a/Makefile b/Makefile
index 1b186f4fd7..5c0ed1ade7 100644
--- a/Makefile
+++ b/Makefile
@@ -727,6 +727,7 @@ TEST_BUILTINS_OBJS += test-oid-array.o
 TEST_BUILTINS_OBJS += test-oidmap.o
 TEST_BUILTINS_OBJS += test-oidtree.o
 TEST_BUILTINS_OBJS += test-online-cpus.o
+TEST_BUILTINS_OBJS += test-pack-mtimes.o
 TEST_BUILTINS_OBJS += test-parse-options.o
 TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
 TEST_BUILTINS_OBJS += test-partial-clone.o
diff --git a/t/helper/test-pack-mtimes.c b/t/helper/test-pack-mtimes.c
new file mode 100644
index 0000000000..f7b79daf4c
--- /dev/null
+++ b/t/helper/test-pack-mtimes.c
@@ -0,0 +1,56 @@
+#include "git-compat-util.h"
+#include "test-tool.h"
+#include "strbuf.h"
+#include "object-store.h"
+#include "packfile.h"
+#include "pack-mtimes.h"
+
+static void dump_mtimes(struct packed_git *p)
+{
+	uint32_t i;
+	if (load_pack_mtimes(p) < 0)
+		die("could not load pack .mtimes");
+
+	for (i = 0; i < p->num_objects; i++) {
+		struct object_id oid;
+		if (nth_packed_object_id(&oid, p, i) < 0)
+			die("could not load object id at position %"PRIu32, i);
+
+		printf("%s %"PRIu32"\n",
+		       oid_to_hex(&oid), nth_packed_mtime(p, i));
+	}
+}
+
+static const char *pack_mtimes_usage = "\n"
+"  test-tool pack-mtimes <pack-name.mtimes>";
+
+int cmd__pack_mtimes(int argc, const char **argv)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct packed_git *p;
+
+	setup_git_directory();
+
+	if (argc != 2)
+		usage(pack_mtimes_usage);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		strbuf_addstr(&buf, basename(p->pack_name));
+		strbuf_strip_suffix(&buf, ".pack");
+		strbuf_addstr(&buf, ".mtimes");
+
+		if (!strcmp(buf.buf, argv[1]))
+			break;
+
+		strbuf_reset(&buf);
+	}
+
+	strbuf_release(&buf);
+
+	if (!p)
+		die("could not find pack '%s'", argv[1]);
+
+	dump_mtimes(p);
+
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index e6ec69cf32..7d472b31fd 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -47,6 +47,7 @@ static struct test_cmd cmds[] = {
 	{ "oidmap", cmd__oidmap },
 	{ "oidtree", cmd__oidtree },
 	{ "online-cpus", cmd__online_cpus },
+	{ "pack-mtimes", cmd__pack_mtimes },
 	{ "parse-options", cmd__parse_options },
 	{ "parse-pathspec-file", cmd__parse_pathspec_file },
 	{ "partial-clone", cmd__partial_clone },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 20756eefdd..0ac4f32955 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -37,6 +37,7 @@ int cmd__mktemp(int argc, const char **argv);
 int cmd__oidmap(int argc, const char **argv);
 int cmd__oidtree(int argc, const char **argv);
 int cmd__online_cpus(int argc, const char **argv);
+int cmd__pack_mtimes(int argc, const char **argv);
 int cmd__parse_options(int argc, const char **argv);
 int cmd__parse_pathspec_file(int argc, const char** argv);
 int cmd__partial_clone(int argc, const char **argv);
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v3 07/17] builtin/pack-objects.c: return from create_object_entry()
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (5 preceding siblings ...)
  2022-03-03  0:20   ` [PATCH v3 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
                     ` (10 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

A new caller in the next commit will want to immediately modify the
object_entry structure created by create_object_entry(). Instead of
forcing that caller to wastefully look-up the entry we just created,
return it from create_object_entry() instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 385970cb7b..3f08a3c63a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1508,13 +1508,13 @@ static int want_object_in_pack(const struct object_id *oid,
 	return 1;
 }
 
-static void create_object_entry(const struct object_id *oid,
-				enum object_type type,
-				uint32_t hash,
-				int exclude,
-				int no_try_delta,
-				struct packed_git *found_pack,
-				off_t found_offset)
+static struct object_entry *create_object_entry(const struct object_id *oid,
+						enum object_type type,
+						uint32_t hash,
+						int exclude,
+						int no_try_delta,
+						struct packed_git *found_pack,
+						off_t found_offset)
 {
 	struct object_entry *entry;
 
@@ -1531,6 +1531,8 @@ static void create_object_entry(const struct object_id *oid,
 	}
 
 	entry->no_try_delta = no_try_delta;
+
+	return entry;
 }
 
 static const char no_closure_warning[] = N_(
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v3 08/17] builtin/pack-objects.c: --cruft without expiration
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (6 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
                     ` (9 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Teach `pack-objects` how to generate a cruft pack when no objects are
dropped (i.e., `--cruft-expiration=never`). Later patches will teach
`pack-objects` how to generate a cruft pack that prunes objects.

When generating a cruft pack which does not prune objects, we want to
collect all unreachable objects into a single pack (noting and updating
their mtimes as we accumulate them). Ordinary use will pass the result
of a `git repack -A` as a kept pack, so when this patch says "kept
pack", readers should think "reachable objects".

Generating a non-expiring cruft packs works as follows:

  - Callers provide a list of every pack they know about, and indicate
    which packs are about to be removed.

  - All packs which are going to be removed (we'll call these the
    redundant ones) are marked as kept in-core.

    Any packs the caller did not mention (but are known to the
    `pack-objects` process) are also marked as kept in-core. Packs not
    mentioned by the caller are assumed to be unknown to them, i.e.,
    they entered the repository after the caller decided which packs
    should be kept and which should be discarded.

    Since we do not want to include objects in these "unknown" packs
    (because we don't know which of their objects are or aren't
    reachable), these are also marked as kept in-core.

  - Then, we enumerate all objects in the repository, and add them to
    our packing list if they do not appear in an in-core kept pack.

This results in a new cruft pack which contains all known objects that
aren't included in the kept packs. When the kept pack is the result of
`git repack -A`, the resulting pack contains all unreachable objects.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-pack-objects.txt |  30 ++++
 builtin/pack-objects.c             | 201 +++++++++++++++++++++++++-
 object-file.c                      |   2 +-
 object-store.h                     |   2 +
 t/t5329-pack-objects-cruft.sh      | 218 +++++++++++++++++++++++++++++
 5 files changed, 448 insertions(+), 5 deletions(-)
 create mode 100755 t/t5329-pack-objects-cruft.sh

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index f8344e1e5b..a9995a932c 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -13,6 +13,7 @@ SYNOPSIS
 	[--no-reuse-delta] [--delta-base-offset] [--non-empty]
 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
 	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
+	[--cruft] [--cruft-expiration=<time>]
 	[--stdout [--filter=<filter-spec>] | <base-name>]
 	[--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>
 
@@ -95,6 +96,35 @@ base-name::
 Incompatible with `--revs`, or options that imply `--revs` (such as
 `--all`), with the exception of `--unpacked`, which is compatible.
 
+--cruft::
+	Packs unreachable objects into a separate "cruft" pack, denoted
+	by the existence of a `.mtimes` file. Typically used by `git
+	repack --cruft`. Callers provide a list of pack names and
+	indicate which packs will remain in the repository, along with
+	which packs will be deleted (indicated by the `-` prefix). The
+	contents of the cruft pack are all objects not contained in the
+	surviving packs which have not exceeded the grace period (see
+	`--cruft-expiration` below), or which have exceeded the grace
+	period, but are reachable from an other object which hasn't.
++
+When the input lists a pack containing all reachable objects (and lists
+all other packs as pending deletion), the corresponding cruft pack will
+contain all unreachable objects (with mtime newer than the
+`--cruft-expiration`) along with any unreachable objects whose mtime is
+older than the `--cruft-expiration`, but are reachable from an
+unreachable object whose mtime is newer than the `--cruft-expiration`).
++
+Incompatible with `--unpack-unreachable`, `--keep-unreachable`,
+`--pack-loose-unreachable`, `--stdin-packs`, as well as any other
+options which imply `--revs`. Also incompatible with `--max-pack-size`;
+when this option is set, the maximum pack size is not inferred from
+`pack.packSizeLimit`.
+
+--cruft-expiration=<approxidate>::
+	If specified, objects are eliminated from the cruft pack if they
+	have an mtime older than `<approxidate>`. If unspecified (and
+	given `--cruft`), then no objects are eliminated.
+
 --window=<n>::
 --depth=<n>::
 	These two options affect how the objects contained in
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 3f08a3c63a..5ba4fc9c2c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -36,6 +36,7 @@
 #include "trace2.h"
 #include "shallow.h"
 #include "promisor-remote.h"
+#include "pack-mtimes.h"
 
 /*
  * Objects we are going to pack are collected in the `to_pack` structure.
@@ -194,6 +195,8 @@ static int reuse_delta = 1, reuse_object = 1;
 static int keep_unreachable, unpack_unreachable, include_tag;
 static timestamp_t unpack_unreachable_expiration;
 static int pack_loose_unreachable;
+static int cruft;
+static timestamp_t cruft_expiration;
 static int local;
 static int have_non_local_packs;
 static int incremental;
@@ -1252,6 +1255,9 @@ static void write_pack_file(void)
 					&to_pack, written_list, nr_written);
 			}
 
+			if (cruft)
+				pack_idx_opts.flags |= WRITE_MTIMES;
+
 			stage_tmp_packfiles(&tmpname, pack_tmp_name,
 					    written_list, nr_written,
 					    &to_pack, &pack_idx_opts, hash,
@@ -3389,6 +3395,135 @@ static void read_packs_list_from_stdin(void)
 	string_list_clear(&exclude_packs, 0);
 }
 
+static void add_cruft_object_entry(const struct object_id *oid, enum object_type type,
+				   struct packed_git *pack, off_t offset,
+				   const char *name, uint32_t mtime)
+{
+	struct object_entry *entry;
+
+	display_progress(progress_state, ++nr_seen);
+
+	entry = packlist_find(&to_pack, oid);
+	if (entry) {
+		if (name) {
+			entry->hash = pack_name_hash(name);
+			entry->no_try_delta = no_try_delta(name);
+		}
+	} else {
+		if (!want_object_in_pack(oid, 0, &pack, &offset))
+			return;
+		if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
+			/*
+			 * If a traversed tree has a missing blob then we want
+			 * to avoid adding that missing object to our pack.
+			 *
+			 * This only applies to missing blobs, not trees,
+			 * because the traversal needs to parse sub-trees but
+			 * not blobs.
+			 *
+			 * Note we only perform this check when we couldn't
+			 * already find the object in a pack, so we're really
+			 * limited to "ensure non-tip blobs which don't exist in
+			 * packs do exist via loose objects". Confused?
+			 */
+			return;
+		}
+
+		entry = create_object_entry(oid, type, pack_name_hash(name),
+					    0, name && no_try_delta(name),
+					    pack, offset);
+	}
+
+	if (mtime > oe_cruft_mtime(&to_pack, entry))
+		oe_set_cruft_mtime(&to_pack, entry, mtime);
+	return;
+}
+
+static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
+{
+	struct string_list_item *item = NULL;
+	for_each_string_list_item(item, packs) {
+		struct packed_git *p = item->util;
+		if (!p)
+			die(_("could not find pack '%s'"), item->string);
+		p->pack_keep_in_core = keep;
+	}
+}
+
+static void add_unreachable_loose_objects(void);
+static void add_objects_in_unpacked_packs(void);
+
+static void enumerate_cruft_objects(void)
+{
+	if (progress)
+		progress_state = start_progress(_("Enumerating cruft objects"), 0);
+
+	add_objects_in_unpacked_packs();
+	add_unreachable_loose_objects();
+
+	stop_progress(&progress_state);
+}
+
+static void read_cruft_objects(void)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct string_list discard_packs = STRING_LIST_INIT_DUP;
+	struct string_list fresh_packs = STRING_LIST_INIT_DUP;
+	struct packed_git *p;
+
+	ignore_packed_keep_in_core = 1;
+
+	while (strbuf_getline(&buf, stdin) != EOF) {
+		if (!buf.len)
+			continue;
+
+		if (*buf.buf == '-')
+			string_list_append(&discard_packs, buf.buf + 1);
+		else
+			string_list_append(&fresh_packs, buf.buf);
+		strbuf_reset(&buf);
+	}
+
+	string_list_sort(&discard_packs);
+	string_list_sort(&fresh_packs);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		const char *pack_name = pack_basename(p);
+		struct string_list_item *item;
+
+		item = string_list_lookup(&fresh_packs, pack_name);
+		if (!item)
+			item = string_list_lookup(&discard_packs, pack_name);
+
+		if (item) {
+			item->util = p;
+		} else {
+			/*
+			 * This pack wasn't mentioned in either the "fresh" or
+			 * "discard" list, so the caller didn't know about it.
+			 *
+			 * Mark it as kept so that its objects are ignored by
+			 * add_unseen_recent_objects_to_traversal(). We'll
+			 * unmark it before starting the traversal so it doesn't
+			 * halt the traversal early.
+			 */
+			p->pack_keep_in_core = 1;
+		}
+	}
+
+	mark_pack_kept_in_core(&fresh_packs, 1);
+	mark_pack_kept_in_core(&discard_packs, 0);
+
+	if (cruft_expiration)
+		die("--cruft-expiration not yet implemented");
+	else
+		enumerate_cruft_objects();
+
+	strbuf_release(&buf);
+	string_list_clear(&discard_packs, 0);
+	string_list_clear(&fresh_packs, 0);
+}
+
 static void read_object_list_from_stdin(void)
 {
 	char line[GIT_MAX_HEXSZ + 1 + PATH_MAX + 2];
@@ -3521,7 +3656,24 @@ static int add_object_in_unpacked_pack(const struct object_id *oid,
 				       uint32_t pos,
 				       void *_data)
 {
-	add_object_entry(oid, OBJ_NONE, "", 0);
+	if (cruft) {
+		off_t offset;
+		time_t mtime;
+
+		if (pack->is_cruft) {
+			if (load_pack_mtimes(pack) < 0)
+				die(_("could not load cruft pack .mtimes"));
+			mtime = nth_packed_mtime(pack, pos);
+		} else {
+			mtime = pack->mtime;
+		}
+		offset = nth_packed_object_offset(pack, pos);
+
+		add_cruft_object_entry(oid, OBJ_NONE, pack, offset,
+				       NULL, mtime);
+	} else {
+		add_object_entry(oid, OBJ_NONE, "", 0);
+	}
 	return 0;
 }
 
@@ -3545,7 +3697,19 @@ static int add_loose_object(const struct object_id *oid, const char *path,
 		return 0;
 	}
 
-	add_object_entry(oid, type, "", 0);
+	if (cruft) {
+		struct stat st;
+		if (stat(path, &st) < 0) {
+			if (errno == ENOENT)
+				return 0;
+			return error_errno("unable to stat %s", oid_to_hex(oid));
+		}
+
+		add_cruft_object_entry(oid, type, NULL, 0, NULL,
+				       st.st_mtime);
+	} else {
+		add_object_entry(oid, type, "", 0);
+	}
 	return 0;
 }
 
@@ -3864,6 +4028,20 @@ static int option_parse_unpack_unreachable(const struct option *opt,
 	return 0;
 }
 
+static int option_parse_cruft_expiration(const struct option *opt,
+					 const char *arg, int unset)
+{
+	if (unset) {
+		cruft = 0;
+		cruft_expiration = 0;
+	} else {
+		cruft = 1;
+		if (arg)
+			cruft_expiration = approxidate(arg);
+	}
+	return 0;
+}
+
 int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 {
 	int use_internal_rev_list = 0;
@@ -3936,6 +4114,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		OPT_CALLBACK_F(0, "unpack-unreachable", NULL, N_("time"),
 		  N_("unpack unreachable objects newer than <time>"),
 		  PARSE_OPT_OPTARG, option_parse_unpack_unreachable),
+		OPT_BOOL(0, "cruft", &cruft, N_("create a cruft pack")),
+		OPT_CALLBACK_F(0, "cruft-expiration", NULL, N_("time"),
+		  N_("expire cruft objects older than <time>"),
+		  PARSE_OPT_OPTARG, option_parse_cruft_expiration),
 		OPT_BOOL(0, "sparse", &sparse,
 			 N_("use the sparse reachability algorithm")),
 		OPT_BOOL(0, "thin", &thin,
@@ -4062,7 +4244,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	if (!HAVE_THREADS && delta_search_threads != 1)
 		warning(_("no threads support, ignoring --threads"));
-	if (!pack_to_stdout && !pack_size_limit)
+	if (!pack_to_stdout && !pack_size_limit && !cruft)
 		pack_size_limit = pack_size_limit_cfg;
 	if (pack_to_stdout && pack_size_limit)
 		die(_("--max-pack-size cannot be used to build a pack for transfer"));
@@ -4089,6 +4271,15 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
 
+	if (cruft) {
+		if (use_internal_rev_list)
+			die(_("cannot use internal rev list with --cruft"));
+		if (stdin_packs)
+			die(_("cannot use --stdin-packs with --cruft"));
+		if (pack_size_limit)
+			die(_("cannot use --max-pack-size with --cruft"));
+	}
+
 	/*
 	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
 	 *
@@ -4145,7 +4336,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			    the_repository);
 	prepare_packing_data(the_repository, &to_pack);
 
-	if (progress)
+	if (progress && !cruft)
 		progress_state = start_progress(_("Enumerating objects"), 0);
 	if (stdin_packs) {
 		/* avoids adding objects in excluded packs */
@@ -4153,6 +4344,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		read_packs_list_from_stdin();
 		if (rev_list_unpacked)
 			add_unreachable_loose_objects();
+	} else if (cruft) {
+		read_cruft_objects();
 	} else if (!use_internal_rev_list) {
 		read_object_list_from_stdin();
 	} else {
diff --git a/object-file.c b/object-file.c
index 8be57f48de..e80da1368d 100644
--- a/object-file.c
+++ b/object-file.c
@@ -996,7 +996,7 @@ int has_loose_object_nonlocal(const struct object_id *oid)
 	return check_and_freshen_nonlocal(oid, 0);
 }
 
-static int has_loose_object(const struct object_id *oid)
+int has_loose_object(const struct object_id *oid)
 {
 	return check_and_freshen(oid, 0);
 }
diff --git a/object-store.h b/object-store.h
index 9b227661f2..6b025dc670 100644
--- a/object-store.h
+++ b/object-store.h
@@ -334,6 +334,8 @@ int repo_has_object_file_with_flags(struct repository *r,
  */
 int has_loose_object_nonlocal(const struct object_id *);
 
+int has_loose_object(const struct object_id *);
+
 void assert_oid_type(const struct object_id *oid, enum object_type expect);
 
 /*
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
new file mode 100755
index 0000000000..003ca7344e
--- /dev/null
+++ b/t/t5329-pack-objects-cruft.sh
@@ -0,0 +1,218 @@
+#!/bin/sh
+
+test_description='cruft pack related pack-objects tests'
+. ./test-lib.sh
+
+objdir=.git/objects
+packdir=$objdir/pack
+
+basic_cruft_pack_tests () {
+	expire="$1"
+
+	test_expect_success "unreachable loose objects are packed (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit base &&
+			git repack -Ad &&
+			test_commit loose &&
+
+			test-tool chmtime +2000 "$objdir/$(test_oid_to_path \
+				$(git rev-parse loose:loose.t))" &&
+			test-tool chmtime +1000 "$objdir/$(test_oid_to_path \
+				$(git rev-parse loose^{tree}))" &&
+
+			(
+				git rev-list --objects --no-object-names base..loose |
+				while read oid
+				do
+					path="$objdir/$(test_oid_to_path "$oid")" &&
+					printf "%s %d\n" "$oid" "$(test-tool chmtime --get "$path")"
+				done |
+				sort -k1
+			) >expect &&
+
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+			cruft="$(echo $keep | git pack-objects --cruft \
+				--cruft-expiration="$expire" $packdir/pack)" &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "unreachable packed objects are packed (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit packed &&
+			git repack -Ad &&
+			test_commit other &&
+
+			git rev-list --objects --no-object-names packed.. >objects &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+			other="$(git pack-objects --delta-base-offset \
+				$packdir/pack <objects)" &&
+			git prune-packed &&
+
+			test-tool chmtime --get -100 "$packdir/pack-$other.pack" >expect &&
+
+			cruft="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF
+			$keep
+			-pack-$other.pack
+			EOF
+			)" &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+			cut -d" " -f2 <actual.raw | sort -u >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "unreachable cruft objects are repacked (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit packed &&
+			git repack -Ad &&
+			test_commit other &&
+
+			git rev-list --objects --no-object-names packed.. >objects &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+
+			cruft_a="$(echo $keep | git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack)" &&
+			git prune-packed &&
+			cruft_b="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF
+			$keep
+			-pack-$cruft_a.pack
+			EOF
+			)" &&
+
+			test-tool pack-mtimes "pack-$cruft_a.mtimes" >expect.raw &&
+			test-tool pack-mtimes "pack-$cruft_b.mtimes" >actual.raw &&
+
+			sort <expect.raw >expect &&
+			sort <actual.raw >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "multiple cruft packs (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			git repack -Ad &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+
+			test_commit cruft &&
+			loose="$objdir/$(test_oid_to_path $(git rev-parse cruft))" &&
+
+			# generate three copies of the cruft object in different
+			# cruft packs, each with a unique mtime:
+			#   - one expired (1000 seconds ago)
+			#   - two non-expired (one 1000 seconds in the future,
+			#     one 1500 seconds in the future)
+			test-tool chmtime =-1000 "$loose" &&
+			git pack-objects --cruft $packdir/pack-A <<-EOF &&
+			$keep
+			EOF
+			test-tool chmtime =+1000 "$loose" &&
+			git pack-objects --cruft $packdir/pack-B <<-EOF &&
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			EOF
+			test-tool chmtime =+1500 "$loose" &&
+			git pack-objects --cruft $packdir/pack-C <<-EOF &&
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			-$(basename $(ls $packdir/pack-B-*.pack))
+			EOF
+
+			# ensure the resulting cruft pack takes the most recent
+			# mtime among all copies
+			cruft="$(git pack-objects --cruft \
+				--cruft-expiration="$expire" \
+				$packdir/pack <<-EOF
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			-$(basename $(ls $packdir/pack-B-*.pack))
+			-$(basename $(ls $packdir/pack-C-*.pack))
+			EOF
+			)" &&
+
+			test-tool pack-mtimes "$(basename $(ls $packdir/pack-C-*.mtimes))" >expect.raw &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+			sort expect.raw >expect &&
+			sort actual.raw >actual &&
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "cruft packs tolerate missing trees (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			test_commit cruft &&
+
+			tree="$(git rev-parse cruft^{tree})" &&
+
+			git reset --hard reachable &&
+			git tag -d cruft &&
+			git reflog expire --all --expire=all &&
+
+			# remove the unreachable tree, but leave the commit
+			# which has it as its root tree intact
+			rm -fr "$objdir/$(test_oid_to_path "$tree")" &&
+
+			git repack -Ad &&
+			basename $(ls $packdir/pack-*.pack) >in &&
+			git pack-objects --cruft --cruft-expiration="$expire" \
+				$packdir/pack <in
+		)
+	'
+
+	test_expect_success "cruft packs tolerate missing blobs (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			test_commit cruft &&
+
+			blob="$(git rev-parse cruft:cruft.t)" &&
+
+			git reset --hard reachable &&
+			git tag -d cruft &&
+			git reflog expire --all --expire=all &&
+
+			# remove the unreachable blob, but leave the commit (and
+			# the root tree of that commit) intact
+			rm -fr "$objdir/$(test_oid_to_path "$blob")" &&
+
+			git repack -Ad &&
+			basename $(ls $packdir/pack-*.pack) >in &&
+			git pack-objects --cruft --cruft-expiration="$expire" \
+				$packdir/pack <in
+		)
+	'
+}
+
+basic_cruft_pack_tests never
+
+test_done
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v3 09/17] reachable: add options to add_unseen_recent_objects_to_traversal
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (7 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
                     ` (8 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

This function behaves very similarly to what we will need in
pack-objects in order to implement cruft packs with expiration. But it
is lacking a couple of things. Namely, it needs:

  - a mechanism to communicate the timestamps of individual recent
    objects to some external caller

  - and, in the case of packed objects, our future caller will also want
    to know the originating pack, as well as the offset within that pack
    at which the object can be found

  - finally, it needs a way to skip over packs which are marked as kept
    in-core.

To address the first two, add a callback interface in this patch which
reports the time of each recent object, as well as a (packed_git,
off_t) pair for packed objects.

Likewise, add a new option to the packed object iterators to skip over
packs which are marked as kept in core. This option will become
implicitly tested in a future patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c |  2 +-
 reachable.c            | 51 +++++++++++++++++++++++++++++++++++-------
 reachable.h            |  9 +++++++-
 3 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5ba4fc9c2c..1ef333717d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3951,7 +3951,7 @@ static void get_object_list(int ac, const char **av)
 	if (unpack_unreachable_expiration) {
 		revs.ignore_missing_links = 1;
 		if (add_unseen_recent_objects_to_traversal(&revs,
-				unpack_unreachable_expiration))
+				unpack_unreachable_expiration, NULL, 0))
 			die(_("unable to add recent objects"));
 		if (prepare_revision_walk(&revs))
 			die(_("revision walk setup failed"));
diff --git a/reachable.c b/reachable.c
index 84e3d0d75e..0eb9909f47 100644
--- a/reachable.c
+++ b/reachable.c
@@ -60,9 +60,13 @@ static void mark_commit(struct commit *c, void *data)
 struct recent_data {
 	struct rev_info *revs;
 	timestamp_t timestamp;
+	report_recent_object_fn *cb;
+	int ignore_in_core_kept_packs;
 };
 
 static void add_recent_object(const struct object_id *oid,
+			      struct packed_git *pack,
+			      off_t offset,
 			      timestamp_t mtime,
 			      struct recent_data *data)
 {
@@ -103,13 +107,29 @@ static void add_recent_object(const struct object_id *oid,
 		die("unable to lookup %s", oid_to_hex(oid));
 
 	add_pending_object(data->revs, obj, "");
+	if (data->cb)
+		data->cb(obj, pack, offset, mtime);
+}
+
+static int want_recent_object(struct recent_data *data,
+			      const struct object_id *oid)
+{
+	if (data->ignore_in_core_kept_packs &&
+	    has_object_kept_pack(oid, IN_CORE_KEEP_PACKS))
+		return 0;
+	return 1;
 }
 
 static int add_recent_loose(const struct object_id *oid,
 			    const char *path, void *data)
 {
 	struct stat st;
-	struct object *obj = lookup_object(the_repository, oid);
+	struct object *obj;
+
+	if (!want_recent_object(data, oid))
+		return 0;
+
+	obj = lookup_object(the_repository, oid);
 
 	if (obj && obj->flags & SEEN)
 		return 0;
@@ -126,7 +146,7 @@ static int add_recent_loose(const struct object_id *oid,
 		return error_errno("unable to stat %s", oid_to_hex(oid));
 	}
 
-	add_recent_object(oid, st.st_mtime, data);
+	add_recent_object(oid, NULL, 0, st.st_mtime, data);
 	return 0;
 }
 
@@ -134,29 +154,43 @@ static int add_recent_packed(const struct object_id *oid,
 			     struct packed_git *p, uint32_t pos,
 			     void *data)
 {
-	struct object *obj = lookup_object(the_repository, oid);
+	struct object *obj;
+
+	if (!want_recent_object(data, oid))
+		return 0;
+
+	obj = lookup_object(the_repository, oid);
 
 	if (obj && obj->flags & SEEN)
 		return 0;
-	add_recent_object(oid, p->mtime, data);
+	add_recent_object(oid, p, nth_packed_object_offset(p, pos), p->mtime, data);
 	return 0;
 }
 
 int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
-					   timestamp_t timestamp)
+					   timestamp_t timestamp,
+					   report_recent_object_fn *cb,
+					   int ignore_in_core_kept_packs)
 {
 	struct recent_data data;
+	enum for_each_object_flags flags;
 	int r;
 
 	data.revs = revs;
 	data.timestamp = timestamp;
+	data.cb = cb;
+	data.ignore_in_core_kept_packs = ignore_in_core_kept_packs;
 
 	r = for_each_loose_object(add_recent_loose, &data,
 				  FOR_EACH_OBJECT_LOCAL_ONLY);
 	if (r)
 		return r;
-	return for_each_packed_object(add_recent_packed, &data,
-				      FOR_EACH_OBJECT_LOCAL_ONLY);
+
+	flags = FOR_EACH_OBJECT_LOCAL_ONLY | FOR_EACH_OBJECT_PACK_ORDER;
+	if (ignore_in_core_kept_packs)
+		flags |= FOR_EACH_OBJECT_SKIP_IN_CORE_KEPT_PACKS;
+
+	return for_each_packed_object(add_recent_packed, &data, flags);
 }
 
 static int mark_object_seen(const struct object_id *oid,
@@ -217,7 +251,8 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 
 	if (mark_recent) {
 		revs->ignore_missing_links = 1;
-		if (add_unseen_recent_objects_to_traversal(revs, mark_recent))
+		if (add_unseen_recent_objects_to_traversal(revs, mark_recent,
+							   NULL, 0))
 			die("unable to mark recent objects");
 		if (prepare_revision_walk(revs))
 			die("revision walk setup failed");
diff --git a/reachable.h b/reachable.h
index 5df932ad8f..b776761baa 100644
--- a/reachable.h
+++ b/reachable.h
@@ -1,11 +1,18 @@
 #ifndef REACHEABLE_H
 #define REACHEABLE_H
 
+#include "object.h"
+
 struct progress;
 struct rev_info;
 
+typedef void report_recent_object_fn(const struct object *, struct packed_git *,
+				     off_t, time_t);
+
 int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
-					   timestamp_t timestamp);
+					   timestamp_t timestamp,
+					   report_recent_object_fn cb,
+					   int ignore_in_core_kept_packs);
 void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 			    timestamp_t mark_recent, struct progress *);
 
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v3 10/17] reachable: report precise timestamps from objects in cruft packs
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (8 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
                     ` (7 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

When generating a cruft pack, the caller within pack-objects will want
to know the precise timestamps of cruft objects (i.e., their
corresponding values in the .mtimes table) rather than the mtime of the
cruft pack itself.

Teach add_recent_packed() to lookup each object's precise mtime from the
.mtimes file if one exists (indicated by the is_cruft bit on the
packed_git structure).

A couple of small things worth noting here:

  - load_pack_mtimes() needs to be called before asking for
    nth_packed_mtime(), and that call is done lazily here. That function
    exits early if the .mtimes file has already been opened and parsed,
    so only the first call is slow.

  - Checking the is_cruft bit can be done without any extra work on the
    caller's behalf, since it is set up for us automatically as a
    side-effect of calling add_packed_git() (just like the 'pack_keep'
    and 'pack_promisor' bits).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 reachable.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/reachable.c b/reachable.c
index 0eb9909f47..9ec8e6bd5b 100644
--- a/reachable.c
+++ b/reachable.c
@@ -13,6 +13,7 @@
 #include "worktree.h"
 #include "object-store.h"
 #include "pack-bitmap.h"
+#include "pack-mtimes.h"
 
 struct connectivity_progress {
 	struct progress *progress;
@@ -155,6 +156,7 @@ static int add_recent_packed(const struct object_id *oid,
 			     void *data)
 {
 	struct object *obj;
+	timestamp_t mtime = p->mtime;
 
 	if (!want_recent_object(data, oid))
 		return 0;
@@ -163,7 +165,12 @@ static int add_recent_packed(const struct object_id *oid,
 
 	if (obj && obj->flags & SEEN)
 		return 0;
-	add_recent_object(oid, p, nth_packed_object_offset(p, pos), p->mtime, data);
+	if (p->is_cruft) {
+		if (load_pack_mtimes(p) < 0)
+			die(_("could not load cruft pack .mtimes"));
+		mtime = nth_packed_mtime(p, pos);
+	}
+	add_recent_object(oid, p, nth_packed_object_offset(p, pos), mtime, data);
 	return 0;
 }
 
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v3 11/17] builtin/pack-objects.c: --cruft with expiration
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (9 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
                     ` (6 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

In a previous patch, pack-objects learned how to generate a cruft pack
so long as no objects are dropped.

This patch teaches pack-objects to handle the case where a non-never
`--cruft-expiration` value is passed. This case is slightly more
complicated than before, because we want pack-objects to save
unreachable objects which would have been pruned when there is another
recent (i.e., non-prunable) unreachable object which reaches the other.
We'll call these objects "unreachable but reachable-from-recent".

Here is how pack-objects handles `--cruft-expiration`:

  - Instead of adding all objects outside of the kept pack(s) into the
    packing list, only handle the ones whose mtime is within the grace
    period.

  - Construct a reachability traversal whose tips are the
    unreachable-but-recent objects.

  - Then, walk along that traversal, stopping if we reach an object in
    the kept pack. At each step along the traversal, we add the object
    we are visiting to the packing list.

In the majority of these cases, any object we visit in this traversal
will already be in our packing list. But we will sometimes encounter
reachable-from-recent cruft objects, which we want to retain even if
they aged out of the grace period.

The most subtle point of this process is that we actually don't need to
bother to update the rescued object's mtime. Even though we will write
an .mtimes file with a value that is older than the expiration window,
it will continue to survive cruft repacks so long as any objects which
reach it haven't aged out.

That is, a future repack will also exclude that object from the initial
packing list, only to discover it later on when doing the reachability
traversal.

Finally, stopping early once an object is found in a kept pack is safe
to do because the kept packs ordinarily represent which packs will
survive after repacking. Assuming that it _isn't_ safe to halt a
traversal early would mean that there is some ancestor object which is
missing, which implies repository corruption (i.e., the complete set of
reachable objects isn't present).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c        |  84 +++++++++++++++++++-
 reachable.h                   |   4 +-
 t/t5329-pack-objects-cruft.sh | 143 ++++++++++++++++++++++++++++++++++
 3 files changed, 228 insertions(+), 3 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 1ef333717d..fcac0b5c91 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3439,6 +3439,44 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
 	return;
 }
 
+static void show_cruft_object(struct object *obj, const char *name, void *data)
+{
+	/*
+	 * if we did not record it earlier, it's at least as old as our
+	 * expiration value. Rather than find it exactly, just use that
+	 * value.  This may bump it forward from its real mtime, but it
+	 * will still be "too old" next time we run with the same
+	 * expiration.
+	 *
+	 * if obj does appear in the packing list, this call is a noop (or may
+	 * set the namehash).
+	 */
+	add_cruft_object_entry(&obj->oid, obj->type, NULL, 0, name, cruft_expiration);
+}
+
+static void show_cruft_commit(struct commit *commit, void *data)
+{
+	show_cruft_object((struct object*)commit, NULL, data);
+}
+
+static int cruft_include_check_obj(struct object *obj, void *data)
+{
+	return !has_object_kept_pack(&obj->oid, IN_CORE_KEEP_PACKS);
+}
+
+static int cruft_include_check(struct commit *commit, void *data)
+{
+	return cruft_include_check_obj((struct object*)commit, data);
+}
+
+static void set_cruft_mtime(const struct object *object,
+			    struct packed_git *pack,
+			    off_t offset, time_t mtime)
+{
+	add_cruft_object_entry(&object->oid, object->type, pack, offset, NULL,
+			       mtime);
+}
+
 static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
 {
 	struct string_list_item *item = NULL;
@@ -3464,6 +3502,50 @@ static void enumerate_cruft_objects(void)
 	stop_progress(&progress_state);
 }
 
+static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs)
+{
+	struct packed_git *p;
+	struct rev_info revs;
+	int ret;
+
+	repo_init_revisions(the_repository, &revs, NULL);
+
+	revs.tag_objects = 1;
+	revs.tree_objects = 1;
+	revs.blob_objects = 1;
+
+	revs.include_check = cruft_include_check;
+	revs.include_check_obj = cruft_include_check_obj;
+
+	revs.ignore_missing_links = 1;
+
+	if (progress)
+		progress_state = start_progress(_("Enumerating cruft objects"), 0);
+	ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
+						     set_cruft_mtime, 1);
+	stop_progress(&progress_state);
+
+	if (ret)
+		die(_("unable to add cruft objects"));
+
+	/*
+	 * Re-mark only the fresh packs as kept so that objects in
+	 * unknown packs do not halt the reachability traversal early.
+	 */
+	for (p = get_all_packs(the_repository); p; p = p->next)
+		p->pack_keep_in_core = 0;
+	mark_pack_kept_in_core(fresh_packs, 1);
+
+	if (prepare_revision_walk(&revs))
+		die(_("revision walk setup failed"));
+	if (progress)
+		progress_state = start_progress(_("Traversing cruft objects"), 0);
+	nr_seen = 0;
+	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
+
+	stop_progress(&progress_state);
+}
+
 static void read_cruft_objects(void)
 {
 	struct strbuf buf = STRBUF_INIT;
@@ -3515,7 +3597,7 @@ static void read_cruft_objects(void)
 	mark_pack_kept_in_core(&discard_packs, 0);
 
 	if (cruft_expiration)
-		die("--cruft-expiration not yet implemented");
+		enumerate_and_traverse_cruft_objects(&fresh_packs);
 	else
 		enumerate_cruft_objects();
 
diff --git a/reachable.h b/reachable.h
index b776761baa..020a887b99 100644
--- a/reachable.h
+++ b/reachable.h
@@ -1,10 +1,10 @@
 #ifndef REACHEABLE_H
 #define REACHEABLE_H
 
-#include "object.h"
-
 struct progress;
 struct rev_info;
+struct object;
+struct packed_git;
 
 typedef void report_recent_object_fn(const struct object *, struct packed_git *,
 				     off_t, time_t);
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 003ca7344e..939cdc297a 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -214,5 +214,148 @@ basic_cruft_pack_tests () {
 }
 
 basic_cruft_pack_tests never
+basic_cruft_pack_tests 2.weeks.ago
+
+test_expect_success 'cruft tags rescue tagged objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit tagged &&
+		git tag -a annotated -m tag &&
+
+		git rev-list --objects --no-object-names packed.. >objects &&
+		while read oid
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $oid)"
+		done <objects &&
+
+		test-tool chmtime -500 \
+			"$objdir/$(test_oid_to_path $(git rev-parse annotated))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+		cut -f1 -d" " <actual.raw | sort >actual &&
+
+		(
+			cat objects &&
+			git rev-parse annotated
+		) >expect.raw &&
+		sort <expect.raw >expect &&
+
+		test_cmp expect actual &&
+		cat actual
+	)
+'
+
+test_expect_success 'cruft commits rescue parents, trees' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit old &&
+		test_commit new &&
+
+		git rev-list --objects --no-object-names packed..new >objects &&
+		while read object
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $object)"
+		done <objects &&
+		test-tool chmtime +500 "$objdir/$(test_oid_to_path \
+			$(git rev-parse HEAD))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+		cut -d" " -f1 <actual.raw | sort >actual &&
+		sort <objects >expect &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'cruft trees rescue sub-trees, blobs' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		mkdir -p dir/sub &&
+		echo foo >foo &&
+		echo bar >dir/bar &&
+		echo baz >dir/sub/baz &&
+
+		test_tick &&
+		git add . &&
+		git commit -m "pruned" &&
+
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD^{tree}))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:foo))" &&
+		test-tool chmtime  -500 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/bar))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/sub))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/sub/baz))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+		cut -f1 -d" " <actual.raw | sort >actual &&
+
+		git rev-parse HEAD:dir HEAD:dir/bar HEAD:dir/sub HEAD:dir/sub/baz >expect.raw &&
+		sort <expect.raw >expect &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'expired objects are pruned' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit pruned &&
+
+		git rev-list --objects --no-object-names packed..pruned >objects &&
+		while read object
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $object)"
+		done <objects &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual &&
+		test_must_be_empty actual
+	)
+'
 
 test_done
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v3 12/17] builtin/repack.c: support generating a cruft pack
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (10 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
                     ` (5 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Expose a way to split the contents of a repository into a main and cruft
pack when doing an all-into-one repack with `git repack --cruft -d`, and
a complementary configuration variable.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt            |  11 ++
 Documentation/technical/cruft-packs.txt |   2 +-
 builtin/repack.c                        | 106 +++++++++++-
 t/t5329-pack-objects-cruft.sh           | 207 ++++++++++++++++++++++++
 4 files changed, 320 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index ee30edc178..0bf13893d8 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -63,6 +63,17 @@ to the new separate pack will be written.
 	Also run  'git prune-packed' to remove redundant
 	loose object files.
 
+--cruft::
+	Same as `-a`, unless `-d` is used. Then any unreachable objects
+	are packed into a separate cruft pack. Unreachable objects can
+	be pruned using the normal expiry rules with the next `git gc`
+	invocation (see linkgit:git-gc[1]). Incompatible with `-k`.
+
+--cruft-expiration=<approxidate>::
+	Expire unreachable objects older than `<approxidate>`
+	immediately instead of waiting for the next `git gc` invocation.
+	Only useful with `--cruft -d`.
+
 -l::
 	Pass the `--local` option to 'git pack-objects'. See
 	linkgit:git-pack-objects[1].
diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
index 2c3c5d93f8..f80e975a47 100644
--- a/Documentation/technical/cruft-packs.txt
+++ b/Documentation/technical/cruft-packs.txt
@@ -17,7 +17,7 @@ pruned according to normal expiry rules with the next 'git gc' invocation.
 
 Unreachable objects aren't removed immediately, since doing so could race with
 an incoming push which may reference an object which is about to be deleted.
-Instead, those unreachable objects are stored as loose object and stay that way
+Instead, those unreachable objects are stored as loose objects and stay that way
 until they are older than the expiration window, at which point they are removed
 by linkgit:git-prune[1].
 
diff --git a/builtin/repack.c b/builtin/repack.c
index f908f7d5dd..f7fb88bcf1 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -18,11 +18,17 @@
 #include "pack-bitmap.h"
 #include "refs.h"
 
+#define ALL_INTO_ONE 1
+#define LOOSEN_UNREACHABLE 2
+#define PACK_CRUFT 4
+
+static int pack_everything;
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
 static int write_bitmaps = -1;
 static int use_delta_islands;
 static char *packdir, *packtmp_name, *packtmp;
+static char *cruft_expiration;
 
 static const char *const git_repack_usage[] = {
 	N_("git repack [<options>]"),
@@ -54,6 +60,7 @@ static int repack_config(const char *var, const char *value, void *cb)
 		use_delta_islands = git_config_bool(var, value);
 		return 0;
 	}
+
 	return git_default_config(var, value, cb);
 }
 
@@ -300,9 +307,6 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 		die(_("could not finish pack-objects to repack promisor objects"));
 }
 
-#define ALL_INTO_ONE 1
-#define LOOSEN_UNREACHABLE 2
-
 struct pack_geometry {
 	struct packed_git **pack;
 	uint32_t pack_nr, pack_alloc;
@@ -339,6 +343,8 @@ static void init_pack_geometry(struct pack_geometry **geometry_p)
 	for (p = get_all_packs(the_repository); p; p = p->next) {
 		if (!pack_kept_objects && p->pack_keep)
 			continue;
+		if (p->is_cruft)
+			continue;
 
 		ALLOC_GROW(geometry->pack,
 			   geometry->pack_nr + 1,
@@ -600,6 +606,67 @@ static int write_midx_included_packs(struct string_list *include,
 	return finish_command(&cmd);
 }
 
+static int write_cruft_pack(const struct pack_objects_args *args,
+			    const char *pack_prefix,
+			    struct string_list *names,
+			    struct string_list *existing_packs,
+			    struct string_list *existing_kept_packs)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf line = STRBUF_INIT;
+	struct string_list_item *item;
+	FILE *in, *out;
+	int ret;
+
+	prepare_pack_objects(&cmd, args);
+
+	strvec_push(&cmd.args, "--cruft");
+	if (cruft_expiration)
+		strvec_pushf(&cmd.args, "--cruft-expiration=%s",
+			     cruft_expiration);
+
+	strvec_push(&cmd.args, "--honor-pack-keep");
+	strvec_push(&cmd.args, "--non-empty");
+	strvec_push(&cmd.args, "--max-pack-size=0");
+
+	cmd.in = -1;
+
+	ret = start_command(&cmd);
+	if (ret)
+		return ret;
+
+	/*
+	 * names has a confusing double use: it both provides the list
+	 * of just-written new packs, and accepts the name of the cruft
+	 * pack we are writing.
+	 *
+	 * By the time it is read here, it contains only the pack(s)
+	 * that were just written, which is exactly the set of packs we
+	 * want to consider kept.
+	 */
+	in = xfdopen(cmd.in, "w");
+	for_each_string_list_item(item, names)
+		fprintf(in, "%s-%s.pack\n", pack_prefix, item->string);
+	for_each_string_list_item(item, existing_packs)
+		fprintf(in, "-%s.pack\n", item->string);
+	for_each_string_list_item(item, existing_kept_packs)
+		fprintf(in, "%s.pack\n", item->string);
+	fclose(in);
+
+	out = xfdopen(cmd.out, "r");
+	while (strbuf_getline_lf(&line, out) != EOF) {
+		if (line.len != the_hash_algo->hexsz)
+			die(_("repack: Expecting full hex object ID lines only "
+			      "from pack-objects."));
+		string_list_append(names, line.buf);
+	}
+	fclose(out);
+
+	strbuf_release(&line);
+
+	return finish_command(&cmd);
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -616,7 +683,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int show_progress;
 
 	/* variables to be filled by option parsing */
-	int pack_everything = 0;
 	int delete_redundant = 0;
 	const char *unpack_unreachable = NULL;
 	int keep_unreachable = 0;
@@ -632,6 +698,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_BIT('A', NULL, &pack_everything,
 				N_("same as -a, and turn unreachable objects loose"),
 				   LOOSEN_UNREACHABLE | ALL_INTO_ONE),
+		OPT_BIT(0, "cruft", &pack_everything,
+				N_("same as -a, pack unreachable cruft objects separately"),
+				   PACK_CRUFT),
+		OPT_STRING(0, "cruft-expiration", &cruft_expiration, N_("approxidate"),
+				N_("with -C, expire objects older than this")),
 		OPT_BOOL('d', NULL, &delete_redundant,
 				N_("remove redundant packs, and run git-prune-packed")),
 		OPT_BOOL('f', NULL, &po_args.no_reuse_delta,
@@ -684,6 +755,15 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	    (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
 		die(_("options '%s' and '%s' cannot be used together"), "--keep-unreachable", "-A");
 
+	if (pack_everything & PACK_CRUFT) {
+		pack_everything |= ALL_INTO_ONE;
+
+		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
+			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-A");
+		if (keep_unreachable)
+			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-k");
+	}
+
 	if (write_bitmaps < 0) {
 		if (!write_midx &&
 		    (!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()))
@@ -767,7 +847,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (pack_everything & ALL_INTO_ONE) {
 		repack_promisor_objects(&po_args, &names);
 
-		if (existing_nonkept_packs.nr && delete_redundant) {
+		if (existing_nonkept_packs.nr && delete_redundant &&
+		    !(pack_everything & PACK_CRUFT)) {
 			for_each_string_list_item(item, &names) {
 				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
 					     packtmp_name, item->string);
@@ -829,6 +910,21 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (!names.nr && !po_args.quiet)
 		printf_ln(_("Nothing new to pack."));
 
+	if (pack_everything & PACK_CRUFT) {
+		const char *pack_prefix;
+		if (!skip_prefix(packtmp, packdir, &pack_prefix))
+			die(_("pack prefix %s does not begin with objdir %s"),
+			    packtmp, packdir);
+		if (*pack_prefix == '/')
+			pack_prefix++;
+
+		ret = write_cruft_pack(&po_args, pack_prefix, &names,
+				       &existing_nonkept_packs,
+				       &existing_kept_packs);
+		if (ret)
+			return ret;
+	}
+
 	for_each_string_list_item(item, &names) {
 		item->util = (void *)(uintptr_t)populate_pack_exts(item->string);
 	}
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 939cdc297a..06c550c958 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -358,4 +358,211 @@ test_expect_success 'expired objects are pruned' '
 	)
 '
 
+test_expect_success 'repack --cruft generates a cruft pack' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git branch -M main &&
+		git checkout --orphan other &&
+		test_commit unreachable &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d unreachable &&
+		# objects are not cruft if they are contained in the reflogs
+		git reflog expire --all --expire=all &&
+
+		git rev-list --objects --all --no-object-names >reachable.raw &&
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+		sort <reachable.raw >reachable &&
+		comm -13 reachable objects >unreachable &&
+
+		git repack --cruft -d &&
+
+		cruft=$(basename $(ls $packdir/pack-*.mtimes) .mtimes) &&
+		pack=$(basename $(ls $packdir/pack-*.pack | grep -v $cruft) .pack) &&
+
+		git show-index <$packdir/$pack.idx >actual.raw &&
+		cut -f2 -d" " actual.raw | sort >actual &&
+		test_cmp reachable actual &&
+
+		git show-index <$packdir/$cruft.idx >actual.raw &&
+		cut -f2 -d" " actual.raw | sort >actual &&
+		test_cmp unreachable actual
+	)
+'
+
+test_expect_success 'loose objects mtimes upsert others' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		test_commit cruft &&
+		# incremental repack, leaving existing objects loose (so
+		# they can be "freshened")
+		git repack &&
+
+		tip="$(git rev-parse cruft)" &&
+		path="$objdir/$(test_oid_to_path "$(git rev-parse cruft)")" &&
+		test-tool chmtime --get +1000 "$path" >expect &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d cruft &&
+		git reflog expire --all --expire=all &&
+
+		git repack --cruft -d &&
+
+		mtimes="$(basename $(ls $packdir/pack-*.mtimes))" &&
+		test-tool pack-mtimes "$mtimes" >actual.raw &&
+		grep "$tip" actual.raw | cut -d" " -f2 >actual &&
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'cruft packs are not included in geometric repack' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		test_commit cruft &&
+		git repack -d &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d cruft &&
+		git reflog expire --all --expire=all &&
+
+		git repack --cruft &&
+
+		find $packdir -type f | sort >before &&
+		git repack --geometric=2 -d &&
+		find $packdir -type f | sort >after &&
+
+		test_cmp before after
+	)
+'
+
+test_expect_success 'repack --geometric collects once-cruft objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		git rm -rf . &&
+		test_commit --no-tag cruft &&
+		cruft="$(git rev-parse HEAD)" &&
+
+		git checkout main &&
+		git branch -D other &&
+		git reflog expire --all --expire=all &&
+
+		# Pack the objects created in the previous step into a cruft
+		# pack. Intentionally leave loose copies of those objects
+		# around so we can pick them up in a subsequent --geometric
+		# reapack.
+		git repack --cruft &&
+
+		# Now make those objects reachable, and ensure that they are
+		# packed into the new pack created via a --geometric repack.
+		git update-ref refs/heads/other $cruft &&
+
+		# Without this object, the set of unpacked objects is exactly
+		# the set of objects already in the cruft pack. Tweak that set
+		# to ensure we do not overwrite the cruft pack entirely.
+		test_commit reachable2 &&
+
+		find $packdir -name "pack-*.idx" | sort >before &&
+		git repack --geometric=2 -d &&
+		find $packdir -name "pack-*.idx" | sort >after &&
+
+		{
+			git rev-list --objects --no-object-names $cruft &&
+			git rev-list --objects --no-object-names reachable..reachable2
+		} >want.raw &&
+		sort want.raw >want &&
+
+		pack=$(comm -13 before after) &&
+		git show-index <$pack >objects.raw &&
+
+		cut -d" " -f2 objects.raw | sort >got &&
+
+		test_cmp want got
+	)
+'
+
+test_expect_success 'cruft repack with no reachable objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		git repack -ad &&
+
+		base="$(git rev-parse base)" &&
+
+		git for-each-ref --format="delete %(refname)" >in &&
+		git update-ref --stdin <in &&
+		git reflog expire --all --expire=all &&
+		rm -fr .git/index &&
+
+		git repack --cruft -d &&
+
+		git cat-file -t $base
+	)
+'
+
+test_expect_success 'cruft repack ignores --max-pack-size' '
+	git init max-pack-size &&
+	(
+		cd max-pack-size &&
+		test_commit base &&
+		# two cruft objects which exceed the maximum pack size
+		test-tool genrandom foo 1048576 | git hash-object --stdin -w &&
+		test-tool genrandom bar 1048576 | git hash-object --stdin -w &&
+		git repack --cruft --max-pack-size=1M &&
+		find $packdir -name "*.mtimes" >cruft &&
+		test_line_count = 1 cruft &&
+		test-tool pack-mtimes "$(basename "$(cat cruft)")" >objects &&
+		test_line_count = 2 objects
+	)
+'
+
+test_expect_success 'cruft repack ignores pack.packSizeLimit' '
+	(
+		cd max-pack-size &&
+		# repack everything back together to remove the existing cruft
+		# pack (but to keep its objects)
+		git repack -adk &&
+		git -c pack.packSizeLimit=1M repack --cruft &&
+		# ensure the same post condition is met when --max-pack-size
+		# would otherwise be inferred from the configuration
+		find $packdir -name "*.mtimes" >cruft &&
+		test_line_count = 1 cruft &&
+		test-tool pack-mtimes "$(basename "$(cat cruft)")" >objects &&
+		test_line_count = 2 objects
+	)
+'
+
 test_done
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v3 13/17] builtin/repack.c: allow configuring cruft pack generation
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (11 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
                     ` (4 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

In servers which set the pack.window configuration to a large value, we
can wind up spending quite a lot of time finding new bases when breaking
delta chains between reachable and unreachable objects while generating
a cruft pack.

Introduce a handful of `repack.cruft*` configuration variables to
control the parameters used by pack-objects when generating a cruft
pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/repack.txt |  9 ++++
 builtin/repack.c                | 50 ++++++++++++++------
 t/t5329-pack-objects-cruft.sh   | 83 +++++++++++++++++++++++++++++++++
 3 files changed, 128 insertions(+), 14 deletions(-)

diff --git a/Documentation/config/repack.txt b/Documentation/config/repack.txt
index 9c413e177e..fd18d1fb89 100644
--- a/Documentation/config/repack.txt
+++ b/Documentation/config/repack.txt
@@ -25,3 +25,12 @@ repack.writeBitmaps::
 	space and extra time spent on the initial repack.  This has
 	no effect if multiple packfiles are created.
 	Defaults to true on bare repos, false otherwise.
+
+repack.cruftWindow::
+repack.cruftWindowMemory::
+repack.cruftDepth::
+repack.cruftThreads::
+	Parameters used by linkgit:git-pack-objects[1] when generating
+	a cruft pack and the respective parameters are not given over
+	the command line. See similarly named `pack.*` configuration
+	variables for defaults and meaning.
diff --git a/builtin/repack.c b/builtin/repack.c
index f7fb88bcf1..d61c78e94e 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -40,9 +40,21 @@ static const char incremental_bitmap_conflict_error[] = N_(
 "--no-write-bitmap-index or disable the pack.writebitmaps configuration."
 );
 
+struct pack_objects_args {
+	const char *window;
+	const char *window_memory;
+	const char *depth;
+	const char *threads;
+	const char *max_pack_size;
+	int no_reuse_delta;
+	int no_reuse_object;
+	int quiet;
+	int local;
+};
 
 static int repack_config(const char *var, const char *value, void *cb)
 {
+	struct pack_objects_args *cruft_po_args = cb;
 	if (!strcmp(var, "repack.usedeltabaseoffset")) {
 		delta_base_offset = git_config_bool(var, value);
 		return 0;
@@ -61,6 +73,15 @@ static int repack_config(const char *var, const char *value, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(var, "repack.cruftwindow"))
+		return git_config_string(&cruft_po_args->window, var, value);
+	if (!strcmp(var, "repack.cruftwindowmemory"))
+		return git_config_string(&cruft_po_args->window_memory, var, value);
+	if (!strcmp(var, "repack.cruftdepth"))
+		return git_config_string(&cruft_po_args->depth, var, value);
+	if (!strcmp(var, "repack.cruftthreads"))
+		return git_config_string(&cruft_po_args->threads, var, value);
+
 	return git_default_config(var, value, cb);
 }
 
@@ -153,18 +174,6 @@ static void remove_redundant_pack(const char *dir_name, const char *base_name)
 	strbuf_release(&buf);
 }
 
-struct pack_objects_args {
-	const char *window;
-	const char *window_memory;
-	const char *depth;
-	const char *threads;
-	const char *max_pack_size;
-	int no_reuse_delta;
-	int no_reuse_object;
-	int quiet;
-	int local;
-};
-
 static void prepare_pack_objects(struct child_process *cmd,
 				 const struct pack_objects_args *args)
 {
@@ -689,6 +698,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	int no_update_server_info = 0;
 	struct pack_objects_args po_args = {NULL};
+	struct pack_objects_args cruft_po_args = {NULL};
 	int geometric_factor = 0;
 	int write_midx = 0;
 
@@ -743,7 +753,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_END()
 	};
 
-	git_config(repack_config, NULL);
+	git_config(repack_config, &cruft_po_args);
 
 	argc = parse_options(argc, argv, prefix, builtin_repack_options,
 				git_repack_usage, 0);
@@ -918,7 +928,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		if (*pack_prefix == '/')
 			pack_prefix++;
 
-		ret = write_cruft_pack(&po_args, pack_prefix, &names,
+		if (!cruft_po_args.window)
+			cruft_po_args.window = po_args.window;
+		if (!cruft_po_args.window_memory)
+			cruft_po_args.window_memory = po_args.window_memory;
+		if (!cruft_po_args.depth)
+			cruft_po_args.depth = po_args.depth;
+		if (!cruft_po_args.threads)
+			cruft_po_args.threads = po_args.threads;
+
+		cruft_po_args.local = po_args.local;
+		cruft_po_args.quiet = po_args.quiet;
+
+		ret = write_cruft_pack(&cruft_po_args, pack_prefix, &names,
 				       &existing_nonkept_packs,
 				       &existing_kept_packs);
 		if (ret)
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 06c550c958..e4744e4465 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -565,4 +565,87 @@ test_expect_success 'cruft repack ignores pack.packSizeLimit' '
 	)
 '
 
+test_expect_success 'cruft repack respects repack.cruftWindow' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		GIT_TRACE2_EVENT=$(pwd)/event.trace \
+		git -c pack.window=1 -c repack.cruftWindow=2 repack \
+		       --cruft --window=3 &&
+
+		grep "pack-objects.*--window=2.*--cruft" event.trace
+	)
+'
+
+test_expect_success 'cruft repack respects --window by default' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		GIT_TRACE2_EVENT=$(pwd)/event.trace \
+		git -c pack.window=2 repack --cruft --window=3 &&
+
+		grep "pack-objects.*--window=3.*--cruft" event.trace
+	)
+'
+
+test_expect_success 'cruft repack respects --quiet' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		GIT_PROGRESS_DELAY=0 git repack --cruft --quiet 2>err &&
+		test_must_be_empty err
+	)
+'
+
+test_expect_success 'cruft --local drops unreachable objects' '
+	git init alternate &&
+	git init repo &&
+	test_when_finished "rm -fr alternate repo" &&
+
+	test_commit -C alternate base &&
+	# Pack all objects in alterate so that the cruft repack in "repo" sees
+	# the object it dropped due to `--local` as packed. Otherwise this
+	# object would not appear packed anywhere (since it is not packed in
+	# alternate and likewise not part of the cruft pack in the other repo
+	# because of `--local`).
+	git -C alternate repack -ad &&
+
+	(
+		cd repo &&
+
+		object="$(git -C ../alternate rev-parse HEAD:base.t)" &&
+		git -C ../alternate cat-file -p $object >contents &&
+
+		# Write some reachable objects and two unreachable ones: one
+		# that the alternate has and another that is unique.
+		test_commit other &&
+		git hash-object -w -t blob contents &&
+		cruft="$(echo cruft | git hash-object -w -t blob --stdin)" &&
+
+		( cd ../alternate/.git/objects && pwd ) \
+		       >.git/objects/info/alternates &&
+
+		test_path_is_file $objdir/$(test_oid_to_path $cruft) &&
+		test_path_is_file $objdir/$(test_oid_to_path $object) &&
+
+		git repack -d --cruft --local &&
+
+		test-tool pack-mtimes "$(basename $(ls $packdir/pack-*.mtimes))" \
+		       >objects &&
+		! grep $object objects &&
+		grep $cruft objects
+	)
+'
+
 test_done
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v3 14/17] builtin/repack.c: use named flags for existing_packs
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (12 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
                     ` (3 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

We use the `util` pointer for items in the `existing_packs` string list
to indicate which packs are going to be deleted. Since that has so far
been the only use of that `util` pointer, we just set it to 0 or 1.

But we're going to add an additional state to this field in the next
patch, so prepare for that by adding a #define for the first bit so we
can more expressively inspect the flags state.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index d61c78e94e..afa4d51a22 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -22,6 +22,8 @@
 #define LOOSEN_UNREACHABLE 2
 #define PACK_CRUFT 4
 
+#define DELETE_PACK 1
+
 static int pack_everything;
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
@@ -561,7 +563,7 @@ static void midx_included_packs(struct string_list *include,
 		}
 	} else {
 		for_each_string_list_item(item, existing_nonkept_packs) {
-			if (item->util)
+			if ((uintptr_t)item->util & DELETE_PACK)
 				continue;
 			string_list_insert(include, xstrfmt("%s.idx", item->string));
 		}
@@ -1000,7 +1002,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			 * was given) and that we will actually delete this pack
 			 * (if `-d` was given).
 			 */
-			item->util = (void*)(intptr_t)!string_list_has_string(&names, sha1);
+			if (!string_list_has_string(&names, sha1))
+				item->util = (void*)(uintptr_t)((size_t)item->util | DELETE_PACK);
 		}
 	}
 
@@ -1024,7 +1027,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (delete_redundant) {
 		int opts = 0;
 		for_each_string_list_item(item, &existing_nonkept_packs) {
-			if (!item->util)
+			if (!((uintptr_t)item->util & DELETE_PACK))
 				continue;
 			remove_redundant_pack(packdir, item->string);
 		}
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v3 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (13 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
                     ` (2 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

When using cruft packs, the following race can occur when a geometric
repack that writes a MIDX bitmap takes place afterwords:

  - First, create an unreachable object and do an all-into-one cruft
    repack which stores that object in the repository's cruft pack.
  - Then make that object reachable.
  - Finally, do a geometric repack and write a MIDX bitmap.

Assuming that we are sufficiently unlucky as to select a commit from the
MIDX which reaches that object for bitmapping, then the `git
multi-pack-index` process will complain that that object is missing.

The reason is because we don't include cruft packs in the MIDX when
doing a geometric repack. Since the "make that object reachable" doesn't
necessarily mean that we'll create a new copy of that object in one of
the packs that will get rolled up as part of a geometric repack, it's
possible that the MIDX won't see any copies of that now-reachable
object.

Of course, it's desirable to avoid including cruft packs in the MIDX
because it causes the MIDX to store a bunch of objects which are likely
to get thrown away. But excluding that pack does open us up to the above
race.

This patch demonstrates the bug, and resolves it by including cruft
packs in the MIDX even when doing a geometric repack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c              | 19 +++++++++++++++++--
 t/t5329-pack-objects-cruft.sh | 26 ++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index afa4d51a22..59b60cd309 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -23,6 +23,7 @@
 #define PACK_CRUFT 4
 
 #define DELETE_PACK 1
+#define CRUFT_PACK 2
 
 static int pack_everything;
 static int delta_base_offset = 1;
@@ -158,8 +159,11 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 		if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
 		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
 			string_list_append_nodup(fname_kept_list, fname);
-		else
-			string_list_append_nodup(fname_nonkept_list, fname);
+		else {
+			struct string_list_item *item = string_list_append_nodup(fname_nonkept_list, fname);
+			if (file_exists(mkpath("%s/%s.mtimes", packdir, fname)))
+				item->util = (void*)(uintptr_t)CRUFT_PACK;
+		}
 	}
 	closedir(dir);
 }
@@ -561,6 +565,17 @@ static void midx_included_packs(struct string_list *include,
 
 			string_list_insert(include, strbuf_detach(&buf, NULL));
 		}
+
+		for_each_string_list_item(item, existing_nonkept_packs) {
+			if (!((uintptr_t)item->util & CRUFT_PACK)) {
+				/*
+				 * no need to check DELETE_PACK, since we're not
+				 * doing an ALL_INTO_ONE repack
+				 */
+				continue;
+			}
+			string_list_insert(include, xstrfmt("%s.idx", item->string));
+		}
 	} else {
 		for_each_string_list_item(item, existing_nonkept_packs) {
 			if ((uintptr_t)item->util & DELETE_PACK)
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index e4744e4465..13158e4ab7 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -648,4 +648,30 @@ test_expect_success 'cruft --local drops unreachable objects' '
 	)
 '
 
+test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		test_commit cruft &&
+		unreachable="$(git rev-parse cruft)" &&
+
+		git reset --hard $unreachable^ &&
+		git tag -d cruft &&
+		git reflog expire --all --expire=all &&
+
+		git repack --cruft -d &&
+
+		# resurrect the unreachable object via a new commit. the
+		# new commit will get selected for a bitmap, but be
+		# missing one of its parents from the selected packs.
+		git reset --hard $unreachable &&
+		test_commit resurrect &&
+
+		git repack --write-midx --write-bitmap-index --geometric=2 -d
+	)
+'
+
 test_done
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v3 16/17] builtin/gc.c: conditionally avoid pruning objects via loose
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (14 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  0:21   ` [PATCH v3 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
  2022-03-03  1:29   ` [PATCH v3 00/17] " Derrick Stolee
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

Expose the new `git repack --cruft` mode from `git gc` via a new opt-in
flag. When invoked like `git gc --cruft`, `git gc` will avoid exploding
unreachable objects as loose ones, and instead create a cruft pack and
`.mtimes` file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/gc.txt   | 21 +++++++++++++-------
 Documentation/git-gc.txt      |  5 +++++
 builtin/gc.c                  | 10 +++++++++-
 t/t5329-pack-objects-cruft.sh | 37 +++++++++++++++++++++++++++++++++++
 4 files changed, 65 insertions(+), 8 deletions(-)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index c834e07991..38fea076a2 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -81,14 +81,21 @@ gc.packRefs::
 	to enable it within all non-bare repos or it can be set to a
 	boolean value.  The default is `true`.
 
+gc.cruftPacks::
+	Store unreachable objects in a cruft pack (see
+	linkgit:git-repack[1]) instead of as loose objects. The default
+	is `false`.
+
 gc.pruneExpire::
-	When 'git gc' is run, it will call 'prune --expire 2.weeks.ago'.
-	Override the grace period with this config variable.  The value
-	"now" may be used to disable this grace period and always prune
-	unreachable objects immediately, or "never" may be used to
-	suppress pruning.  This feature helps prevent corruption when
-	'git gc' runs concurrently with another process writing to the
-	repository; see the "NOTES" section of linkgit:git-gc[1].
+	When 'git gc' is run, it will call 'prune --expire 2.weeks.ago'
+	(and 'repack --cruft --cruft-expiration 2.weeks.ago' if using
+	cruft packs via `gc.cruftPacks` or `--cruft`).  Override the
+	grace period with this config variable.  The value "now" may be
+	used to disable this grace period and always prune unreachable
+	objects immediately, or "never" may be used to suppress pruning.
+	This feature helps prevent corruption when 'git gc' runs
+	concurrently with another process writing to the repository; see
+	the "NOTES" section of linkgit:git-gc[1].
 
 gc.worktreePruneExpire::
 	When 'git gc' is run, it calls
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 853967dea0..ba4e67700e 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -54,6 +54,11 @@ other housekeeping tasks (e.g. rerere, working trees, reflog...) will
 be performed as well.
 
 
+--cruft::
+	When expiring unreachable objects, pack them separately into a
+	cruft pack instead of storing the loose objects as loose
+	objects.
+
 --prune=<date>::
 	Prune loose objects older than date (default is 2 weeks ago,
 	overridable by the config variable `gc.pruneExpire`).
diff --git a/builtin/gc.c b/builtin/gc.c
index ffaf0daf5d..11f5150234 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -43,6 +43,7 @@ static const char * const builtin_gc_usage[] = {
 
 static int pack_refs = 1;
 static int prune_reflogs = 1;
+static int cruft_packs = 0;
 static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
@@ -153,6 +154,7 @@ static void gc_config(void)
 	git_config_get_int("gc.auto", &gc_auto_threshold);
 	git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
 	git_config_get_bool("gc.autodetach", &detach_auto);
+	git_config_get_bool("gc.cruftpacks", &cruft_packs);
 	git_config_get_expiry("gc.pruneexpire", &prune_expire);
 	git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire);
 	git_config_get_expiry("gc.logexpiry", &gc_log_expire);
@@ -332,7 +334,11 @@ static void add_repack_all_option(struct string_list *keep_pack)
 {
 	if (prune_expire && !strcmp(prune_expire, "now"))
 		strvec_push(&repack, "-a");
-	else {
+	else if (cruft_packs) {
+		strvec_push(&repack, "--cruft");
+		if (prune_expire)
+			strvec_pushf(&repack, "--cruft-expiration=%s", prune_expire);
+	} else {
 		strvec_push(&repack, "-A");
 		if (prune_expire)
 			strvec_pushf(&repack, "--unpack-unreachable=%s", prune_expire);
@@ -552,6 +558,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 		{ OPTION_STRING, 0, "prune", &prune_expire, N_("date"),
 			N_("prune unreferenced objects"),
 			PARSE_OPT_OPTARG, NULL, (intptr_t)prune_expire },
+		OPT_BOOL(0, "cruft", &cruft_packs, N_("pack unreferenced objects separately")),
 		OPT_BOOL(0, "aggressive", &aggressive, N_("be more thorough (increased runtime)")),
 		OPT_BOOL_F(0, "auto", &auto_gc, N_("enable auto-gc mode"),
 			   PARSE_OPT_NOCOMPLETE),
@@ -671,6 +678,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 			die(FAILED_RUN, repack.v[0]);
 
 		if (prune_expire) {
+			/* run `git prune` even if using cruft packs */
 			strvec_push(&prune, prune_expire);
 			if (quiet)
 				strvec_push(&prune, "--no-progress");
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 13158e4ab7..3910e186ef 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -429,6 +429,43 @@ test_expect_success 'loose objects mtimes upsert others' '
 	)
 '
 
+test_expect_success 'expiring cruft objects with git gc' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git branch -M main &&
+		git checkout --orphan other &&
+		test_commit unreachable &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d unreachable &&
+		# objects are not cruft if they are contained in the reflogs
+		git reflog expire --all --expire=all &&
+
+		git rev-list --objects --all --no-object-names >reachable.raw &&
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+		sort <reachable.raw >reachable &&
+		comm -13 reachable objects >unreachable &&
+
+		git repack --cruft -d &&
+
+		mtimes=$(ls .git/objects/pack/pack-*.mtimes) &&
+		test_path_is_file $mtimes &&
+
+		git gc --cruft --prune=now &&
+
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+
+		comm -23 unreachable objects >removed &&
+		test_cmp unreachable removed &&
+		test_path_is_missing $mtimes
+	)
+'
+
 test_expect_success 'cruft packs are not included in geometric repack' '
 	git init repo &&
 	test_when_finished "rm -fr repo" &&
-- 
2.35.1.73.gccc5557600


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v3 17/17] sha1-file.c: don't freshen cruft packs
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (15 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
@ 2022-03-03  0:21   ` Taylor Blau
  2022-03-03  1:29   ` [PATCH v3 00/17] " Derrick Stolee
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-03  0:21 UTC (permalink / raw)
  To: git; +Cc: tytso, derrickstolee, gitster, larsxschneider

We don't bother to freshen objects stored in a cruft pack individually
by updating the `.mtimes` file. This is because we can't portably `mmap`
and write into the middle of a file (i.e., to update the mtime of just
one object). Instead, we would have to rewrite the entire `.mtimes` file
which may incur some wasted effort especially if there a lot of cruft
objects and they are freshened infrequently.

Instead, force the freshening code to avoid an optimizing write by
writing out the object loose and letting it pick up a current mtime.

This works because we prefer the mtime of the loose copy of an object
when both a loose and packed one exist (whether or not the packed copy
comes from a cruft pack or not).

This could certainly do with a test and/or be included earlier in this
series/PR, but I want to wait until after I have a chance to clean up
the overly-repetitive nature of the cruft pack tests in general.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 object-file.c                 |  2 ++
 t/t5329-pack-objects-cruft.sh | 25 +++++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/object-file.c b/object-file.c
index e80da1368d..65b8df7fb6 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1989,6 +1989,8 @@ static int freshen_packed_object(const struct object_id *oid)
 	struct pack_entry e;
 	if (!find_pack_entry(the_repository, oid, &e))
 		return 0;
+	if (e.p->is_cruft)
+		return 0;
 	if (e.p->freshened)
 		return 1;
 	if (!freshen_file(e.p->pack_name))
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 3910e186ef..4681558612 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -711,4 +711,29 @@ test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
 	)
 '
 
+test_expect_success 'cruft objects are freshend via loose' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		echo "cruft" >contents &&
+		blob="$(git hash-object -w -t blob contents)" &&
+		loose="$objdir/$(test_oid_to_path $blob)" &&
+
+		test_commit base &&
+
+		git repack --cruft -d &&
+
+		test_path_is_missing "$loose" &&
+		test-tool pack-mtimes "$(basename "$(ls $packdir/pack-*.mtimes)")" >cruft &&
+		grep "$blob" cruft &&
+
+		# write the same object again
+		git hash-object -w -t blob contents &&
+
+		test_path_is_file "$loose"
+	)
+'
+
 test_done
-- 
2.35.1.73.gccc5557600

^ permalink raw reply related	[flat|nested] 201+ messages in thread

* Re: [PATCH v3 00/17] cruft packs
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
                     ` (16 preceding siblings ...)
  2022-03-03  0:21   ` [PATCH v3 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
@ 2022-03-03  1:29   ` Derrick Stolee
  17 siblings, 0 replies; 201+ messages in thread
From: Derrick Stolee @ 2022-03-03  1:29 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: tytso, gitster, larsxschneider

On 3/2/2022 7:20 PM, Taylor Blau wrote:
> Here is a small reroll of my series to implement "cruft packs", based on
> Stolee's review.
> 
> The changes here are minor, and mostly are limited to removing a
> redundant "if" statement, avoiding an unnecessary header include, and
> moving the tests (again!) to t5329's territory.
> 
> As always, a range-diff is below. Thanks in advance for taking another
> look!

This range-diff satisfies my comments. Thanks!
-Stolee

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v3 04/17] chunk-format.h: extract oid_version()
  2022-03-03  0:20   ` [PATCH v3 04/17] chunk-format.h: extract oid_version() Taylor Blau
@ 2022-03-03 16:30     ` Ævar Arnfjörð Bjarmason
  2022-03-03 23:32       ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-03 16:30 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, tytso, derrickstolee, gitster, larsxschneider


On Wed, Mar 02 2022, Taylor Blau wrote:

> Consolidate these into a single definition in chunk-format.h. It's not
> clear that this is the best header to define this function in, but it
> should do for now.
> [...]
> +
> +uint8_t oid_version(const struct git_hash_algo *algop)
> +{
> +	switch (hash_algo_by_ptr(algop)) {
> +	case GIT_HASH_SHA1:
> +		return 1;
> +	case GIT_HASH_SHA256:
> +		return 2;

Not a new issue, but I wonder why these don't return hash_algo_by_ptr
aka GIT_HASH_WHATEVER here. I.e. this is the same as this more
straightforward & obvious code that avoids re-hardcoding the magic
constants:

	const int algo = hash_algo_by_ptr(algop)

	switch (algo) {
	case GIT_HASH_SHA1:
	case GIT_HASH_SHA256:
		return algo;
	default:
        [...]
        }

Probably best left as a later cleanup. FWIW I came up with this on top
of my designated init series:

diff --git a/hash.h b/hash.h
index 5d40368f18a..fd710ec6ae8 100644
--- a/hash.h
+++ b/hash.h
@@ -86,14 +86,18 @@ static inline void git_SHA256_Clone(git_SHA256_CTX *dst, const git_SHA256_CTX *s
  * field for being non-zero.  Use the name field for user-visible situations and
  * the format_id field for fixed-length fields on disk.
  */
-/* An unknown hash function. */
-#define GIT_HASH_UNKNOWN 0
-/* SHA-1 */
-#define GIT_HASH_SHA1 1
-/* SHA-256  */
-#define GIT_HASH_SHA256 2
-/* Number of algorithms supported (including unknown). */
-#define GIT_HASH_NALGOS (GIT_HASH_SHA256 + 1)
+enum git_hash_algo_name {
+	/* An unknown hash function. */
+	GIT_HASH_UNKNOWN,
+	/* SHA-1 */
+	GIT_HASH_SHA1,
+	GIT_HASH_SHA256,
+	/*
+	 * Number of algorithms supported (including unknown). This
+	 * must be kept last!
+	 */
+	GIT_HASH_NALGOS,
+};
 
 /* "sha1", big-endian */
 #define GIT_SHA1_FORMAT_ID 0x73686131
diff --git a/object-file.c b/object-file.c
index 5074471b471..f2d54a86969 100644
--- a/object-file.c
+++ b/object-file.c
@@ -166,7 +166,7 @@ static void git_hash_unknown_final_oid(struct object_id *oid, git_hash_ctx *ctx)
 }
 
 const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
-	{
+	[GIT_HASH_UNKNOWN] = {
 		.name = NULL,
 		.format_id = 0x00000000,
 		.rawsz = 0,
@@ -181,7 +181,7 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.empty_blob = NULL,
 		.null_oid = NULL,
 	},
-	{
+	[GIT_HASH_SHA1] = {
 		.name = "sha1",
 		.format_id = GIT_SHA1_FORMAT_ID,
 		.rawsz = GIT_SHA1_RAWSZ,
@@ -196,7 +196,7 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
 		.empty_blob = &empty_blob_oid,
 		.null_oid = &null_oid_sha1,
 	},
-	{
+	[GIT_HASH_SHA256] = {
 		.name = "sha256",
 		.format_id = GIT_SHA256_FORMAT_ID,
 		.rawsz = GIT_SHA256_RAWSZ,

^ permalink raw reply related	[flat|nested] 201+ messages in thread

* Re: [PATCH v3 05/17] pack-mtimes: support writing pack .mtimes files
  2022-03-03  0:20   ` [PATCH v3 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
@ 2022-03-03 16:45     ` Ævar Arnfjörð Bjarmason
  2022-03-03 23:35       ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-03 16:45 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, tytso, derrickstolee, gitster, larsxschneider


On Wed, Mar 02 2022, Taylor Blau wrote:

> Now that the `.mtimes` format is defined, supplement the pack-write API
> to be able to conditionally write an `.mtimes` file along with a pack by
> setting an additional flag and passing an oidmap that contains the
> timestamps corresponding to each object in the pack.
> [...]
>  void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought)
> diff --git a/pack.h b/pack.h
> index fd27cfdfd7..01d385903a 100644
> --- a/pack.h
> +++ b/pack.h
> @@ -44,6 +44,7 @@ struct pack_idx_option {
>  #define WRITE_IDX_STRICT 02
>  #define WRITE_REV 04
>  #define WRITE_REV_VERIFY 010
> +#define WRITE_MTIMES 020
>  
>  	uint32_t version;
>  	uint32_t off32_limit;

Why the hardcoding? The 010 was added in your 8ef50d9958f (pack-write.c:
prepare to write 'pack-*.rev' files, 2021-01-25). That would be the same
as 8|2, but there's no 8 there., ditto this new 020 that's the same as
1<<4 | 1<<2, but there's no "16", just WRITE_REV=4.

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v3 04/17] chunk-format.h: extract oid_version()
  2022-03-03 16:30     ` Ævar Arnfjörð Bjarmason
@ 2022-03-03 23:32       ` Taylor Blau
  2022-03-04  0:16         ` Junio C Hamano
  0 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-03-03 23:32 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, git, tytso, derrickstolee, gitster, larsxschneider

On Thu, Mar 03, 2022 at 05:30:44PM +0100, Ævar Arnfjörð Bjarmason wrote:
>
> On Wed, Mar 02 2022, Taylor Blau wrote:
>
> > Consolidate these into a single definition in chunk-format.h. It's not
> > clear that this is the best header to define this function in, but it
> > should do for now.
> > [...]
> > +
> > +uint8_t oid_version(const struct git_hash_algo *algop)
> > +{
> > +	switch (hash_algo_by_ptr(algop)) {
> > +	case GIT_HASH_SHA1:
> > +		return 1;
> > +	case GIT_HASH_SHA256:
> > +		return 2;
>
> Not a new issue, but I wonder why these don't return hash_algo_by_ptr
> aka GIT_HASH_WHATEVER here. I.e. this is the same as this more
> straightforward & obvious code that avoids re-hardcoding the magic
> constants:

Hmm. Certainly the value returned by hash_algo_by_ptr() works for SHA-1
and SHA-256, but writes may want to use a different value for future
hashes. Not that this couldn't be changed then, but my feeling is that
the existing code is clearer since it avoids the reader having to jump
to hash_algo_by_ptr()'s implementation to figure out what it returns.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v3 05/17] pack-mtimes: support writing pack .mtimes files
  2022-03-03 16:45     ` Ævar Arnfjörð Bjarmason
@ 2022-03-03 23:35       ` Taylor Blau
  2022-03-04 10:40         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-03-03 23:35 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, git, tytso, derrickstolee, gitster, larsxschneider

On Thu, Mar 03, 2022 at 05:45:23PM +0100, Ævar Arnfjörð Bjarmason wrote:
>
> On Wed, Mar 02 2022, Taylor Blau wrote:
>
> > Now that the `.mtimes` format is defined, supplement the pack-write API
> > to be able to conditionally write an `.mtimes` file along with a pack by
> > setting an additional flag and passing an oidmap that contains the
> > timestamps corresponding to each object in the pack.
> > [...]
> >  void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought)
> > diff --git a/pack.h b/pack.h
> > index fd27cfdfd7..01d385903a 100644
> > --- a/pack.h
> > +++ b/pack.h
> > @@ -44,6 +44,7 @@ struct pack_idx_option {
> >  #define WRITE_IDX_STRICT 02
> >  #define WRITE_REV 04
> >  #define WRITE_REV_VERIFY 010
> > +#define WRITE_MTIMES 020
> >
> >  	uint32_t version;
> >  	uint32_t off32_limit;
>
> Why the hardcoding? The 010 was added in your 8ef50d9958f (pack-write.c:
> prepare to write 'pack-*.rev' files, 2021-01-25). That would be the same
> as 8|2, but there's no 8 there., ditto this new 020 that's the same as
> 1<<4 | 1<<2, but there's no "16", just WRITE_REV=4.

I'm not sure I understand. These are octals, so octal "20" (or decimal
16) just gives us bit 5 -- the next available -- by itself.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v3 04/17] chunk-format.h: extract oid_version()
  2022-03-03 23:32       ` Taylor Blau
@ 2022-03-04  0:16         ` Junio C Hamano
  0 siblings, 0 replies; 201+ messages in thread
From: Junio C Hamano @ 2022-03-04  0:16 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Ævar Arnfjörð Bjarmason, git, tytso, derrickstolee,
	larsxschneider

Taylor Blau <me@ttaylorr.com> writes:

> On Thu, Mar 03, 2022 at 05:30:44PM +0100, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Wed, Mar 02 2022, Taylor Blau wrote:
>>
>> > Consolidate these into a single definition in chunk-format.h. It's not
>> > clear that this is the best header to define this function in, but it
>> > should do for now.
>> > [...]
>> > +
>> > +uint8_t oid_version(const struct git_hash_algo *algop)
>> > +{
>> > +	switch (hash_algo_by_ptr(algop)) {
>> > +	case GIT_HASH_SHA1:
>> > +		return 1;
>> > +	case GIT_HASH_SHA256:
>> > +		return 2;
>>
>> Not a new issue, but I wonder why these don't return hash_algo_by_ptr
>> aka GIT_HASH_WHATEVER here. I.e. this is the same as this more
>> straightforward & obvious code that avoids re-hardcoding the magic
>> constants:
>
> Hmm. Certainly the value returned by hash_algo_by_ptr() works for SHA-1
> and SHA-256, but writes may want to use a different value for future
> hashes. Not that this couldn't be changed then, but my feeling is that
> the existing code is clearer since it avoids the reader having to jump
> to hash_algo_by_ptr()'s implementation to figure out what it returns.

If we promise that everywhere in file formats where we identify what
hash is used, we write "1" for SHA1 and "2" for SHA256, it would be
natural to define GIT_HASH_SHA1 to "1" and GIT_HASH_SHA256 to "2".

And readers do not have to "figure out", if that is a clearly
written guideline to represent the hash used in file formats.  As
written, the readers who -assumes- such a guideline is there must
figure out from hash.h that GIT_HASH_SHA1 is 1 and GIT_HASH_SHA256
is 2 to be convinced that the above code is correct.

Now, hash.h says GIT_HASH_SHA1 is 1 and GIT_HASH_SHA256 is 2.  So

	int oidv = hash_algo_by_ptr(algop)
	switch (oidv) {
	case GIT_HASH_SHA1:
	case GIT_HASH_SHA256:
		return oidv;
	default:
		die();
	}

should work already.  To put it differently, if this didn't work, we
should renumber GIT_HASH_SHA1 and GIT_HASH_SHA256 to make it work, I
would think.  If not, we have a huge mess on our hands, as constants
used in on-disk file formats is hard (almost impossible) to change.

An overly generic function name oid_version() cannot be justified
unless the same constants are used everywhere.  I see hits from 'git
grep oid_version' in

    chunk-format.c (obviously)
    commit-graph.c
    midx.c
    pack-write.c

so presumably these types of files are using the "canonical"
numbering.

And when we introduce GIT_HASH_SHA3 or whatever, we should give it a
number that this function can return (i.e. from the range 3..255).

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v3 05/17] pack-mtimes: support writing pack .mtimes files
  2022-03-03 23:35       ` Taylor Blau
@ 2022-03-04 10:40         ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 201+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-04 10:40 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, tytso, derrickstolee, gitster, larsxschneider


On Thu, Mar 03 2022, Taylor Blau wrote:

> On Thu, Mar 03, 2022 at 05:45:23PM +0100, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Wed, Mar 02 2022, Taylor Blau wrote:
>>
>> > Now that the `.mtimes` format is defined, supplement the pack-write API
>> > to be able to conditionally write an `.mtimes` file along with a pack by
>> > setting an additional flag and passing an oidmap that contains the
>> > timestamps corresponding to each object in the pack.
>> > [...]
>> >  void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought)
>> > diff --git a/pack.h b/pack.h
>> > index fd27cfdfd7..01d385903a 100644
>> > --- a/pack.h
>> > +++ b/pack.h
>> > @@ -44,6 +44,7 @@ struct pack_idx_option {
>> >  #define WRITE_IDX_STRICT 02
>> >  #define WRITE_REV 04
>> >  #define WRITE_REV_VERIFY 010
>> > +#define WRITE_MTIMES 020
>> >
>> >  	uint32_t version;
>> >  	uint32_t off32_limit;
>>
>> Why the hardcoding? The 010 was added in your 8ef50d9958f (pack-write.c:
>> prepare to write 'pack-*.rev' files, 2021-01-25). That would be the same
>> as 8|2, but there's no 8 there., ditto this new 020 that's the same as
>> 1<<4 | 1<<2, but there's no "16", just WRITE_REV=4.
>
> I'm not sure I understand. These are octals, so octal "20" (or decimal
> 16) just gives us bit 5 -- the next available -- by itself.

Urgh, tired/rushed eyes yesterday. I managed to read these as decimals,
sorry.

I see from:

    git grep 'define[^0-9]*(\b020\b|\b16\b|1.*<<.*\b4\b)[^0-9]*$'

That I managed to patch what seems to be one of two other places in the
codebase using it recently (that goes >=020) in 245b9488150 (cat-file:
use GET_OID_ONLY_TO_DIE in --(textconv|filters), 2021-12-28).

Anyway, I think nothing needs to be done here. If you ever feel like
some churn here I think converting it to the almost ubiquitous "1 << N"
style we use almost everywhere else would be an improvement :)

Sorry!

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-03  0:20   ` [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
@ 2022-03-07 18:03     ` Jonathan Nieder
  2022-03-22  1:16       ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Jonathan Nieder @ 2022-03-07 18:03 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, tytso, derrickstolee, gitster, larsxschneider

Hi,

Taylor Blau wrote:

> Create a technical document to explain cruft packs. It contains a brief
> overview of the problem, some background, details on the implementation,
> and a couple of alternative approaches not considered here.

Sorry for the very slow review!  I've mentioned a few times that this
overlaps in interesting ways with the gc mechanism described in
hash-function-transition.txt, so I'd like to compare and see how they
interact.

[...]
> --- /dev/null
> +++ b/Documentation/technical/cruft-packs.txt
> @@ -0,0 +1,97 @@
[...]
> +Unreachable objects aren't removed immediately, since doing so could race with
> +an incoming push which may reference an object which is about to be deleted.
> +Instead, those unreachable objects are stored as loose object and stay that way
> +until they are older than the expiration window, at which point they are removed
> +by linkgit:git-prune[1].
> +
> +Git must store these unreachable objects loose in order to keep track of their
> +per-object mtimes.

It's worth noting that this behavior is already racy.  That is because
when an unreachable object becomes newly reachable, we do not update
its mtime and the mtimes of every object reachable from it, so if it
then becomes transiently unreachable again then it can be wrongly
collected.

[...]
>                                these repositories often take up a large amount of
> +disk space, since we can only zlib compress them, but not store them in delta
> +chains.

Yes!  I'm happy we're making progress on this.

> +
> +== Cruft packs
> +
> +A cruft pack eliminates the need for storing unreachable objects in a loose
> +state by including the per-object mtimes in a separate file alongside a single
> +pack containing all loose objects.

Can this doc say a little about how "git prune" handles these files?
In particular, does a non cruft pack aware copy of Git (or JGit,
libgit2, etc) do the right thing or does it fight with this mechanism?
If the latter, do we have a repository extension (extensions.*) to
prevent that?

[...]
> +  3. Write the pack out, along with a `.mtimes` file that records the per-object
> +     timestamps.

As a point of comparison, the design in hash-function-transition uses
a single timestamp for the whole pack.  During read operations, objects
in a cruft pack are considered present; during writes, they are
considered _not present_ so that if we want to make a cruft object
newly present then we put a copy of it in a new pack.

Advantage of the mtimes file approach:
- less duplication of storage: a revived object is only stored once,
  in a cruft pack, and then the next gc can "graduate" it out of the
  cruft pack and shrink the cruft pack
- less affect on non-gc Git code: writes don't need to know that any
  cruft objects referenced need to be copied into a new pack

Advantages of the mtime per cruft pack approach:
- easy expiration: once a cruft pack has reached its expiration date,
  it can be deleted as a whole
- less I/O churn: a cruft pack stays as-is until combined into another
  cruft pack or deleted.  There is no frequently-modified mtimes file
  associated to it
- informs the storage layer about what is likely to be accessed: cruft
  packs can get filesystem attributes to put them in less-optimized
  storage since they are likely to be less frequently read

[...]
> +Notable alternatives to this design include:
> +
> +  - The location of the per-object mtime data, and
> +  - Storing unreachable objects in multiple cruft packs.
> +
> +On the location of mtime data, a new auxiliary file tied to the pack was chosen
> +to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
> +support for optional chunks of data, it may make sense to consolidate the
> +`.mtimes` format into the `.idx` itself.
> +
> +Storing unreachable objects among multiple cruft packs (e.g., creating a new
> +cruft pack during each repacking operation including only unreachable objects
> +which aren't already stored in an earlier cruft pack) is significantly more
> +complicated to construct, and so aren't pursued here. The obvious drawback to
> +the current implementation is that the entire cruft pack must be re-written from
> +scratch.

This doesn't mention the approach described in
hash-function-transition.txt (and that's already implemented and has
been in use for many years in JGit's DfsRepository).  Does that mean
you aren't aware of it?

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-07 18:03     ` Jonathan Nieder
@ 2022-03-22  1:16       ` Taylor Blau
  2022-03-22 21:45         ` Jonathan Nieder
  0 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-03-22  1:16 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, tytso, derrickstolee, gitster, larsxschneider

On Mon, Mar 07, 2022 at 10:03:35AM -0800, Jonathan Nieder wrote:
> Sorry for the very slow review!  I've mentioned a few times that this
> overlaps in interesting ways with the gc mechanism described in
> hash-function-transition.txt, so I'd like to compare and see how they
> interact.

Sorry for my equally-slow reply ;). I was on vacation last week and
wasn't following the list closely.

> > +Unreachable objects aren't removed immediately, since doing so could race with
> > +an incoming push which may reference an object which is about to be deleted.
> > +Instead, those unreachable objects are stored as loose object and stay that way
> > +until they are older than the expiration window, at which point they are removed
> > +by linkgit:git-prune[1].
> > +
> > +Git must store these unreachable objects loose in order to keep track of their
> > +per-object mtimes.
>
> It's worth noting that this behavior is already racy.  That is because
> when an unreachable object becomes newly reachable, we do not update
> its mtime and the mtimes of every object reachable from it, so if it
> then becomes transiently unreachable again then it can be wrongly
> collected.

Just to be clear, the race here only happens if the object in question
becomes reachable _after_ a pruning GC determines its mtime. If that's
the case, then the object will indeed be wrongly collected. This is
consistent with the existing behavior (which is racy in the exact same
way).

(After re-reading what you wrote and my response, I think we are saying
the exact same thing, but it doesn't hurt to think aloud).

> > +
> > +== Cruft packs
> > +
> > +A cruft pack eliminates the need for storing unreachable objects in a loose
> > +state by including the per-object mtimes in a separate file alongside a single
> > +pack containing all loose objects.
>
> Can this doc say a little about how "git prune" handles these files?
> In particular, does a non cruft pack aware copy of Git (or JGit,
> libgit2, etc) do the right thing or does it fight with this mechanism?
> If the latter, do we have a repository extension (extensions.*) to
> prevent that?

I mentioned this in much more detail in [1], but the answer is that the
cruft pack looks like any other pack, it just happens to have another
metadata file (the .mtimes one) attached to it. So other implementations
of Git should treat it as they would any other pack. Like I mentioned in
[1], cruft packs were designed with the explicit goal of not requiring a
repository extension.

> > +  3. Write the pack out, along with a `.mtimes` file that records the per-object
> > +     timestamps.
>
> As a point of comparison, the design in hash-function-transition uses
> a single timestamp for the whole pack.  During read operations, objects
> in a cruft pack are considered present; during writes, they are
> considered _not present_ so that if we want to make a cruft object
> newly present then we put a copy of it in a new pack.
>
> Advantage of the mtimes file approach:
> - less duplication of storage: a revived object is only stored once,
>   in a cruft pack, and then the next gc can "graduate" it out of the
>   cruft pack and shrink the cruft pack
> - less affect on non-gc Git code: writes don't need to know that any
>   cruft objects referenced need to be copied into a new pack
>
> Advantages of the mtime per cruft pack approach:
> - easy expiration: once a cruft pack has reached its expiration date,
>   it can be deleted as a whole
> - less I/O churn: a cruft pack stays as-is until combined into another
>   cruft pack or deleted.  There is no frequently-modified mtimes file
>   associated to it
> - informs the storage layer about what is likely to be accessed: cruft
>   packs can get filesystem attributes to put them in less-optimized
>   storage since they are likely to be less frequently read
>
> [...]

The key advantage of cruft packs is that you can expire unreachable
objects in piecemeal while still retaining the benefit of being able to
de-duplicate cruft objects and store them packed against each other.

> > +Notable alternatives to this design include:
>
> This doesn't mention the approach described in
> hash-function-transition.txt (and that's already implemented and has
> been in use for many years in JGit's DfsRepository).  Does that mean
> you aren't aware of it?

Implementing the UNREACHABLE_GARBAGE concept from
hash-function-transition.txt in cruft pack-terms would be equivalent to
not writing the mtimes file at all. This follows from the fact that a
pre-cruft packs implementation of Git considers a packed object's mtime
to be the same as the pack it's contained in. (I'm deliberately
avoiding any details from the h-f-t document regarding re-writing
objects contained in a garbage pack here, since this is separate from
the pack structure itself (and could easily be implemented on top of
cruft packs)).

So I'm not sure what the alternative we'd list would be, since it
removes the key feature of the design of cruft packs.

Thanks,
Taylor

[1]: https://lore.kernel.org/git/YiZMhuI%2FDdpvQ%2FED@nand.local/

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-22  1:16       ` Taylor Blau
@ 2022-03-22 21:45         ` Jonathan Nieder
  2022-03-22 22:02           ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Jonathan Nieder @ 2022-03-22 21:45 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, tytso, derrickstolee, gitster, larsxschneider

Hi,

Taylor Blau wrote:
> On Mon, Mar 07, 2022 at 10:03:35AM -0800, Jonathan Nieder wrote:

>> Sorry for the very slow review!  I've mentioned a few times that this
>> overlaps in interesting ways with the gc mechanism described in
>> hash-function-transition.txt, so I'd like to compare and see how they
>> interact.
>
> Sorry for my equally-slow reply ;). I was on vacation last week and
> wasn't following the list closely.

No problem --- thanks for getting back to me.

[...]
> (After re-reading what you wrote and my response, I think we are saying
> the exact same thing, but it doesn't hurt to think aloud).

Great.  Can the doc cover this?  I think it would be helpful to make
that easy to find for others with similar questions.

If it's a matter of finding enough time to write some text, let me
know and I can try to find some time to help.

[...]
>> Can this doc say a little about how "git prune" handles these files?
>> In particular, does a non cruft pack aware copy of Git (or JGit,
>> libgit2, etc) do the right thing or does it fight with this mechanism?
>> If the latter, do we have a repository extension (extensions.*) to
>> prevent that?
>
> I mentioned this in much more detail in [1], but the answer is that the
> cruft pack looks like any other pack, it just happens to have another
> metadata file (the .mtimes one) attached to it. So other implementations
> of Git should treat it as they would any other pack. Like I mentioned in
> [1], cruft packs were designed with the explicit goal of not requiring a
> repository extension.

Sorry, the above seems like it's answering a different question than I
asked.  The doc in Documentation/technical/ seems like a natural place
to describe what semantics the new .mtimes file has, and I didn't find
that there.  Is there a different piece of documentation I should have
been looking at?

Can you tell me a little more about why we would want _not_ to have a
repository format extension?  To me, it seems like a fairly simple
addition that would drastically reduce the cognitive overload for
people considering making use of this feature.

[...]
> The key advantage of cruft packs is that you can expire unreachable
> objects in piecemeal while still retaining the benefit of being able to
> de-duplicate cruft objects and store them packed against each other.

Can you say a little more about this?  My experience with the similar
feature in JGit is that it has been helpful to be able to expire a
cruft pack altogether; since objects that became reachable around the
same time get packed at the same time, it's not obvious to me what
benefit this extra piecemeal capability brings.

That doesn't mean the benefit doesn't exist, just that it seems like
there's a piece of context I'm still missing.

>>> +Notable alternatives to this design include:
>>
>> This doesn't mention the approach described in
>> hash-function-transition.txt (and that's already implemented and has
>> been in use for many years in JGit's DfsRepository).  Does that mean
>> you aren't aware of it?
>
> Implementing the UNREACHABLE_GARBAGE concept from
> hash-function-transition.txt in cruft pack-terms would be equivalent to
> not writing the mtimes file at all. This follows from the fact that a
> pre-cruft packs implementation of Git considers a packed object's mtime
> to be the same as the pack it's contained in. (I'm deliberately
> avoiding any details from the h-f-t document regarding re-writing
> objects contained in a garbage pack here, since this is separate from
> the pack structure itself (and could easily be implemented on top of
> cruft packs)).
>
> So I'm not sure what the alternative we'd list would be, since it
> removes the key feature of the design of cruft packs.

Sorry, I don't understand this answer either.  Do you mean to say that
JGit's DfsRepository does not in fact have a cruft packs like feature
that is live in the wild?  Or that that feature is equivalent to not
having such a feature?  Or something else?

To be clear, I'm not trying to say that that's superior to what you've
proposed here --- only that documenting the comparison would be
useful.

Puzzled,
Jonathan

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-22 21:45         ` Jonathan Nieder
@ 2022-03-22 22:02           ` Taylor Blau
  2022-03-22 23:04             ` Jonathan Nieder
  0 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-03-22 22:02 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Taylor Blau, git, tytso, derrickstolee, gitster, larsxschneider

On Tue, Mar 22, 2022 at 02:45:16PM -0700, Jonathan Nieder wrote:
> Hi,
>
> Taylor Blau wrote:
> > On Mon, Mar 07, 2022 at 10:03:35AM -0800, Jonathan Nieder wrote:
>
> >> Sorry for the very slow review!  I've mentioned a few times that this
> >> overlaps in interesting ways with the gc mechanism described in
> >> hash-function-transition.txt, so I'd like to compare and see how they
> >> interact.
> >
> > Sorry for my equally-slow reply ;). I was on vacation last week and
> > wasn't following the list closely.
>
> No problem --- thanks for getting back to me.
>
> [...]
> > (After re-reading what you wrote and my response, I think we are saying
> > the exact same thing, but it doesn't hurt to think aloud).
>
> Great.  Can the doc cover this?  I think it would be helpful to make
> that easy to find for others with similar questions.

I believe the doc covers this already, see the paragraph beginning with
"Unreachable objects aren't removed immediately...".

> >> Can this doc say a little about how "git prune" handles these files?
> >> In particular, does a non cruft pack aware copy of Git (or JGit,
> >> libgit2, etc) do the right thing or does it fight with this mechanism?
> >> If the latter, do we have a repository extension (extensions.*) to
> >> prevent that?
> >
> > I mentioned this in much more detail in [1], but the answer is that the
> > cruft pack looks like any other pack, it just happens to have another
> > metadata file (the .mtimes one) attached to it. So other implementations
> > of Git should treat it as they would any other pack. Like I mentioned in
> > [1], cruft packs were designed with the explicit goal of not requiring a
> > repository extension.
>
> Sorry, the above seems like it's answering a different question than I
> asked.  The doc in Documentation/technical/ seems like a natural place
> to describe what semantics the new .mtimes file has, and I didn't find
> that there.  Is there a different piece of documentation I should have
> been looking at?

Are you looking for a technical description of the mtimes file? If so,
there is a section in Documentation/technical/pack-format.txt (added in
"pack-mtimes: support reading .mtimes files") that explains this.

> Can you tell me a little more about why we would want _not_ to have a
> repository format extension?  To me, it seems like a fairly simple
> addition that would drastically reduce the cognitive overload for
> people considering making use of this feature.

There is no reason to prevent a pre-cruft packs version of Git from
reading/writing a repository that uses cruft packs, since the two
versions will still function as normal. Since there's no need to prevent
the old version from interacting with a repository that has cruft packs,
we wouldn't want to enforce an unnecessary boundary with an extension.

> [...]
> > The key advantage of cruft packs is that you can expire unreachable
> > objects in piecemeal while still retaining the benefit of being able to
> > de-duplicate cruft objects and store them packed against each other.
>
> Can you say a little more about this?  My experience with the similar
> feature in JGit is that it has been helpful to be able to expire a
> cruft pack altogether; since objects that became reachable around the
> same time get packed at the same time, it's not obvious to me what
> benefit this extra piecemeal capability brings.
>
> That doesn't mean the benefit doesn't exist, just that it seems like
> there's a piece of context I'm still missing.

Expiring objects in piecemeal is somewhat interesting, but I think I was
reaching a little too far when I said it was the "key benefit". It does
have some nice properties, like being able to store cruft objects as
deltas against other cruft objects which might get pruned at a different
time (though, of course, you'll need to re-delta them in the case you do
prune an object which is the base of another cruft object).

But the issue with having multiple cruft packs is that the semantics get
significantly more complicated. E.g., if you have an object represented
in multiple cruft packs, which mtime do you use? If you want to prune
it, you suddenly may have many packs you need to update and keep track
of.

> >>> +Notable alternatives to this design include:
> >>
> >> This doesn't mention the approach described in
> >> hash-function-transition.txt (and that's already implemented and has
> >> been in use for many years in JGit's DfsRepository).  Does that mean
> >> you aren't aware of it?
> >
> > Implementing the UNREACHABLE_GARBAGE concept from
> > hash-function-transition.txt in cruft pack-terms would be equivalent to
> > not writing the mtimes file at all. This follows from the fact that a
> > pre-cruft packs implementation of Git considers a packed object's mtime
> > to be the same as the pack it's contained in. (I'm deliberately
> > avoiding any details from the h-f-t document regarding re-writing
> > objects contained in a garbage pack here, since this is separate from
> > the pack structure itself (and could easily be implemented on top of
> > cruft packs)).
> >
> > So I'm not sure what the alternative we'd list would be, since it
> > removes the key feature of the design of cruft packs.
>
> Sorry, I don't understand this answer either.  Do you mean to say that
> JGit's DfsRepository does not in fact have a cruft packs like feature
> that is live in the wild?  Or that that feature is equivalent to not
> having such a feature?  Or something else?
>
> To be clear, I'm not trying to say that that's superior to what you've
> proposed here --- only that documenting the comparison would be
> useful.

I'm not familiar enough with JGit (or its DfsRepository class) to know
how to answer this. I was comparing cruft packs to the
UNREACHABLE_GARBAGE concept mentioned in the hash-function-transition
doc, and noting the differences there.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-22 22:02           ` Taylor Blau
@ 2022-03-22 23:04             ` Jonathan Nieder
  2022-03-23  1:01               ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Jonathan Nieder @ 2022-03-22 23:04 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, tytso, derrickstolee, gitster, larsxschneider

Hi,

Taylor Blau wrote:
> On Tue, Mar 22, 2022 at 02:45:16PM -0700, Jonathan Nieder wrote:

>> Great.  Can the doc cover this?  I think it would be helpful to make
>> that easy to find for others with similar questions.
>
> I believe the doc covers this already, see the paragraph beginning with
> "Unreachable objects aren't removed immediately...".

Thanks.  I just reread that section and it didn't say anything obvious
about the race that continues to exist and whether cruft packs address
it.

[...]
>> Sorry, the above seems like it's answering a different question than I
>> asked.  The doc in Documentation/technical/ seems like a natural place
>> to describe what semantics the new .mtimes file has, and I didn't find
>> that there.  Is there a different piece of documentation I should have
>> been looking at?
>
> Are you looking for a technical description of the mtimes file? If so,
> there is a section in Documentation/technical/pack-format.txt (added in
> "pack-mtimes: support reading .mtimes files") that explains this.

I see --- is the idea that cruft-packs.txt means to refer to
pack-format.txt for the details, and cruft-packs is an overview of some
other, non-detail aspects?

I just checked pack-format.txt and it didn't describe the semantics
(what a Git implementation is expected to do when it sees an mtimes
file).  For example, in Documentation/technical/cruft-packs.txt, the
kind of thing I'd expect to see is

- what does an mtime value in the mtimes file represent?  When is it
  meant to be updated?
- what guarantees are present about when an object is safe to be
  pruned?

[...]
>> Can you tell me a little more about why we would want _not_ to have a
>> repository format extension?  To me, it seems like a fairly simple
>> addition that would drastically reduce the cognitive overload for
>> people considering making use of this feature.
>
>There is no reason to prevent a pre-cruft packs version of Git from
> reading/writing a repository that uses cruft packs, since the two
> versions will still function as normal. Since there's no need to prevent
> the old version from interacting with a repository that has cruft packs,
> we wouldn't want to enforce an unnecessary boundary with an extension.

Does "function as normal" include in repository maintenance operations
like "git maintenance", "git gc", and "git prune"?  If so, this seems
like something very useful to describe in the cruft-packs.txt
document, since what happens when we bounce back and forth between old
and new versions of Git operating on the same NFS mounted repository
would not be obvious without such a discussion.

I'm still interested in the _downsides_ of using a repository format
extension.  "There is no reason" is not a downside, unless you mean
that it requires adding a line of code. :)  The main downside I can
imagine is that it prevents accessing the repository _that has enabled
this feature_ with an older version of Git, but I (perhaps due to a
failure of imagination) haven't put two and two together yet about
when I would want to do so.

[...]
> Expiring objects in piecemeal is somewhat interesting, but I think I was
> reaching a little too far when I said it was the "key benefit". It does
> have some nice properties, like being able to store cruft objects as
> deltas against other cruft objects which might get pruned at a different
> time (though, of course, you'll need to re-delta them in the case you do
> prune an object which is the base of another cruft object).
>
> But the issue with having multiple cruft packs is that the semantics get
> significantly more complicated. E.g., if you have an object represented
> in multiple cruft packs, which mtime do you use? If you want to prune
> it, you suddenly may have many packs you need to update and keep track
> of.

Thanks for this explanation.  In hash-function-transition.txt, I see

	"git gc" currently expels any unreachable objects it encounters in
	pack files to loose objects in an attempt to prevent a race when
	pruning them (in case another process is simultaneously writing a new
	object that refers to the about-to-be-deleted object). This leads to
	an explosion in the number of loose objects present and disk space
	usage due to the objects in delta form being replaced with independent
	loose objects.  Worse, the race is still present for loose objects.

	Instead, "git gc" will need to move unreachable objects to a new
	packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see
	below). To avoid the race when writing new objects referring to an
	about-to-be-deleted object, code paths that write new objects will
	need to copy any objects from UNREACHABLE_GARBAGE packs that they
	refer to new, non-UNREACHABLE_GARBAGE packs (or loose objects).
	UNREACHABLE_GARBAGE are then safe to delete if their creation time (as
	indicated by the file's mtime) is long enough ago.

	To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be
	combined under certain circumstances. [etc]

So the proposal there is that the file mtime for an UNREACHABLE_GARBAGE
pack refers to when that pack was written and governs when that pack
can be deleted.  If an object is present in multiple packs, then newer
packs with the object have a newer mtime and thus cause the object to
be kept around for longer.

[...]
>> Sorry, I don't understand this answer either.  Do you mean to say that
>> JGit's DfsRepository does not in fact have a cruft packs like feature
>> that is live in the wild?  Or that that feature is equivalent to not
>> having such a feature?  Or something else?
>>
>> To be clear, I'm not trying to say that that's superior to what you've
>> proposed here --- only that documenting the comparison would be
>> useful.
>
> I'm not familiar enough with JGit (or its DfsRepository class) to know
> how to answer this. I was comparing cruft packs to the
> UNREACHABLE_GARBAGE concept mentioned in the hash-function-transition
> doc, and noting the differences there.

Thanks.  I think there's some implied feedback about the documentation
of UNREACHABLE_GARBAGE there, because if I understand then you're
saying that it does not describe maintaining cruft packs.  Perhaps a
pointer to the particular sentence that led you to that conclusion
would help.

Sincerely,
Jonathan

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-22 23:04             ` Jonathan Nieder
@ 2022-03-23  1:01               ` Taylor Blau
  2022-03-28 18:46                 ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-03-23  1:01 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Taylor Blau, git, tytso, derrickstolee, gitster, larsxschneider

On Tue, Mar 22, 2022 at 04:04:53PM -0700, Jonathan Nieder wrote:
> Hi,
>
> Taylor Blau wrote:
> > On Tue, Mar 22, 2022 at 02:45:16PM -0700, Jonathan Nieder wrote:
>
> >> Great.  Can the doc cover this?  I think it would be helpful to make
> >> that easy to find for others with similar questions.
> >
> > I believe the doc covers this already, see the paragraph beginning with
> > "Unreachable objects aren't removed immediately...".
>
> Thanks.  I just reread that section and it didn't say anything obvious
> about the race that continues to exist and whether cruft packs address
> it.

Yeah, there isn't an explicit "and cruft packs addresses the loose
object explosion but does not address the race" sentence. I'm not
opposed to adding something like that to clarify (though TBH, I would
rather do it as a clean-up on top rather than send out a bazillion
mostly-unchanged patches).

> [...]
> >> Sorry, the above seems like it's answering a different question than I
> >> asked.  The doc in Documentation/technical/ seems like a natural place
> >> to describe what semantics the new .mtimes file has, and I didn't find
> >> that there.  Is there a different piece of documentation I should have
> >> been looking at?
> >
> > Are you looking for a technical description of the mtimes file? If so,
> > there is a section in Documentation/technical/pack-format.txt (added in
> > "pack-mtimes: support reading .mtimes files") that explains this.
>
> I see --- is the idea that cruft-packs.txt means to refer to
> pack-format.txt for the details, and cruft-packs is an overview of some
> other, non-detail aspects?
>
> I just checked pack-format.txt and it didn't describe the semantics
> (what a Git implementation is expected to do when it sees an mtimes
> file).  For example, in Documentation/technical/cruft-packs.txt, the
> kind of thing I'd expect to see is

Right; I have always considered the files in Documentation/technical to
primarily be about the file format itself.

> - what does an mtime value in the mtimes file represent?  When is it
>   meant to be updated?
> - what guarantees are present about when an object is safe to be
>   pruned?

The cruft-packs.txt document covers these, though I think somewhat
implicitly. Again, I'm not opposed to more clarification, but I again
would like to do so on top.

I think many of these are discussed within the threads above, but to
answer your questions in order:

  - The mtime of an object in a cruft pack represents the last time that
    object was known to be reachable, and it's updated when generating a
    cruft pack or pruning.

  - The same guarantees are made in the cruft pack case as in the
    non-cruft case (i.e., "none", and so a grace period is recommended).

> [...]
> >> Can you tell me a little more about why we would want _not_ to have a
> >> repository format extension?  To me, it seems like a fairly simple
> >> addition that would drastically reduce the cognitive overload for
> >> people considering making use of this feature.
> >
> >There is no reason to prevent a pre-cruft packs version of Git from
> > reading/writing a repository that uses cruft packs, since the two
> > versions will still function as normal. Since there's no need to prevent
> > the old version from interacting with a repository that has cruft packs,
> > we wouldn't want to enforce an unnecessary boundary with an extension.
>
> Does "function as normal" include in repository maintenance operations
> like "git maintenance", "git gc", and "git prune"?  If so, this seems
> like something very useful to describe in the cruft-packs.txt
> document, since what happens when we bounce back and forth between old
> and new versions of Git operating on the same NFS mounted repository
> would not be obvious without such a discussion.

Yes, all of those commands will simply ignore the .mtimes file and treat
the unreachable objects as normal (where "normal" means in the exact
same way as they currently do without cruft packs). I think adding a
section that summarizes our discussion would be useful.

> I'm still interested in the _downsides_ of using a repository format
> extension.  "There is no reason" is not a downside, unless you mean
> that it requires adding a line of code. :)  The main downside I can
> imagine is that it prevents accessing the repository _that has enabled
> this feature_ with an older version of Git, but I (perhaps due to a
> failure of imagination) haven't put two and two together yet about
> when I would want to do so.

Sorry for not being clear; I meant: "There is no reason [to prohibit
two versions of Git from interacting with each other when they are
compatible to do so]".

> [...]
> > Expiring objects in piecemeal is somewhat interesting, but I think I was
> > reaching a little too far when I said it was the "key benefit". It does
> > have some nice properties, like being able to store cruft objects as
> > deltas against other cruft objects which might get pruned at a different
> > time (though, of course, you'll need to re-delta them in the case you do
> > prune an object which is the base of another cruft object).
> >
> > But the issue with having multiple cruft packs is that the semantics get
> > significantly more complicated. E.g., if you have an object represented
> > in multiple cruft packs, which mtime do you use? If you want to prune
> > it, you suddenly may have many packs you need to update and keep track
> > of.
>
> Thanks for this explanation.  In hash-function-transition.txt, I see
>
> 	"git gc" currently expels any unreachable objects it encounters in
> 	pack files to loose objects in an attempt to prevent a race when
> 	pruning them (in case another process is simultaneously writing a new
> 	object that refers to the about-to-be-deleted object). This leads to
> 	an explosion in the number of loose objects present and disk space
> 	usage due to the objects in delta form being replaced with independent
> 	loose objects.  Worse, the race is still present for loose objects.
>
> 	Instead, "git gc" will need to move unreachable objects to a new
> 	packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see
> 	below). To avoid the race when writing new objects referring to an
> 	about-to-be-deleted object, code paths that write new objects will
> 	need to copy any objects from UNREACHABLE_GARBAGE packs that they
> 	refer to new, non-UNREACHABLE_GARBAGE packs (or loose objects).
> 	UNREACHABLE_GARBAGE are then safe to delete if their creation time (as
> 	indicated by the file's mtime) is long enough ago.
>
> 	To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be
> 	combined under certain circumstances. [etc]
>
> So the proposal there is that the file mtime for an UNREACHABLE_GARBAGE
> pack refers to when that pack was written and governs when that pack
> can be deleted.  If an object is present in multiple packs, then newer
> packs with the object have a newer mtime and thus cause the object to
> be kept around for longer.

That matches my understanding.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-23  1:01               ` Taylor Blau
@ 2022-03-28 18:46                 ` Taylor Blau
  2022-03-28 20:55                   ` Junio C Hamano
  0 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-03-28 18:46 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, tytso, derrickstolee, gitster, larsxschneider

On Tue, Mar 22, 2022 at 09:01:43PM -0400, Taylor Blau wrote:
> > >> Can you tell me a little more about why we would want _not_ to have a
> > >> repository format extension?  To me, it seems like a fairly simple
> > >> addition that would drastically reduce the cognitive overload for
> > >> people considering making use of this feature.
> > >
> > >There is no reason to prevent a pre-cruft packs version of Git from
> > > reading/writing a repository that uses cruft packs, since the two
> > > versions will still function as normal. Since there's no need to prevent
> > > the old version from interacting with a repository that has cruft packs,
> > > we wouldn't want to enforce an unnecessary boundary with an extension.
> >
> > Does "function as normal" include in repository maintenance operations
> > like "git maintenance", "git gc", and "git prune"?  If so, this seems
> > like something very useful to describe in the cruft-packs.txt
> > document, since what happens when we bounce back and forth between old
> > and new versions of Git operating on the same NFS mounted repository
> > would not be obvious without such a discussion.
>
> Yes, all of those commands will simply ignore the .mtimes file and treat
> the unreachable objects as normal (where "normal" means in the exact
> same way as they currently do without cruft packs). I think adding a
> section that summarizes our discussion would be useful.
>
> > I'm still interested in the _downsides_ of using a repository format
> > extension.  "There is no reason" is not a downside, unless you mean
> > that it requires adding a line of code. :)  The main downside I can
> > imagine is that it prevents accessing the repository _that has enabled
> > this feature_ with an older version of Git, but I (perhaps due to a
> > failure of imagination) haven't put two and two together yet about
> > when I would want to do so.
>
> Sorry for not being clear; I meant: "There is no reason [to prohibit
> two versions of Git from interacting with each other when they are
> compatible to do so]".

Jonathan, myself, and others discussed this extensively in today's
standup.

To summarize Jonathan's point (as I think I severely misunderstood it
before), if two writers are repacking a repository with unreachable
objects. The following can happen:

  - $NEWGIT packs the repository and writes a cruft pack and .mtimes
    file.

  - $OLDGIT packs the repository, exploding unreachable objects from the
    cruft pack as loose, setting their mtimes to "now".

This causes the repository to lose information about the unreachable
mtimes, which would cause the repository to never prune objects (except
for when`--unpack-unreachable=now` is passed).

One approach (that Jonathan suggested) is to prevent the above situation
by introducing a format extension, so $OLDGIT could not touch the
repository. But this comes at a (in my view, significant) cost which is
that $OLDGIT can't touch the repository _at all_. An extension would be
desirable if cross-version interaction resulted in repository
corruption, but this scenario does not lead to corruption at all.

Another approach (courtesy Stolee, in an off-list discussion) is that we
could introduce an optional extension available as an opt-in to prevent
older versions of Git from interacting in a repository that contains
cruft packs, but is not required to write them.

A third approach (and probably my preferred direction) is to indicate
clearly via a combination of updates to Documentation/cruft-packs.txt
and the release notes that say something along the lines of:

    If you use are repacking a repository using both a pre- and
    post-cruft packs version of Git, please be aware that you will lose
    information about the mtimes of unreachable objects.

I imagine that would probably be sufficient, but we could also introduce
the opt-in extension as an easy alternative to avoid forcing an upgrade
of Git.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-28 18:46                 ` Taylor Blau
@ 2022-03-28 20:55                   ` Junio C Hamano
  2022-03-28 21:21                     ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Junio C Hamano @ 2022-03-28 20:55 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Nieder, git, tytso, derrickstolee, larsxschneider

Taylor Blau <me@ttaylorr.com> writes:

> To summarize Jonathan's point (as I think I severely misunderstood it
> before), if two writers are repacking a repository with unreachable
> objects. The following can happen:
>
>   - $NEWGIT packs the repository and writes a cruft pack and .mtimes
>     file.
>
>   - $OLDGIT packs the repository, exploding unreachable objects from the
>     cruft pack as loose, setting their mtimes to "now".

And if these repeat, alternating new and old versions of Git, we
will keep refreshing the unreachable objects' mtimes forever.

But once you stop using old versions of Git, perhaps in 3 release
cycles or so, we'll eventually be able to purge them, right?

> One approach (that Jonathan suggested) is to prevent the above situation
> by introducing a format extension, so $OLDGIT could not touch the
> repository. But this comes at a (in my view, significant) cost which is
> that $OLDGIT can't touch the repository _at all_. An extension would be
> desirable if cross-version interaction resulted in repository
> corruption, but this scenario does not lead to corruption at all.

A repository may not be in a healthy state, when tons of unreachable
objects stay around forever, but it probably is a bit too harsh to
call it "corrupt".

> Another approach (courtesy Stolee, in an off-list discussion) is that we
> could introduce an optional extension available as an opt-in to prevent
> older versions of Git from interacting in a repository that contains
> cruft packs, but is not required to write them.

That smells too magic; let's not go there.

> A third approach (and probably my preferred direction) is to indicate
> clearly via a combination of updates to Documentation/cruft-packs.txt
> and the release notes that say something along the lines of:
>
>     If you use are repacking a repository using both a pre- and
>     post-cruft packs version of Git, please be aware that you will lose
>     information about the mtimes of unreachable objects.

I do not quite see how it helps.  After hearing "... will lose
information about the mtimes ...", what concrete action can a user
take?  Or a sys-admin?

It's not like use of cruft-pack is mandatory when you upgrade the
new version of Git, right?  Perhaps use of cruft-pack should be
guarded behind a configuration variable so that users who might want
to use mixed versions of Git will be protected against accidental
use of new version of Git that introduces the forever-renewing
untracked objects problem?  

Perhaps a configuration variable, repack.cruftPackEnabled, that is
by default disabled, can be used to protect people who do not want
to get into the "keep refreshing mtime" loop from using the cruft
packs by mistake?  repack.cruftPackEnabled can probably be part of
the "experimental" feature set, if we think it is the direction in
the future.





^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-28 20:55                   ` Junio C Hamano
@ 2022-03-28 21:21                     ` Taylor Blau
  2022-03-29 15:59                       ` Junio C Hamano
  0 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-03-28 21:21 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Taylor Blau, Jonathan Nieder, git, tytso, derrickstolee,
	larsxschneider

On Mon, Mar 28, 2022 at 01:55:43PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > To summarize Jonathan's point (as I think I severely misunderstood it
> > before), if two writers are repacking a repository with unreachable
> > objects. The following can happen:
> >
> >   - $NEWGIT packs the repository and writes a cruft pack and .mtimes
> >     file.
> >
> >   - $OLDGIT packs the repository, exploding unreachable objects from the
> >     cruft pack as loose, setting their mtimes to "now".
>
> And if these repeat, alternating new and old versions of Git, we
> will keep refreshing the unreachable objects' mtimes forever.
>
> But once you stop using old versions of Git, perhaps in 3 release
> cycles or so, we'll eventually be able to purge them, right?

As soon as all of the repackers understand cruft packs, yes.

> > One approach (that Jonathan suggested) is to prevent the above situation
> > by introducing a format extension, so $OLDGIT could not touch the
> > repository. But this comes at a (in my view, significant) cost which is
> > that $OLDGIT can't touch the repository _at all_. An extension would be
> > desirable if cross-version interaction resulted in repository
> > corruption, but this scenario does not lead to corruption at all.
>
> A repository may not be in a healthy state, when tons of unreachable
> objects stay around forever, but it probably is a bit too harsh to
> call it "corrupt".

I agree, though I would note that this is no worse than the situation
today, where unreachable-but-recent objects are already exploded as
loose can already cause the kinds of issues that this series is designed
to prevent.

> > Another approach (courtesy Stolee, in an off-list discussion) is that we
> > could introduce an optional extension available as an opt-in to prevent
> > older versions of Git from interacting in a repository that contains
> > cruft packs, but is not required to write them.
>
> That smells too magic; let's not go there.

I'm not sure... if we did:

--- 8< ---

diff --git a/setup.c b/setup.c
index 04ce33cdcd..fa54c9baa4 100644
--- a/setup.c
+++ b/setup.c
@@ -565,2 +565,4 @@ static enum extension_result handle_extension(const char *var,
 		return EXTENSION_OK;
+	} else if (!strcmp(ext, "cruftpacks")) {
+		return EXTENSION_OK;
 	}

--- >8 ---

but nothing more, then a hypothetical `extensions.cruftPacks` could be
used to prevent older writers in a mixed version environment. But if you
don't have or care about older versions of Git, you can avoid setting it
altogether.

The key bit is that we don't have a check along the lines of "only allow
writing a cruft pack when extensions.cruftPacks" is set, so it's opt-in
as far as the new code is concerned.

> > A third approach (and probably my preferred direction) is to indicate
> > clearly via a combination of updates to Documentation/cruft-packs.txt
> > and the release notes that say something along the lines of:
> >
> >     If you use are repacking a repository using both a pre- and
> >     post-cruft packs version of Git, please be aware that you will lose
> >     information about the mtimes of unreachable objects.
>
> I do not quite see how it helps.  After hearing "... will lose
> information about the mtimes ...", what concrete action can a user
> take?  Or a sys-admin?
>
> It's not like use of cruft-pack is mandatory when you upgrade the
> new version of Git, right?  Perhaps use of cruft-pack should be
> guarded behind a configuration variable so that users who might want
> to use mixed versions of Git will be protected against accidental
> use of new version of Git that introduces the forever-renewing
> untracked objects problem?

I don't think we would have much to offer a user in that case; if the
mtimes are gone, then I couldn't think of anything to bring them back
outside of setting them manually.

But cruft packs are already guarded in two places:

  - `git repack` won't write a cruft pack unless given the `--cruft`
    flag (i.e., `git repack -A` doesn't suddenly start generating cruft
    packs upon upgrade).

  - `git gc` won't write cruft packs unless the `gc.cruftPacks`
    configuration is set, or `--cruft` is given as a flag.

I'd be curious what Jonathan and others think of that approach (which,
to be clear, is what this series already implements). We could make it
clear to say:

    If you have mixed versions of Git which both repack a repository
    (either manually or by auto-GC / background maintenance), consider
    leaving `gc.cruftPacks` unset and avoiding passing `--cruft` as a
    command-line argument to `git repack` and `git gc`, since doing so
    can lead to [...]

> Perhaps a configuration variable, repack.cruftPackEnabled, that is
> by default disabled, can be used to protect people who do not want
> to get into the "keep refreshing mtime" loop from using the cruft
> packs by mistake?  repack.cruftPackEnabled can probably be part of
> the "experimental" feature set, if we think it is the direction in
> the future.

I'd probably want to leave `-A` separate from `--cruft`, since something
about setting `repack.cruftPackEnabled` having the effect of causing
`-A` to produce a cruft pack feels strange to me.

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 201+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-28 21:21                     ` Taylor Blau
@ 2022-03-29 15:59                       ` Junio C Hamano
  2022-03-30  2:23                         ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Junio C Hamano @ 2022-03-29 15:59 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Nieder, git, tytso, derrickstolee, larsxschneider

Taylor Blau <me@ttaylorr.com> writes:

> I'm not sure... if we did:
>
> --- 8< ---
>
> diff --git a/setup.c b/setup.c
> index 04ce33cdcd..fa54c9baa4 100644
> --- a/setup.c
> +++ b/setup.c
> @@ -565,2 +565,4 @@ static enum extension_result handle_extension(const char *var,
>  		return EXTENSION_OK;
> +	} else if (!strcmp(ext, "cruftpacks")) {
> +		return EXTENSION_OK;
>  	}
>
> --- >8 ---
>
> but nothing more, then a hypothetical `extensions.cruftPacks` could be
> used to prevent older writers in a mixed version environment. But if you
> don't have or care about older versions of Git, you can avoid setting it
> altogether.

Smells like "unsafe by default, but you can opt into safety", which
is backwards, isn't it?

>> I do not quite see how it helps.  After hearing "... will lose
>> information about the mtimes ...", what concrete action can a user
>> take?  Or a sys-admin?
>>
>> It's not like use of cruft-pack is mandatory when you upgrade the
>> new version of Git, right?  Perhaps use of cruft-pack should be
>> guarded behind a configuration variable so that users who might want
>> to use mixed versions of Git will be protected against accidental
>> use of new version of Git that introduces the forever-renewing
>> untracked objects problem?
>
> I don't think we would have much to offer a user in that case; if the
> mtimes are gone, then I couldn't think of anything to bring them back
> outside of setting them manually.

Yes, so rambling about losing mtimes in documentation or release
notes would not help users all that much.  Let's not do that.

> But cruft packs are already guarded in two places:
>
>   - `git repack` won't write a cruft pack unless given the `--cruft`
>     flag (i.e., `git repack -A` doesn't suddenly start generating cruft
>     packs upon upgrade).
>
>   - `git gc` won't write cruft packs unless the `gc.cruftPacks`
>     configuration is set, or `--cruft` is given as a flag.

Hmph, OK.  So individuals can sort-of protect from hurting
themselves by refraining from running these with --cruft or writing
--cruft in their maintenance scripts.  An organization that wants to
let the more adventurous types to early opt-in can prepare two
versions of the maintenance scripts they distribute to their users,
one with and the other without --cruft, and use the mechanism they
use for gradual rollouts to control the population.  Perhaps that
would make sufficient protection?  I dunno.

Jonathan, what do you think?

> I'd be curious what Jonathan and others think of that approach (which,
> to be clear, is what this series already implements). We could make it
> clear to say:
>
>     If you have mixed versions of Git which both repack a repository
>     (either manually or by auto-GC / background maintenance), consider
>     leaving `gc.cruftPacks` unset and avoiding passing `--cruft` as a
>     command-line argument to `git repack` and `git gc`, since doing so
>     can lead to [...]

That message is (depending on what comes in [...]) much more helpful
than just throwing a word "mtime" out and letting the reader figure
out the rest ;-)

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-29 15:59                       ` Junio C Hamano
@ 2022-03-30  2:23                         ` Taylor Blau
  2022-03-30 13:37                           ` Junio C Hamano
  0 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-03-30  2:23 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Taylor Blau, Jonathan Nieder, git, tytso, derrickstolee,
	larsxschneider

On Tue, Mar 29, 2022 at 08:59:24AM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > I'm not sure... if we did:
> >
> > --- 8< ---
> >
> > diff --git a/setup.c b/setup.c
> > index 04ce33cdcd..fa54c9baa4 100644
> > --- a/setup.c
> > +++ b/setup.c
> > @@ -565,2 +565,4 @@ static enum extension_result handle_extension(const char *var,
> >  		return EXTENSION_OK;
> > +	} else if (!strcmp(ext, "cruftpacks")) {
> > +		return EXTENSION_OK;
> >  	}
> >
> > --- >8 ---
> >
> > but nothing more, then a hypothetical `extensions.cruftPacks` could be
> > used to prevent older writers in a mixed version environment. But if you
> > don't have or care about older versions of Git, you can avoid setting it
> > altogether.
>
> Smells like "unsafe by default, but you can opt into safety", which
> is backwards, isn't it?

I see it a little differently. The default (not writing cruft packs at
all) is safe, even in a mixed-version environment. If a user (a) wants
to use cruft packs, and (b) has older versions of Git also gc'ing the
repository, and (c) can't get rid of them, _then_ an opt-in extension
would make it impossible for those older versions to interact with the
repository.

I still can't shake the feeling that this is a pretty fringe and
timing-dependent scenario, which at worst keeps too many unreachable
objects around.

But I think this in conjunction with the already opt-in nature of cruft
packs would be a nice way to create safeguards for the situation
Jonathan described. There may be a simpler way, but I'm not sure I see
it (i.e., if you control whether or not `--cruft` is passed when
doing maintenance with newer versions of Git, but not whether older
versions are running around doing their own maintenance, then an
extension would be necessary to lock the old versions out).

> > But cruft packs are already guarded in two places:
> >
> >   - `git repack` won't write a cruft pack unless given the `--cruft`
> >     flag (i.e., `git repack -A` doesn't suddenly start generating cruft
> >     packs upon upgrade).
> >
> >   - `git gc` won't write cruft packs unless the `gc.cruftPacks`
> >     configuration is set, or `--cruft` is given as a flag.
>
> Hmph, OK.  So individuals can sort-of protect from hurting
> themselves by refraining from running these with --cruft or writing
> --cruft in their maintenance scripts.  An organization that wants to
> let the more adventurous types to early opt-in can prepare two
> versions of the maintenance scripts they distribute to their users,
> one with and the other without --cruft, and use the mechanism they
> use for gradual rollouts to control the population.  Perhaps that
> would make sufficient protection?  I dunno.
>
> Jonathan, what do you think?

I'm confused: if newer versions of Git are writing cruft packs, then
having the older versions gc'ing in the same repository runs into the
same scenario Jonathan originally describes.

The thing I think Jonathan seeks to prevent is older versions of Git
gc'ing a repo that has cruft packs. I think I may need you to clarify a
little, sorry :-(.

> > I'd be curious what Jonathan and others think of that approach (which,
> > to be clear, is what this series already implements). We could make it
> > clear to say:
> >
> >     If you have mixed versions of Git which both repack a repository
> >     (either manually or by auto-GC / background maintenance), consider
> >     leaving `gc.cruftPacks` unset and avoiding passing `--cruft` as a
> >     command-line argument to `git repack` and `git gc`, since doing so
> >     can lead to [...]
>
> That message is (depending on what comes in [...]) much more helpful
> than just throwing a word "mtime" out and letting the reader figure
> out the rest ;-)

Yes, totally agreed.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-30  2:23                         ` Taylor Blau
@ 2022-03-30 13:37                           ` Junio C Hamano
  2022-03-30 17:30                             ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Junio C Hamano @ 2022-03-30 13:37 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jonathan Nieder, git, tytso, derrickstolee, larsxschneider

Taylor Blau <me@ttaylorr.com> writes:

> The thing I think Jonathan seeks to prevent is older versions of Git
> gc'ing a repo that has cruft packs. I think I may need you to clarify a
> little, sorry :-(.

By making controlled rollout of the use of "--cruft" option (and the
assumption here is that a large organization setting people do not
manually say "gc --cruft", and they can ship their maintenance
scripts that may be run via cron or whatever with and without
"--cruft"), you can control the number of repositories that can
potentially see older versions of Git running gc on with cruft
packs.  Those users, for whom it is not their turn to start using
"--cruft" enabled version of the script, will not have cruft packs,
so it does not matter if they keep an older version of Git somewhere
hidden in a hermetic build of an IDE that bundles Git and gc kicks
in for them.


^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt
  2022-03-30 13:37                           ` Junio C Hamano
@ 2022-03-30 17:30                             ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-03-30 17:30 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Taylor Blau, Jonathan Nieder, git, tytso, derrickstolee,
	larsxschneider

On Wed, Mar 30, 2022 at 06:37:54AM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > The thing I think Jonathan seeks to prevent is older versions of Git
> > gc'ing a repo that has cruft packs. I think I may need you to clarify a
> > little, sorry :-(.
>
> By making controlled rollout of the use of "--cruft" option (and the
> assumption here is that a large organization setting people do not
> manually say "gc --cruft", and they can ship their maintenance
> scripts that may be run via cron or whatever with and without
> "--cruft"), you can control the number of repositories that can
> potentially see older versions of Git running gc on with cruft
> packs.  Those users, for whom it is not their turn to start using
> "--cruft" enabled version of the script, will not have cruft packs,
> so it does not matter if they keep an older version of Git somewhere
> hidden in a hermetic build of an IDE that bundles Git and gc kicks
> in for them.

Ahh, OK. Thanks for explaining: this is what I was pretty sure you
meant, but I wanted to make sure before agreeing to it.

Yes, this solution amounts to: "if you have mixed-versions of Git
mutually gc'ing a repository, then use the same rollout method used for
controlling Git itself to guard when to start creating cruft packs".

I would be very eager to hear if this works for Jonathan's case. It
should do the trick, I'd think.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* [PATCH v4 00/17] cruft packs
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (19 preceding siblings ...)
  2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
@ 2022-05-18 23:10 ` Taylor Blau
  2022-05-18 23:10   ` [PATCH v4 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
                     ` (19 more replies)
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
  21 siblings, 20 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-18 23:10 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

Here is another reroll of my series to implement "cruft packs", which is based
on the v2.36 tree, and incorporates feedback from the discussion we had about
mixed-version GCs with cruft packs in [1].

The changes here are limited to:

  - a cautionary note in Documentation/technical/cruft-packs.txt
    describing the potential interaction between pruning GCs across pre-
    and post-cruft pack versions of Git, as discussed towards the bottom
    of [2]

  - updating the `finalize_hashfile()` calls for writing `.mtimes` files
    to indicate that they are `FSYNC_COMPONENT_PACK_METADATA`, since the
    original version of this series predates the fine-grained fsync
    configuration in 2.36.

As always, a range-diff is below. Thanks in advance for taking another
look!

[1]: https://lore.kernel.org/git/YiZI99yeijQe5Jaq@google.com/
[2]: https://lore.kernel.org/git/YkIm7lnQsUT0JnvS@nand.local/

Taylor Blau (17):
  Documentation/technical: add cruft-packs.txt
  pack-mtimes: support reading .mtimes files
  pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  chunk-format.h: extract oid_version()
  pack-mtimes: support writing pack .mtimes files
  t/helper: add 'pack-mtimes' test-tool
  builtin/pack-objects.c: return from create_object_entry()
  builtin/pack-objects.c: --cruft without expiration
  reachable: add options to add_unseen_recent_objects_to_traversal
  reachable: report precise timestamps from objects in cruft packs
  builtin/pack-objects.c: --cruft with expiration
  builtin/repack.c: support generating a cruft pack
  builtin/repack.c: allow configuring cruft pack generation
  builtin/repack.c: use named flags for existing_packs
  builtin/repack.c: add cruft packs to MIDX during geometric repack
  builtin/gc.c: conditionally avoid pruning objects via loose
  sha1-file.c: don't freshen cruft packs

 Documentation/Makefile                  |   1 +
 Documentation/config/gc.txt             |  21 +-
 Documentation/config/repack.txt         |   9 +
 Documentation/git-gc.txt                |   5 +
 Documentation/git-pack-objects.txt      |  30 +
 Documentation/git-repack.txt            |  11 +
 Documentation/technical/cruft-packs.txt | 123 ++++
 Documentation/technical/pack-format.txt |  19 +
 Makefile                                |   2 +
 builtin/gc.c                            |  10 +-
 builtin/pack-objects.c                  | 304 +++++++++-
 builtin/repack.c                        | 181 +++++-
 bulk-checkin.c                          |   2 +-
 chunk-format.c                          |  12 +
 chunk-format.h                          |   3 +
 commit-graph.c                          |  18 +-
 midx.c                                  |  18 +-
 object-file.c                           |   4 +-
 object-store.h                          |   7 +-
 pack-mtimes.c                           | 126 ++++
 pack-mtimes.h                           |  15 +
 pack-objects.c                          |   6 +
 pack-objects.h                          |  25 +
 pack-write.c                            |  93 ++-
 pack.h                                  |   4 +
 packfile.c                              |  19 +-
 reachable.c                             |  58 +-
 reachable.h                             |   9 +-
 t/helper/test-pack-mtimes.c             |  56 ++
 t/helper/test-tool.c                    |   1 +
 t/helper/test-tool.h                    |   1 +
 t/t5329-pack-objects-cruft.sh           | 739 ++++++++++++++++++++++++
 32 files changed, 1831 insertions(+), 101 deletions(-)
 create mode 100644 Documentation/technical/cruft-packs.txt
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h
 create mode 100644 t/helper/test-pack-mtimes.c
 create mode 100755 t/t5329-pack-objects-cruft.sh

Range-diff against v3:
 1:  784ee7e0ee !  1:  f494ef7377 Documentation/technical: add cruft-packs.txt
    @@ Documentation/technical/cruft-packs.txt (new)
     +It is linkgit:git-gc[1] that is typically responsible for removing expired
     +unreachable objects.
     +
    ++== Caution for mixed-version environments
    ++
    ++Repositories that have cruft packs in them will continue to work with any older
    ++version of Git. Note, however, that previous versions of Git which do not
    ++understand the `.mtimes` file will use the cruft pack's mtime as the mtime for
    ++all of the objects in it. In other words, do not expect older (pre-cruft pack)
    ++versions of Git to interpret or even read the contents of the `.mtimes` file.
    ++
    ++Note that having mixed versions of Git GC-ing the same repository can lead to
    ++unreachable objects never being completely pruned. This can happen under the
    ++following circumstances:
    ++
    ++  - An older version of Git running GC explodes the contents of an existing
    ++    cruft pack loose, using the cruft pack's mtime.
    ++  - A newer version running GC collects those loose objects into a cruft pack,
    ++    where the .mtime file reflects the loose object's actual mtimes, but the
    ++    cruft pack mtime is "now".
    ++
    ++Repeating this process will lead to unreachable objects not getting pruned as a
    ++result of repeatedly resetting the objects' mtimes to the present time.
    ++
    ++If you are GC-ing repositories in a mixed version environment, consider omitting
    ++the `--cruft` option when using linkgit:git-repack[1] and linkgit:git-gc[1], and
    ++leaving the `gc.cruftPacks` configuration unset until all writers understand
    ++cruft packs.
    ++
     +== Alternatives
     +
     +Notable alternatives to this design include:
 2:  1ec754ad1b =  2:  8f9fd21be9 pack-mtimes: support reading .mtimes files
 3:  0f5d6d6492 =  3:  cdb21236e1 pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
 4:  135a07276b =  4:  1d775f9850 chunk-format.h: extract oid_version()
 5:  0600503856 !  5:  6172861bd9 pack-mtimes: support writing pack .mtimes files
    @@ pack-write.c: const char *write_rev_file_order(const char *rev_name,
     +	if (adjust_shared_perm(mtimes_name) < 0)
     +		die(_("failed to make %s readable"), mtimes_name);
     +
    -+	finalize_hashfile(f, NULL,
    ++	finalize_hashfile(f, NULL, FSYNC_COMPONENT_PACK_METADATA,
     +			  CSUM_HASH_IN_STREAM | CSUM_CLOSE | CSUM_FSYNC);
     +
     +	return mtimes_name;
 6:  4780c8437b =  6:  5f9a9a5b7b t/helper: add 'pack-mtimes' test-tool
 7:  33862a07c9 =  7:  b8a38fe2e4 builtin/pack-objects.c: return from create_object_entry()
 8:  22705e4887 !  8:  94fe03cc65 builtin/pack-objects.c: --cruft without expiration
    @@ builtin/pack-objects.c: static int option_parse_unpack_unreachable(const struct
     +	return 0;
     +}
     +
    - int cmd_pack_objects(int argc, const char **argv, const char *prefix)
    - {
    - 	int use_internal_rev_list = 0;
    + struct po_filter_data {
    + 	unsigned have_revs:1;
    + 	struct rev_info revs;
     @@ builtin/pack-objects.c: int cmd_pack_objects(int argc, const char **argv, const char *prefix)
      		OPT_CALLBACK_F(0, "unpack-unreachable", NULL, N_("time"),
      		  N_("unpack unreachable objects newer than <time>"),
    @@ builtin/pack-objects.c: int cmd_pack_objects(int argc, const char **argv, const
     +		read_cruft_objects();
      	} else if (!use_internal_rev_list) {
      		read_object_list_from_stdin();
    - 	} else {
    + 	} else if (pfd.have_revs) {
     
      ## object-file.c ##
     @@ object-file.c: int has_loose_object_nonlocal(const struct object_id *oid)
    @@ object-store.h: int repo_has_object_file_with_flags(struct repository *r,
      
     +int has_loose_object(const struct object_id *);
     +
    - void assert_oid_type(const struct object_id *oid, enum object_type expect);
    - 
    - /*
    + /**
    +  * format_object_header() is a thin wrapper around s xsnprintf() that
    +  * writes the initial "<type> <obj-len>" part of the loose object
     
      ## t/t5329-pack-objects-cruft.sh (new) ##
     @@
 9:  cebb30b667 !  9:  da7273f41f reachable: add options to add_unseen_recent_objects_to_traversal
    @@ Commit message
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## builtin/pack-objects.c ##
    -@@ builtin/pack-objects.c: static void get_object_list(int ac, const char **av)
    +@@ builtin/pack-objects.c: static void get_object_list(struct rev_info *revs, int ac, const char **av)
      	if (unpack_unreachable_expiration) {
    - 		revs.ignore_missing_links = 1;
    - 		if (add_unseen_recent_objects_to_traversal(&revs,
    + 		revs->ignore_missing_links = 1;
    + 		if (add_unseen_recent_objects_to_traversal(revs,
     -				unpack_unreachable_expiration))
     +				unpack_unreachable_expiration, NULL, 0))
      			die(_("unable to add recent objects"));
    - 		if (prepare_revision_walk(&revs))
    + 		if (prepare_revision_walk(revs))
      			die(_("revision walk setup failed"));
     
      ## reachable.c ##
10:  fa4de8859d = 10:  58fecd1747 reachable: report precise timestamps from objects in cruft packs
11:  92318f8700 = 11:  1740b8ef01 builtin/pack-objects.c: --cruft with expiration
12:  1e94b33cb4 ! 12:  5992a72cbf builtin/repack.c: support generating a cruft pack
    @@ builtin/repack.c
      static int pack_kept_objects = -1;
      static int write_bitmaps = -1;
      static int use_delta_islands;
    + static int run_update_server_info = 1;
      static char *packdir, *packtmp_name, *packtmp;
     +static char *cruft_expiration;
      
      static const char *const git_repack_usage[] = {
      	N_("git repack [<options>]"),
    -@@ builtin/repack.c: static int repack_config(const char *var, const char *value, void *cb)
    - 		use_delta_islands = git_config_bool(var, value);
    - 		return 0;
    - 	}
    -+
    - 	return git_default_config(var, value, cb);
    - }
    - 
     @@ builtin/repack.c: static void repack_promisor_objects(const struct pack_objects_args *args,
      		die(_("could not finish pack-objects to repack promisor objects"));
      }
13:  9cfcd123bd ! 13:  1b241f8f91 builtin/repack.c: allow configuring cruft pack generation
    @@ Commit message
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## Documentation/config/repack.txt ##
    -@@ Documentation/config/repack.txt: repack.writeBitmaps::
    - 	space and extra time spent on the initial repack.  This has
    - 	no effect if multiple packfiles are created.
    - 	Defaults to true on bare repos, false otherwise.
    +@@ Documentation/config/repack.txt: repack.updateServerInfo::
    + 	If set to false, linkgit:git-repack[1] will not run
    + 	linkgit:git-update-server-info[1]. Defaults to true. Can be overridden
    + 	when true by the `-n` option of linkgit:git-repack[1].
     +
     +repack.cruftWindow::
     +repack.cruftWindowMemory::
    @@ builtin/repack.c: static const char incremental_bitmap_conflict_error[] = N_(
      		delta_base_offset = git_config_bool(var, value);
      		return 0;
     @@ builtin/repack.c: static int repack_config(const char *var, const char *value, void *cb)
    + 		run_update_server_info = git_config_bool(var, value);
      		return 0;
      	}
    - 
     +	if (!strcmp(var, "repack.cruftwindow"))
     +		return git_config_string(&cruft_po_args->window, var, value);
     +	if (!strcmp(var, "repack.cruftwindowmemory"))
    @@ builtin/repack.c: static int repack_config(const char *var, const char *value, v
     +		return git_config_string(&cruft_po_args->depth, var, value);
     +	if (!strcmp(var, "repack.cruftthreads"))
     +		return git_config_string(&cruft_po_args->threads, var, value);
    -+
      	return git_default_config(var, value, cb);
      }
      
    @@ builtin/repack.c: static void remove_redundant_pack(const char *dir_name, const
      				 const struct pack_objects_args *args)
      {
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
    + 	int keep_unreachable = 0;
      	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
    - 	int no_update_server_info = 0;
      	struct pack_objects_args po_args = {NULL};
     +	struct pack_objects_args cruft_po_args = {NULL};
      	int geometric_factor = 0;
14:  1a58807df0 = 14:  ffae78852c builtin/repack.c: use named flags for existing_packs
15:  ed05cf536b = 15:  0743e373ba builtin/repack.c: add cruft packs to MIDX during geometric repack
16:  1d5f334138 = 16:  9f7e0acac6 builtin/gc.c: conditionally avoid pruning objects via loose
17:  f74b425872 = 17:  07fa9d4b47 sha1-file.c: don't freshen cruft packs
-- 
2.36.1.94.gb0d54bedca

^ permalink raw reply	[flat|nested] 201+ messages in thread

* [PATCH v4 01/17] Documentation/technical: add cruft-packs.txt
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
@ 2022-05-18 23:10   ` Taylor Blau
  2022-05-19 14:04     ` Junio C Hamano
  2022-05-18 23:10   ` [PATCH v4 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
                     ` (18 subsequent siblings)
  19 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-05-18 23:10 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

Create a technical document to explain cruft packs. It contains a brief
overview of the problem, some background, details on the implementation,
and a couple of alternative approaches not considered here.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/Makefile                  |   1 +
 Documentation/technical/cruft-packs.txt | 123 ++++++++++++++++++++++++
 2 files changed, 124 insertions(+)
 create mode 100644 Documentation/technical/cruft-packs.txt

diff --git a/Documentation/Makefile b/Documentation/Makefile
index adb2f1b50a..2faffb52ab 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -94,6 +94,7 @@ TECH_DOCS += MyFirstContribution
 TECH_DOCS += MyFirstObjectWalk
 TECH_DOCS += SubmittingPatches
 TECH_DOCS += technical/bundle-format
+TECH_DOCS += technical/cruft-packs
 TECH_DOCS += technical/hash-function-transition
 TECH_DOCS += technical/http-protocol
 TECH_DOCS += technical/index-format
diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
new file mode 100644
index 0000000000..c0f583cd48
--- /dev/null
+++ b/Documentation/technical/cruft-packs.txt
@@ -0,0 +1,123 @@
+= Cruft packs
+
+The cruft packs feature offer an alternative to Git's traditional mechanism of
+removing unreachable objects. This document provides an overview of Git's
+pruning mechanism, and how a cruft pack can be used instead to accomplish the
+same.
+
+== Background
+
+To remove unreachable objects from your repository, Git offers `git repack -Ad`
+(see linkgit:git-repack[1]). Quoting from the documentation:
+
+[quote]
+[...] unreachable objects in a previous pack become loose, unpacked objects,
+instead of being left in the old pack. [...] loose unreachable objects will be
+pruned according to normal expiry rules with the next 'git gc' invocation.
+
+Unreachable objects aren't removed immediately, since doing so could race with
+an incoming push which may reference an object which is about to be deleted.
+Instead, those unreachable objects are stored as loose object and stay that way
+until they are older than the expiration window, at which point they are removed
+by linkgit:git-prune[1].
+
+Git must store these unreachable objects loose in order to keep track of their
+per-object mtimes. If these unreachable objects were written into one big pack,
+then either freshening that pack (because an object contained within it was
+re-written) or creating a new pack of unreachable objects would cause the pack's
+mtime to get updated, and the objects within it would never leave the expiration
+window. Instead, objects are stored loose in order to keep track of the
+individual object mtimes and avoid a situation where all cruft objects are
+freshened at once.
+
+This can lead to undesirable situations when a repository contains many
+unreachable objects which have not yet left the grace period. Having large
+directories in the shards of `.git/objects` can lead to decreased performance in
+the repository. But given enough unreachable objects, this can lead to inode
+starvation and degrade the performance of the whole system. Since we
+can never pack those objects, these repositories often take up a large amount of
+disk space, since we can only zlib compress them, but not store them in delta
+chains.
+
+== Cruft packs
+
+A cruft pack eliminates the need for storing unreachable objects in a loose
+state by including the per-object mtimes in a separate file alongside a single
+pack containing all loose objects.
+
+A cruft pack is written by `git repack --cruft` when generating a new pack.
+linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft`
+is a classic all-into-one repack, meaning that everything in the resulting pack is
+reachable, and everything else is unreachable. Once written, the `--cruft`
+option instructs `git repack` to generate another pack containing only objects
+not packed in the previous step (which equates to packing all unreachable
+objects together). This progresses as follows:
+
+  1. Enumerate every object, marking any object which is (a) not contained in a
+     kept-pack, and (b) whose mtime is within the grace period as a traversal
+     tip.
+
+  2. Perform a reachability traversal based on the tips gathered in the previous
+     step, adding every object along the way to the pack.
+
+  3. Write the pack out, along with a `.mtimes` file that records the per-object
+     timestamps.
+
+This mode is invoked internally by linkgit:git-repack[1] when instructed to
+write a cruft pack. Crucially, the set of in-core kept packs is exactly the set
+of packs which will not be deleted by the repack; in other words, they contain
+all of the repository's reachable objects.
+
+When a repository already has a cruft pack, `git repack --cruft` typically only
+adds objects to it. An exception to this is when `git repack` is given the
+`--cruft-expiration` option, which allows the generated cruft pack to omit
+expired objects instead of waiting for linkgit:git-gc[1] to expire those objects
+later on.
+
+It is linkgit:git-gc[1] that is typically responsible for removing expired
+unreachable objects.
+
+== Caution for mixed-version environments
+
+Repositories that have cruft packs in them will continue to work with any older
+version of Git. Note, however, that previous versions of Git which do not
+understand the `.mtimes` file will use the cruft pack's mtime as the mtime for
+all of the objects in it. In other words, do not expect older (pre-cruft pack)
+versions of Git to interpret or even read the contents of the `.mtimes` file.
+
+Note that having mixed versions of Git GC-ing the same repository can lead to
+unreachable objects never being completely pruned. This can happen under the
+following circumstances:
+
+  - An older version of Git running GC explodes the contents of an existing
+    cruft pack loose, using the cruft pack's mtime.
+  - A newer version running GC collects those loose objects into a cruft pack,
+    where the .mtime file reflects the loose object's actual mtimes, but the
+    cruft pack mtime is "now".
+
+Repeating this process will lead to unreachable objects not getting pruned as a
+result of repeatedly resetting the objects' mtimes to the present time.
+
+If you are GC-ing repositories in a mixed version environment, consider omitting
+the `--cruft` option when using linkgit:git-repack[1] and linkgit:git-gc[1], and
+leaving the `gc.cruftPacks` configuration unset until all writers understand
+cruft packs.
+
+== Alternatives
+
+Notable alternatives to this design include:
+
+  - The location of the per-object mtime data, and
+  - Storing unreachable objects in multiple cruft packs.
+
+On the location of mtime data, a new auxiliary file tied to the pack was chosen
+to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
+support for optional chunks of data, it may make sense to consolidate the
+`.mtimes` format into the `.idx` itself.
+
+Storing unreachable objects among multiple cruft packs (e.g., creating a new
+cruft pack during each repacking operation including only unreachable objects
+which aren't already stored in an earlier cruft pack) is significantly more
+complicated to construct, and so aren't pursued here. The obvious drawback to
+the current implementation is that the entire cruft pack must be re-written from
+scratch.
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v4 02/17] pack-mtimes: support reading .mtimes files
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
  2022-05-18 23:10   ` [PATCH v4 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
@ 2022-05-18 23:10   ` Taylor Blau
  2022-05-19 10:40     ` Ævar Arnfjörð Bjarmason
  2022-05-18 23:10   ` [PATCH v4 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
                     ` (17 subsequent siblings)
  19 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-05-18 23:10 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

To store the individual mtimes of objects in a cruft pack, introduce a
new `.mtimes` format that can optionally accompany a single pack in the
repository.

The format is defined in Documentation/technical/pack-format.txt, and
stores a 4-byte network order timestamp for each object in name (index)
order.

This patch prepares for cruft packs by defining the `.mtimes` format,
and introducing a basic API that callers can use to read out individual
mtimes.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/pack-format.txt |  19 ++++
 Makefile                                |   1 +
 builtin/repack.c                        |   1 +
 object-store.h                          |   5 +-
 pack-mtimes.c                           | 126 ++++++++++++++++++++++++
 pack-mtimes.h                           |  15 +++
 packfile.c                              |  19 +++-
 7 files changed, 183 insertions(+), 3 deletions(-)
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 6d3efb7d16..c443dbb526 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -294,6 +294,25 @@ Pack file entry: <+
 
 All 4-byte numbers are in network order.
 
+== pack-*.mtimes files have the format:
+
+  - A 4-byte magic number '0x4d544d45' ('MTME').
+
+  - A 4-byte version identifier (= 1).
+
+  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
+
+  - A table of 4-byte unsigned integers in network order. The ith
+    value is the modification time (mtime) of the ith object in the
+    corresponding pack by lexicographic (index) order. The mtimes
+    count standard epoch seconds.
+
+  - A trailer, containing a checksum of the corresponding packfile,
+    and a checksum of all of the above (each having length according
+    to the specified hash function).
+
+All 4-byte numbers are in network order.
+
 == multi-pack-index (MIDX) files have the following format:
 
 The multi-pack-index files refer to multiple pack-files and loose objects.
diff --git a/Makefile b/Makefile
index 61aadf3ce8..a299580b7c 100644
--- a/Makefile
+++ b/Makefile
@@ -993,6 +993,7 @@ LIB_OBJS += oidtree.o
 LIB_OBJS += pack-bitmap-write.o
 LIB_OBJS += pack-bitmap.o
 LIB_OBJS += pack-check.o
+LIB_OBJS += pack-mtimes.o
 LIB_OBJS += pack-objects.o
 LIB_OBJS += pack-revindex.o
 LIB_OBJS += pack-write.o
diff --git a/builtin/repack.c b/builtin/repack.c
index d1a563d5b6..e7a3920c6d 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -217,6 +217,7 @@ static struct {
 } exts[] = {
 	{".pack"},
 	{".rev", 1},
+	{".mtimes", 1},
 	{".bitmap", 1},
 	{".promisor", 1},
 	{".idx"},
diff --git a/object-store.h b/object-store.h
index 53996018c1..2c4671ed7a 100644
--- a/object-store.h
+++ b/object-store.h
@@ -115,12 +115,15 @@ struct packed_git {
 		 freshened:1,
 		 do_not_close:1,
 		 pack_promisor:1,
-		 multi_pack_index:1;
+		 multi_pack_index:1,
+		 is_cruft:1;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	struct revindex_entry *revindex;
 	const uint32_t *revindex_data;
 	const uint32_t *revindex_map;
 	size_t revindex_size;
+	const uint32_t *mtimes_map;
+	size_t mtimes_size;
 	/* something like ".git/objects/pack/xxxxx.pack" */
 	char pack_name[FLEX_ARRAY]; /* more */
 };
diff --git a/pack-mtimes.c b/pack-mtimes.c
new file mode 100644
index 0000000000..46ad584af1
--- /dev/null
+++ b/pack-mtimes.c
@@ -0,0 +1,126 @@
+#include "pack-mtimes.h"
+#include "object-store.h"
+#include "packfile.h"
+
+static char *pack_mtimes_filename(struct packed_git *p)
+{
+	size_t len;
+	if (!strip_suffix(p->pack_name, ".pack", &len))
+		BUG("pack_name does not end in .pack");
+	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
+	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
+}
+
+#define MTIMES_HEADER_SIZE (12)
+#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))
+
+struct mtimes_header {
+	uint32_t signature;
+	uint32_t version;
+	uint32_t hash_id;
+};
+
+static int load_pack_mtimes_file(char *mtimes_file,
+				 uint32_t num_objects,
+				 const uint32_t **data_p, size_t *len_p)
+{
+	int fd, ret = 0;
+	struct stat st;
+	void *data = NULL;
+	size_t mtimes_size;
+	struct mtimes_header header;
+	uint32_t *hdr;
+
+	fd = git_open(mtimes_file);
+
+	if (fd < 0) {
+		ret = -1;
+		goto cleanup;
+	}
+	if (fstat(fd, &st)) {
+		ret = error_errno(_("failed to read %s"), mtimes_file);
+		goto cleanup;
+	}
+
+	mtimes_size = xsize_t(st.st_size);
+
+	if (mtimes_size < MTIMES_MIN_SIZE) {
+		ret = error(_("mtimes file %s is too small"), mtimes_file);
+		goto cleanup;
+	}
+
+	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
+		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
+		goto cleanup;
+	}
+
+	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
+
+	header.signature = ntohl(hdr[0]);
+	header.version = ntohl(hdr[1]);
+	header.hash_id = ntohl(hdr[2]);
+
+	if (header.signature != MTIMES_SIGNATURE) {
+		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
+		goto cleanup;
+	}
+
+	if (header.version != 1) {
+		ret = error(_("mtimes file %s has unsupported version %"PRIu32),
+			    mtimes_file, header.version);
+		goto cleanup;
+	}
+
+	if (!(header.hash_id == 1 || header.hash_id == 2)) {
+		ret = error(_("mtimes file %s has unsupported hash id %"PRIu32),
+			    mtimes_file, header.hash_id);
+		goto cleanup;
+	}
+
+cleanup:
+	if (ret) {
+		if (data)
+			munmap(data, mtimes_size);
+	} else {
+		*len_p = mtimes_size;
+		*data_p = (const uint32_t *)data;
+	}
+
+	close(fd);
+	return ret;
+}
+
+int load_pack_mtimes(struct packed_git *p)
+{
+	char *mtimes_name = NULL;
+	int ret = 0;
+
+	if (!p->is_cruft)
+		return ret; /* not a cruft pack */
+	if (p->mtimes_map)
+		return ret; /* already loaded */
+
+	ret = open_pack_index(p);
+	if (ret < 0)
+		goto cleanup;
+
+	mtimes_name = pack_mtimes_filename(p);
+	ret = load_pack_mtimes_file(mtimes_name,
+				    p->num_objects,
+				    &p->mtimes_map,
+				    &p->mtimes_size);
+cleanup:
+	free(mtimes_name);
+	return ret;
+}
+
+uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)
+{
+	if (!p->mtimes_map)
+		BUG("pack .mtimes file not loaded for %s", p->pack_name);
+	if (p->num_objects <= pos)
+		BUG("pack .mtimes out-of-bounds (%"PRIu32" vs %"PRIu32")",
+		    pos, p->num_objects);
+
+	return get_be32(p->mtimes_map + pos + 3);
+}
diff --git a/pack-mtimes.h b/pack-mtimes.h
new file mode 100644
index 0000000000..38ddb9f893
--- /dev/null
+++ b/pack-mtimes.h
@@ -0,0 +1,15 @@
+#ifndef PACK_MTIMES_H
+#define PACK_MTIMES_H
+
+#include "git-compat-util.h"
+
+#define MTIMES_SIGNATURE 0x4d544d45 /* "MTME" */
+#define MTIMES_VERSION 1
+
+struct packed_git;
+
+int load_pack_mtimes(struct packed_git *p);
+
+uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos);
+
+#endif
diff --git a/packfile.c b/packfile.c
index 835b2d2716..fc0245fbab 100644
--- a/packfile.c
+++ b/packfile.c
@@ -334,12 +334,22 @@ static void close_pack_revindex(struct packed_git *p)
 	p->revindex_data = NULL;
 }
 
+static void close_pack_mtimes(struct packed_git *p)
+{
+	if (!p->mtimes_map)
+		return;
+
+	munmap((void *)p->mtimes_map, p->mtimes_size);
+	p->mtimes_map = NULL;
+}
+
 void close_pack(struct packed_git *p)
 {
 	close_pack_windows(p);
 	close_pack_fd(p);
 	close_pack_index(p);
 	close_pack_revindex(p);
+	close_pack_mtimes(p);
 	oidset_clear(&p->bad_objects);
 }
 
@@ -363,7 +373,7 @@ void close_object_store(struct raw_object_store *o)
 
 void unlink_pack_path(const char *pack_name, int force_delete)
 {
-	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
+	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".mtimes"};
 	int i;
 	struct strbuf buf = STRBUF_INIT;
 	size_t plen;
@@ -718,6 +728,10 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
 	if (!access(p->pack_name, F_OK))
 		p->pack_promisor = 1;
 
+	xsnprintf(p->pack_name + path_len, alloc - path_len, ".mtimes");
+	if (!access(p->pack_name, F_OK))
+		p->is_cruft = 1;
+
 	xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
 	if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) {
 		free(p);
@@ -869,7 +883,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 	    ends_with(file_name, ".pack") ||
 	    ends_with(file_name, ".bitmap") ||
 	    ends_with(file_name, ".keep") ||
-	    ends_with(file_name, ".promisor"))
+	    ends_with(file_name, ".promisor") ||
+	    ends_with(file_name, ".mtimes"))
 		string_list_append(data->garbage, full_name);
 	else
 		report_garbage(PACKDIR_FILE_GARBAGE, full_name);
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v4 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
  2022-05-18 23:10   ` [PATCH v4 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
  2022-05-18 23:10   ` [PATCH v4 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
@ 2022-05-18 23:10   ` Taylor Blau
  2022-05-18 23:11   ` [PATCH v4 04/17] chunk-format.h: extract oid_version() Taylor Blau
                     ` (16 subsequent siblings)
  19 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-18 23:10 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

This structure will be used to communicate the per-object mtimes when
writing a cruft pack. Here, we need the full packing_data structure
because the mtime information is stored in an array there, not on the
individual object_entry's themselves (to avoid paying the overhead in
structure width for operations which do not generate a cruft pack).

We haven't passed this information down before because one of the two
callers (in bulk-checkin.c) does not have a packing_data structure at
all. In that case (where no cruft pack will be generated), NULL is
passed instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 3 ++-
 bulk-checkin.c         | 2 +-
 pack-write.c           | 1 +
 pack.h                 | 3 +++
 4 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 014dcd4bc9..6ac927047c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1262,7 +1262,8 @@ static void write_pack_file(void)
 
 			stage_tmp_packfiles(&tmpname, pack_tmp_name,
 					    written_list, nr_written,
-					    &pack_idx_opts, hash, &idx_tmp_name);
+					    &to_pack, &pack_idx_opts, hash,
+					    &idx_tmp_name);
 
 			if (write_bitmap_index) {
 				size_t tmpname_len = tmpname.len;
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 6d6c37171c..e988a388b6 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -33,7 +33,7 @@ static void finish_tmp_packfile(struct strbuf *basename,
 	char *idx_tmp_name = NULL;
 
 	stage_tmp_packfiles(basename, pack_tmp_name, written_list, nr_written,
-			    pack_idx_opts, hash, &idx_tmp_name);
+			    NULL, pack_idx_opts, hash, &idx_tmp_name);
 	rename_tmp_packfile_idx(basename, &idx_tmp_name);
 
 	free(idx_tmp_name);
diff --git a/pack-write.c b/pack-write.c
index 51812cb129..a2adc565f4 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -484,6 +484,7 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 const char *pack_tmp_name,
 			 struct pack_idx_entry **written_list,
 			 uint32_t nr_written,
+			 struct packing_data *to_pack,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[],
 			 char **idx_tmp_name)
diff --git a/pack.h b/pack.h
index b22bfc4a18..fd27cfdfd7 100644
--- a/pack.h
+++ b/pack.h
@@ -109,11 +109,14 @@ int encode_in_pack_object_header(unsigned char *hdr, int hdr_len,
 #define PH_ERROR_PROTOCOL	(-3)
 int read_pack_header(int fd, struct pack_header *);
 
+struct packing_data;
+
 struct hashfile *create_tmp_packfile(char **pack_tmp_name);
 void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 const char *pack_tmp_name,
 			 struct pack_idx_entry **written_list,
 			 uint32_t nr_written,
+			 struct packing_data *to_pack,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[],
 			 char **idx_tmp_name);
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v4 04/17] chunk-format.h: extract oid_version()
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (2 preceding siblings ...)
  2022-05-18 23:10   ` [PATCH v4 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-19 11:44     ` Ævar Arnfjörð Bjarmason
  2022-05-18 23:11   ` [PATCH v4 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
                     ` (15 subsequent siblings)
  19 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

There are three definitions of an identical function which converts
`the_hash_algo` into either 1 (for SHA-1) or 2 (for SHA-256). There is a
copy of this function for writing both the commit-graph and
multi-pack-index file, and another inline definition used to write the
.rev header.

Consolidate these into a single definition in chunk-format.h. It's not
clear that this is the best header to define this function in, but it
should do for now.

(Worth noting, the .rev caller expects a 4-byte unsigned, but the other
two callers work with a single unsigned byte. The consolidated version
uses the latter type, and lets the compiler widen it when required).

Another caller will be added in a subsequent patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 chunk-format.c | 12 ++++++++++++
 chunk-format.h |  3 +++
 commit-graph.c | 18 +++---------------
 midx.c         | 18 +++---------------
 pack-write.c   | 15 ++-------------
 5 files changed, 23 insertions(+), 43 deletions(-)

diff --git a/chunk-format.c b/chunk-format.c
index 1c3dca62e2..0275b74a89 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -181,3 +181,15 @@ int read_chunk(struct chunkfile *cf,
 
 	return CHUNK_NOT_FOUND;
 }
+
+uint8_t oid_version(const struct git_hash_algo *algop)
+{
+	switch (hash_algo_by_ptr(algop)) {
+	case GIT_HASH_SHA1:
+		return 1;
+	case GIT_HASH_SHA256:
+		return 2;
+	default:
+		die(_("invalid hash version"));
+	}
+}
diff --git a/chunk-format.h b/chunk-format.h
index 9ccbe00377..7885aa0848 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -2,6 +2,7 @@
 #define CHUNK_FORMAT_H
 
 #include "git-compat-util.h"
+#include "hash.h"
 
 struct hashfile;
 struct chunkfile;
@@ -65,4 +66,6 @@ int read_chunk(struct chunkfile *cf,
 	       chunk_read_fn fn,
 	       void *data);
 
+uint8_t oid_version(const struct git_hash_algo *algop);
+
 #endif
diff --git a/commit-graph.c b/commit-graph.c
index 06107beedc..066d82ed6a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -193,18 +193,6 @@ char *get_commit_graph_chain_filename(struct object_directory *odb)
 	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
 }
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 static struct commit_graph *alloc_commit_graph(void)
 {
 	struct commit_graph *g = xcalloc(1, sizeof(*g));
@@ -365,9 +353,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 	}
 
 	hash_version = *(unsigned char*)(data + 5);
-	if (hash_version != oid_version()) {
+	if (hash_version != oid_version(the_hash_algo)) {
 		error(_("commit-graph hash version %X does not match version %X"),
-		      hash_version, oid_version());
+		      hash_version, oid_version(the_hash_algo));
 		return NULL;
 	}
 
@@ -1924,7 +1912,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
 	hashwrite_u8(f, GRAPH_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, oid_version(the_hash_algo));
 	hashwrite_u8(f, get_num_chunks(cf));
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
diff --git a/midx.c b/midx.c
index 3db0e47735..c617c51cd0 100644
--- a/midx.c
+++ b/midx.c
@@ -41,18 +41,6 @@
 
 #define PACK_EXPIRED UINT_MAX
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 const unsigned char *get_midx_checksum(struct multi_pack_index *m)
 {
 	return m->data + m->data_len - the_hash_algo->rawsz;
@@ -134,9 +122,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 		      m->version);
 
 	hash_version = m->data[MIDX_BYTE_HASH_VERSION];
-	if (hash_version != oid_version()) {
+	if (hash_version != oid_version(the_hash_algo)) {
 		error(_("multi-pack-index hash version %u does not match version %u"),
-		      hash_version, oid_version());
+		      hash_version, oid_version(the_hash_algo));
 		goto cleanup_fail;
 	}
 	m->hash_len = the_hash_algo->rawsz;
@@ -420,7 +408,7 @@ static size_t write_midx_header(struct hashfile *f,
 {
 	hashwrite_be32(f, MIDX_SIGNATURE);
 	hashwrite_u8(f, MIDX_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, oid_version(the_hash_algo));
 	hashwrite_u8(f, num_chunks);
 	hashwrite_u8(f, 0); /* unused */
 	hashwrite_be32(f, num_packs);
diff --git a/pack-write.c b/pack-write.c
index a2adc565f4..27b171e440 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -2,6 +2,7 @@
 #include "pack.h"
 #include "csum-file.h"
 #include "remote.h"
+#include "chunk-format.h"
 
 void reset_pack_idx_option(struct pack_idx_option *opts)
 {
@@ -181,21 +182,9 @@ static int pack_order_cmp(const void *va, const void *vb, void *ctx)
 
 static void write_rev_header(struct hashfile *f)
 {
-	uint32_t oid_version;
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		oid_version = 1;
-		break;
-	case GIT_HASH_SHA256:
-		oid_version = 2;
-		break;
-	default:
-		die("write_rev_header: unknown hash version");
-	}
-
 	hashwrite_be32(f, RIDX_SIGNATURE);
 	hashwrite_be32(f, RIDX_VERSION);
-	hashwrite_be32(f, oid_version);
+	hashwrite_be32(f, oid_version(the_hash_algo));
 }
 
 static void write_rev_index_positions(struct hashfile *f,
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v4 05/17] pack-mtimes: support writing pack .mtimes files
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (3 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 04/17] chunk-format.h: extract oid_version() Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-18 23:11   ` [PATCH v4 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
                     ` (14 subsequent siblings)
  19 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

Now that the `.mtimes` format is defined, supplement the pack-write API
to be able to conditionally write an `.mtimes` file along with a pack by
setting an additional flag and passing an oidmap that contains the
timestamps corresponding to each object in the pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-objects.c |  6 ++++
 pack-objects.h | 25 ++++++++++++++++
 pack-write.c   | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++
 pack.h         |  1 +
 4 files changed, 109 insertions(+)

diff --git a/pack-objects.c b/pack-objects.c
index fe2a4eace9..272e8d4517 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -170,6 +170,9 @@ struct object_entry *packlist_alloc(struct packing_data *pdata,
 
 		if (pdata->layer)
 			REALLOC_ARRAY(pdata->layer, pdata->nr_alloc);
+
+		if (pdata->cruft_mtime)
+			REALLOC_ARRAY(pdata->cruft_mtime, pdata->nr_alloc);
 	}
 
 	new_entry = pdata->objects + pdata->nr_objects++;
@@ -198,6 +201,9 @@ struct object_entry *packlist_alloc(struct packing_data *pdata,
 	if (pdata->layer)
 		pdata->layer[pdata->nr_objects - 1] = 0;
 
+	if (pdata->cruft_mtime)
+		pdata->cruft_mtime[pdata->nr_objects - 1] = 0;
+
 	return new_entry;
 }
 
diff --git a/pack-objects.h b/pack-objects.h
index dca2351ef9..393b9db546 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -168,6 +168,14 @@ struct packing_data {
 	/* delta islands */
 	unsigned int *tree_depth;
 	unsigned char *layer;
+
+	/*
+	 * Used when writing cruft packs.
+	 *
+	 * Object mtimes are stored in pack order when writing, but
+	 * written out in lexicographic (index) order.
+	 */
+	uint32_t *cruft_mtime;
 };
 
 void prepare_packing_data(struct repository *r, struct packing_data *pdata);
@@ -289,4 +297,21 @@ static inline void oe_set_layer(struct packing_data *pack,
 	pack->layer[e - pack->objects] = layer;
 }
 
+static inline uint32_t oe_cruft_mtime(struct packing_data *pack,
+				      struct object_entry *e)
+{
+	if (!pack->cruft_mtime)
+		return 0;
+	return pack->cruft_mtime[e - pack->objects];
+}
+
+static inline void oe_set_cruft_mtime(struct packing_data *pack,
+				      struct object_entry *e,
+				      uint32_t mtime)
+{
+	if (!pack->cruft_mtime)
+		CALLOC_ARRAY(pack->cruft_mtime, pack->nr_alloc);
+	pack->cruft_mtime[e - pack->objects] = mtime;
+}
+
 #endif
diff --git a/pack-write.c b/pack-write.c
index 27b171e440..23c0342018 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -3,6 +3,10 @@
 #include "csum-file.h"
 #include "remote.h"
 #include "chunk-format.h"
+#include "pack-mtimes.h"
+#include "oidmap.h"
+#include "chunk-format.h"
+#include "pack-objects.h"
 
 void reset_pack_idx_option(struct pack_idx_option *opts)
 {
@@ -277,6 +281,70 @@ const char *write_rev_file_order(const char *rev_name,
 	return rev_name;
 }
 
+static void write_mtimes_header(struct hashfile *f)
+{
+	hashwrite_be32(f, MTIMES_SIGNATURE);
+	hashwrite_be32(f, MTIMES_VERSION);
+	hashwrite_be32(f, oid_version(the_hash_algo));
+}
+
+/*
+ * Writes the object mtimes of "objects" for use in a .mtimes file.
+ * Note that objects must be in lexicographic (index) order, which is
+ * the expected ordering of these values in the .mtimes file.
+ */
+static void write_mtimes_objects(struct hashfile *f,
+				 struct packing_data *to_pack,
+				 struct pack_idx_entry **objects,
+				 uint32_t nr_objects)
+{
+	uint32_t i;
+	for (i = 0; i < nr_objects; i++) {
+		struct object_entry *e = (struct object_entry*)objects[i];
+		hashwrite_be32(f, oe_cruft_mtime(to_pack, e));
+	}
+}
+
+static void write_mtimes_trailer(struct hashfile *f, const unsigned char *hash)
+{
+	hashwrite(f, hash, the_hash_algo->rawsz);
+}
+
+static const char *write_mtimes_file(const char *mtimes_name,
+				     struct packing_data *to_pack,
+				     struct pack_idx_entry **objects,
+				     uint32_t nr_objects,
+				     const unsigned char *hash)
+{
+	struct hashfile *f;
+	int fd;
+
+	if (!to_pack)
+		BUG("cannot call write_mtimes_file with NULL packing_data");
+
+	if (!mtimes_name) {
+		struct strbuf tmp_file = STRBUF_INIT;
+		fd = odb_mkstemp(&tmp_file, "pack/tmp_mtimes_XXXXXX");
+		mtimes_name = strbuf_detach(&tmp_file, NULL);
+	} else {
+		unlink(mtimes_name);
+		fd = xopen(mtimes_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
+	}
+	f = hashfd(fd, mtimes_name);
+
+	write_mtimes_header(f);
+	write_mtimes_objects(f, to_pack, objects, nr_objects);
+	write_mtimes_trailer(f, hash);
+
+	if (adjust_shared_perm(mtimes_name) < 0)
+		die(_("failed to make %s readable"), mtimes_name);
+
+	finalize_hashfile(f, NULL, FSYNC_COMPONENT_PACK_METADATA,
+			  CSUM_HASH_IN_STREAM | CSUM_CLOSE | CSUM_FSYNC);
+
+	return mtimes_name;
+}
+
 off_t write_pack_header(struct hashfile *f, uint32_t nr_entries)
 {
 	struct pack_header hdr;
@@ -479,6 +547,7 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 char **idx_tmp_name)
 {
 	const char *rev_tmp_name = NULL;
+	const char *mtimes_tmp_name = NULL;
 
 	if (adjust_shared_perm(pack_tmp_name))
 		die_errno("unable to make temporary pack file readable");
@@ -491,9 +560,17 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 	rev_tmp_name = write_rev_file(NULL, written_list, nr_written, hash,
 				      pack_idx_opts->flags);
 
+	if (pack_idx_opts->flags & WRITE_MTIMES) {
+		mtimes_tmp_name = write_mtimes_file(NULL, to_pack, written_list,
+						    nr_written,
+						    hash);
+	}
+
 	rename_tmp_packfile(name_buffer, pack_tmp_name, "pack");
 	if (rev_tmp_name)
 		rename_tmp_packfile(name_buffer, rev_tmp_name, "rev");
+	if (mtimes_tmp_name)
+		rename_tmp_packfile(name_buffer, mtimes_tmp_name, "mtimes");
 }
 
 void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought)
diff --git a/pack.h b/pack.h
index fd27cfdfd7..01d385903a 100644
--- a/pack.h
+++ b/pack.h
@@ -44,6 +44,7 @@ struct pack_idx_option {
 #define WRITE_IDX_STRICT 02
 #define WRITE_REV 04
 #define WRITE_REV_VERIFY 010
+#define WRITE_MTIMES 020
 
 	uint32_t version;
 	uint32_t off32_limit;
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v4 06/17] t/helper: add 'pack-mtimes' test-tool
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (4 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-18 23:11   ` [PATCH v4 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
                     ` (13 subsequent siblings)
  19 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

In the next patch, we will implement and test support for writing a
cruft pack via a special mode of `git pack-objects`. To make sure that
objects are written with the correct timestamps, and a new test-tool
that can dump the object names and corresponding timestamps from a given
`.mtimes` file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Makefile                    |  1 +
 t/helper/test-pack-mtimes.c | 56 +++++++++++++++++++++++++++++++++++++
 t/helper/test-tool.c        |  1 +
 t/helper/test-tool.h        |  1 +
 4 files changed, 59 insertions(+)
 create mode 100644 t/helper/test-pack-mtimes.c

diff --git a/Makefile b/Makefile
index a299580b7c..0b6eab0453 100644
--- a/Makefile
+++ b/Makefile
@@ -738,6 +738,7 @@ TEST_BUILTINS_OBJS += test-oid-array.o
 TEST_BUILTINS_OBJS += test-oidmap.o
 TEST_BUILTINS_OBJS += test-oidtree.o
 TEST_BUILTINS_OBJS += test-online-cpus.o
+TEST_BUILTINS_OBJS += test-pack-mtimes.o
 TEST_BUILTINS_OBJS += test-parse-options.o
 TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
 TEST_BUILTINS_OBJS += test-partial-clone.o
diff --git a/t/helper/test-pack-mtimes.c b/t/helper/test-pack-mtimes.c
new file mode 100644
index 0000000000..f7b79daf4c
--- /dev/null
+++ b/t/helper/test-pack-mtimes.c
@@ -0,0 +1,56 @@
+#include "git-compat-util.h"
+#include "test-tool.h"
+#include "strbuf.h"
+#include "object-store.h"
+#include "packfile.h"
+#include "pack-mtimes.h"
+
+static void dump_mtimes(struct packed_git *p)
+{
+	uint32_t i;
+	if (load_pack_mtimes(p) < 0)
+		die("could not load pack .mtimes");
+
+	for (i = 0; i < p->num_objects; i++) {
+		struct object_id oid;
+		if (nth_packed_object_id(&oid, p, i) < 0)
+			die("could not load object id at position %"PRIu32, i);
+
+		printf("%s %"PRIu32"\n",
+		       oid_to_hex(&oid), nth_packed_mtime(p, i));
+	}
+}
+
+static const char *pack_mtimes_usage = "\n"
+"  test-tool pack-mtimes <pack-name.mtimes>";
+
+int cmd__pack_mtimes(int argc, const char **argv)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct packed_git *p;
+
+	setup_git_directory();
+
+	if (argc != 2)
+		usage(pack_mtimes_usage);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		strbuf_addstr(&buf, basename(p->pack_name));
+		strbuf_strip_suffix(&buf, ".pack");
+		strbuf_addstr(&buf, ".mtimes");
+
+		if (!strcmp(buf.buf, argv[1]))
+			break;
+
+		strbuf_reset(&buf);
+	}
+
+	strbuf_release(&buf);
+
+	if (!p)
+		die("could not find pack '%s'", argv[1]);
+
+	dump_mtimes(p);
+
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 0424f7adf5..d2eacd302d 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -48,6 +48,7 @@ static struct test_cmd cmds[] = {
 	{ "oidmap", cmd__oidmap },
 	{ "oidtree", cmd__oidtree },
 	{ "online-cpus", cmd__online_cpus },
+	{ "pack-mtimes", cmd__pack_mtimes },
 	{ "parse-options", cmd__parse_options },
 	{ "parse-pathspec-file", cmd__parse_pathspec_file },
 	{ "partial-clone", cmd__partial_clone },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index c876e8246f..960cc27ef7 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -38,6 +38,7 @@ int cmd__mktemp(int argc, const char **argv);
 int cmd__oidmap(int argc, const char **argv);
 int cmd__oidtree(int argc, const char **argv);
 int cmd__online_cpus(int argc, const char **argv);
+int cmd__pack_mtimes(int argc, const char **argv);
 int cmd__parse_options(int argc, const char **argv);
 int cmd__parse_pathspec_file(int argc, const char** argv);
 int cmd__partial_clone(int argc, const char **argv);
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v4 07/17] builtin/pack-objects.c: return from create_object_entry()
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (5 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-18 23:11   ` [PATCH v4 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
                     ` (12 subsequent siblings)
  19 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

A new caller in the next commit will want to immediately modify the
object_entry structure created by create_object_entry(). Instead of
forcing that caller to wastefully look-up the entry we just created,
return it from create_object_entry() instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6ac927047c..c6d16872ee 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1516,13 +1516,13 @@ static int want_object_in_pack(const struct object_id *oid,
 	return 1;
 }
 
-static void create_object_entry(const struct object_id *oid,
-				enum object_type type,
-				uint32_t hash,
-				int exclude,
-				int no_try_delta,
-				struct packed_git *found_pack,
-				off_t found_offset)
+static struct object_entry *create_object_entry(const struct object_id *oid,
+						enum object_type type,
+						uint32_t hash,
+						int exclude,
+						int no_try_delta,
+						struct packed_git *found_pack,
+						off_t found_offset)
 {
 	struct object_entry *entry;
 
@@ -1539,6 +1539,8 @@ static void create_object_entry(const struct object_id *oid,
 	}
 
 	entry->no_try_delta = no_try_delta;
+
+	return entry;
 }
 
 static const char no_closure_warning[] = N_(
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v4 08/17] builtin/pack-objects.c: --cruft without expiration
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (6 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-19 10:04     ` Junio C Hamano
  2022-05-18 23:11   ` [PATCH v4 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
                     ` (11 subsequent siblings)
  19 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

Teach `pack-objects` how to generate a cruft pack when no objects are
dropped (i.e., `--cruft-expiration=never`). Later patches will teach
`pack-objects` how to generate a cruft pack that prunes objects.

When generating a cruft pack which does not prune objects, we want to
collect all unreachable objects into a single pack (noting and updating
their mtimes as we accumulate them). Ordinary use will pass the result
of a `git repack -A` as a kept pack, so when this patch says "kept
pack", readers should think "reachable objects".

Generating a non-expiring cruft packs works as follows:

  - Callers provide a list of every pack they know about, and indicate
    which packs are about to be removed.

  - All packs which are going to be removed (we'll call these the
    redundant ones) are marked as kept in-core.

    Any packs the caller did not mention (but are known to the
    `pack-objects` process) are also marked as kept in-core. Packs not
    mentioned by the caller are assumed to be unknown to them, i.e.,
    they entered the repository after the caller decided which packs
    should be kept and which should be discarded.

    Since we do not want to include objects in these "unknown" packs
    (because we don't know which of their objects are or aren't
    reachable), these are also marked as kept in-core.

  - Then, we enumerate all objects in the repository, and add them to
    our packing list if they do not appear in an in-core kept pack.

This results in a new cruft pack which contains all known objects that
aren't included in the kept packs. When the kept pack is the result of
`git repack -A`, the resulting pack contains all unreachable objects.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-pack-objects.txt |  30 ++++
 builtin/pack-objects.c             | 201 +++++++++++++++++++++++++-
 object-file.c                      |   2 +-
 object-store.h                     |   2 +
 t/t5329-pack-objects-cruft.sh      | 218 +++++++++++++++++++++++++++++
 5 files changed, 448 insertions(+), 5 deletions(-)
 create mode 100755 t/t5329-pack-objects-cruft.sh

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index f8344e1e5b..a9995a932c 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -13,6 +13,7 @@ SYNOPSIS
 	[--no-reuse-delta] [--delta-base-offset] [--non-empty]
 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
 	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
+	[--cruft] [--cruft-expiration=<time>]
 	[--stdout [--filter=<filter-spec>] | <base-name>]
 	[--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>
 
@@ -95,6 +96,35 @@ base-name::
 Incompatible with `--revs`, or options that imply `--revs` (such as
 `--all`), with the exception of `--unpacked`, which is compatible.
 
+--cruft::
+	Packs unreachable objects into a separate "cruft" pack, denoted
+	by the existence of a `.mtimes` file. Typically used by `git
+	repack --cruft`. Callers provide a list of pack names and
+	indicate which packs will remain in the repository, along with
+	which packs will be deleted (indicated by the `-` prefix). The
+	contents of the cruft pack are all objects not contained in the
+	surviving packs which have not exceeded the grace period (see
+	`--cruft-expiration` below), or which have exceeded the grace
+	period, but are reachable from an other object which hasn't.
++
+When the input lists a pack containing all reachable objects (and lists
+all other packs as pending deletion), the corresponding cruft pack will
+contain all unreachable objects (with mtime newer than the
+`--cruft-expiration`) along with any unreachable objects whose mtime is
+older than the `--cruft-expiration`, but are reachable from an
+unreachable object whose mtime is newer than the `--cruft-expiration`).
++
+Incompatible with `--unpack-unreachable`, `--keep-unreachable`,
+`--pack-loose-unreachable`, `--stdin-packs`, as well as any other
+options which imply `--revs`. Also incompatible with `--max-pack-size`;
+when this option is set, the maximum pack size is not inferred from
+`pack.packSizeLimit`.
+
+--cruft-expiration=<approxidate>::
+	If specified, objects are eliminated from the cruft pack if they
+	have an mtime older than `<approxidate>`. If unspecified (and
+	given `--cruft`), then no objects are eliminated.
+
 --window=<n>::
 --depth=<n>::
 	These two options affect how the objects contained in
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c6d16872ee..9cf89be673 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -36,6 +36,7 @@
 #include "trace2.h"
 #include "shallow.h"
 #include "promisor-remote.h"
+#include "pack-mtimes.h"
 
 /*
  * Objects we are going to pack are collected in the `to_pack` structure.
@@ -194,6 +195,8 @@ static int reuse_delta = 1, reuse_object = 1;
 static int keep_unreachable, unpack_unreachable, include_tag;
 static timestamp_t unpack_unreachable_expiration;
 static int pack_loose_unreachable;
+static int cruft;
+static timestamp_t cruft_expiration;
 static int local;
 static int have_non_local_packs;
 static int incremental;
@@ -1260,6 +1263,9 @@ static void write_pack_file(void)
 					&to_pack, written_list, nr_written);
 			}
 
+			if (cruft)
+				pack_idx_opts.flags |= WRITE_MTIMES;
+
 			stage_tmp_packfiles(&tmpname, pack_tmp_name,
 					    written_list, nr_written,
 					    &to_pack, &pack_idx_opts, hash,
@@ -3397,6 +3403,135 @@ static void read_packs_list_from_stdin(void)
 	string_list_clear(&exclude_packs, 0);
 }
 
+static void add_cruft_object_entry(const struct object_id *oid, enum object_type type,
+				   struct packed_git *pack, off_t offset,
+				   const char *name, uint32_t mtime)
+{
+	struct object_entry *entry;
+
+	display_progress(progress_state, ++nr_seen);
+
+	entry = packlist_find(&to_pack, oid);
+	if (entry) {
+		if (name) {
+			entry->hash = pack_name_hash(name);
+			entry->no_try_delta = no_try_delta(name);
+		}
+	} else {
+		if (!want_object_in_pack(oid, 0, &pack, &offset))
+			return;
+		if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
+			/*
+			 * If a traversed tree has a missing blob then we want
+			 * to avoid adding that missing object to our pack.
+			 *
+			 * This only applies to missing blobs, not trees,
+			 * because the traversal needs to parse sub-trees but
+			 * not blobs.
+			 *
+			 * Note we only perform this check when we couldn't
+			 * already find the object in a pack, so we're really
+			 * limited to "ensure non-tip blobs which don't exist in
+			 * packs do exist via loose objects". Confused?
+			 */
+			return;
+		}
+
+		entry = create_object_entry(oid, type, pack_name_hash(name),
+					    0, name && no_try_delta(name),
+					    pack, offset);
+	}
+
+	if (mtime > oe_cruft_mtime(&to_pack, entry))
+		oe_set_cruft_mtime(&to_pack, entry, mtime);
+	return;
+}
+
+static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
+{
+	struct string_list_item *item = NULL;
+	for_each_string_list_item(item, packs) {
+		struct packed_git *p = item->util;
+		if (!p)
+			die(_("could not find pack '%s'"), item->string);
+		p->pack_keep_in_core = keep;
+	}
+}
+
+static void add_unreachable_loose_objects(void);
+static void add_objects_in_unpacked_packs(void);
+
+static void enumerate_cruft_objects(void)
+{
+	if (progress)
+		progress_state = start_progress(_("Enumerating cruft objects"), 0);
+
+	add_objects_in_unpacked_packs();
+	add_unreachable_loose_objects();
+
+	stop_progress(&progress_state);
+}
+
+static void read_cruft_objects(void)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct string_list discard_packs = STRING_LIST_INIT_DUP;
+	struct string_list fresh_packs = STRING_LIST_INIT_DUP;
+	struct packed_git *p;
+
+	ignore_packed_keep_in_core = 1;
+
+	while (strbuf_getline(&buf, stdin) != EOF) {
+		if (!buf.len)
+			continue;
+
+		if (*buf.buf == '-')
+			string_list_append(&discard_packs, buf.buf + 1);
+		else
+			string_list_append(&fresh_packs, buf.buf);
+		strbuf_reset(&buf);
+	}
+
+	string_list_sort(&discard_packs);
+	string_list_sort(&fresh_packs);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		const char *pack_name = pack_basename(p);
+		struct string_list_item *item;
+
+		item = string_list_lookup(&fresh_packs, pack_name);
+		if (!item)
+			item = string_list_lookup(&discard_packs, pack_name);
+
+		if (item) {
+			item->util = p;
+		} else {
+			/*
+			 * This pack wasn't mentioned in either the "fresh" or
+			 * "discard" list, so the caller didn't know about it.
+			 *
+			 * Mark it as kept so that its objects are ignored by
+			 * add_unseen_recent_objects_to_traversal(). We'll
+			 * unmark it before starting the traversal so it doesn't
+			 * halt the traversal early.
+			 */
+			p->pack_keep_in_core = 1;
+		}
+	}
+
+	mark_pack_kept_in_core(&fresh_packs, 1);
+	mark_pack_kept_in_core(&discard_packs, 0);
+
+	if (cruft_expiration)
+		die("--cruft-expiration not yet implemented");
+	else
+		enumerate_cruft_objects();
+
+	strbuf_release(&buf);
+	string_list_clear(&discard_packs, 0);
+	string_list_clear(&fresh_packs, 0);
+}
+
 static void read_object_list_from_stdin(void)
 {
 	char line[GIT_MAX_HEXSZ + 1 + PATH_MAX + 2];
@@ -3529,7 +3664,24 @@ static int add_object_in_unpacked_pack(const struct object_id *oid,
 				       uint32_t pos,
 				       void *_data)
 {
-	add_object_entry(oid, OBJ_NONE, "", 0);
+	if (cruft) {
+		off_t offset;
+		time_t mtime;
+
+		if (pack->is_cruft) {
+			if (load_pack_mtimes(pack) < 0)
+				die(_("could not load cruft pack .mtimes"));
+			mtime = nth_packed_mtime(pack, pos);
+		} else {
+			mtime = pack->mtime;
+		}
+		offset = nth_packed_object_offset(pack, pos);
+
+		add_cruft_object_entry(oid, OBJ_NONE, pack, offset,
+				       NULL, mtime);
+	} else {
+		add_object_entry(oid, OBJ_NONE, "", 0);
+	}
 	return 0;
 }
 
@@ -3553,7 +3705,19 @@ static int add_loose_object(const struct object_id *oid, const char *path,
 		return 0;
 	}
 
-	add_object_entry(oid, type, "", 0);
+	if (cruft) {
+		struct stat st;
+		if (stat(path, &st) < 0) {
+			if (errno == ENOENT)
+				return 0;
+			return error_errno("unable to stat %s", oid_to_hex(oid));
+		}
+
+		add_cruft_object_entry(oid, type, NULL, 0, NULL,
+				       st.st_mtime);
+	} else {
+		add_object_entry(oid, type, "", 0);
+	}
 	return 0;
 }
 
@@ -3870,6 +4034,20 @@ static int option_parse_unpack_unreachable(const struct option *opt,
 	return 0;
 }
 
+static int option_parse_cruft_expiration(const struct option *opt,
+					 const char *arg, int unset)
+{
+	if (unset) {
+		cruft = 0;
+		cruft_expiration = 0;
+	} else {
+		cruft = 1;
+		if (arg)
+			cruft_expiration = approxidate(arg);
+	}
+	return 0;
+}
+
 struct po_filter_data {
 	unsigned have_revs:1;
 	struct rev_info revs;
@@ -3959,6 +4137,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		OPT_CALLBACK_F(0, "unpack-unreachable", NULL, N_("time"),
 		  N_("unpack unreachable objects newer than <time>"),
 		  PARSE_OPT_OPTARG, option_parse_unpack_unreachable),
+		OPT_BOOL(0, "cruft", &cruft, N_("create a cruft pack")),
+		OPT_CALLBACK_F(0, "cruft-expiration", NULL, N_("time"),
+		  N_("expire cruft objects older than <time>"),
+		  PARSE_OPT_OPTARG, option_parse_cruft_expiration),
 		OPT_BOOL(0, "sparse", &sparse,
 			 N_("use the sparse reachability algorithm")),
 		OPT_BOOL(0, "thin", &thin,
@@ -4085,7 +4267,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	if (!HAVE_THREADS && delta_search_threads != 1)
 		warning(_("no threads support, ignoring --threads"));
-	if (!pack_to_stdout && !pack_size_limit)
+	if (!pack_to_stdout && !pack_size_limit && !cruft)
 		pack_size_limit = pack_size_limit_cfg;
 	if (pack_to_stdout && pack_size_limit)
 		die(_("--max-pack-size cannot be used to build a pack for transfer"));
@@ -4112,6 +4294,15 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
 
+	if (cruft) {
+		if (use_internal_rev_list)
+			die(_("cannot use internal rev list with --cruft"));
+		if (stdin_packs)
+			die(_("cannot use --stdin-packs with --cruft"));
+		if (pack_size_limit)
+			die(_("cannot use --max-pack-size with --cruft"));
+	}
+
 	/*
 	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
 	 *
@@ -4168,7 +4359,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			    the_repository);
 	prepare_packing_data(the_repository, &to_pack);
 
-	if (progress)
+	if (progress && !cruft)
 		progress_state = start_progress(_("Enumerating objects"), 0);
 	if (stdin_packs) {
 		/* avoids adding objects in excluded packs */
@@ -4176,6 +4367,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		read_packs_list_from_stdin();
 		if (rev_list_unpacked)
 			add_unreachable_loose_objects();
+	} else if (cruft) {
+		read_cruft_objects();
 	} else if (!use_internal_rev_list) {
 		read_object_list_from_stdin();
 	} else if (pfd.have_revs) {
diff --git a/object-file.c b/object-file.c
index 5ffbf3d4fd..ff0cffe68e 100644
--- a/object-file.c
+++ b/object-file.c
@@ -997,7 +997,7 @@ int has_loose_object_nonlocal(const struct object_id *oid)
 	return check_and_freshen_nonlocal(oid, 0);
 }
 
-static int has_loose_object(const struct object_id *oid)
+int has_loose_object(const struct object_id *oid)
 {
 	return check_and_freshen(oid, 0);
 }
diff --git a/object-store.h b/object-store.h
index 2c4671ed7a..c41609e8db 100644
--- a/object-store.h
+++ b/object-store.h
@@ -330,6 +330,8 @@ int repo_has_object_file_with_flags(struct repository *r,
  */
 int has_loose_object_nonlocal(const struct object_id *);
 
+int has_loose_object(const struct object_id *);
+
 /**
  * format_object_header() is a thin wrapper around s xsnprintf() that
  * writes the initial "<type> <obj-len>" part of the loose object
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
new file mode 100755
index 0000000000..003ca7344e
--- /dev/null
+++ b/t/t5329-pack-objects-cruft.sh
@@ -0,0 +1,218 @@
+#!/bin/sh
+
+test_description='cruft pack related pack-objects tests'
+. ./test-lib.sh
+
+objdir=.git/objects
+packdir=$objdir/pack
+
+basic_cruft_pack_tests () {
+	expire="$1"
+
+	test_expect_success "unreachable loose objects are packed (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit base &&
+			git repack -Ad &&
+			test_commit loose &&
+
+			test-tool chmtime +2000 "$objdir/$(test_oid_to_path \
+				$(git rev-parse loose:loose.t))" &&
+			test-tool chmtime +1000 "$objdir/$(test_oid_to_path \
+				$(git rev-parse loose^{tree}))" &&
+
+			(
+				git rev-list --objects --no-object-names base..loose |
+				while read oid
+				do
+					path="$objdir/$(test_oid_to_path "$oid")" &&
+					printf "%s %d\n" "$oid" "$(test-tool chmtime --get "$path")"
+				done |
+				sort -k1
+			) >expect &&
+
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+			cruft="$(echo $keep | git pack-objects --cruft \
+				--cruft-expiration="$expire" $packdir/pack)" &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "unreachable packed objects are packed (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit packed &&
+			git repack -Ad &&
+			test_commit other &&
+
+			git rev-list --objects --no-object-names packed.. >objects &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+			other="$(git pack-objects --delta-base-offset \
+				$packdir/pack <objects)" &&
+			git prune-packed &&
+
+			test-tool chmtime --get -100 "$packdir/pack-$other.pack" >expect &&
+
+			cruft="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF
+			$keep
+			-pack-$other.pack
+			EOF
+			)" &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+			cut -d" " -f2 <actual.raw | sort -u >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "unreachable cruft objects are repacked (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit packed &&
+			git repack -Ad &&
+			test_commit other &&
+
+			git rev-list --objects --no-object-names packed.. >objects &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+
+			cruft_a="$(echo $keep | git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack)" &&
+			git prune-packed &&
+			cruft_b="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF
+			$keep
+			-pack-$cruft_a.pack
+			EOF
+			)" &&
+
+			test-tool pack-mtimes "pack-$cruft_a.mtimes" >expect.raw &&
+			test-tool pack-mtimes "pack-$cruft_b.mtimes" >actual.raw &&
+
+			sort <expect.raw >expect &&
+			sort <actual.raw >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "multiple cruft packs (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			git repack -Ad &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+
+			test_commit cruft &&
+			loose="$objdir/$(test_oid_to_path $(git rev-parse cruft))" &&
+
+			# generate three copies of the cruft object in different
+			# cruft packs, each with a unique mtime:
+			#   - one expired (1000 seconds ago)
+			#   - two non-expired (one 1000 seconds in the future,
+			#     one 1500 seconds in the future)
+			test-tool chmtime =-1000 "$loose" &&
+			git pack-objects --cruft $packdir/pack-A <<-EOF &&
+			$keep
+			EOF
+			test-tool chmtime =+1000 "$loose" &&
+			git pack-objects --cruft $packdir/pack-B <<-EOF &&
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			EOF
+			test-tool chmtime =+1500 "$loose" &&
+			git pack-objects --cruft $packdir/pack-C <<-EOF &&
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			-$(basename $(ls $packdir/pack-B-*.pack))
+			EOF
+
+			# ensure the resulting cruft pack takes the most recent
+			# mtime among all copies
+			cruft="$(git pack-objects --cruft \
+				--cruft-expiration="$expire" \
+				$packdir/pack <<-EOF
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			-$(basename $(ls $packdir/pack-B-*.pack))
+			-$(basename $(ls $packdir/pack-C-*.pack))
+			EOF
+			)" &&
+
+			test-tool pack-mtimes "$(basename $(ls $packdir/pack-C-*.mtimes))" >expect.raw &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+			sort expect.raw >expect &&
+			sort actual.raw >actual &&
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "cruft packs tolerate missing trees (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			test_commit cruft &&
+
+			tree="$(git rev-parse cruft^{tree})" &&
+
+			git reset --hard reachable &&
+			git tag -d cruft &&
+			git reflog expire --all --expire=all &&
+
+			# remove the unreachable tree, but leave the commit
+			# which has it as its root tree intact
+			rm -fr "$objdir/$(test_oid_to_path "$tree")" &&
+
+			git repack -Ad &&
+			basename $(ls $packdir/pack-*.pack) >in &&
+			git pack-objects --cruft --cruft-expiration="$expire" \
+				$packdir/pack <in
+		)
+	'
+
+	test_expect_success "cruft packs tolerate missing blobs (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			test_commit cruft &&
+
+			blob="$(git rev-parse cruft:cruft.t)" &&
+
+			git reset --hard reachable &&
+			git tag -d cruft &&
+			git reflog expire --all --expire=all &&
+
+			# remove the unreachable blob, but leave the commit (and
+			# the root tree of that commit) intact
+			rm -fr "$objdir/$(test_oid_to_path "$blob")" &&
+
+			git repack -Ad &&
+			basename $(ls $packdir/pack-*.pack) >in &&
+			git pack-objects --cruft --cruft-expiration="$expire" \
+				$packdir/pack <in
+		)
+	'
+}
+
+basic_cruft_pack_tests never
+
+test_done
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v4 09/17] reachable: add options to add_unseen_recent_objects_to_traversal
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (7 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-18 23:11   ` [PATCH v4 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
                     ` (10 subsequent siblings)
  19 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

This function behaves very similarly to what we will need in
pack-objects in order to implement cruft packs with expiration. But it
is lacking a couple of things. Namely, it needs:

  - a mechanism to communicate the timestamps of individual recent
    objects to some external caller

  - and, in the case of packed objects, our future caller will also want
    to know the originating pack, as well as the offset within that pack
    at which the object can be found

  - finally, it needs a way to skip over packs which are marked as kept
    in-core.

To address the first two, add a callback interface in this patch which
reports the time of each recent object, as well as a (packed_git,
off_t) pair for packed objects.

Likewise, add a new option to the packed object iterators to skip over
packs which are marked as kept in core. This option will become
implicitly tested in a future patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c |  2 +-
 reachable.c            | 51 +++++++++++++++++++++++++++++++++++-------
 reachable.h            |  9 +++++++-
 3 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 9cf89be673..3b8bf6a3dd 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3957,7 +3957,7 @@ static void get_object_list(struct rev_info *revs, int ac, const char **av)
 	if (unpack_unreachable_expiration) {
 		revs->ignore_missing_links = 1;
 		if (add_unseen_recent_objects_to_traversal(revs,
-				unpack_unreachable_expiration))
+				unpack_unreachable_expiration, NULL, 0))
 			die(_("unable to add recent objects"));
 		if (prepare_revision_walk(revs))
 			die(_("revision walk setup failed"));
diff --git a/reachable.c b/reachable.c
index b9f4ad886e..d4507c4270 100644
--- a/reachable.c
+++ b/reachable.c
@@ -60,9 +60,13 @@ static void mark_commit(struct commit *c, void *data)
 struct recent_data {
 	struct rev_info *revs;
 	timestamp_t timestamp;
+	report_recent_object_fn *cb;
+	int ignore_in_core_kept_packs;
 };
 
 static void add_recent_object(const struct object_id *oid,
+			      struct packed_git *pack,
+			      off_t offset,
 			      timestamp_t mtime,
 			      struct recent_data *data)
 {
@@ -103,13 +107,29 @@ static void add_recent_object(const struct object_id *oid,
 		die("unable to lookup %s", oid_to_hex(oid));
 
 	add_pending_object(data->revs, obj, "");
+	if (data->cb)
+		data->cb(obj, pack, offset, mtime);
+}
+
+static int want_recent_object(struct recent_data *data,
+			      const struct object_id *oid)
+{
+	if (data->ignore_in_core_kept_packs &&
+	    has_object_kept_pack(oid, IN_CORE_KEEP_PACKS))
+		return 0;
+	return 1;
 }
 
 static int add_recent_loose(const struct object_id *oid,
 			    const char *path, void *data)
 {
 	struct stat st;
-	struct object *obj = lookup_object(the_repository, oid);
+	struct object *obj;
+
+	if (!want_recent_object(data, oid))
+		return 0;
+
+	obj = lookup_object(the_repository, oid);
 
 	if (obj && obj->flags & SEEN)
 		return 0;
@@ -126,7 +146,7 @@ static int add_recent_loose(const struct object_id *oid,
 		return error_errno("unable to stat %s", oid_to_hex(oid));
 	}
 
-	add_recent_object(oid, st.st_mtime, data);
+	add_recent_object(oid, NULL, 0, st.st_mtime, data);
 	return 0;
 }
 
@@ -134,29 +154,43 @@ static int add_recent_packed(const struct object_id *oid,
 			     struct packed_git *p, uint32_t pos,
 			     void *data)
 {
-	struct object *obj = lookup_object(the_repository, oid);
+	struct object *obj;
+
+	if (!want_recent_object(data, oid))
+		return 0;
+
+	obj = lookup_object(the_repository, oid);
 
 	if (obj && obj->flags & SEEN)
 		return 0;
-	add_recent_object(oid, p->mtime, data);
+	add_recent_object(oid, p, nth_packed_object_offset(p, pos), p->mtime, data);
 	return 0;
 }
 
 int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
-					   timestamp_t timestamp)
+					   timestamp_t timestamp,
+					   report_recent_object_fn *cb,
+					   int ignore_in_core_kept_packs)
 {
 	struct recent_data data;
+	enum for_each_object_flags flags;
 	int r;
 
 	data.revs = revs;
 	data.timestamp = timestamp;
+	data.cb = cb;
+	data.ignore_in_core_kept_packs = ignore_in_core_kept_packs;
 
 	r = for_each_loose_object(add_recent_loose, &data,
 				  FOR_EACH_OBJECT_LOCAL_ONLY);
 	if (r)
 		return r;
-	return for_each_packed_object(add_recent_packed, &data,
-				      FOR_EACH_OBJECT_LOCAL_ONLY);
+
+	flags = FOR_EACH_OBJECT_LOCAL_ONLY | FOR_EACH_OBJECT_PACK_ORDER;
+	if (ignore_in_core_kept_packs)
+		flags |= FOR_EACH_OBJECT_SKIP_IN_CORE_KEPT_PACKS;
+
+	return for_each_packed_object(add_recent_packed, &data, flags);
 }
 
 static int mark_object_seen(const struct object_id *oid,
@@ -217,7 +251,8 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 
 	if (mark_recent) {
 		revs->ignore_missing_links = 1;
-		if (add_unseen_recent_objects_to_traversal(revs, mark_recent))
+		if (add_unseen_recent_objects_to_traversal(revs, mark_recent,
+							   NULL, 0))
 			die("unable to mark recent objects");
 		if (prepare_revision_walk(revs))
 			die("revision walk setup failed");
diff --git a/reachable.h b/reachable.h
index 5df932ad8f..b776761baa 100644
--- a/reachable.h
+++ b/reachable.h
@@ -1,11 +1,18 @@
 #ifndef REACHEABLE_H
 #define REACHEABLE_H
 
+#include "object.h"
+
 struct progress;
 struct rev_info;
 
+typedef void report_recent_object_fn(const struct object *, struct packed_git *,
+				     off_t, time_t);
+
 int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
-					   timestamp_t timestamp);
+					   timestamp_t timestamp,
+					   report_recent_object_fn cb,
+					   int ignore_in_core_kept_packs);
 void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 			    timestamp_t mark_recent, struct progress *);
 
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v4 10/17] reachable: report precise timestamps from objects in cruft packs
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (8 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-18 23:11   ` [PATCH v4 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
                     ` (9 subsequent siblings)
  19 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

When generating a cruft pack, the caller within pack-objects will want
to know the precise timestamps of cruft objects (i.e., their
corresponding values in the .mtimes table) rather than the mtime of the
cruft pack itself.

Teach add_recent_packed() to lookup each object's precise mtime from the
.mtimes file if one exists (indicated by the is_cruft bit on the
packed_git structure).

A couple of small things worth noting here:

  - load_pack_mtimes() needs to be called before asking for
    nth_packed_mtime(), and that call is done lazily here. That function
    exits early if the .mtimes file has already been opened and parsed,
    so only the first call is slow.

  - Checking the is_cruft bit can be done without any extra work on the
    caller's behalf, since it is set up for us automatically as a
    side-effect of calling add_packed_git() (just like the 'pack_keep'
    and 'pack_promisor' bits).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 reachable.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/reachable.c b/reachable.c
index d4507c4270..aba63ebeb3 100644
--- a/reachable.c
+++ b/reachable.c
@@ -13,6 +13,7 @@
 #include "worktree.h"
 #include "object-store.h"
 #include "pack-bitmap.h"
+#include "pack-mtimes.h"
 
 struct connectivity_progress {
 	struct progress *progress;
@@ -155,6 +156,7 @@ static int add_recent_packed(const struct object_id *oid,
 			     void *data)
 {
 	struct object *obj;
+	timestamp_t mtime = p->mtime;
 
 	if (!want_recent_object(data, oid))
 		return 0;
@@ -163,7 +165,12 @@ static int add_recent_packed(const struct object_id *oid,
 
 	if (obj && obj->flags & SEEN)
 		return 0;
-	add_recent_object(oid, p, nth_packed_object_offset(p, pos), p->mtime, data);
+	if (p->is_cruft) {
+		if (load_pack_mtimes(p) < 0)
+			die(_("could not load cruft pack .mtimes"));
+		mtime = nth_packed_mtime(p, pos);
+	}
+	add_recent_object(oid, p, nth_packed_object_offset(p, pos), mtime, data);
 	return 0;
 }
 
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v4 11/17] builtin/pack-objects.c: --cruft with expiration
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (9 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-18 23:11   ` [PATCH v4 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
                     ` (8 subsequent siblings)
  19 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

In a previous patch, pack-objects learned how to generate a cruft pack
so long as no objects are dropped.

This patch teaches pack-objects to handle the case where a non-never
`--cruft-expiration` value is passed. This case is slightly more
complicated than before, because we want pack-objects to save
unreachable objects which would have been pruned when there is another
recent (i.e., non-prunable) unreachable object which reaches the other.
We'll call these objects "unreachable but reachable-from-recent".

Here is how pack-objects handles `--cruft-expiration`:

  - Instead of adding all objects outside of the kept pack(s) into the
    packing list, only handle the ones whose mtime is within the grace
    period.

  - Construct a reachability traversal whose tips are the
    unreachable-but-recent objects.

  - Then, walk along that traversal, stopping if we reach an object in
    the kept pack. At each step along the traversal, we add the object
    we are visiting to the packing list.

In the majority of these cases, any object we visit in this traversal
will already be in our packing list. But we will sometimes encounter
reachable-from-recent cruft objects, which we want to retain even if
they aged out of the grace period.

The most subtle point of this process is that we actually don't need to
bother to update the rescued object's mtime. Even though we will write
an .mtimes file with a value that is older than the expiration window,
it will continue to survive cruft repacks so long as any objects which
reach it haven't aged out.

That is, a future repack will also exclude that object from the initial
packing list, only to discover it later on when doing the reachability
traversal.

Finally, stopping early once an object is found in a kept pack is safe
to do because the kept packs ordinarily represent which packs will
survive after repacking. Assuming that it _isn't_ safe to halt a
traversal early would mean that there is some ancestor object which is
missing, which implies repository corruption (i.e., the complete set of
reachable objects isn't present).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c        |  84 +++++++++++++++++++-
 reachable.h                   |   4 +-
 t/t5329-pack-objects-cruft.sh | 143 ++++++++++++++++++++++++++++++++++
 3 files changed, 228 insertions(+), 3 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 3b8bf6a3dd..8decc9dc0c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3447,6 +3447,44 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
 	return;
 }
 
+static void show_cruft_object(struct object *obj, const char *name, void *data)
+{
+	/*
+	 * if we did not record it earlier, it's at least as old as our
+	 * expiration value. Rather than find it exactly, just use that
+	 * value.  This may bump it forward from its real mtime, but it
+	 * will still be "too old" next time we run with the same
+	 * expiration.
+	 *
+	 * if obj does appear in the packing list, this call is a noop (or may
+	 * set the namehash).
+	 */
+	add_cruft_object_entry(&obj->oid, obj->type, NULL, 0, name, cruft_expiration);
+}
+
+static void show_cruft_commit(struct commit *commit, void *data)
+{
+	show_cruft_object((struct object*)commit, NULL, data);
+}
+
+static int cruft_include_check_obj(struct object *obj, void *data)
+{
+	return !has_object_kept_pack(&obj->oid, IN_CORE_KEEP_PACKS);
+}
+
+static int cruft_include_check(struct commit *commit, void *data)
+{
+	return cruft_include_check_obj((struct object*)commit, data);
+}
+
+static void set_cruft_mtime(const struct object *object,
+			    struct packed_git *pack,
+			    off_t offset, time_t mtime)
+{
+	add_cruft_object_entry(&object->oid, object->type, pack, offset, NULL,
+			       mtime);
+}
+
 static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
 {
 	struct string_list_item *item = NULL;
@@ -3472,6 +3510,50 @@ static void enumerate_cruft_objects(void)
 	stop_progress(&progress_state);
 }
 
+static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs)
+{
+	struct packed_git *p;
+	struct rev_info revs;
+	int ret;
+
+	repo_init_revisions(the_repository, &revs, NULL);
+
+	revs.tag_objects = 1;
+	revs.tree_objects = 1;
+	revs.blob_objects = 1;
+
+	revs.include_check = cruft_include_check;
+	revs.include_check_obj = cruft_include_check_obj;
+
+	revs.ignore_missing_links = 1;
+
+	if (progress)
+		progress_state = start_progress(_("Enumerating cruft objects"), 0);
+	ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
+						     set_cruft_mtime, 1);
+	stop_progress(&progress_state);
+
+	if (ret)
+		die(_("unable to add cruft objects"));
+
+	/*
+	 * Re-mark only the fresh packs as kept so that objects in
+	 * unknown packs do not halt the reachability traversal early.
+	 */
+	for (p = get_all_packs(the_repository); p; p = p->next)
+		p->pack_keep_in_core = 0;
+	mark_pack_kept_in_core(fresh_packs, 1);
+
+	if (prepare_revision_walk(&revs))
+		die(_("revision walk setup failed"));
+	if (progress)
+		progress_state = start_progress(_("Traversing cruft objects"), 0);
+	nr_seen = 0;
+	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
+
+	stop_progress(&progress_state);
+}
+
 static void read_cruft_objects(void)
 {
 	struct strbuf buf = STRBUF_INIT;
@@ -3523,7 +3605,7 @@ static void read_cruft_objects(void)
 	mark_pack_kept_in_core(&discard_packs, 0);
 
 	if (cruft_expiration)
-		die("--cruft-expiration not yet implemented");
+		enumerate_and_traverse_cruft_objects(&fresh_packs);
 	else
 		enumerate_cruft_objects();
 
diff --git a/reachable.h b/reachable.h
index b776761baa..020a887b99 100644
--- a/reachable.h
+++ b/reachable.h
@@ -1,10 +1,10 @@
 #ifndef REACHEABLE_H
 #define REACHEABLE_H
 
-#include "object.h"
-
 struct progress;
 struct rev_info;
+struct object;
+struct packed_git;
 
 typedef void report_recent_object_fn(const struct object *, struct packed_git *,
 				     off_t, time_t);
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 003ca7344e..939cdc297a 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -214,5 +214,148 @@ basic_cruft_pack_tests () {
 }
 
 basic_cruft_pack_tests never
+basic_cruft_pack_tests 2.weeks.ago
+
+test_expect_success 'cruft tags rescue tagged objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit tagged &&
+		git tag -a annotated -m tag &&
+
+		git rev-list --objects --no-object-names packed.. >objects &&
+		while read oid
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $oid)"
+		done <objects &&
+
+		test-tool chmtime -500 \
+			"$objdir/$(test_oid_to_path $(git rev-parse annotated))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+		cut -f1 -d" " <actual.raw | sort >actual &&
+
+		(
+			cat objects &&
+			git rev-parse annotated
+		) >expect.raw &&
+		sort <expect.raw >expect &&
+
+		test_cmp expect actual &&
+		cat actual
+	)
+'
+
+test_expect_success 'cruft commits rescue parents, trees' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit old &&
+		test_commit new &&
+
+		git rev-list --objects --no-object-names packed..new >objects &&
+		while read object
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $object)"
+		done <objects &&
+		test-tool chmtime +500 "$objdir/$(test_oid_to_path \
+			$(git rev-parse HEAD))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+		cut -d" " -f1 <actual.raw | sort >actual &&
+		sort <objects >expect &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'cruft trees rescue sub-trees, blobs' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		mkdir -p dir/sub &&
+		echo foo >foo &&
+		echo bar >dir/bar &&
+		echo baz >dir/sub/baz &&
+
+		test_tick &&
+		git add . &&
+		git commit -m "pruned" &&
+
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD^{tree}))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:foo))" &&
+		test-tool chmtime  -500 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/bar))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/sub))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/sub/baz))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+		cut -f1 -d" " <actual.raw | sort >actual &&
+
+		git rev-parse HEAD:dir HEAD:dir/bar HEAD:dir/sub HEAD:dir/sub/baz >expect.raw &&
+		sort <expect.raw >expect &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'expired objects are pruned' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit pruned &&
+
+		git rev-list --objects --no-object-names packed..pruned >objects &&
+		while read object
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $object)"
+		done <objects &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual &&
+		test_must_be_empty actual
+	)
+'
 
 test_done
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v4 12/17] builtin/repack.c: support generating a cruft pack
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (10 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-19 11:29     ` Ævar Arnfjörð Bjarmason
  2022-05-18 23:11   ` [PATCH v4 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
                     ` (7 subsequent siblings)
  19 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

Expose a way to split the contents of a repository into a main and cruft
pack when doing an all-into-one repack with `git repack --cruft -d`, and
a complementary configuration variable.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt            |  11 ++
 Documentation/technical/cruft-packs.txt |   2 +-
 builtin/repack.c                        | 105 +++++++++++-
 t/t5329-pack-objects-cruft.sh           | 207 ++++++++++++++++++++++++
 4 files changed, 319 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index ee30edc178..0bf13893d8 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -63,6 +63,17 @@ to the new separate pack will be written.
 	Also run  'git prune-packed' to remove redundant
 	loose object files.
 
+--cruft::
+	Same as `-a`, unless `-d` is used. Then any unreachable objects
+	are packed into a separate cruft pack. Unreachable objects can
+	be pruned using the normal expiry rules with the next `git gc`
+	invocation (see linkgit:git-gc[1]). Incompatible with `-k`.
+
+--cruft-expiration=<approxidate>::
+	Expire unreachable objects older than `<approxidate>`
+	immediately instead of waiting for the next `git gc` invocation.
+	Only useful with `--cruft -d`.
+
 -l::
 	Pass the `--local` option to 'git pack-objects'. See
 	linkgit:git-pack-objects[1].
diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
index c0f583cd48..d81f3a8982 100644
--- a/Documentation/technical/cruft-packs.txt
+++ b/Documentation/technical/cruft-packs.txt
@@ -17,7 +17,7 @@ pruned according to normal expiry rules with the next 'git gc' invocation.
 
 Unreachable objects aren't removed immediately, since doing so could race with
 an incoming push which may reference an object which is about to be deleted.
-Instead, those unreachable objects are stored as loose object and stay that way
+Instead, those unreachable objects are stored as loose objects and stay that way
 until they are older than the expiration window, at which point they are removed
 by linkgit:git-prune[1].
 
diff --git a/builtin/repack.c b/builtin/repack.c
index e7a3920c6d..593c18d4e8 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -18,12 +18,18 @@
 #include "pack-bitmap.h"
 #include "refs.h"
 
+#define ALL_INTO_ONE 1
+#define LOOSEN_UNREACHABLE 2
+#define PACK_CRUFT 4
+
+static int pack_everything;
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
 static int write_bitmaps = -1;
 static int use_delta_islands;
 static int run_update_server_info = 1;
 static char *packdir, *packtmp_name, *packtmp;
+static char *cruft_expiration;
 
 static const char *const git_repack_usage[] = {
 	N_("git repack [<options>]"),
@@ -305,9 +311,6 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 		die(_("could not finish pack-objects to repack promisor objects"));
 }
 
-#define ALL_INTO_ONE 1
-#define LOOSEN_UNREACHABLE 2
-
 struct pack_geometry {
 	struct packed_git **pack;
 	uint32_t pack_nr, pack_alloc;
@@ -344,6 +347,8 @@ static void init_pack_geometry(struct pack_geometry **geometry_p)
 	for (p = get_all_packs(the_repository); p; p = p->next) {
 		if (!pack_kept_objects && p->pack_keep)
 			continue;
+		if (p->is_cruft)
+			continue;
 
 		ALLOC_GROW(geometry->pack,
 			   geometry->pack_nr + 1,
@@ -605,6 +610,67 @@ static int write_midx_included_packs(struct string_list *include,
 	return finish_command(&cmd);
 }
 
+static int write_cruft_pack(const struct pack_objects_args *args,
+			    const char *pack_prefix,
+			    struct string_list *names,
+			    struct string_list *existing_packs,
+			    struct string_list *existing_kept_packs)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf line = STRBUF_INIT;
+	struct string_list_item *item;
+	FILE *in, *out;
+	int ret;
+
+	prepare_pack_objects(&cmd, args);
+
+	strvec_push(&cmd.args, "--cruft");
+	if (cruft_expiration)
+		strvec_pushf(&cmd.args, "--cruft-expiration=%s",
+			     cruft_expiration);
+
+	strvec_push(&cmd.args, "--honor-pack-keep");
+	strvec_push(&cmd.args, "--non-empty");
+	strvec_push(&cmd.args, "--max-pack-size=0");
+
+	cmd.in = -1;
+
+	ret = start_command(&cmd);
+	if (ret)
+		return ret;
+
+	/*
+	 * names has a confusing double use: it both provides the list
+	 * of just-written new packs, and accepts the name of the cruft
+	 * pack we are writing.
+	 *
+	 * By the time it is read here, it contains only the pack(s)
+	 * that were just written, which is exactly the set of packs we
+	 * want to consider kept.
+	 */
+	in = xfdopen(cmd.in, "w");
+	for_each_string_list_item(item, names)
+		fprintf(in, "%s-%s.pack\n", pack_prefix, item->string);
+	for_each_string_list_item(item, existing_packs)
+		fprintf(in, "-%s.pack\n", item->string);
+	for_each_string_list_item(item, existing_kept_packs)
+		fprintf(in, "%s.pack\n", item->string);
+	fclose(in);
+
+	out = xfdopen(cmd.out, "r");
+	while (strbuf_getline_lf(&line, out) != EOF) {
+		if (line.len != the_hash_algo->hexsz)
+			die(_("repack: Expecting full hex object ID lines only "
+			      "from pack-objects."));
+		string_list_append(names, line.buf);
+	}
+	fclose(out);
+
+	strbuf_release(&line);
+
+	return finish_command(&cmd);
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -621,7 +687,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int show_progress;
 
 	/* variables to be filled by option parsing */
-	int pack_everything = 0;
 	int delete_redundant = 0;
 	const char *unpack_unreachable = NULL;
 	int keep_unreachable = 0;
@@ -636,6 +701,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_BIT('A', NULL, &pack_everything,
 				N_("same as -a, and turn unreachable objects loose"),
 				   LOOSEN_UNREACHABLE | ALL_INTO_ONE),
+		OPT_BIT(0, "cruft", &pack_everything,
+				N_("same as -a, pack unreachable cruft objects separately"),
+				   PACK_CRUFT),
+		OPT_STRING(0, "cruft-expiration", &cruft_expiration, N_("approxidate"),
+				N_("with -C, expire objects older than this")),
 		OPT_BOOL('d', NULL, &delete_redundant,
 				N_("remove redundant packs, and run git-prune-packed")),
 		OPT_BOOL('f', NULL, &po_args.no_reuse_delta,
@@ -688,6 +758,15 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	    (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
 		die(_("options '%s' and '%s' cannot be used together"), "--keep-unreachable", "-A");
 
+	if (pack_everything & PACK_CRUFT) {
+		pack_everything |= ALL_INTO_ONE;
+
+		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
+			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-A");
+		if (keep_unreachable)
+			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-k");
+	}
+
 	if (write_bitmaps < 0) {
 		if (!write_midx &&
 		    (!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()))
@@ -771,7 +850,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (pack_everything & ALL_INTO_ONE) {
 		repack_promisor_objects(&po_args, &names);
 
-		if (existing_nonkept_packs.nr && delete_redundant) {
+		if (existing_nonkept_packs.nr && delete_redundant &&
+		    !(pack_everything & PACK_CRUFT)) {
 			for_each_string_list_item(item, &names) {
 				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
 					     packtmp_name, item->string);
@@ -833,6 +913,21 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (!names.nr && !po_args.quiet)
 		printf_ln(_("Nothing new to pack."));
 
+	if (pack_everything & PACK_CRUFT) {
+		const char *pack_prefix;
+		if (!skip_prefix(packtmp, packdir, &pack_prefix))
+			die(_("pack prefix %s does not begin with objdir %s"),
+			    packtmp, packdir);
+		if (*pack_prefix == '/')
+			pack_prefix++;
+
+		ret = write_cruft_pack(&po_args, pack_prefix, &names,
+				       &existing_nonkept_packs,
+				       &existing_kept_packs);
+		if (ret)
+			return ret;
+	}
+
 	for_each_string_list_item(item, &names) {
 		item->util = (void *)(uintptr_t)populate_pack_exts(item->string);
 	}
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 939cdc297a..06c550c958 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -358,4 +358,211 @@ test_expect_success 'expired objects are pruned' '
 	)
 '
 
+test_expect_success 'repack --cruft generates a cruft pack' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git branch -M main &&
+		git checkout --orphan other &&
+		test_commit unreachable &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d unreachable &&
+		# objects are not cruft if they are contained in the reflogs
+		git reflog expire --all --expire=all &&
+
+		git rev-list --objects --all --no-object-names >reachable.raw &&
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+		sort <reachable.raw >reachable &&
+		comm -13 reachable objects >unreachable &&
+
+		git repack --cruft -d &&
+
+		cruft=$(basename $(ls $packdir/pack-*.mtimes) .mtimes) &&
+		pack=$(basename $(ls $packdir/pack-*.pack | grep -v $cruft) .pack) &&
+
+		git show-index <$packdir/$pack.idx >actual.raw &&
+		cut -f2 -d" " actual.raw | sort >actual &&
+		test_cmp reachable actual &&
+
+		git show-index <$packdir/$cruft.idx >actual.raw &&
+		cut -f2 -d" " actual.raw | sort >actual &&
+		test_cmp unreachable actual
+	)
+'
+
+test_expect_success 'loose objects mtimes upsert others' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		test_commit cruft &&
+		# incremental repack, leaving existing objects loose (so
+		# they can be "freshened")
+		git repack &&
+
+		tip="$(git rev-parse cruft)" &&
+		path="$objdir/$(test_oid_to_path "$(git rev-parse cruft)")" &&
+		test-tool chmtime --get +1000 "$path" >expect &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d cruft &&
+		git reflog expire --all --expire=all &&
+
+		git repack --cruft -d &&
+
+		mtimes="$(basename $(ls $packdir/pack-*.mtimes))" &&
+		test-tool pack-mtimes "$mtimes" >actual.raw &&
+		grep "$tip" actual.raw | cut -d" " -f2 >actual &&
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'cruft packs are not included in geometric repack' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		test_commit cruft &&
+		git repack -d &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d cruft &&
+		git reflog expire --all --expire=all &&
+
+		git repack --cruft &&
+
+		find $packdir -type f | sort >before &&
+		git repack --geometric=2 -d &&
+		find $packdir -type f | sort >after &&
+
+		test_cmp before after
+	)
+'
+
+test_expect_success 'repack --geometric collects once-cruft objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		git rm -rf . &&
+		test_commit --no-tag cruft &&
+		cruft="$(git rev-parse HEAD)" &&
+
+		git checkout main &&
+		git branch -D other &&
+		git reflog expire --all --expire=all &&
+
+		# Pack the objects created in the previous step into a cruft
+		# pack. Intentionally leave loose copies of those objects
+		# around so we can pick them up in a subsequent --geometric
+		# reapack.
+		git repack --cruft &&
+
+		# Now make those objects reachable, and ensure that they are
+		# packed into the new pack created via a --geometric repack.
+		git update-ref refs/heads/other $cruft &&
+
+		# Without this object, the set of unpacked objects is exactly
+		# the set of objects already in the cruft pack. Tweak that set
+		# to ensure we do not overwrite the cruft pack entirely.
+		test_commit reachable2 &&
+
+		find $packdir -name "pack-*.idx" | sort >before &&
+		git repack --geometric=2 -d &&
+		find $packdir -name "pack-*.idx" | sort >after &&
+
+		{
+			git rev-list --objects --no-object-names $cruft &&
+			git rev-list --objects --no-object-names reachable..reachable2
+		} >want.raw &&
+		sort want.raw >want &&
+
+		pack=$(comm -13 before after) &&
+		git show-index <$pack >objects.raw &&
+
+		cut -d" " -f2 objects.raw | sort >got &&
+
+		test_cmp want got
+	)
+'
+
+test_expect_success 'cruft repack with no reachable objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		git repack -ad &&
+
+		base="$(git rev-parse base)" &&
+
+		git for-each-ref --format="delete %(refname)" >in &&
+		git update-ref --stdin <in &&
+		git reflog expire --all --expire=all &&
+		rm -fr .git/index &&
+
+		git repack --cruft -d &&
+
+		git cat-file -t $base
+	)
+'
+
+test_expect_success 'cruft repack ignores --max-pack-size' '
+	git init max-pack-size &&
+	(
+		cd max-pack-size &&
+		test_commit base &&
+		# two cruft objects which exceed the maximum pack size
+		test-tool genrandom foo 1048576 | git hash-object --stdin -w &&
+		test-tool genrandom bar 1048576 | git hash-object --stdin -w &&
+		git repack --cruft --max-pack-size=1M &&
+		find $packdir -name "*.mtimes" >cruft &&
+		test_line_count = 1 cruft &&
+		test-tool pack-mtimes "$(basename "$(cat cruft)")" >objects &&
+		test_line_count = 2 objects
+	)
+'
+
+test_expect_success 'cruft repack ignores pack.packSizeLimit' '
+	(
+		cd max-pack-size &&
+		# repack everything back together to remove the existing cruft
+		# pack (but to keep its objects)
+		git repack -adk &&
+		git -c pack.packSizeLimit=1M repack --cruft &&
+		# ensure the same post condition is met when --max-pack-size
+		# would otherwise be inferred from the configuration
+		find $packdir -name "*.mtimes" >cruft &&
+		test_line_count = 1 cruft &&
+		test-tool pack-mtimes "$(basename "$(cat cruft)")" >objects &&
+		test_line_count = 2 objects
+	)
+'
+
 test_done
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v4 13/17] builtin/repack.c: allow configuring cruft pack generation
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (11 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-18 23:11   ` [PATCH v4 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
                     ` (6 subsequent siblings)
  19 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

In servers which set the pack.window configuration to a large value, we
can wind up spending quite a lot of time finding new bases when breaking
delta chains between reachable and unreachable objects while generating
a cruft pack.

Introduce a handful of `repack.cruft*` configuration variables to
control the parameters used by pack-objects when generating a cruft
pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/repack.txt |  9 ++++
 builtin/repack.c                | 49 +++++++++++++------
 t/t5329-pack-objects-cruft.sh   | 83 +++++++++++++++++++++++++++++++++
 3 files changed, 127 insertions(+), 14 deletions(-)

diff --git a/Documentation/config/repack.txt b/Documentation/config/repack.txt
index 41ac6953c8..c79af6d7b8 100644
--- a/Documentation/config/repack.txt
+++ b/Documentation/config/repack.txt
@@ -30,3 +30,12 @@ repack.updateServerInfo::
 	If set to false, linkgit:git-repack[1] will not run
 	linkgit:git-update-server-info[1]. Defaults to true. Can be overridden
 	when true by the `-n` option of linkgit:git-repack[1].
+
+repack.cruftWindow::
+repack.cruftWindowMemory::
+repack.cruftDepth::
+repack.cruftThreads::
+	Parameters used by linkgit:git-pack-objects[1] when generating
+	a cruft pack and the respective parameters are not given over
+	the command line. See similarly named `pack.*` configuration
+	variables for defaults and meaning.
diff --git a/builtin/repack.c b/builtin/repack.c
index 593c18d4e8..b85483a148 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -41,9 +41,21 @@ static const char incremental_bitmap_conflict_error[] = N_(
 "--no-write-bitmap-index or disable the pack.writebitmaps configuration."
 );
 
+struct pack_objects_args {
+	const char *window;
+	const char *window_memory;
+	const char *depth;
+	const char *threads;
+	const char *max_pack_size;
+	int no_reuse_delta;
+	int no_reuse_object;
+	int quiet;
+	int local;
+};
 
 static int repack_config(const char *var, const char *value, void *cb)
 {
+	struct pack_objects_args *cruft_po_args = cb;
 	if (!strcmp(var, "repack.usedeltabaseoffset")) {
 		delta_base_offset = git_config_bool(var, value);
 		return 0;
@@ -65,6 +77,14 @@ static int repack_config(const char *var, const char *value, void *cb)
 		run_update_server_info = git_config_bool(var, value);
 		return 0;
 	}
+	if (!strcmp(var, "repack.cruftwindow"))
+		return git_config_string(&cruft_po_args->window, var, value);
+	if (!strcmp(var, "repack.cruftwindowmemory"))
+		return git_config_string(&cruft_po_args->window_memory, var, value);
+	if (!strcmp(var, "repack.cruftdepth"))
+		return git_config_string(&cruft_po_args->depth, var, value);
+	if (!strcmp(var, "repack.cruftthreads"))
+		return git_config_string(&cruft_po_args->threads, var, value);
 	return git_default_config(var, value, cb);
 }
 
@@ -157,18 +177,6 @@ static void remove_redundant_pack(const char *dir_name, const char *base_name)
 	strbuf_release(&buf);
 }
 
-struct pack_objects_args {
-	const char *window;
-	const char *window_memory;
-	const char *depth;
-	const char *threads;
-	const char *max_pack_size;
-	int no_reuse_delta;
-	int no_reuse_object;
-	int quiet;
-	int local;
-};
-
 static void prepare_pack_objects(struct child_process *cmd,
 				 const struct pack_objects_args *args)
 {
@@ -692,6 +700,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int keep_unreachable = 0;
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	struct pack_objects_args po_args = {NULL};
+	struct pack_objects_args cruft_po_args = {NULL};
 	int geometric_factor = 0;
 	int write_midx = 0;
 
@@ -746,7 +755,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_END()
 	};
 
-	git_config(repack_config, NULL);
+	git_config(repack_config, &cruft_po_args);
 
 	argc = parse_options(argc, argv, prefix, builtin_repack_options,
 				git_repack_usage, 0);
@@ -921,7 +930,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		if (*pack_prefix == '/')
 			pack_prefix++;
 
-		ret = write_cruft_pack(&po_args, pack_prefix, &names,
+		if (!cruft_po_args.window)
+			cruft_po_args.window = po_args.window;
+		if (!cruft_po_args.window_memory)
+			cruft_po_args.window_memory = po_args.window_memory;
+		if (!cruft_po_args.depth)
+			cruft_po_args.depth = po_args.depth;
+		if (!cruft_po_args.threads)
+			cruft_po_args.threads = po_args.threads;
+
+		cruft_po_args.local = po_args.local;
+		cruft_po_args.quiet = po_args.quiet;
+
+		ret = write_cruft_pack(&cruft_po_args, pack_prefix, &names,
 				       &existing_nonkept_packs,
 				       &existing_kept_packs);
 		if (ret)
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 06c550c958..e4744e4465 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -565,4 +565,87 @@ test_expect_success 'cruft repack ignores pack.packSizeLimit' '
 	)
 '
 
+test_expect_success 'cruft repack respects repack.cruftWindow' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		GIT_TRACE2_EVENT=$(pwd)/event.trace \
+		git -c pack.window=1 -c repack.cruftWindow=2 repack \
+		       --cruft --window=3 &&
+
+		grep "pack-objects.*--window=2.*--cruft" event.trace
+	)
+'
+
+test_expect_success 'cruft repack respects --window by default' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		GIT_TRACE2_EVENT=$(pwd)/event.trace \
+		git -c pack.window=2 repack --cruft --window=3 &&
+
+		grep "pack-objects.*--window=3.*--cruft" event.trace
+	)
+'
+
+test_expect_success 'cruft repack respects --quiet' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		GIT_PROGRESS_DELAY=0 git repack --cruft --quiet 2>err &&
+		test_must_be_empty err
+	)
+'
+
+test_expect_success 'cruft --local drops unreachable objects' '
+	git init alternate &&
+	git init repo &&
+	test_when_finished "rm -fr alternate repo" &&
+
+	test_commit -C alternate base &&
+	# Pack all objects in alterate so that the cruft repack in "repo" sees
+	# the object it dropped due to `--local` as packed. Otherwise this
+	# object would not appear packed anywhere (since it is not packed in
+	# alternate and likewise not part of the cruft pack in the other repo
+	# because of `--local`).
+	git -C alternate repack -ad &&
+
+	(
+		cd repo &&
+
+		object="$(git -C ../alternate rev-parse HEAD:base.t)" &&
+		git -C ../alternate cat-file -p $object >contents &&
+
+		# Write some reachable objects and two unreachable ones: one
+		# that the alternate has and another that is unique.
+		test_commit other &&
+		git hash-object -w -t blob contents &&
+		cruft="$(echo cruft | git hash-object -w -t blob --stdin)" &&
+
+		( cd ../alternate/.git/objects && pwd ) \
+		       >.git/objects/info/alternates &&
+
+		test_path_is_file $objdir/$(test_oid_to_path $cruft) &&
+		test_path_is_file $objdir/$(test_oid_to_path $object) &&
+
+		git repack -d --cruft --local &&
+
+		test-tool pack-mtimes "$(basename $(ls $packdir/pack-*.mtimes))" \
+		       >objects &&
+		! grep $object objects &&
+		grep $cruft objects
+	)
+'
+
 test_done
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v4 14/17] builtin/repack.c: use named flags for existing_packs
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (12 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-18 23:11   ` [PATCH v4 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
                     ` (5 subsequent siblings)
  19 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

We use the `util` pointer for items in the `existing_packs` string list
to indicate which packs are going to be deleted. Since that has so far
been the only use of that `util` pointer, we just set it to 0 or 1.

But we're going to add an additional state to this field in the next
patch, so prepare for that by adding a #define for the first bit so we
can more expressively inspect the flags state.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index b85483a148..36d1f03671 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -22,6 +22,8 @@
 #define LOOSEN_UNREACHABLE 2
 #define PACK_CRUFT 4
 
+#define DELETE_PACK 1
+
 static int pack_everything;
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
@@ -564,7 +566,7 @@ static void midx_included_packs(struct string_list *include,
 		}
 	} else {
 		for_each_string_list_item(item, existing_nonkept_packs) {
-			if (item->util)
+			if ((uintptr_t)item->util & DELETE_PACK)
 				continue;
 			string_list_insert(include, xstrfmt("%s.idx", item->string));
 		}
@@ -1002,7 +1004,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			 * was given) and that we will actually delete this pack
 			 * (if `-d` was given).
 			 */
-			item->util = (void*)(intptr_t)!string_list_has_string(&names, sha1);
+			if (!string_list_has_string(&names, sha1))
+				item->util = (void*)(uintptr_t)((size_t)item->util | DELETE_PACK);
 		}
 	}
 
@@ -1026,7 +1029,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (delete_redundant) {
 		int opts = 0;
 		for_each_string_list_item(item, &existing_nonkept_packs) {
-			if (!item->util)
+			if (!((uintptr_t)item->util & DELETE_PACK))
 				continue;
 			remove_redundant_pack(packdir, item->string);
 		}
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v4 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (13 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-19 11:32     ` Ævar Arnfjörð Bjarmason
  2022-05-18 23:11   ` [PATCH v4 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
                     ` (4 subsequent siblings)
  19 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

When using cruft packs, the following race can occur when a geometric
repack that writes a MIDX bitmap takes place afterwords:

  - First, create an unreachable object and do an all-into-one cruft
    repack which stores that object in the repository's cruft pack.
  - Then make that object reachable.
  - Finally, do a geometric repack and write a MIDX bitmap.

Assuming that we are sufficiently unlucky as to select a commit from the
MIDX which reaches that object for bitmapping, then the `git
multi-pack-index` process will complain that that object is missing.

The reason is because we don't include cruft packs in the MIDX when
doing a geometric repack. Since the "make that object reachable" doesn't
necessarily mean that we'll create a new copy of that object in one of
the packs that will get rolled up as part of a geometric repack, it's
possible that the MIDX won't see any copies of that now-reachable
object.

Of course, it's desirable to avoid including cruft packs in the MIDX
because it causes the MIDX to store a bunch of objects which are likely
to get thrown away. But excluding that pack does open us up to the above
race.

This patch demonstrates the bug, and resolves it by including cruft
packs in the MIDX even when doing a geometric repack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c              | 19 +++++++++++++++++--
 t/t5329-pack-objects-cruft.sh | 26 ++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 36d1f03671..e9e3a2b4e3 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -23,6 +23,7 @@
 #define PACK_CRUFT 4
 
 #define DELETE_PACK 1
+#define CRUFT_PACK 2
 
 static int pack_everything;
 static int delta_base_offset = 1;
@@ -161,8 +162,11 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 		if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
 		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
 			string_list_append_nodup(fname_kept_list, fname);
-		else
-			string_list_append_nodup(fname_nonkept_list, fname);
+		else {
+			struct string_list_item *item = string_list_append_nodup(fname_nonkept_list, fname);
+			if (file_exists(mkpath("%s/%s.mtimes", packdir, fname)))
+				item->util = (void*)(uintptr_t)CRUFT_PACK;
+		}
 	}
 	closedir(dir);
 }
@@ -564,6 +568,17 @@ static void midx_included_packs(struct string_list *include,
 
 			string_list_insert(include, strbuf_detach(&buf, NULL));
 		}
+
+		for_each_string_list_item(item, existing_nonkept_packs) {
+			if (!((uintptr_t)item->util & CRUFT_PACK)) {
+				/*
+				 * no need to check DELETE_PACK, since we're not
+				 * doing an ALL_INTO_ONE repack
+				 */
+				continue;
+			}
+			string_list_insert(include, xstrfmt("%s.idx", item->string));
+		}
 	} else {
 		for_each_string_list_item(item, existing_nonkept_packs) {
 			if ((uintptr_t)item->util & DELETE_PACK)
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index e4744e4465..13158e4ab7 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -648,4 +648,30 @@ test_expect_success 'cruft --local drops unreachable objects' '
 	)
 '
 
+test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		test_commit cruft &&
+		unreachable="$(git rev-parse cruft)" &&
+
+		git reset --hard $unreachable^ &&
+		git tag -d cruft &&
+		git reflog expire --all --expire=all &&
+
+		git repack --cruft -d &&
+
+		# resurrect the unreachable object via a new commit. the
+		# new commit will get selected for a bitmap, but be
+		# missing one of its parents from the selected packs.
+		git reset --hard $unreachable &&
+		test_commit resurrect &&
+
+		git repack --write-midx --write-bitmap-index --geometric=2 -d
+	)
+'
+
 test_done
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v4 16/17] builtin/gc.c: conditionally avoid pruning objects via loose
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (14 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-18 23:11   ` [PATCH v4 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
                     ` (3 subsequent siblings)
  19 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

Expose the new `git repack --cruft` mode from `git gc` via a new opt-in
flag. When invoked like `git gc --cruft`, `git gc` will avoid exploding
unreachable objects as loose ones, and instead create a cruft pack and
`.mtimes` file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/gc.txt   | 21 +++++++++++++-------
 Documentation/git-gc.txt      |  5 +++++
 builtin/gc.c                  | 10 +++++++++-
 t/t5329-pack-objects-cruft.sh | 37 +++++++++++++++++++++++++++++++++++
 4 files changed, 65 insertions(+), 8 deletions(-)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index c834e07991..38fea076a2 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -81,14 +81,21 @@ gc.packRefs::
 	to enable it within all non-bare repos or it can be set to a
 	boolean value.  The default is `true`.
 
+gc.cruftPacks::
+	Store unreachable objects in a cruft pack (see
+	linkgit:git-repack[1]) instead of as loose objects. The default
+	is `false`.
+
 gc.pruneExpire::
-	When 'git gc' is run, it will call 'prune --expire 2.weeks.ago'.
-	Override the grace period with this config variable.  The value
-	"now" may be used to disable this grace period and always prune
-	unreachable objects immediately, or "never" may be used to
-	suppress pruning.  This feature helps prevent corruption when
-	'git gc' runs concurrently with another process writing to the
-	repository; see the "NOTES" section of linkgit:git-gc[1].
+	When 'git gc' is run, it will call 'prune --expire 2.weeks.ago'
+	(and 'repack --cruft --cruft-expiration 2.weeks.ago' if using
+	cruft packs via `gc.cruftPacks` or `--cruft`).  Override the
+	grace period with this config variable.  The value "now" may be
+	used to disable this grace period and always prune unreachable
+	objects immediately, or "never" may be used to suppress pruning.
+	This feature helps prevent corruption when 'git gc' runs
+	concurrently with another process writing to the repository; see
+	the "NOTES" section of linkgit:git-gc[1].
 
 gc.worktreePruneExpire::
 	When 'git gc' is run, it calls
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 853967dea0..ba4e67700e 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -54,6 +54,11 @@ other housekeeping tasks (e.g. rerere, working trees, reflog...) will
 be performed as well.
 
 
+--cruft::
+	When expiring unreachable objects, pack them separately into a
+	cruft pack instead of storing the loose objects as loose
+	objects.
+
 --prune=<date>::
 	Prune loose objects older than date (default is 2 weeks ago,
 	overridable by the config variable `gc.pruneExpire`).
diff --git a/builtin/gc.c b/builtin/gc.c
index b335cffa33..4d995e85e9 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -42,6 +42,7 @@ static const char * const builtin_gc_usage[] = {
 
 static int pack_refs = 1;
 static int prune_reflogs = 1;
+static int cruft_packs = 0;
 static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
@@ -152,6 +153,7 @@ static void gc_config(void)
 	git_config_get_int("gc.auto", &gc_auto_threshold);
 	git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
 	git_config_get_bool("gc.autodetach", &detach_auto);
+	git_config_get_bool("gc.cruftpacks", &cruft_packs);
 	git_config_get_expiry("gc.pruneexpire", &prune_expire);
 	git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire);
 	git_config_get_expiry("gc.logexpiry", &gc_log_expire);
@@ -331,7 +333,11 @@ static void add_repack_all_option(struct string_list *keep_pack)
 {
 	if (prune_expire && !strcmp(prune_expire, "now"))
 		strvec_push(&repack, "-a");
-	else {
+	else if (cruft_packs) {
+		strvec_push(&repack, "--cruft");
+		if (prune_expire)
+			strvec_pushf(&repack, "--cruft-expiration=%s", prune_expire);
+	} else {
 		strvec_push(&repack, "-A");
 		if (prune_expire)
 			strvec_pushf(&repack, "--unpack-unreachable=%s", prune_expire);
@@ -551,6 +557,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 		{ OPTION_STRING, 0, "prune", &prune_expire, N_("date"),
 			N_("prune unreferenced objects"),
 			PARSE_OPT_OPTARG, NULL, (intptr_t)prune_expire },
+		OPT_BOOL(0, "cruft", &cruft_packs, N_("pack unreferenced objects separately")),
 		OPT_BOOL(0, "aggressive", &aggressive, N_("be more thorough (increased runtime)")),
 		OPT_BOOL_F(0, "auto", &auto_gc, N_("enable auto-gc mode"),
 			   PARSE_OPT_NOCOMPLETE),
@@ -670,6 +677,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 			die(FAILED_RUN, repack.v[0]);
 
 		if (prune_expire) {
+			/* run `git prune` even if using cruft packs */
 			strvec_push(&prune, prune_expire);
 			if (quiet)
 				strvec_push(&prune, "--no-progress");
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 13158e4ab7..3910e186ef 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -429,6 +429,43 @@ test_expect_success 'loose objects mtimes upsert others' '
 	)
 '
 
+test_expect_success 'expiring cruft objects with git gc' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git branch -M main &&
+		git checkout --orphan other &&
+		test_commit unreachable &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d unreachable &&
+		# objects are not cruft if they are contained in the reflogs
+		git reflog expire --all --expire=all &&
+
+		git rev-list --objects --all --no-object-names >reachable.raw &&
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+		sort <reachable.raw >reachable &&
+		comm -13 reachable objects >unreachable &&
+
+		git repack --cruft -d &&
+
+		mtimes=$(ls .git/objects/pack/pack-*.mtimes) &&
+		test_path_is_file $mtimes &&
+
+		git gc --cruft --prune=now &&
+
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+
+		comm -23 unreachable objects >removed &&
+		test_cmp unreachable removed &&
+		test_path_is_missing $mtimes
+	)
+'
+
 test_expect_success 'cruft packs are not included in geometric repack' '
 	git init repo &&
 	test_when_finished "rm -fr repo" &&
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v4 17/17] sha1-file.c: don't freshen cruft packs
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (15 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
@ 2022-05-18 23:11   ` Taylor Blau
  2022-05-18 23:48   ` [PATCH v4 00/17] " Derrick Stolee
                     ` (2 subsequent siblings)
  19 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-18 23:11 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

We don't bother to freshen objects stored in a cruft pack individually
by updating the `.mtimes` file. This is because we can't portably `mmap`
and write into the middle of a file (i.e., to update the mtime of just
one object). Instead, we would have to rewrite the entire `.mtimes` file
which may incur some wasted effort especially if there a lot of cruft
objects and they are freshened infrequently.

Instead, force the freshening code to avoid an optimizing write by
writing out the object loose and letting it pick up a current mtime.

This works because we prefer the mtime of the loose copy of an object
when both a loose and packed one exist (whether or not the packed copy
comes from a cruft pack or not).

This could certainly do with a test and/or be included earlier in this
series/PR, but I want to wait until after I have a chance to clean up
the overly-repetitive nature of the cruft pack tests in general.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 object-file.c                 |  2 ++
 t/t5329-pack-objects-cruft.sh | 25 +++++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/object-file.c b/object-file.c
index ff0cffe68e..495a359200 100644
--- a/object-file.c
+++ b/object-file.c
@@ -2035,6 +2035,8 @@ static int freshen_packed_object(const struct object_id *oid)
 	struct pack_entry e;
 	if (!find_pack_entry(the_repository, oid, &e))
 		return 0;
+	if (e.p->is_cruft)
+		return 0;
 	if (e.p->freshened)
 		return 1;
 	if (!freshen_file(e.p->pack_name))
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 3910e186ef..4681558612 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -711,4 +711,29 @@ test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
 	)
 '
 
+test_expect_success 'cruft objects are freshend via loose' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		echo "cruft" >contents &&
+		blob="$(git hash-object -w -t blob contents)" &&
+		loose="$objdir/$(test_oid_to_path $blob)" &&
+
+		test_commit base &&
+
+		git repack --cruft -d &&
+
+		test_path_is_missing "$loose" &&
+		test-tool pack-mtimes "$(basename "$(ls $packdir/pack-*.mtimes)")" >cruft &&
+		grep "$blob" cruft &&
+
+		# write the same object again
+		git hash-object -w -t blob contents &&
+
+		test_path_is_file "$loose"
+	)
+'
+
 test_done
-- 
2.36.1.94.gb0d54bedca

^ permalink raw reply related	[flat|nested] 201+ messages in thread

* Re: [PATCH v4 00/17] cruft packs
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (16 preceding siblings ...)
  2022-05-18 23:11   ` [PATCH v4 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
@ 2022-05-18 23:48   ` Derrick Stolee
  2022-05-20 23:19     ` Junio C Hamano
  2022-05-19 11:42   ` [RFC PATCH 0/2] Utility functions for duplicated pack(write) code Ævar Arnfjörð Bjarmason
  2022-05-19 11:54   ` [PATCH v4 00/17] cruft packs Ævar Arnfjörð Bjarmason
  19 siblings, 1 reply; 201+ messages in thread
From: Derrick Stolee @ 2022-05-18 23:48 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: avarab, gitster, jrnieder, larsxschneider, tytso

On 5/18/2022 7:10 PM, Taylor Blau wrote:
> Here is another reroll of my series to implement "cruft packs", which is based
> on the v2.36 tree, and incorporates feedback from the discussion we had about
> mixed-version GCs with cruft packs in [1].
> 
> The changes here are limited to:
> 
>   - a cautionary note in Documentation/technical/cruft-packs.txt
>     describing the potential interaction between pruning GCs across pre-
>     and post-cruft pack versions of Git, as discussed towards the bottom
>     of [2]

I think this documentation is sufficient guarding against this issue,
which is not so critical as to do something more involved. When users
opt-in to using cruft packs, they should know about their scenario
enough to know if they would stumble into this issue.

>   - updating the `finalize_hashfile()` calls for writing `.mtimes` files
>     to indicate that they are `FSYNC_COMPONENT_PACK_METADATA`, since the
>     original version of this series predates the fine-grained fsync
>     configuration in 2.36.

Good to have this update and not require it to be handled at merge
time by the maintainer.

> As always, a range-diff is below. Thanks in advance for taking another
> look!

Looking at the range-diff, I'm happy with this version.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v4 08/17] builtin/pack-objects.c: --cruft without expiration
  2022-05-18 23:11   ` [PATCH v4 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
@ 2022-05-19 10:04     ` Junio C Hamano
  2022-05-19 15:16       ` Junio C Hamano
  0 siblings, 1 reply; 201+ messages in thread
From: Junio C Hamano @ 2022-05-19 10:04 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, avarab, derrickstolee, jrnieder, larsxschneider, tytso

Taylor Blau <me@ttaylorr.com> writes:

> @@ -3870,6 +4034,20 @@ static int option_parse_unpack_unreachable(const struct option *opt,
>  	return 0;
>  }
>  
> +static int option_parse_cruft_expiration(const struct option *opt,
> +					 const char *arg, int unset)
> +{
> +	if (unset) {
> +		cruft = 0;
> +		cruft_expiration = 0;
> +	} else {
> +		cruft = 1;
> +		if (arg)
> +			cruft_expiration = approxidate(arg);
> +	}
> +	return 0;
> +}

It is somewhat sad that we have to invent this function, instead of
using parse_opt_expiry_date_cb().

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v4 02/17] pack-mtimes: support reading .mtimes files
  2022-05-18 23:10   ` [PATCH v4 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
@ 2022-05-19 10:40     ` Ævar Arnfjörð Bjarmason
  2022-05-19 15:21       ` Junio C Hamano
  0 siblings, 1 reply; 201+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-05-19 10:40 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, derrickstolee, gitster, jrnieder, larsxschneider, tytso


On Wed, May 18 2022, Taylor Blau wrote:

Nit:

> +  - A 4-byte magic number '0x4d544d45' ('MTME').
> +
> +  - A 4-byte version identifier (= 1).
> +
> +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).

Here we let it suffice that later we'll say "All 4-byte numbers are in
network order".

> +  - A table of 4-byte unsigned integers in network order. The ith

But here we call out "network order" explicitly, shouldn't this just be
s/ in network order//?

> +    value is the modification time (mtime) of the ith object in the
> +    corresponding pack by lexicographic (index) order. The mtimes
> +    count standard epoch seconds.
> +
> +  - A trailer, containing a checksum of the corresponding packfile,
> +    and a checksum of all of the above (each having length according
> +    to the specified hash function).
> +
> +All 4-byte numbers are in network order.

I.e. this is sufficient.

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v4 12/17] builtin/repack.c: support generating a cruft pack
  2022-05-18 23:11   ` [PATCH v4 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
@ 2022-05-19 11:29     ` Ævar Arnfjörð Bjarmason
  2022-05-20 22:39       ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-05-19 11:29 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, derrickstolee, gitster, jrnieder, larsxschneider, tytso


On Wed, May 18 2022, Taylor Blau wrote:

> +		tip="$(git rev-parse cruft)" &&

Here we don't hide the exit status of "git", as it'll be reflected in what's &&-chained.

> +		path="$objdir/$(test_oid_to_path "$(git rev-parse cruft)")" &&

But here we do, as we'll get the exit status of test_oid_to_path. But as
we just rev parsed it shouldn't this be $tip in any case?

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v4 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack
  2022-05-18 23:11   ` [PATCH v4 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
@ 2022-05-19 11:32     ` Ævar Arnfjörð Bjarmason
  2022-05-20 22:42       ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-05-19 11:32 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, derrickstolee, gitster, jrnieder, larsxschneider, tytso


On Wed, May 18 2022, Taylor Blau wrote:

> When using cruft packs, the following race can occur when a geometric
> repack that writes a MIDX bitmap takes place afterwords:
>
>   - First, create an unreachable object and do an all-into-one cruft
>     repack which stores that object in the repository's cruft pack.
>   - Then make that object reachable.
>   - Finally, do a geometric repack and write a MIDX bitmap.
>
> Assuming that we are sufficiently unlucky as to select a commit from the
> MIDX which reaches that object for bitmapping, then the `git
> multi-pack-index` process will complain that that object is missing.
>
> The reason is because we don't include cruft packs in the MIDX when
> doing a geometric repack. Since the "make that object reachable" doesn't
> necessarily mean that we'll create a new copy of that object in one of
> the packs that will get rolled up as part of a geometric repack, it's
> possible that the MIDX won't see any copies of that now-reachable
> object.
>
> Of course, it's desirable to avoid including cruft packs in the MIDX
> because it causes the MIDX to store a bunch of objects which are likely
> to get thrown away. But excluding that pack does open us up to the above
> race.
>
> This patch demonstrates the bug, and resolves it by including cruft
> packs in the MIDX even when doing a geometric repack.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  builtin/repack.c              | 19 +++++++++++++++++--
>  t/t5329-pack-objects-cruft.sh | 26 ++++++++++++++++++++++++++
>  2 files changed, 43 insertions(+), 2 deletions(-)
>
> diff --git a/builtin/repack.c b/builtin/repack.c
> index 36d1f03671..e9e3a2b4e3 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -23,6 +23,7 @@
>  #define PACK_CRUFT 4
>  
>  #define DELETE_PACK 1
> +#define CRUFT_PACK 2
>  
>  static int pack_everything;
>  static int delta_base_offset = 1;
> @@ -161,8 +162,11 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
>  		if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
>  		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
>  			string_list_append_nodup(fname_kept_list, fname);
> -		else
> -			string_list_append_nodup(fname_nonkept_list, fname);
> +		else {
> +			struct string_list_item *item = string_list_append_nodup(fname_nonkept_list, fname);

Nit: very long line, and we end up with {} just on the else, not the if.

^ permalink raw reply	[flat|nested] 201+ messages in thread

* [RFC PATCH 0/2] Utility functions for duplicated pack(write) code
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (17 preceding siblings ...)
  2022-05-18 23:48   ` [PATCH v4 00/17] " Derrick Stolee
@ 2022-05-19 11:42   ` Ævar Arnfjörð Bjarmason
  2022-05-19 11:42     ` [RFC PATCH 1/2] packfile API: add and use a pack_name_to_ext() utility function Ævar Arnfjörð Bjarmason
                       ` (2 more replies)
  2022-05-19 11:54   ` [PATCH v4 00/17] cruft packs Ævar Arnfjörð Bjarmason
  19 siblings, 3 replies; 201+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-05-19 11:42 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Taylor Blau, derrickstolee, jrnieder,
	larsxschneider, tytso, Ævar Arnfjörð Bjarmason

Minor cleanups thath would semantically & textually conflict with
Taylor's
https://lore.kernel.org/git/cover.1652915424.git.me@ttaylorr.com/; but
which I noted while reading through it.

The 2/2 here is something I wrote before spotting
https://lore.kernel.org/git/1d775f9850f00b0c3d1e9133669a6365c8d7bbba.1652915424.git.me@ttaylorr.com/;
which does pretty much the same thing. but IMO it's better to put this
in hash.h than chunk-format.h.

The 1/2 then fixes the minor NEEDSWORK in that series:
https://lore.kernel.org/git/8f9fd21be9fcdda5c73d800fc66d1087d61a6888.1652915424.git.me@ttaylorr.com/

All of this can be ignored for now, I can submit it after cruft packs
land (if I remember), or if Taylor's interested in picking it up in
some way...

But I figured it was useful to send it along in liue of "maybe do it
this way" (2/2) or "can we just create a utility function for this?"
(1/2) comments on the series itself.

Ævar Arnfjörð Bjarmason (2):
  packfile API: add and use a pack_name_to_ext() utility function
  hash API: add and use a hash_short_id_by_algo() function

 commit-graph.c  | 18 +++---------------
 hash.h          | 26 ++++++++++++++++++++++++--
 midx.c          | 18 +++---------------
 pack-bitmap.c   |  6 +-----
 pack-revindex.c |  5 +----
 pack-write.c    | 12 +-----------
 packfile.c      | 14 ++++++++++----
 packfile.h      |  9 +++++++++
 8 files changed, 52 insertions(+), 56 deletions(-)

-- 
2.36.1.952.g6652f7f0e6b


^ permalink raw reply	[flat|nested] 201+ messages in thread

* [RFC PATCH 1/2] packfile API: add and use a pack_name_to_ext() utility function
  2022-05-19 11:42   ` [RFC PATCH 0/2] Utility functions for duplicated pack(write) code Ævar Arnfjörð Bjarmason
@ 2022-05-19 11:42     ` Ævar Arnfjörð Bjarmason
  2022-05-19 15:40       ` Junio C Hamano
  2022-05-19 11:42     ` [RFC PATCH 2/2] hash API: add and use a hash_short_id_by_algo() function Ævar Arnfjörð Bjarmason
  2022-05-19 15:31     ` [RFC PATCH 0/2] Utility functions for duplicated pack(write) code Junio C Hamano
  2 siblings, 1 reply; 201+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-05-19 11:42 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Taylor Blau, derrickstolee, jrnieder,
	larsxschneider, tytso, Ævar Arnfjörð Bjarmason

Add and use a pack_name_to_ext() utility function for the copy/pasted
cases of creating a FOO.ext file given a string like FOO.pack.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 pack-bitmap.c   |  6 +-----
 pack-revindex.c |  5 +----
 packfile.c      | 14 ++++++++++----
 packfile.h      |  9 +++++++++
 4 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 97909d48da3..0c3770d038d 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -302,11 +302,7 @@ char *midx_bitmap_filename(struct multi_pack_index *midx)
 
 char *pack_bitmap_filename(struct packed_git *p)
 {
-	size_t len;
-
-	if (!strip_suffix(p->pack_name, ".pack", &len))
-		BUG("pack_name does not end in .pack");
-	return xstrfmt("%.*s.bitmap", (int)len, p->pack_name);
+	return pack_name_to_ext(p->pack_name, "bitmap");
 }
 
 static int open_midx_bitmap_1(struct bitmap_index *bitmap_git,
diff --git a/pack-revindex.c b/pack-revindex.c
index 08dc1601679..69dc5688796 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -179,10 +179,7 @@ static int create_pack_revindex_in_memory(struct packed_git *p)
 
 static char *pack_revindex_filename(struct packed_git *p)
 {
-	size_t len;
-	if (!strip_suffix(p->pack_name, ".pack", &len))
-		BUG("pack_name does not end in .pack");
-	return xstrfmt("%.*s.rev", (int)len, p->pack_name);
+	return pack_name_to_ext(p->pack_name, "rev");
 }
 
 #define RIDX_HEADER_SIZE (12)
diff --git a/packfile.c b/packfile.c
index 835b2d27164..bd6ad441bf5 100644
--- a/packfile.c
+++ b/packfile.c
@@ -191,15 +191,12 @@ int load_idx(const char *path, const unsigned int hashsz, void *idx_map,
 int open_pack_index(struct packed_git *p)
 {
 	char *idx_name;
-	size_t len;
 	int ret;
 
 	if (p->index_data)
 		return 0;
 
-	if (!strip_suffix(p->pack_name, ".pack", &len))
-		BUG("pack_name does not end in .pack");
-	idx_name = xstrfmt("%.*s.idx", (int)len, p->pack_name);
+	idx_name = pack_name_to_ext(p->pack_name, "idx");
 	ret = check_packed_git_idx(idx_name, p);
 	free(idx_name);
 	return ret;
@@ -2266,3 +2263,12 @@ int is_promisor_object(const struct object_id *oid)
 	}
 	return oidset_contains(&promisor_objects, oid);
 }
+
+char *pack_name_to_ext(const char *pack_name, const char *ext)
+{
+	size_t len;
+
+	if (!strip_suffix(pack_name, ".pack", &len))
+		BUG("pack_name does not end in .pack");
+	return xstrfmt("%.*s.%s", (int)len, pack_name, ext);
+}
diff --git a/packfile.h b/packfile.h
index a3f6723857b..6890c57ebdb 100644
--- a/packfile.h
+++ b/packfile.h
@@ -195,4 +195,13 @@ int is_promisor_object(const struct object_id *oid);
 int load_idx(const char *path, const unsigned int hashsz, void *idx_map,
 	     size_t idx_size, struct packed_git *p);
 
+/**
+ * Given a string like "foo.pack" and "ext" returns an xstrdup()'d
+ * "foo.ext" string. Used for creating e.g. PACK.{bitmap,rev,...}
+ * filenames from PACK.pack.
+ *
+ * Will BUG() if the expected string can't be created from the
+ * "pack_name" argument.
+ */
+char *pack_name_to_ext(const char *pack_name, const char *ext);
 #endif
-- 
2.36.1.952.g6652f7f0e6b


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [RFC PATCH 2/2] hash API: add and use a hash_short_id_by_algo() function
  2022-05-19 11:42   ` [RFC PATCH 0/2] Utility functions for duplicated pack(write) code Ævar Arnfjörð Bjarmason
  2022-05-19 11:42     ` [RFC PATCH 1/2] packfile API: add and use a pack_name_to_ext() utility function Ævar Arnfjörð Bjarmason
@ 2022-05-19 11:42     ` Ævar Arnfjörð Bjarmason
  2022-05-19 15:50       ` Junio C Hamano
  2022-05-19 15:31     ` [RFC PATCH 0/2] Utility functions for duplicated pack(write) code Junio C Hamano
  2 siblings, 1 reply; 201+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-05-19 11:42 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Taylor Blau, derrickstolee, jrnieder,
	larsxschneider, tytso, Ævar Arnfjörð Bjarmason

Add and use a hash_short_id_by_algo() function. As noted in the
comment being modified here (added in [1]) the intention wasn't to
have these end up in on-disk formats, but since [2], [3] and [3]
that's been the case, and there's an outstanding patch to add another
format that uses these[5].

So let's expose this functionality as a documented utility function,
instead of copy/pasting this code in various places.

Replacing the die() in the existing functions with a BUG() might be
overzelous, it's correct for the case of
e.g. write_commit_graph_file() and write_midx_header(), but we also
use this for parsing on-disk files, e.g. in parse_commit_graph().

We could add a "gently" version of this, but for now I think that
worrying about the distinction would be worrying too much. If we ever
end up parsing such files that'll almost certainly be a bug in our own
writing code, so the distinction would be rather academic, even though
such files could theoretically occur without a bug of ours.

1. f50e766b7b3 (Add structure representing hash algorithm, 2017-11-12)
2. 665d70ad033 (commit-graph: use the "hash version" byte, 2020-08-17)
3. d96075428a9 (multi-pack-index: use hash version byte, 2020-08-17)
4. 8ef50d9958f (pack-write.c: prepare to write 'pack-*.rev' files, 2021-01-25)
5. https://lore.kernel.org/git/1d775f9850f00b0c3d1e9133669a6365c8d7bbba.1652915424.git.me@ttaylorr.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 commit-graph.c | 18 +++---------------
 hash.h         | 26 ++++++++++++++++++++++++--
 midx.c         | 18 +++---------------
 pack-write.c   | 12 +-----------
 4 files changed, 31 insertions(+), 43 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 06107beedcb..157de4dd717 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -193,18 +193,6 @@ char *get_commit_graph_chain_filename(struct object_directory *odb)
 	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
 }
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 static struct commit_graph *alloc_commit_graph(void)
 {
 	struct commit_graph *g = xcalloc(1, sizeof(*g));
@@ -365,9 +353,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 	}
 
 	hash_version = *(unsigned char*)(data + 5);
-	if (hash_version != oid_version()) {
+	if (hash_version != hash_short_id_by_algo()) {
 		error(_("commit-graph hash version %X does not match version %X"),
-		      hash_version, oid_version());
+		      hash_version, hash_short_id_by_algo());
 		return NULL;
 	}
 
@@ -1924,7 +1912,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
 	hashwrite_u8(f, GRAPH_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, hash_short_id_by_algo());
 	hashwrite_u8(f, get_num_chunks(cf));
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
diff --git a/hash.h b/hash.h
index 5d40368f18a..31293401809 100644
--- a/hash.h
+++ b/hash.h
@@ -80,8 +80,7 @@ static inline void git_SHA256_Clone(git_SHA256_CTX *dst, const git_SHA256_CTX *s
 
 /*
  * Note that these constants are suitable for indexing the hash_algos array and
- * comparing against each other, but are otherwise arbitrary, so they should not
- * be exposed to the user or serialized to disk.  To know whether a
+ * comparing against each other, but are otherwise arbitrary. To know whether a
  * git_hash_algo struct points to some usable hash function, test the format_id
  * field for being non-zero.  Use the name field for user-visible situations and
  * the format_id field for fixed-length fields on disk.
@@ -337,4 +336,27 @@ static inline void oid_set_algo(struct object_id *oid, const struct git_hash_alg
 const char *empty_tree_oid_hex(void);
 const char *empty_blob_oid_hex(void);
 
+/**
+ * Convert GIT_HASH_SHA1 to 1, GIT_HASH_SHA256 to 2 etc.
+ *
+ * It's preferable to use GIT_{SHA1,SHA256}_FORMAT_ID instead for file
+ * formats. The original intention was not to make these short
+ * constants part of any file format.
+ * 
+ * But since that ship has sailed for various on-disk formats this
+ * utility function allows us to do that consistently in one place.
+ */
+static inline int hash_short_id_by_algo(void)
+{
+	int hash_algo = hash_algo_by_ptr(the_hash_algo);
+
+	switch (hash_algo) {
+	case GIT_HASH_SHA1:
+	case GIT_HASH_SHA256:
+		return hash_algo;
+	default:
+		BUG("invalid hash version");
+	}
+}
+
 #endif
diff --git a/midx.c b/midx.c
index 3db0e47735f..2e42afa5f00 100644
--- a/midx.c
+++ b/midx.c
@@ -41,18 +41,6 @@
 
 #define PACK_EXPIRED UINT_MAX
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 const unsigned char *get_midx_checksum(struct multi_pack_index *m)
 {
 	return m->data + m->data_len - the_hash_algo->rawsz;
@@ -134,9 +122,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 		      m->version);
 
 	hash_version = m->data[MIDX_BYTE_HASH_VERSION];
-	if (hash_version != oid_version()) {
+	if (hash_version != hash_short_id_by_algo()) {
 		error(_("multi-pack-index hash version %u does not match version %u"),
-		      hash_version, oid_version());
+		      hash_version, hash_short_id_by_algo());
 		goto cleanup_fail;
 	}
 	m->hash_len = the_hash_algo->rawsz;
@@ -420,7 +408,7 @@ static size_t write_midx_header(struct hashfile *f,
 {
 	hashwrite_be32(f, MIDX_SIGNATURE);
 	hashwrite_u8(f, MIDX_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, hash_short_id_by_algo());
 	hashwrite_u8(f, num_chunks);
 	hashwrite_u8(f, 0); /* unused */
 	hashwrite_be32(f, num_packs);
diff --git a/pack-write.c b/pack-write.c
index 51812cb1299..c1ce8f6df8f 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -181,17 +181,7 @@ static int pack_order_cmp(const void *va, const void *vb, void *ctx)
 
 static void write_rev_header(struct hashfile *f)
 {
-	uint32_t oid_version;
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		oid_version = 1;
-		break;
-	case GIT_HASH_SHA256:
-		oid_version = 2;
-		break;
-	default:
-		die("write_rev_header: unknown hash version");
-	}
+	uint32_t oid_version = hash_short_id_by_algo();
 
 	hashwrite_be32(f, RIDX_SIGNATURE);
 	hashwrite_be32(f, RIDX_VERSION);
-- 
2.36.1.952.g6652f7f0e6b


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* Re: [PATCH v4 04/17] chunk-format.h: extract oid_version()
  2022-05-18 23:11   ` [PATCH v4 04/17] chunk-format.h: extract oid_version() Taylor Blau
@ 2022-05-19 11:44     ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 201+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-05-19 11:44 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, derrickstolee, gitster, jrnieder, larsxschneider, tytso


On Wed, May 18 2022, Taylor Blau wrote:

> There are three definitions of an identical function which converts
> `the_hash_algo` into either 1 (for SHA-1) or 2 (for SHA-256). There is a
> copy of this function for writing both the commit-graph and
> multi-pack-index file, and another inline definition used to write the
> .rev header.
>
> Consolidate these into a single definition in chunk-format.h. It's not
> clear that this is the best header to define this function in, but it
> should do for now.

Maybe hash.h? :)
https://lore.kernel.org/git/RFC-patch-2.2-051f0612ab9-20220519T113538Z-avarab@gmail.com/

> (Worth noting, the .rev caller expects a 4-byte unsigned, but the other
> two callers work with a single unsigned byte. The consolidated version
> uses the latter type, and lets the compiler widen it when required).

I just went for "int" and had the compiler similarly cast that, which
seems simpler & more obvious, no?

I.e. it seems to me that we really only need these more narrow types at
the time that we write this data, which we alredy have casts for.

> +uint8_t oid_version(const struct git_hash_algo *algop)
> +{
> +	switch (hash_algo_by_ptr(algop)) {
> +	case GIT_HASH_SHA1:
> +		return 1;
> +	case GIT_HASH_SHA256:
> +		return 2;
> +	default:
> +		die(_("invalid hash version"));
> +	}

As noted in the 2/2 I posted above we have some cases where we really
should have BUG here, and others (reading) which are arguably die(). I
think just going for BUG() makes sense in this case.

But if you're just unifying existing code we can also just keep it
as-is.

FWIW I struggled to come up with a name for this, and ended up with
hash_short_id_by_algo(). Somewhat bikesheddy, but I'd prefer if we fixed
that "oid_version" name while at it, since this really has nothing do do
with an "OID version" (whatever that is).

We only refer to hash versions elsewhere, which collectively describe
the versions of all OIDs we need to handle.

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v4 00/17] cruft packs
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
                     ` (18 preceding siblings ...)
  2022-05-19 11:42   ` [RFC PATCH 0/2] Utility functions for duplicated pack(write) code Ævar Arnfjörð Bjarmason
@ 2022-05-19 11:54   ` Ævar Arnfjörð Bjarmason
  19 siblings, 0 replies; 201+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-05-19 11:54 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, derrickstolee, gitster, jrnieder, larsxschneider, tytso


On Wed, May 18 2022, Taylor Blau wrote:

> Here is another reroll of my series to implement "cruft packs", which is based
> on the v2.36 tree, and incorporates feedback from the discussion we had about
> mixed-version GCs with cruft packs in [1].
>
> The changes here are limited to:
>
>   - a cautionary note in Documentation/technical/cruft-packs.txt
>     describing the potential interaction between pruning GCs across pre-
>     and post-cruft pack versions of Git, as discussed towards the bottom
>     of [2]
>
>   - updating the `finalize_hashfile()` calls for writing `.mtimes` files
>     to indicate that they are `FSYNC_COMPONENT_PACK_METADATA`, since the
>     original version of this series predates the fine-grained fsync
>     configuration in 2.36.
>
> As always, a range-diff is below. Thanks in advance for taking another
> look!

I left some minor & nit-y comments on this v4, but overall I think this
looks really good with not much to add.

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v4 01/17] Documentation/technical: add cruft-packs.txt
  2022-05-18 23:10   ` [PATCH v4 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
@ 2022-05-19 14:04     ` Junio C Hamano
  0 siblings, 0 replies; 201+ messages in thread
From: Junio C Hamano @ 2022-05-19 14:04 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, avarab, derrickstolee, jrnieder, larsxschneider, tytso

Taylor Blau <me@ttaylorr.com> writes:

> +== Caution for mixed-version environments
> +
> +Repositories that have cruft packs in them will continue to work with any older
> +version of Git. Note, however, that previous versions of Git which do not
> ...
> +cruft packs.

I've compared a rebase of the previous iteration on top of v2.36.0
with the result of application of this iteration on the same commit,
and the above additional documentation seems to be the only real
difference.

Will replace and queue.

Thanks.

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v4 08/17] builtin/pack-objects.c: --cruft without expiration
  2022-05-19 10:04     ` Junio C Hamano
@ 2022-05-19 15:16       ` Junio C Hamano
  2022-05-20 22:52         ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Junio C Hamano @ 2022-05-19 15:16 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, avarab, derrickstolee, jrnieder, larsxschneider, tytso

Junio C Hamano <gitster@pobox.com> writes:

> Taylor Blau <me@ttaylorr.com> writes:
>
>> @@ -3870,6 +4034,20 @@ static int option_parse_unpack_unreachable(const struct option *opt,
>>  	return 0;
>>  }
>>  
>> +static int option_parse_cruft_expiration(const struct option *opt,
>> +					 const char *arg, int unset)
>> +{
>> +	if (unset) {
>> +		cruft = 0;
>> +		cruft_expiration = 0;
>> +	} else {
>> +		cruft = 1;
>> +		if (arg)
>> +			cruft_expiration = approxidate(arg);
>> +	}
>> +	return 0;
>> +}
>
> It is somewhat sad that we have to invent this function, instead of
> using parse_opt_expiry_date_cb().

I failed to mention that this one does more than the bog-standard
callback so the latter cannot be reused as-is, and that is what I
meant by "somewhat sad".  If we can find a way to reuse the
parse_opt_expiry_date_cb() for the purpose of the user of this
function that would be ideal, but only if we can do so without
making the caller too unnatural.  Having two separate values, "did
we get --cruft-expiration option?" and "what's the value of it?",
does benefit the current caller and we do not want to twist it just
for not adding a similar callback---that's a tail wagging a dog.

Thanks.


^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v4 02/17] pack-mtimes: support reading .mtimes files
  2022-05-19 10:40     ` Ævar Arnfjörð Bjarmason
@ 2022-05-19 15:21       ` Junio C Hamano
  2022-05-20  7:32         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 201+ messages in thread
From: Junio C Hamano @ 2022-05-19 15:21 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, git, derrickstolee, jrnieder, larsxschneider, tytso

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> On Wed, May 18 2022, Taylor Blau wrote:
>
> Nit:
>
>> +  - A 4-byte magic number '0x4d544d45' ('MTME').
>> +
>> +  - A 4-byte version identifier (= 1).
>> +
>> +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
>
> Here we let it suffice that later we'll say "All 4-byte numbers are in
> network order".
>
>> +  - A table of 4-byte unsigned integers in network order. The ith
>
> But here we call out "network order" explicitly, shouldn't this just be
> s/ in network order//?
>
>> +    value is the modification time (mtime) of the ith object in the
>> +    corresponding pack by lexicographic (index) order. The mtimes
>> +    count standard epoch seconds.
>> +
>> +  - A trailer, containing a checksum of the corresponding packfile,
>> +    and a checksum of all of the above (each having length according
>> +    to the specified hash function).
>> +
>> +All 4-byte numbers are in network order.
>
> I.e. this is sufficient.

Very good eyes.  One explicit mention among several others can
indeed be misleading the readers.

When asked for "network order", all your search engines show are
entries about "network byte order", so let's use that longer form of
spelling.

Thanks.

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [RFC PATCH 0/2] Utility functions for duplicated pack(write) code
  2022-05-19 11:42   ` [RFC PATCH 0/2] Utility functions for duplicated pack(write) code Ævar Arnfjörð Bjarmason
  2022-05-19 11:42     ` [RFC PATCH 1/2] packfile API: add and use a pack_name_to_ext() utility function Ævar Arnfjörð Bjarmason
  2022-05-19 11:42     ` [RFC PATCH 2/2] hash API: add and use a hash_short_id_by_algo() function Ævar Arnfjörð Bjarmason
@ 2022-05-19 15:31     ` Junio C Hamano
  2 siblings, 0 replies; 201+ messages in thread
From: Junio C Hamano @ 2022-05-19 15:31 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Taylor Blau, derrickstolee, jrnieder, larsxschneider, tytso

Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> Minor cleanups thath would semantically & textually conflict with
> Taylor's
> https://lore.kernel.org/git/cover.1652915424.git.me@ttaylorr.com/; but
> which I noted while reading through it.

Very much appreciated that you marked this as RFC.

It is natural and easy to notice problems in the code that is in
flux, because it is inevitable for anybody working on the codebase
to see the changes in-flight and the original code as they review,
or as they make trial merges of their own work and see conflicts.

But making patches to address them immediately out of spinal reflex
would not help anybody.  Marking them as RFC and calling attention
by those involved in the "other topic" while the code being cleaned
up is still fresh in their mind makes it efficient to review the
clean-up while letting the "other topic" to either proceed without
clean-up or with it rolled in.


^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [RFC PATCH 1/2] packfile API: add and use a pack_name_to_ext() utility function
  2022-05-19 11:42     ` [RFC PATCH 1/2] packfile API: add and use a pack_name_to_ext() utility function Ævar Arnfjörð Bjarmason
@ 2022-05-19 15:40       ` Junio C Hamano
  0 siblings, 0 replies; 201+ messages in thread
From: Junio C Hamano @ 2022-05-19 15:40 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Taylor Blau, derrickstolee, jrnieder, larsxschneider, tytso

Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> Add and use a pack_name_to_ext() utility function for the copy/pasted
> cases of creating a FOO.ext file given a string like FOO.pack.

I agree that "remove .pack extension and replace it with .foo
extention" is a common thing to do and it would be a welcome
simplification.

But pack_name_to_ext() sounds like taking pack-123456...9876.pack
and returning its extension (i.e. ".pack") that will become useful
when we introduce a drastically different naming convention and
start calling packfiles in newer format with different extensions
like ".pac4".

I wonder if this has easier-to-understand name that would be equally
(or slightly more) useful?

	char *replace_ext(const char *name, const char *src, const char *dst)
	{
		size_t len;

		if (!strip_suffix(name, src, &len))
			BUG("name '%s' does not end in suffix '%s'", name, src);
		return xstrfmt("%.*s.%s", (int)len, name, dst);
	}




> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---
>  pack-bitmap.c   |  6 +-----
>  pack-revindex.c |  5 +----
>  packfile.c      | 14 ++++++++++----
>  packfile.h      |  9 +++++++++
>  4 files changed, 21 insertions(+), 13 deletions(-)
>
> diff --git a/pack-bitmap.c b/pack-bitmap.c
> index 97909d48da3..0c3770d038d 100644
> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -302,11 +302,7 @@ char *midx_bitmap_filename(struct multi_pack_index *midx)
>  
>  char *pack_bitmap_filename(struct packed_git *p)
>  {
> -	size_t len;
> -
> -	if (!strip_suffix(p->pack_name, ".pack", &len))
> -		BUG("pack_name does not end in .pack");
> -	return xstrfmt("%.*s.bitmap", (int)len, p->pack_name);
> +	return pack_name_to_ext(p->pack_name, "bitmap");
>  }
>  
>  static int open_midx_bitmap_1(struct bitmap_index *bitmap_git,
> diff --git a/pack-revindex.c b/pack-revindex.c
> index 08dc1601679..69dc5688796 100644
> --- a/pack-revindex.c
> +++ b/pack-revindex.c
> @@ -179,10 +179,7 @@ static int create_pack_revindex_in_memory(struct packed_git *p)
>  
>  static char *pack_revindex_filename(struct packed_git *p)
>  {
> -	size_t len;
> -	if (!strip_suffix(p->pack_name, ".pack", &len))
> -		BUG("pack_name does not end in .pack");
> -	return xstrfmt("%.*s.rev", (int)len, p->pack_name);
> +	return pack_name_to_ext(p->pack_name, "rev");
>  }
>  
>  #define RIDX_HEADER_SIZE (12)
> diff --git a/packfile.c b/packfile.c
> index 835b2d27164..bd6ad441bf5 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -191,15 +191,12 @@ int load_idx(const char *path, const unsigned int hashsz, void *idx_map,
>  int open_pack_index(struct packed_git *p)
>  {
>  	char *idx_name;
> -	size_t len;
>  	int ret;
>  
>  	if (p->index_data)
>  		return 0;
>  
> -	if (!strip_suffix(p->pack_name, ".pack", &len))
> -		BUG("pack_name does not end in .pack");
> -	idx_name = xstrfmt("%.*s.idx", (int)len, p->pack_name);
> +	idx_name = pack_name_to_ext(p->pack_name, "idx");
>  	ret = check_packed_git_idx(idx_name, p);
>  	free(idx_name);
>  	return ret;
> @@ -2266,3 +2263,12 @@ int is_promisor_object(const struct object_id *oid)
>  	}
>  	return oidset_contains(&promisor_objects, oid);
>  }
> +
> +char *pack_name_to_ext(const char *pack_name, const char *ext)
> +{
> +	size_t len;
> +
> +	if (!strip_suffix(pack_name, ".pack", &len))
> +		BUG("pack_name does not end in .pack");
> +	return xstrfmt("%.*s.%s", (int)len, pack_name, ext);
> +}
> diff --git a/packfile.h b/packfile.h
> index a3f6723857b..6890c57ebdb 100644
> --- a/packfile.h
> +++ b/packfile.h
> @@ -195,4 +195,13 @@ int is_promisor_object(const struct object_id *oid);
>  int load_idx(const char *path, const unsigned int hashsz, void *idx_map,
>  	     size_t idx_size, struct packed_git *p);
>  
> +/**
> + * Given a string like "foo.pack" and "ext" returns an xstrdup()'d
> + * "foo.ext" string. Used for creating e.g. PACK.{bitmap,rev,...}
> + * filenames from PACK.pack.
> + *
> + * Will BUG() if the expected string can't be created from the
> + * "pack_name" argument.
> + */
> +char *pack_name_to_ext(const char *pack_name, const char *ext);
>  #endif

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [RFC PATCH 2/2] hash API: add and use a hash_short_id_by_algo() function
  2022-05-19 11:42     ` [RFC PATCH 2/2] hash API: add and use a hash_short_id_by_algo() function Ævar Arnfjörð Bjarmason
@ 2022-05-19 15:50       ` Junio C Hamano
  2022-05-19 19:07         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 201+ messages in thread
From: Junio C Hamano @ 2022-05-19 15:50 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Taylor Blau, derrickstolee, jrnieder, larsxschneider, tytso

Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> Add and use a hash_short_id_by_algo() function. As noted in the
> comment being modified here (added in [1]) the intention wasn't to
> have these end up in on-disk formats, but since [2], [3] and [3]

double [3]?

> that's been the case, and there's an outstanding patch to add another
> format that uses these[5].
>
> So let's expose this functionality as a documented utility function,
> instead of copy/pasting this code in various places.
>
> Replacing the die() in the existing functions with a BUG() might be
> overzelous, it's correct for the case of
> e.g. write_commit_graph_file() and write_midx_header(), but we also
> use this for parsing on-disk files, e.g. in parse_commit_graph().

If we know the offending data can come from outside, not from
literals in the code, then there is no "might be", such a use of
BUG() is simply wrong.

> We could add a "gently" version of this, but for now I think that
> worrying about the distinction would be worrying too much. If we ever
> end up parsing such files that'll almost certainly be a bug in our own
> writing code, so the distinction would be rather academic, even though

It would be a file written by an ancient buggy version of our code,
or a buggy third-party reimplementation of Git.  It could be that a
new version of Git is using a yet-to-be-invented algorithm this
version of Git does not know about.

The distinction matters in that "The version of Git I downloaded and
built last week out of the latest release tag said BUG" should mean
only one thing: that version that reports BUG() is the culprit, not
some random other thing we do not even know where it came from that
left a corrupt data on disk.

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [RFC PATCH 2/2] hash API: add and use a hash_short_id_by_algo() function
  2022-05-19 15:50       ` Junio C Hamano
@ 2022-05-19 19:07         ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 201+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-05-19 19:07 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Taylor Blau, derrickstolee, jrnieder, larsxschneider, tytso


On Thu, May 19 2022, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>
>> Add and use a hash_short_id_by_algo() function. As noted in the
>> comment being modified here (added in [1]) the intention wasn't to
>> have these end up in on-disk formats, but since [2], [3] and [3]
>
> double [3]?

*nod*, sorry.

>> that's been the case, and there's an outstanding patch to add another
>> format that uses these[5].
>>
>> So let's expose this functionality as a documented utility function,
>> instead of copy/pasting this code in various places.
>>
>> Replacing the die() in the existing functions with a BUG() might be
>> overzelous, it's correct for the case of
>> e.g. write_commit_graph_file() and write_midx_header(), but we also
>> use this for parsing on-disk files, e.g. in parse_commit_graph().
>
> If we know the offending data can come from outside, not from
> literals in the code, then there is no "might be", such a use of
> BUG() is simply wrong.

Fair enough, we could change it to have 1/2 of this (the writing) use
BUG(), and die() for the other part, or just leave it as die() for both.

>> We could add a "gently" version of this, but for now I think that
>> worrying about the distinction would be worrying too much. If we ever
>> end up parsing such files that'll almost certainly be a bug in our own
>> writing code, so the distinction would be rather academic, even though
>
> It would be a file written by an ancient buggy version of our code,
> or a buggy third-party reimplementation of Git.  It could be that a
> new version of Git is using a yet-to-be-invented algorithm this
> version of Git does not know about.
>
> The distinction matters in that "The version of Git I downloaded and
> built last week out of the latest release tag said BUG" should mean
> only one thing: that version that reports BUG() is the culprit, not
> some random other thing we do not even know where it came from that
> left a corrupt data on disk.

Makes sense.

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v4 02/17] pack-mtimes: support reading .mtimes files
  2022-05-19 15:21       ` Junio C Hamano
@ 2022-05-20  7:32         ` Ævar Arnfjörð Bjarmason
  2022-05-20 22:37           ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-05-20  7:32 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Taylor Blau, git, derrickstolee, jrnieder, larsxschneider, tytso


On Thu, May 19 2022, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> On Wed, May 18 2022, Taylor Blau wrote:
>>
>> Nit:
>>
>>> +  - A 4-byte magic number '0x4d544d45' ('MTME').
>>> +
>>> +  - A 4-byte version identifier (= 1).
>>> +
>>> +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
>>
>> Here we let it suffice that later we'll say "All 4-byte numbers are in
>> network order".
>>
>>> +  - A table of 4-byte unsigned integers in network order. The ith
>>
>> But here we call out "network order" explicitly, shouldn't this just be
>> s/ in network order//?
>>
>>> +    value is the modification time (mtime) of the ith object in the
>>> +    corresponding pack by lexicographic (index) order. The mtimes
>>> +    count standard epoch seconds.
>>> +
>>> +  - A trailer, containing a checksum of the corresponding packfile,
>>> +    and a checksum of all of the above (each having length according
>>> +    to the specified hash function).
>>> +
>>> +All 4-byte numbers are in network order.
>>
>> I.e. this is sufficient.
>
> Very good eyes.  One explicit mention among several others can
> indeed be misleading the readers.
>
> When asked for "network order", all your search engines show are
> entries about "network byte order", so let's use that longer form of
> spelling.

*Nod*, note that "network order" is on "master" already though,
i.e. this section re-used a template introduced in 2f4ba2a867f
(packfile: prepare for the existence of '*.rev' files, 2021-01-25) just
above this hunk.

Before that change the rest of the file used "network byte order"
consistently.

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v4 02/17] pack-mtimes: support reading .mtimes files
  2022-05-20  7:32         ` Ævar Arnfjörð Bjarmason
@ 2022-05-20 22:37           ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 22:37 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Junio C Hamano, git, derrickstolee, jrnieder, larsxschneider,
	tytso

On Fri, May 20, 2022 at 09:32:50AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
> On Thu, May 19 2022, Junio C Hamano wrote:
>
> > Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
> >
> >> On Wed, May 18 2022, Taylor Blau wrote:
> >>
> >> Nit:
> >>
> >>> +  - A 4-byte magic number '0x4d544d45' ('MTME').
> >>> +
> >>> +  - A 4-byte version identifier (= 1).
> >>> +
> >>> +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
> >>
> >> Here we let it suffice that later we'll say "All 4-byte numbers are in
> >> network order".
> >>
> >>> +  - A table of 4-byte unsigned integers in network order. The ith
> >>
> >> But here we call out "network order" explicitly, shouldn't this just be
> >> s/ in network order//?
> >>
> >>> +    value is the modification time (mtime) of the ith object in the
> >>> +    corresponding pack by lexicographic (index) order. The mtimes
> >>> +    count standard epoch seconds.
> >>> +
> >>> +  - A trailer, containing a checksum of the corresponding packfile,
> >>> +    and a checksum of all of the above (each having length according
> >>> +    to the specified hash function).
> >>> +
> >>> +All 4-byte numbers are in network order.
> >>
> >> I.e. this is sufficient.
> >
> > Very good eyes.  One explicit mention among several others can
> > indeed be misleading the readers.
> >
> > When asked for "network order", all your search engines show are
> > entries about "network byte order", so let's use that longer form of
> > spelling.
>
> *Nod*, note that "network order" is on "master" already though,
> i.e. this section re-used a template introduced in 2f4ba2a867f
> (packfile: prepare for the existence of '*.rev' files, 2021-01-25) just
> above this hunk.
>
> Before that change the rest of the file used "network byte order"
> consistently.

Hmm. e0d1bcf825 (multi-pack-index: add format details, 2018-07-12)
(which predates 2f4ba2a867f by a few years) introduced the first use of
"network order" as opposed to "network byte order".

I think it's worth cleaning this up, but let's do it in two parts.
I'll send a rerolled version of tb/cruft-packs that moves the "All
4-byte numbers are in network order" to the top of that section,
switching "network order" for "network byte order", and dropping other
mentions of "network [byte] order" from that section.

Then, we can come back later and perhaps do something like the following
(but I don't want to do it now and tie up this series with semi-related
cleanups):

--- 8< ---

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index b520aa9c45..2591a410fd 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -276,6 +276,8 @@ Pack file entry: <+

 == pack-*.rev files have the format:

+All 4-byte numbers are in network byte order.
+
   - A 4-byte magic number '0x52494458' ('RIDX').

   - A 4-byte version identifier (= 1).
@@ -283,8 +285,8 @@ Pack file entry: <+
   - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).

   - A table of index positions (one per packed object, num_objects in
-    total, each a 4-byte unsigned integer in network order), sorted by
-    their corresponding offsets in the packfile.
+    total, each a 4-byte unsigned integer), sorted by their
+    corresponding offsets in the packfile.

   - A trailer, containing a:

@@ -292,8 +294,6 @@ Pack file entry: <+

     a checksum of all of the above.

-All 4-byte numbers are in network order.
-
 == pack-*.mtimes files have the format:

 All 4-byte numbers are in network byte order.
@@ -322,7 +322,7 @@ the body into "chunks" and provide a lookup table at the beginning of the
 body. The header includes certain length values, such as the number of packs,
 the number of base MIDX files, hash lengths and types.

-All 4-byte numbers are in network order.
+All 4-byte numbers are in network byte order.

 HEADER:

@@ -397,8 +397,8 @@ CHUNK DATA:

 	[Optional] Bitmap pack order (ID: {'R', 'I', 'D', 'X'})
 	    A list of MIDX positions (one per object in the MIDX, num_objects in
-	    total, each a 4-byte unsigned integer in network byte order), sorted
-	    according to their relative bitmap/pseudo-pack positions.
+	    total, each a 4-byte unsigned integer), sorted according to their
+	    relative bitmap/pseudo-pack positions.

 TRAILER:


--- >8 ---

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 201+ messages in thread

* Re: [PATCH v4 12/17] builtin/repack.c: support generating a cruft pack
  2022-05-19 11:29     ` Ævar Arnfjörð Bjarmason
@ 2022-05-20 22:39       ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 22:39 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, derrickstolee, gitster, jrnieder, larsxschneider, tytso

On Thu, May 19, 2022 at 01:29:26PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
> On Wed, May 18 2022, Taylor Blau wrote:
>
> > +		tip="$(git rev-parse cruft)" &&
>
> Here we don't hide the exit status of "git", as it'll be reflected in what's &&-chained.

Oops! Nice catch.

> > +		path="$objdir/$(test_oid_to_path "$(git rev-parse cruft)")" &&
>
> But here we do, as we'll get the exit status of test_oid_to_path. But as
> we just rev parsed it shouldn't this be $tip in any case?

Indeed, this can just be:

    path="$objdir/$(test_oid_to_path "$tip")" &&

Will include in a reroll shortly.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v4 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack
  2022-05-19 11:32     ` Ævar Arnfjörð Bjarmason
@ 2022-05-20 22:42       ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 22:42 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, derrickstolee, gitster, jrnieder, larsxschneider, tytso

On Thu, May 19, 2022 at 01:32:26PM +0200, Ævar Arnfjörð Bjarmason wrote:
> > @@ -161,8 +162,11 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
> >  		if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
> >  		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
> >  			string_list_append_nodup(fname_kept_list, fname);
> > -		else
> > -			string_list_append_nodup(fname_nonkept_list, fname);
> > +		else {
> > +			struct string_list_item *item = string_list_append_nodup(fname_nonkept_list, fname);
>
> Nit: very long line, and we end up with {} just on the else, not the if.

Thanks for spotting, I split this line up and added braces to the other
half of this conditional.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v4 08/17] builtin/pack-objects.c: --cruft without expiration
  2022-05-19 15:16       ` Junio C Hamano
@ 2022-05-20 22:52         ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 22:52 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, avarab, derrickstolee, jrnieder, larsxschneider, tytso

On Thu, May 19, 2022 at 08:16:49AM -0700, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
> > Taylor Blau <me@ttaylorr.com> writes:
> >
> >> @@ -3870,6 +4034,20 @@ static int option_parse_unpack_unreachable(const struct option *opt,
> >>  	return 0;
> >>  }
> >>
> >> +static int option_parse_cruft_expiration(const struct option *opt,
> >> +					 const char *arg, int unset)
> >> +{
> >> +	if (unset) {
> >> +		cruft = 0;
> >> +		cruft_expiration = 0;
> >> +	} else {
> >> +		cruft = 1;
> >> +		if (arg)
> >> +			cruft_expiration = approxidate(arg);
> >> +	}
> >> +	return 0;
> >> +}
> >
> > It is somewhat sad that we have to invent this function, instead of
> > using parse_opt_expiry_date_cb().
>
> I failed to mention that this one does more than the bog-standard
> callback so the latter cannot be reused as-is, and that is what I
> meant by "somewhat sad".  If we can find a way to reuse the
> parse_opt_expiry_date_cb() for the purpose of the user of this
> function that would be ideal, but only if we can do so without
> making the caller too unnatural.  Having two separate values, "did
> we get --cruft-expiration option?" and "what's the value of it?",
> does benefit the current caller and we do not want to twist it just
> for not adding a similar callback---that's a tail wagging a dog.

I agree, though I'm not sure such a cleanup is possible: if the caller
specified `--cruft-expiration=never`, how would we distinguish that from
"the caller did not want to generate cruft packs"?

In that case, approxidate() would set our `cruft_expiration` variable to
`0` there, which makes "generate a cruft pack without expiring any
objects" indistinguishable from "do not generate a cruft pack" without
the additional bit of information stored in the "cruft" variable.

For now we're just recycling the pattern from the callback immediately
above this one: option_parse_unpack_unreachable().

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* [PATCH v5 00/17] cruft packs
  2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
                   ` (20 preceding siblings ...)
  2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
@ 2022-05-20 23:17 ` Taylor Blau
  2022-05-20 23:17   ` [PATCH v5 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
                     ` (17 more replies)
  21 siblings, 18 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 23:17 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

Here is another reroll of my series to implement "cruft packs". This is really
more like v4.1, since it has only cosmetic changes incorporated from the review
on v4 of this topic.

Since last time:

  - The new section in pack-format.txt (describing the ".mtimes" format) now
    says at the top "all 4-byte numbers are in network byte order", and avoids
    repeating "network [byte] order" throughout that section to reduce
    confusion.

  - A sub-shell in t5329 which incorrectly masked over the exit code of a "git"
    process was removed.

  - An overly-long line in builtin/repack.c::collect_pack_filenames() was
    eliminated, and matching braces are added.

...and that's pretty much it. In any case, a range-diff is included below.
Thanks again for all of the thoughtful feedback on this series.

Taylor Blau (17):
  Documentation/technical: add cruft-packs.txt
  pack-mtimes: support reading .mtimes files
  pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  chunk-format.h: extract oid_version()
  pack-mtimes: support writing pack .mtimes files
  t/helper: add 'pack-mtimes' test-tool
  builtin/pack-objects.c: return from create_object_entry()
  builtin/pack-objects.c: --cruft without expiration
  reachable: add options to add_unseen_recent_objects_to_traversal
  reachable: report precise timestamps from objects in cruft packs
  builtin/pack-objects.c: --cruft with expiration
  builtin/repack.c: support generating a cruft pack
  builtin/repack.c: allow configuring cruft pack generation
  builtin/repack.c: use named flags for existing_packs
  builtin/repack.c: add cruft packs to MIDX during geometric repack
  builtin/gc.c: conditionally avoid pruning objects via loose
  sha1-file.c: don't freshen cruft packs

 Documentation/Makefile                  |   1 +
 Documentation/config/gc.txt             |  21 +-
 Documentation/config/repack.txt         |   9 +
 Documentation/git-gc.txt                |   5 +
 Documentation/git-pack-objects.txt      |  30 +
 Documentation/git-repack.txt            |  11 +
 Documentation/technical/cruft-packs.txt | 123 ++++
 Documentation/technical/pack-format.txt |  19 +
 Makefile                                |   2 +
 builtin/gc.c                            |  10 +-
 builtin/pack-objects.c                  | 304 +++++++++-
 builtin/repack.c                        | 185 +++++-
 bulk-checkin.c                          |   2 +-
 chunk-format.c                          |  12 +
 chunk-format.h                          |   3 +
 commit-graph.c                          |  18 +-
 midx.c                                  |  18 +-
 object-file.c                           |   4 +-
 object-store.h                          |   7 +-
 pack-mtimes.c                           | 126 ++++
 pack-mtimes.h                           |  15 +
 pack-objects.c                          |   6 +
 pack-objects.h                          |  25 +
 pack-write.c                            |  93 ++-
 pack.h                                  |   4 +
 packfile.c                              |  19 +-
 reachable.c                             |  58 +-
 reachable.h                             |   9 +-
 t/helper/test-pack-mtimes.c             |  56 ++
 t/helper/test-tool.c                    |   1 +
 t/helper/test-tool.h                    |   1 +
 t/t5329-pack-objects-cruft.sh           | 739 ++++++++++++++++++++++++
 32 files changed, 1834 insertions(+), 102 deletions(-)
 create mode 100644 Documentation/technical/cruft-packs.txt
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h
 create mode 100644 t/helper/test-pack-mtimes.c
 create mode 100755 t/t5329-pack-objects-cruft.sh

Range-diff against v4:
 -:  ---------- >  1:  f494ef7377 Documentation/technical: add cruft-packs.txt
 1:  8f9fd21be9 !  2:  91a9d21b0b pack-mtimes: support reading .mtimes files
    @@ Documentation/technical/pack-format.txt: Pack file entry: <+
      
     +== pack-*.mtimes files have the format:
     +
    ++All 4-byte numbers are in network byte order.
    ++
     +  - A 4-byte magic number '0x4d544d45' ('MTME').
     +
     +  - A 4-byte version identifier (= 1).
     +
     +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
     +
    -+  - A table of 4-byte unsigned integers in network order. The ith
    -+    value is the modification time (mtime) of the ith object in the
    -+    corresponding pack by lexicographic (index) order. The mtimes
    -+    count standard epoch seconds.
    ++  - A table of 4-byte unsigned integers. The ith value is the
    ++    modification time (mtime) of the ith object in the corresponding
    ++    pack by lexicographic (index) order. The mtimes count standard
    ++    epoch seconds.
     +
     +  - A trailer, containing a checksum of the corresponding packfile,
     +    and a checksum of all of the above (each having length according
     +    to the specified hash function).
    -+
    -+All 4-byte numbers are in network order.
     +
      == multi-pack-index (MIDX) files have the following format:
      
 2:  cdb21236e1 =  3:  67c4e7209d pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
 3:  1d775f9850 =  4:  fc86506881 chunk-format.h: extract oid_version()
 4:  6172861bd9 =  5:  788d1f96f2 pack-mtimes: support writing pack .mtimes files
 5:  5f9a9a5b7b =  6:  2a6cfb00bf t/helper: add 'pack-mtimes' test-tool
 6:  b8a38fe2e4 =  7:  edb6fcd5ec builtin/pack-objects.c: return from create_object_entry()
 7:  94fe03cc65 =  8:  e3185741f2 builtin/pack-objects.c: --cruft without expiration
 8:  da7273f41f =  9:  1cf00d462c reachable: add options to add_unseen_recent_objects_to_traversal
 9:  58fecd1747 = 10:  d66be44d9a reachable: report precise timestamps from objects in cruft packs
10:  1740b8ef01 = 11:  1434e37623 builtin/pack-objects.c: --cruft with expiration
11:  5992a72cbf ! 12:  0d3555d595 builtin/repack.c: support generating a cruft pack
    @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned'
     +		git repack &&
     +
     +		tip="$(git rev-parse cruft)" &&
    -+		path="$objdir/$(test_oid_to_path "$(git rev-parse cruft)")" &&
    ++		path="$objdir/$(test_oid_to_path "$tip")" &&
     +		test-tool chmtime --get +1000 "$path" >expect &&
     +
     +		git checkout main &&
12:  1b241f8f91 = 13:  4b721d3ee9 builtin/repack.c: allow configuring cruft pack generation
13:  ffae78852c = 14:  f9e3ab56b1 builtin/repack.c: use named flags for existing_packs
14:  0743e373ba ! 15:  e9f46e7b5e builtin/repack.c: add cruft packs to MIDX during geometric repack
    @@ builtin/repack.c
      static int pack_everything;
      static int delta_base_offset = 1;
     @@ builtin/repack.c: static void collect_pack_filenames(struct string_list *fname_nonkept_list,
    + 		fname = xmemdupz(e->d_name, len);
    + 
      		if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
    - 		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
    +-		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
    ++		    (file_exists(mkpath("%s/%s.keep", packdir, fname)))) {
      			string_list_append_nodup(fname_kept_list, fname);
     -		else
     -			string_list_append_nodup(fname_nonkept_list, fname);
    -+		else {
    -+			struct string_list_item *item = string_list_append_nodup(fname_nonkept_list, fname);
    ++		} else {
    ++			struct string_list_item *item;
    ++			item = string_list_append_nodup(fname_nonkept_list,
    ++							fname);
     +			if (file_exists(mkpath("%s/%s.mtimes", packdir, fname)))
     +				item->util = (void*)(uintptr_t)CRUFT_PACK;
     +		}
15:  9f7e0acac6 = 16:  43c14eec07 builtin/gc.c: conditionally avoid pruning objects via loose
16:  07fa9d4b47 = 17:  1e313b89e8 sha1-file.c: don't freshen cruft packs
-- 
2.36.1.94.gb0d54bedca

^ permalink raw reply	[flat|nested] 201+ messages in thread

* [PATCH v5 01/17] Documentation/technical: add cruft-packs.txt
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
@ 2022-05-20 23:17   ` Taylor Blau
  2022-05-20 23:17   ` [PATCH v5 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
                     ` (16 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 23:17 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

Create a technical document to explain cruft packs. It contains a brief
overview of the problem, some background, details on the implementation,
and a couple of alternative approaches not considered here.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/Makefile                  |   1 +
 Documentation/technical/cruft-packs.txt | 123 ++++++++++++++++++++++++
 2 files changed, 124 insertions(+)
 create mode 100644 Documentation/technical/cruft-packs.txt

diff --git a/Documentation/Makefile b/Documentation/Makefile
index adb2f1b50a..2faffb52ab 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -94,6 +94,7 @@ TECH_DOCS += MyFirstContribution
 TECH_DOCS += MyFirstObjectWalk
 TECH_DOCS += SubmittingPatches
 TECH_DOCS += technical/bundle-format
+TECH_DOCS += technical/cruft-packs
 TECH_DOCS += technical/hash-function-transition
 TECH_DOCS += technical/http-protocol
 TECH_DOCS += technical/index-format
diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
new file mode 100644
index 0000000000..c0f583cd48
--- /dev/null
+++ b/Documentation/technical/cruft-packs.txt
@@ -0,0 +1,123 @@
+= Cruft packs
+
+The cruft packs feature offer an alternative to Git's traditional mechanism of
+removing unreachable objects. This document provides an overview of Git's
+pruning mechanism, and how a cruft pack can be used instead to accomplish the
+same.
+
+== Background
+
+To remove unreachable objects from your repository, Git offers `git repack -Ad`
+(see linkgit:git-repack[1]). Quoting from the documentation:
+
+[quote]
+[...] unreachable objects in a previous pack become loose, unpacked objects,
+instead of being left in the old pack. [...] loose unreachable objects will be
+pruned according to normal expiry rules with the next 'git gc' invocation.
+
+Unreachable objects aren't removed immediately, since doing so could race with
+an incoming push which may reference an object which is about to be deleted.
+Instead, those unreachable objects are stored as loose object and stay that way
+until they are older than the expiration window, at which point they are removed
+by linkgit:git-prune[1].
+
+Git must store these unreachable objects loose in order to keep track of their
+per-object mtimes. If these unreachable objects were written into one big pack,
+then either freshening that pack (because an object contained within it was
+re-written) or creating a new pack of unreachable objects would cause the pack's
+mtime to get updated, and the objects within it would never leave the expiration
+window. Instead, objects are stored loose in order to keep track of the
+individual object mtimes and avoid a situation where all cruft objects are
+freshened at once.
+
+This can lead to undesirable situations when a repository contains many
+unreachable objects which have not yet left the grace period. Having large
+directories in the shards of `.git/objects` can lead to decreased performance in
+the repository. But given enough unreachable objects, this can lead to inode
+starvation and degrade the performance of the whole system. Since we
+can never pack those objects, these repositories often take up a large amount of
+disk space, since we can only zlib compress them, but not store them in delta
+chains.
+
+== Cruft packs
+
+A cruft pack eliminates the need for storing unreachable objects in a loose
+state by including the per-object mtimes in a separate file alongside a single
+pack containing all loose objects.
+
+A cruft pack is written by `git repack --cruft` when generating a new pack.
+linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft`
+is a classic all-into-one repack, meaning that everything in the resulting pack is
+reachable, and everything else is unreachable. Once written, the `--cruft`
+option instructs `git repack` to generate another pack containing only objects
+not packed in the previous step (which equates to packing all unreachable
+objects together). This progresses as follows:
+
+  1. Enumerate every object, marking any object which is (a) not contained in a
+     kept-pack, and (b) whose mtime is within the grace period as a traversal
+     tip.
+
+  2. Perform a reachability traversal based on the tips gathered in the previous
+     step, adding every object along the way to the pack.
+
+  3. Write the pack out, along with a `.mtimes` file that records the per-object
+     timestamps.
+
+This mode is invoked internally by linkgit:git-repack[1] when instructed to
+write a cruft pack. Crucially, the set of in-core kept packs is exactly the set
+of packs which will not be deleted by the repack; in other words, they contain
+all of the repository's reachable objects.
+
+When a repository already has a cruft pack, `git repack --cruft` typically only
+adds objects to it. An exception to this is when `git repack` is given the
+`--cruft-expiration` option, which allows the generated cruft pack to omit
+expired objects instead of waiting for linkgit:git-gc[1] to expire those objects
+later on.
+
+It is linkgit:git-gc[1] that is typically responsible for removing expired
+unreachable objects.
+
+== Caution for mixed-version environments
+
+Repositories that have cruft packs in them will continue to work with any older
+version of Git. Note, however, that previous versions of Git which do not
+understand the `.mtimes` file will use the cruft pack's mtime as the mtime for
+all of the objects in it. In other words, do not expect older (pre-cruft pack)
+versions of Git to interpret or even read the contents of the `.mtimes` file.
+
+Note that having mixed versions of Git GC-ing the same repository can lead to
+unreachable objects never being completely pruned. This can happen under the
+following circumstances:
+
+  - An older version of Git running GC explodes the contents of an existing
+    cruft pack loose, using the cruft pack's mtime.
+  - A newer version running GC collects those loose objects into a cruft pack,
+    where the .mtime file reflects the loose object's actual mtimes, but the
+    cruft pack mtime is "now".
+
+Repeating this process will lead to unreachable objects not getting pruned as a
+result of repeatedly resetting the objects' mtimes to the present time.
+
+If you are GC-ing repositories in a mixed version environment, consider omitting
+the `--cruft` option when using linkgit:git-repack[1] and linkgit:git-gc[1], and
+leaving the `gc.cruftPacks` configuration unset until all writers understand
+cruft packs.
+
+== Alternatives
+
+Notable alternatives to this design include:
+
+  - The location of the per-object mtime data, and
+  - Storing unreachable objects in multiple cruft packs.
+
+On the location of mtime data, a new auxiliary file tied to the pack was chosen
+to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
+support for optional chunks of data, it may make sense to consolidate the
+`.mtimes` format into the `.idx` itself.
+
+Storing unreachable objects among multiple cruft packs (e.g., creating a new
+cruft pack during each repacking operation including only unreachable objects
+which aren't already stored in an earlier cruft pack) is significantly more
+complicated to construct, and so aren't pursued here. The obvious drawback to
+the current implementation is that the entire cruft pack must be re-written from
+scratch.
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v5 02/17] pack-mtimes: support reading .mtimes files
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
  2022-05-20 23:17   ` [PATCH v5 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
@ 2022-05-20 23:17   ` Taylor Blau
  2022-05-24 19:32     ` Jonathan Nieder
                       ` (2 more replies)
  2022-05-20 23:17   ` [PATCH v5 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
                     ` (15 subsequent siblings)
  17 siblings, 3 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 23:17 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

To store the individual mtimes of objects in a cruft pack, introduce a
new `.mtimes` format that can optionally accompany a single pack in the
repository.

The format is defined in Documentation/technical/pack-format.txt, and
stores a 4-byte network order timestamp for each object in name (index)
order.

This patch prepares for cruft packs by defining the `.mtimes` format,
and introducing a basic API that callers can use to read out individual
mtimes.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/pack-format.txt |  19 ++++
 Makefile                                |   1 +
 builtin/repack.c                        |   1 +
 object-store.h                          |   5 +-
 pack-mtimes.c                           | 126 ++++++++++++++++++++++++
 pack-mtimes.h                           |  15 +++
 packfile.c                              |  19 +++-
 7 files changed, 183 insertions(+), 3 deletions(-)
 create mode 100644 pack-mtimes.c
 create mode 100644 pack-mtimes.h

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 6d3efb7d16..b520aa9c45 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -294,6 +294,25 @@ Pack file entry: <+
 
 All 4-byte numbers are in network order.
 
+== pack-*.mtimes files have the format:
+
+All 4-byte numbers are in network byte order.
+
+  - A 4-byte magic number '0x4d544d45' ('MTME').
+
+  - A 4-byte version identifier (= 1).
+
+  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
+
+  - A table of 4-byte unsigned integers. The ith value is the
+    modification time (mtime) of the ith object in the corresponding
+    pack by lexicographic (index) order. The mtimes count standard
+    epoch seconds.
+
+  - A trailer, containing a checksum of the corresponding packfile,
+    and a checksum of all of the above (each having length according
+    to the specified hash function).
+
 == multi-pack-index (MIDX) files have the following format:
 
 The multi-pack-index files refer to multiple pack-files and loose objects.
diff --git a/Makefile b/Makefile
index 61aadf3ce8..a299580b7c 100644
--- a/Makefile
+++ b/Makefile
@@ -993,6 +993,7 @@ LIB_OBJS += oidtree.o
 LIB_OBJS += pack-bitmap-write.o
 LIB_OBJS += pack-bitmap.o
 LIB_OBJS += pack-check.o
+LIB_OBJS += pack-mtimes.o
 LIB_OBJS += pack-objects.o
 LIB_OBJS += pack-revindex.o
 LIB_OBJS += pack-write.o
diff --git a/builtin/repack.c b/builtin/repack.c
index d1a563d5b6..e7a3920c6d 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -217,6 +217,7 @@ static struct {
 } exts[] = {
 	{".pack"},
 	{".rev", 1},
+	{".mtimes", 1},
 	{".bitmap", 1},
 	{".promisor", 1},
 	{".idx"},
diff --git a/object-store.h b/object-store.h
index 53996018c1..2c4671ed7a 100644
--- a/object-store.h
+++ b/object-store.h
@@ -115,12 +115,15 @@ struct packed_git {
 		 freshened:1,
 		 do_not_close:1,
 		 pack_promisor:1,
-		 multi_pack_index:1;
+		 multi_pack_index:1,
+		 is_cruft:1;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	struct revindex_entry *revindex;
 	const uint32_t *revindex_data;
 	const uint32_t *revindex_map;
 	size_t revindex_size;
+	const uint32_t *mtimes_map;
+	size_t mtimes_size;
 	/* something like ".git/objects/pack/xxxxx.pack" */
 	char pack_name[FLEX_ARRAY]; /* more */
 };
diff --git a/pack-mtimes.c b/pack-mtimes.c
new file mode 100644
index 0000000000..46ad584af1
--- /dev/null
+++ b/pack-mtimes.c
@@ -0,0 +1,126 @@
+#include "pack-mtimes.h"
+#include "object-store.h"
+#include "packfile.h"
+
+static char *pack_mtimes_filename(struct packed_git *p)
+{
+	size_t len;
+	if (!strip_suffix(p->pack_name, ".pack", &len))
+		BUG("pack_name does not end in .pack");
+	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
+	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
+}
+
+#define MTIMES_HEADER_SIZE (12)
+#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))
+
+struct mtimes_header {
+	uint32_t signature;
+	uint32_t version;
+	uint32_t hash_id;
+};
+
+static int load_pack_mtimes_file(char *mtimes_file,
+				 uint32_t num_objects,
+				 const uint32_t **data_p, size_t *len_p)
+{
+	int fd, ret = 0;
+	struct stat st;
+	void *data = NULL;
+	size_t mtimes_size;
+	struct mtimes_header header;
+	uint32_t *hdr;
+
+	fd = git_open(mtimes_file);
+
+	if (fd < 0) {
+		ret = -1;
+		goto cleanup;
+	}
+	if (fstat(fd, &st)) {
+		ret = error_errno(_("failed to read %s"), mtimes_file);
+		goto cleanup;
+	}
+
+	mtimes_size = xsize_t(st.st_size);
+
+	if (mtimes_size < MTIMES_MIN_SIZE) {
+		ret = error(_("mtimes file %s is too small"), mtimes_file);
+		goto cleanup;
+	}
+
+	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
+		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
+		goto cleanup;
+	}
+
+	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
+
+	header.signature = ntohl(hdr[0]);
+	header.version = ntohl(hdr[1]);
+	header.hash_id = ntohl(hdr[2]);
+
+	if (header.signature != MTIMES_SIGNATURE) {
+		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
+		goto cleanup;
+	}
+
+	if (header.version != 1) {
+		ret = error(_("mtimes file %s has unsupported version %"PRIu32),
+			    mtimes_file, header.version);
+		goto cleanup;
+	}
+
+	if (!(header.hash_id == 1 || header.hash_id == 2)) {
+		ret = error(_("mtimes file %s has unsupported hash id %"PRIu32),
+			    mtimes_file, header.hash_id);
+		goto cleanup;
+	}
+
+cleanup:
+	if (ret) {
+		if (data)
+			munmap(data, mtimes_size);
+	} else {
+		*len_p = mtimes_size;
+		*data_p = (const uint32_t *)data;
+	}
+
+	close(fd);
+	return ret;
+}
+
+int load_pack_mtimes(struct packed_git *p)
+{
+	char *mtimes_name = NULL;
+	int ret = 0;
+
+	if (!p->is_cruft)
+		return ret; /* not a cruft pack */
+	if (p->mtimes_map)
+		return ret; /* already loaded */
+
+	ret = open_pack_index(p);
+	if (ret < 0)
+		goto cleanup;
+
+	mtimes_name = pack_mtimes_filename(p);
+	ret = load_pack_mtimes_file(mtimes_name,
+				    p->num_objects,
+				    &p->mtimes_map,
+				    &p->mtimes_size);
+cleanup:
+	free(mtimes_name);
+	return ret;
+}
+
+uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)
+{
+	if (!p->mtimes_map)
+		BUG("pack .mtimes file not loaded for %s", p->pack_name);
+	if (p->num_objects <= pos)
+		BUG("pack .mtimes out-of-bounds (%"PRIu32" vs %"PRIu32")",
+		    pos, p->num_objects);
+
+	return get_be32(p->mtimes_map + pos + 3);
+}
diff --git a/pack-mtimes.h b/pack-mtimes.h
new file mode 100644
index 0000000000..38ddb9f893
--- /dev/null
+++ b/pack-mtimes.h
@@ -0,0 +1,15 @@
+#ifndef PACK_MTIMES_H
+#define PACK_MTIMES_H
+
+#include "git-compat-util.h"
+
+#define MTIMES_SIGNATURE 0x4d544d45 /* "MTME" */
+#define MTIMES_VERSION 1
+
+struct packed_git;
+
+int load_pack_mtimes(struct packed_git *p);
+
+uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos);
+
+#endif
diff --git a/packfile.c b/packfile.c
index 835b2d2716..fc0245fbab 100644
--- a/packfile.c
+++ b/packfile.c
@@ -334,12 +334,22 @@ static void close_pack_revindex(struct packed_git *p)
 	p->revindex_data = NULL;
 }
 
+static void close_pack_mtimes(struct packed_git *p)
+{
+	if (!p->mtimes_map)
+		return;
+
+	munmap((void *)p->mtimes_map, p->mtimes_size);
+	p->mtimes_map = NULL;
+}
+
 void close_pack(struct packed_git *p)
 {
 	close_pack_windows(p);
 	close_pack_fd(p);
 	close_pack_index(p);
 	close_pack_revindex(p);
+	close_pack_mtimes(p);
 	oidset_clear(&p->bad_objects);
 }
 
@@ -363,7 +373,7 @@ void close_object_store(struct raw_object_store *o)
 
 void unlink_pack_path(const char *pack_name, int force_delete)
 {
-	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
+	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".mtimes"};
 	int i;
 	struct strbuf buf = STRBUF_INIT;
 	size_t plen;
@@ -718,6 +728,10 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
 	if (!access(p->pack_name, F_OK))
 		p->pack_promisor = 1;
 
+	xsnprintf(p->pack_name + path_len, alloc - path_len, ".mtimes");
+	if (!access(p->pack_name, F_OK))
+		p->is_cruft = 1;
+
 	xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
 	if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) {
 		free(p);
@@ -869,7 +883,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 	    ends_with(file_name, ".pack") ||
 	    ends_with(file_name, ".bitmap") ||
 	    ends_with(file_name, ".keep") ||
-	    ends_with(file_name, ".promisor"))
+	    ends_with(file_name, ".promisor") ||
+	    ends_with(file_name, ".mtimes"))
 		string_list_append(data->garbage, full_name);
 	else
 		report_garbage(PACKDIR_FILE_GARBAGE, full_name);
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v5 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
  2022-05-20 23:17   ` [PATCH v5 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
  2022-05-20 23:17   ` [PATCH v5 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
@ 2022-05-20 23:17   ` Taylor Blau
  2022-05-20 23:17   ` [PATCH v5 04/17] chunk-format.h: extract oid_version() Taylor Blau
                     ` (14 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 23:17 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

This structure will be used to communicate the per-object mtimes when
writing a cruft pack. Here, we need the full packing_data structure
because the mtime information is stored in an array there, not on the
individual object_entry's themselves (to avoid paying the overhead in
structure width for operations which do not generate a cruft pack).

We haven't passed this information down before because one of the two
callers (in bulk-checkin.c) does not have a packing_data structure at
all. In that case (where no cruft pack will be generated), NULL is
passed instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 3 ++-
 bulk-checkin.c         | 2 +-
 pack-write.c           | 1 +
 pack.h                 | 3 +++
 4 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 014dcd4bc9..6ac927047c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1262,7 +1262,8 @@ static void write_pack_file(void)
 
 			stage_tmp_packfiles(&tmpname, pack_tmp_name,
 					    written_list, nr_written,
-					    &pack_idx_opts, hash, &idx_tmp_name);
+					    &to_pack, &pack_idx_opts, hash,
+					    &idx_tmp_name);
 
 			if (write_bitmap_index) {
 				size_t tmpname_len = tmpname.len;
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 6d6c37171c..e988a388b6 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -33,7 +33,7 @@ static void finish_tmp_packfile(struct strbuf *basename,
 	char *idx_tmp_name = NULL;
 
 	stage_tmp_packfiles(basename, pack_tmp_name, written_list, nr_written,
-			    pack_idx_opts, hash, &idx_tmp_name);
+			    NULL, pack_idx_opts, hash, &idx_tmp_name);
 	rename_tmp_packfile_idx(basename, &idx_tmp_name);
 
 	free(idx_tmp_name);
diff --git a/pack-write.c b/pack-write.c
index 51812cb129..a2adc565f4 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -484,6 +484,7 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 const char *pack_tmp_name,
 			 struct pack_idx_entry **written_list,
 			 uint32_t nr_written,
+			 struct packing_data *to_pack,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[],
 			 char **idx_tmp_name)
diff --git a/pack.h b/pack.h
index b22bfc4a18..fd27cfdfd7 100644
--- a/pack.h
+++ b/pack.h
@@ -109,11 +109,14 @@ int encode_in_pack_object_header(unsigned char *hdr, int hdr_len,
 #define PH_ERROR_PROTOCOL	(-3)
 int read_pack_header(int fd, struct pack_header *);
 
+struct packing_data;
+
 struct hashfile *create_tmp_packfile(char **pack_tmp_name);
 void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 const char *pack_tmp_name,
 			 struct pack_idx_entry **written_list,
 			 uint32_t nr_written,
+			 struct packing_data *to_pack,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[],
 			 char **idx_tmp_name);
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v5 04/17] chunk-format.h: extract oid_version()
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
                     ` (2 preceding siblings ...)
  2022-05-20 23:17   ` [PATCH v5 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
@ 2022-05-20 23:17   ` Taylor Blau
  2022-05-20 23:17   ` [PATCH v5 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
                     ` (13 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 23:17 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

There are three definitions of an identical function which converts
`the_hash_algo` into either 1 (for SHA-1) or 2 (for SHA-256). There is a
copy of this function for writing both the commit-graph and
multi-pack-index file, and another inline definition used to write the
.rev header.

Consolidate these into a single definition in chunk-format.h. It's not
clear that this is the best header to define this function in, but it
should do for now.

(Worth noting, the .rev caller expects a 4-byte unsigned, but the other
two callers work with a single unsigned byte. The consolidated version
uses the latter type, and lets the compiler widen it when required).

Another caller will be added in a subsequent patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 chunk-format.c | 12 ++++++++++++
 chunk-format.h |  3 +++
 commit-graph.c | 18 +++---------------
 midx.c         | 18 +++---------------
 pack-write.c   | 15 ++-------------
 5 files changed, 23 insertions(+), 43 deletions(-)

diff --git a/chunk-format.c b/chunk-format.c
index 1c3dca62e2..0275b74a89 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -181,3 +181,15 @@ int read_chunk(struct chunkfile *cf,
 
 	return CHUNK_NOT_FOUND;
 }
+
+uint8_t oid_version(const struct git_hash_algo *algop)
+{
+	switch (hash_algo_by_ptr(algop)) {
+	case GIT_HASH_SHA1:
+		return 1;
+	case GIT_HASH_SHA256:
+		return 2;
+	default:
+		die(_("invalid hash version"));
+	}
+}
diff --git a/chunk-format.h b/chunk-format.h
index 9ccbe00377..7885aa0848 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -2,6 +2,7 @@
 #define CHUNK_FORMAT_H
 
 #include "git-compat-util.h"
+#include "hash.h"
 
 struct hashfile;
 struct chunkfile;
@@ -65,4 +66,6 @@ int read_chunk(struct chunkfile *cf,
 	       chunk_read_fn fn,
 	       void *data);
 
+uint8_t oid_version(const struct git_hash_algo *algop);
+
 #endif
diff --git a/commit-graph.c b/commit-graph.c
index 06107beedc..066d82ed6a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -193,18 +193,6 @@ char *get_commit_graph_chain_filename(struct object_directory *odb)
 	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
 }
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 static struct commit_graph *alloc_commit_graph(void)
 {
 	struct commit_graph *g = xcalloc(1, sizeof(*g));
@@ -365,9 +353,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 	}
 
 	hash_version = *(unsigned char*)(data + 5);
-	if (hash_version != oid_version()) {
+	if (hash_version != oid_version(the_hash_algo)) {
 		error(_("commit-graph hash version %X does not match version %X"),
-		      hash_version, oid_version());
+		      hash_version, oid_version(the_hash_algo));
 		return NULL;
 	}
 
@@ -1924,7 +1912,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
 	hashwrite_u8(f, GRAPH_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, oid_version(the_hash_algo));
 	hashwrite_u8(f, get_num_chunks(cf));
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
diff --git a/midx.c b/midx.c
index 3db0e47735..c617c51cd0 100644
--- a/midx.c
+++ b/midx.c
@@ -41,18 +41,6 @@
 
 #define PACK_EXPIRED UINT_MAX
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 const unsigned char *get_midx_checksum(struct multi_pack_index *m)
 {
 	return m->data + m->data_len - the_hash_algo->rawsz;
@@ -134,9 +122,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 		      m->version);
 
 	hash_version = m->data[MIDX_BYTE_HASH_VERSION];
-	if (hash_version != oid_version()) {
+	if (hash_version != oid_version(the_hash_algo)) {
 		error(_("multi-pack-index hash version %u does not match version %u"),
-		      hash_version, oid_version());
+		      hash_version, oid_version(the_hash_algo));
 		goto cleanup_fail;
 	}
 	m->hash_len = the_hash_algo->rawsz;
@@ -420,7 +408,7 @@ static size_t write_midx_header(struct hashfile *f,
 {
 	hashwrite_be32(f, MIDX_SIGNATURE);
 	hashwrite_u8(f, MIDX_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, oid_version(the_hash_algo));
 	hashwrite_u8(f, num_chunks);
 	hashwrite_u8(f, 0); /* unused */
 	hashwrite_be32(f, num_packs);
diff --git a/pack-write.c b/pack-write.c
index a2adc565f4..27b171e440 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -2,6 +2,7 @@
 #include "pack.h"
 #include "csum-file.h"
 #include "remote.h"
+#include "chunk-format.h"
 
 void reset_pack_idx_option(struct pack_idx_option *opts)
 {
@@ -181,21 +182,9 @@ static int pack_order_cmp(const void *va, const void *vb, void *ctx)
 
 static void write_rev_header(struct hashfile *f)
 {
-	uint32_t oid_version;
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		oid_version = 1;
-		break;
-	case GIT_HASH_SHA256:
-		oid_version = 2;
-		break;
-	default:
-		die("write_rev_header: unknown hash version");
-	}
-
 	hashwrite_be32(f, RIDX_SIGNATURE);
 	hashwrite_be32(f, RIDX_VERSION);
-	hashwrite_be32(f, oid_version);
+	hashwrite_be32(f, oid_version(the_hash_algo));
 }
 
 static void write_rev_index_positions(struct hashfile *f,
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v5 05/17] pack-mtimes: support writing pack .mtimes files
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
                     ` (3 preceding siblings ...)
  2022-05-20 23:17   ` [PATCH v5 04/17] chunk-format.h: extract oid_version() Taylor Blau
@ 2022-05-20 23:17   ` Taylor Blau
  2022-05-20 23:17   ` [PATCH v5 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
                     ` (12 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 23:17 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

Now that the `.mtimes` format is defined, supplement the pack-write API
to be able to conditionally write an `.mtimes` file along with a pack by
setting an additional flag and passing an oidmap that contains the
timestamps corresponding to each object in the pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-objects.c |  6 ++++
 pack-objects.h | 25 ++++++++++++++++
 pack-write.c   | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++
 pack.h         |  1 +
 4 files changed, 109 insertions(+)

diff --git a/pack-objects.c b/pack-objects.c
index fe2a4eace9..272e8d4517 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -170,6 +170,9 @@ struct object_entry *packlist_alloc(struct packing_data *pdata,
 
 		if (pdata->layer)
 			REALLOC_ARRAY(pdata->layer, pdata->nr_alloc);
+
+		if (pdata->cruft_mtime)
+			REALLOC_ARRAY(pdata->cruft_mtime, pdata->nr_alloc);
 	}
 
 	new_entry = pdata->objects + pdata->nr_objects++;
@@ -198,6 +201,9 @@ struct object_entry *packlist_alloc(struct packing_data *pdata,
 	if (pdata->layer)
 		pdata->layer[pdata->nr_objects - 1] = 0;
 
+	if (pdata->cruft_mtime)
+		pdata->cruft_mtime[pdata->nr_objects - 1] = 0;
+
 	return new_entry;
 }
 
diff --git a/pack-objects.h b/pack-objects.h
index dca2351ef9..393b9db546 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -168,6 +168,14 @@ struct packing_data {
 	/* delta islands */
 	unsigned int *tree_depth;
 	unsigned char *layer;
+
+	/*
+	 * Used when writing cruft packs.
+	 *
+	 * Object mtimes are stored in pack order when writing, but
+	 * written out in lexicographic (index) order.
+	 */
+	uint32_t *cruft_mtime;
 };
 
 void prepare_packing_data(struct repository *r, struct packing_data *pdata);
@@ -289,4 +297,21 @@ static inline void oe_set_layer(struct packing_data *pack,
 	pack->layer[e - pack->objects] = layer;
 }
 
+static inline uint32_t oe_cruft_mtime(struct packing_data *pack,
+				      struct object_entry *e)
+{
+	if (!pack->cruft_mtime)
+		return 0;
+	return pack->cruft_mtime[e - pack->objects];
+}
+
+static inline void oe_set_cruft_mtime(struct packing_data *pack,
+				      struct object_entry *e,
+				      uint32_t mtime)
+{
+	if (!pack->cruft_mtime)
+		CALLOC_ARRAY(pack->cruft_mtime, pack->nr_alloc);
+	pack->cruft_mtime[e - pack->objects] = mtime;
+}
+
 #endif
diff --git a/pack-write.c b/pack-write.c
index 27b171e440..23c0342018 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -3,6 +3,10 @@
 #include "csum-file.h"
 #include "remote.h"
 #include "chunk-format.h"
+#include "pack-mtimes.h"
+#include "oidmap.h"
+#include "chunk-format.h"
+#include "pack-objects.h"
 
 void reset_pack_idx_option(struct pack_idx_option *opts)
 {
@@ -277,6 +281,70 @@ const char *write_rev_file_order(const char *rev_name,
 	return rev_name;
 }
 
+static void write_mtimes_header(struct hashfile *f)
+{
+	hashwrite_be32(f, MTIMES_SIGNATURE);
+	hashwrite_be32(f, MTIMES_VERSION);
+	hashwrite_be32(f, oid_version(the_hash_algo));
+}
+
+/*
+ * Writes the object mtimes of "objects" for use in a .mtimes file.
+ * Note that objects must be in lexicographic (index) order, which is
+ * the expected ordering of these values in the .mtimes file.
+ */
+static void write_mtimes_objects(struct hashfile *f,
+				 struct packing_data *to_pack,
+				 struct pack_idx_entry **objects,
+				 uint32_t nr_objects)
+{
+	uint32_t i;
+	for (i = 0; i < nr_objects; i++) {
+		struct object_entry *e = (struct object_entry*)objects[i];
+		hashwrite_be32(f, oe_cruft_mtime(to_pack, e));
+	}
+}
+
+static void write_mtimes_trailer(struct hashfile *f, const unsigned char *hash)
+{
+	hashwrite(f, hash, the_hash_algo->rawsz);
+}
+
+static const char *write_mtimes_file(const char *mtimes_name,
+				     struct packing_data *to_pack,
+				     struct pack_idx_entry **objects,
+				     uint32_t nr_objects,
+				     const unsigned char *hash)
+{
+	struct hashfile *f;
+	int fd;
+
+	if (!to_pack)
+		BUG("cannot call write_mtimes_file with NULL packing_data");
+
+	if (!mtimes_name) {
+		struct strbuf tmp_file = STRBUF_INIT;
+		fd = odb_mkstemp(&tmp_file, "pack/tmp_mtimes_XXXXXX");
+		mtimes_name = strbuf_detach(&tmp_file, NULL);
+	} else {
+		unlink(mtimes_name);
+		fd = xopen(mtimes_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
+	}
+	f = hashfd(fd, mtimes_name);
+
+	write_mtimes_header(f);
+	write_mtimes_objects(f, to_pack, objects, nr_objects);
+	write_mtimes_trailer(f, hash);
+
+	if (adjust_shared_perm(mtimes_name) < 0)
+		die(_("failed to make %s readable"), mtimes_name);
+
+	finalize_hashfile(f, NULL, FSYNC_COMPONENT_PACK_METADATA,
+			  CSUM_HASH_IN_STREAM | CSUM_CLOSE | CSUM_FSYNC);
+
+	return mtimes_name;
+}
+
 off_t write_pack_header(struct hashfile *f, uint32_t nr_entries)
 {
 	struct pack_header hdr;
@@ -479,6 +547,7 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 char **idx_tmp_name)
 {
 	const char *rev_tmp_name = NULL;
+	const char *mtimes_tmp_name = NULL;
 
 	if (adjust_shared_perm(pack_tmp_name))
 		die_errno("unable to make temporary pack file readable");
@@ -491,9 +560,17 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 	rev_tmp_name = write_rev_file(NULL, written_list, nr_written, hash,
 				      pack_idx_opts->flags);
 
+	if (pack_idx_opts->flags & WRITE_MTIMES) {
+		mtimes_tmp_name = write_mtimes_file(NULL, to_pack, written_list,
+						    nr_written,
+						    hash);
+	}
+
 	rename_tmp_packfile(name_buffer, pack_tmp_name, "pack");
 	if (rev_tmp_name)
 		rename_tmp_packfile(name_buffer, rev_tmp_name, "rev");
+	if (mtimes_tmp_name)
+		rename_tmp_packfile(name_buffer, mtimes_tmp_name, "mtimes");
 }
 
 void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought)
diff --git a/pack.h b/pack.h
index fd27cfdfd7..01d385903a 100644
--- a/pack.h
+++ b/pack.h
@@ -44,6 +44,7 @@ struct pack_idx_option {
 #define WRITE_IDX_STRICT 02
 #define WRITE_REV 04
 #define WRITE_REV_VERIFY 010
+#define WRITE_MTIMES 020
 
 	uint32_t version;
 	uint32_t off32_limit;
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v5 06/17] t/helper: add 'pack-mtimes' test-tool
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
                     ` (4 preceding siblings ...)
  2022-05-20 23:17   ` [PATCH v5 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
@ 2022-05-20 23:17   ` Taylor Blau
  2022-05-20 23:17   ` [PATCH v5 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
                     ` (11 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 23:17 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

In the next patch, we will implement and test support for writing a
cruft pack via a special mode of `git pack-objects`. To make sure that
objects are written with the correct timestamps, and a new test-tool
that can dump the object names and corresponding timestamps from a given
`.mtimes` file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Makefile                    |  1 +
 t/helper/test-pack-mtimes.c | 56 +++++++++++++++++++++++++++++++++++++
 t/helper/test-tool.c        |  1 +
 t/helper/test-tool.h        |  1 +
 4 files changed, 59 insertions(+)
 create mode 100644 t/helper/test-pack-mtimes.c

diff --git a/Makefile b/Makefile
index a299580b7c..0b6eab0453 100644
--- a/Makefile
+++ b/Makefile
@@ -738,6 +738,7 @@ TEST_BUILTINS_OBJS += test-oid-array.o
 TEST_BUILTINS_OBJS += test-oidmap.o
 TEST_BUILTINS_OBJS += test-oidtree.o
 TEST_BUILTINS_OBJS += test-online-cpus.o
+TEST_BUILTINS_OBJS += test-pack-mtimes.o
 TEST_BUILTINS_OBJS += test-parse-options.o
 TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
 TEST_BUILTINS_OBJS += test-partial-clone.o
diff --git a/t/helper/test-pack-mtimes.c b/t/helper/test-pack-mtimes.c
new file mode 100644
index 0000000000..f7b79daf4c
--- /dev/null
+++ b/t/helper/test-pack-mtimes.c
@@ -0,0 +1,56 @@
+#include "git-compat-util.h"
+#include "test-tool.h"
+#include "strbuf.h"
+#include "object-store.h"
+#include "packfile.h"
+#include "pack-mtimes.h"
+
+static void dump_mtimes(struct packed_git *p)
+{
+	uint32_t i;
+	if (load_pack_mtimes(p) < 0)
+		die("could not load pack .mtimes");
+
+	for (i = 0; i < p->num_objects; i++) {
+		struct object_id oid;
+		if (nth_packed_object_id(&oid, p, i) < 0)
+			die("could not load object id at position %"PRIu32, i);
+
+		printf("%s %"PRIu32"\n",
+		       oid_to_hex(&oid), nth_packed_mtime(p, i));
+	}
+}
+
+static const char *pack_mtimes_usage = "\n"
+"  test-tool pack-mtimes <pack-name.mtimes>";
+
+int cmd__pack_mtimes(int argc, const char **argv)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct packed_git *p;
+
+	setup_git_directory();
+
+	if (argc != 2)
+		usage(pack_mtimes_usage);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		strbuf_addstr(&buf, basename(p->pack_name));
+		strbuf_strip_suffix(&buf, ".pack");
+		strbuf_addstr(&buf, ".mtimes");
+
+		if (!strcmp(buf.buf, argv[1]))
+			break;
+
+		strbuf_reset(&buf);
+	}
+
+	strbuf_release(&buf);
+
+	if (!p)
+		die("could not find pack '%s'", argv[1]);
+
+	dump_mtimes(p);
+
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 0424f7adf5..d2eacd302d 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -48,6 +48,7 @@ static struct test_cmd cmds[] = {
 	{ "oidmap", cmd__oidmap },
 	{ "oidtree", cmd__oidtree },
 	{ "online-cpus", cmd__online_cpus },
+	{ "pack-mtimes", cmd__pack_mtimes },
 	{ "parse-options", cmd__parse_options },
 	{ "parse-pathspec-file", cmd__parse_pathspec_file },
 	{ "partial-clone", cmd__partial_clone },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index c876e8246f..960cc27ef7 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -38,6 +38,7 @@ int cmd__mktemp(int argc, const char **argv);
 int cmd__oidmap(int argc, const char **argv);
 int cmd__oidtree(int argc, const char **argv);
 int cmd__online_cpus(int argc, const char **argv);
+int cmd__pack_mtimes(int argc, const char **argv);
 int cmd__parse_options(int argc, const char **argv);
 int cmd__parse_pathspec_file(int argc, const char** argv);
 int cmd__partial_clone(int argc, const char **argv);
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v5 07/17] builtin/pack-objects.c: return from create_object_entry()
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
                     ` (5 preceding siblings ...)
  2022-05-20 23:17   ` [PATCH v5 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
@ 2022-05-20 23:17   ` Taylor Blau
  2022-05-20 23:17   ` [PATCH v5 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
                     ` (10 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 23:17 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

A new caller in the next commit will want to immediately modify the
object_entry structure created by create_object_entry(). Instead of
forcing that caller to wastefully look-up the entry we just created,
return it from create_object_entry() instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6ac927047c..c6d16872ee 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1516,13 +1516,13 @@ static int want_object_in_pack(const struct object_id *oid,
 	return 1;
 }
 
-static void create_object_entry(const struct object_id *oid,
-				enum object_type type,
-				uint32_t hash,
-				int exclude,
-				int no_try_delta,
-				struct packed_git *found_pack,
-				off_t found_offset)
+static struct object_entry *create_object_entry(const struct object_id *oid,
+						enum object_type type,
+						uint32_t hash,
+						int exclude,
+						int no_try_delta,
+						struct packed_git *found_pack,
+						off_t found_offset)
 {
 	struct object_entry *entry;
 
@@ -1539,6 +1539,8 @@ static void create_object_entry(const struct object_id *oid,
 	}
 
 	entry->no_try_delta = no_try_delta;
+
+	return entry;
 }
 
 static const char no_closure_warning[] = N_(
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v5 08/17] builtin/pack-objects.c: --cruft without expiration
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
                     ` (6 preceding siblings ...)
  2022-05-20 23:17   ` [PATCH v5 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
@ 2022-05-20 23:17   ` Taylor Blau
  2022-05-20 23:17   ` [PATCH v5 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
                     ` (9 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 23:17 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

Teach `pack-objects` how to generate a cruft pack when no objects are
dropped (i.e., `--cruft-expiration=never`). Later patches will teach
`pack-objects` how to generate a cruft pack that prunes objects.

When generating a cruft pack which does not prune objects, we want to
collect all unreachable objects into a single pack (noting and updating
their mtimes as we accumulate them). Ordinary use will pass the result
of a `git repack -A` as a kept pack, so when this patch says "kept
pack", readers should think "reachable objects".

Generating a non-expiring cruft packs works as follows:

  - Callers provide a list of every pack they know about, and indicate
    which packs are about to be removed.

  - All packs which are going to be removed (we'll call these the
    redundant ones) are marked as kept in-core.

    Any packs the caller did not mention (but are known to the
    `pack-objects` process) are also marked as kept in-core. Packs not
    mentioned by the caller are assumed to be unknown to them, i.e.,
    they entered the repository after the caller decided which packs
    should be kept and which should be discarded.

    Since we do not want to include objects in these "unknown" packs
    (because we don't know which of their objects are or aren't
    reachable), these are also marked as kept in-core.

  - Then, we enumerate all objects in the repository, and add them to
    our packing list if they do not appear in an in-core kept pack.

This results in a new cruft pack which contains all known objects that
aren't included in the kept packs. When the kept pack is the result of
`git repack -A`, the resulting pack contains all unreachable objects.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-pack-objects.txt |  30 ++++
 builtin/pack-objects.c             | 201 +++++++++++++++++++++++++-
 object-file.c                      |   2 +-
 object-store.h                     |   2 +
 t/t5329-pack-objects-cruft.sh      | 218 +++++++++++++++++++++++++++++
 5 files changed, 448 insertions(+), 5 deletions(-)
 create mode 100755 t/t5329-pack-objects-cruft.sh

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index f8344e1e5b..a9995a932c 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -13,6 +13,7 @@ SYNOPSIS
 	[--no-reuse-delta] [--delta-base-offset] [--non-empty]
 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
 	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
+	[--cruft] [--cruft-expiration=<time>]
 	[--stdout [--filter=<filter-spec>] | <base-name>]
 	[--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>
 
@@ -95,6 +96,35 @@ base-name::
 Incompatible with `--revs`, or options that imply `--revs` (such as
 `--all`), with the exception of `--unpacked`, which is compatible.
 
+--cruft::
+	Packs unreachable objects into a separate "cruft" pack, denoted
+	by the existence of a `.mtimes` file. Typically used by `git
+	repack --cruft`. Callers provide a list of pack names and
+	indicate which packs will remain in the repository, along with
+	which packs will be deleted (indicated by the `-` prefix). The
+	contents of the cruft pack are all objects not contained in the
+	surviving packs which have not exceeded the grace period (see
+	`--cruft-expiration` below), or which have exceeded the grace
+	period, but are reachable from an other object which hasn't.
++
+When the input lists a pack containing all reachable objects (and lists
+all other packs as pending deletion), the corresponding cruft pack will
+contain all unreachable objects (with mtime newer than the
+`--cruft-expiration`) along with any unreachable objects whose mtime is
+older than the `--cruft-expiration`, but are reachable from an
+unreachable object whose mtime is newer than the `--cruft-expiration`).
++
+Incompatible with `--unpack-unreachable`, `--keep-unreachable`,
+`--pack-loose-unreachable`, `--stdin-packs`, as well as any other
+options which imply `--revs`. Also incompatible with `--max-pack-size`;
+when this option is set, the maximum pack size is not inferred from
+`pack.packSizeLimit`.
+
+--cruft-expiration=<approxidate>::
+	If specified, objects are eliminated from the cruft pack if they
+	have an mtime older than `<approxidate>`. If unspecified (and
+	given `--cruft`), then no objects are eliminated.
+
 --window=<n>::
 --depth=<n>::
 	These two options affect how the objects contained in
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c6d16872ee..9cf89be673 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -36,6 +36,7 @@
 #include "trace2.h"
 #include "shallow.h"
 #include "promisor-remote.h"
+#include "pack-mtimes.h"
 
 /*
  * Objects we are going to pack are collected in the `to_pack` structure.
@@ -194,6 +195,8 @@ static int reuse_delta = 1, reuse_object = 1;
 static int keep_unreachable, unpack_unreachable, include_tag;
 static timestamp_t unpack_unreachable_expiration;
 static int pack_loose_unreachable;
+static int cruft;
+static timestamp_t cruft_expiration;
 static int local;
 static int have_non_local_packs;
 static int incremental;
@@ -1260,6 +1263,9 @@ static void write_pack_file(void)
 					&to_pack, written_list, nr_written);
 			}
 
+			if (cruft)
+				pack_idx_opts.flags |= WRITE_MTIMES;
+
 			stage_tmp_packfiles(&tmpname, pack_tmp_name,
 					    written_list, nr_written,
 					    &to_pack, &pack_idx_opts, hash,
@@ -3397,6 +3403,135 @@ static void read_packs_list_from_stdin(void)
 	string_list_clear(&exclude_packs, 0);
 }
 
+static void add_cruft_object_entry(const struct object_id *oid, enum object_type type,
+				   struct packed_git *pack, off_t offset,
+				   const char *name, uint32_t mtime)
+{
+	struct object_entry *entry;
+
+	display_progress(progress_state, ++nr_seen);
+
+	entry = packlist_find(&to_pack, oid);
+	if (entry) {
+		if (name) {
+			entry->hash = pack_name_hash(name);
+			entry->no_try_delta = no_try_delta(name);
+		}
+	} else {
+		if (!want_object_in_pack(oid, 0, &pack, &offset))
+			return;
+		if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
+			/*
+			 * If a traversed tree has a missing blob then we want
+			 * to avoid adding that missing object to our pack.
+			 *
+			 * This only applies to missing blobs, not trees,
+			 * because the traversal needs to parse sub-trees but
+			 * not blobs.
+			 *
+			 * Note we only perform this check when we couldn't
+			 * already find the object in a pack, so we're really
+			 * limited to "ensure non-tip blobs which don't exist in
+			 * packs do exist via loose objects". Confused?
+			 */
+			return;
+		}
+
+		entry = create_object_entry(oid, type, pack_name_hash(name),
+					    0, name && no_try_delta(name),
+					    pack, offset);
+	}
+
+	if (mtime > oe_cruft_mtime(&to_pack, entry))
+		oe_set_cruft_mtime(&to_pack, entry, mtime);
+	return;
+}
+
+static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
+{
+	struct string_list_item *item = NULL;
+	for_each_string_list_item(item, packs) {
+		struct packed_git *p = item->util;
+		if (!p)
+			die(_("could not find pack '%s'"), item->string);
+		p->pack_keep_in_core = keep;
+	}
+}
+
+static void add_unreachable_loose_objects(void);
+static void add_objects_in_unpacked_packs(void);
+
+static void enumerate_cruft_objects(void)
+{
+	if (progress)
+		progress_state = start_progress(_("Enumerating cruft objects"), 0);
+
+	add_objects_in_unpacked_packs();
+	add_unreachable_loose_objects();
+
+	stop_progress(&progress_state);
+}
+
+static void read_cruft_objects(void)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct string_list discard_packs = STRING_LIST_INIT_DUP;
+	struct string_list fresh_packs = STRING_LIST_INIT_DUP;
+	struct packed_git *p;
+
+	ignore_packed_keep_in_core = 1;
+
+	while (strbuf_getline(&buf, stdin) != EOF) {
+		if (!buf.len)
+			continue;
+
+		if (*buf.buf == '-')
+			string_list_append(&discard_packs, buf.buf + 1);
+		else
+			string_list_append(&fresh_packs, buf.buf);
+		strbuf_reset(&buf);
+	}
+
+	string_list_sort(&discard_packs);
+	string_list_sort(&fresh_packs);
+
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		const char *pack_name = pack_basename(p);
+		struct string_list_item *item;
+
+		item = string_list_lookup(&fresh_packs, pack_name);
+		if (!item)
+			item = string_list_lookup(&discard_packs, pack_name);
+
+		if (item) {
+			item->util = p;
+		} else {
+			/*
+			 * This pack wasn't mentioned in either the "fresh" or
+			 * "discard" list, so the caller didn't know about it.
+			 *
+			 * Mark it as kept so that its objects are ignored by
+			 * add_unseen_recent_objects_to_traversal(). We'll
+			 * unmark it before starting the traversal so it doesn't
+			 * halt the traversal early.
+			 */
+			p->pack_keep_in_core = 1;
+		}
+	}
+
+	mark_pack_kept_in_core(&fresh_packs, 1);
+	mark_pack_kept_in_core(&discard_packs, 0);
+
+	if (cruft_expiration)
+		die("--cruft-expiration not yet implemented");
+	else
+		enumerate_cruft_objects();
+
+	strbuf_release(&buf);
+	string_list_clear(&discard_packs, 0);
+	string_list_clear(&fresh_packs, 0);
+}
+
 static void read_object_list_from_stdin(void)
 {
 	char line[GIT_MAX_HEXSZ + 1 + PATH_MAX + 2];
@@ -3529,7 +3664,24 @@ static int add_object_in_unpacked_pack(const struct object_id *oid,
 				       uint32_t pos,
 				       void *_data)
 {
-	add_object_entry(oid, OBJ_NONE, "", 0);
+	if (cruft) {
+		off_t offset;
+		time_t mtime;
+
+		if (pack->is_cruft) {
+			if (load_pack_mtimes(pack) < 0)
+				die(_("could not load cruft pack .mtimes"));
+			mtime = nth_packed_mtime(pack, pos);
+		} else {
+			mtime = pack->mtime;
+		}
+		offset = nth_packed_object_offset(pack, pos);
+
+		add_cruft_object_entry(oid, OBJ_NONE, pack, offset,
+				       NULL, mtime);
+	} else {
+		add_object_entry(oid, OBJ_NONE, "", 0);
+	}
 	return 0;
 }
 
@@ -3553,7 +3705,19 @@ static int add_loose_object(const struct object_id *oid, const char *path,
 		return 0;
 	}
 
-	add_object_entry(oid, type, "", 0);
+	if (cruft) {
+		struct stat st;
+		if (stat(path, &st) < 0) {
+			if (errno == ENOENT)
+				return 0;
+			return error_errno("unable to stat %s", oid_to_hex(oid));
+		}
+
+		add_cruft_object_entry(oid, type, NULL, 0, NULL,
+				       st.st_mtime);
+	} else {
+		add_object_entry(oid, type, "", 0);
+	}
 	return 0;
 }
 
@@ -3870,6 +4034,20 @@ static int option_parse_unpack_unreachable(const struct option *opt,
 	return 0;
 }
 
+static int option_parse_cruft_expiration(const struct option *opt,
+					 const char *arg, int unset)
+{
+	if (unset) {
+		cruft = 0;
+		cruft_expiration = 0;
+	} else {
+		cruft = 1;
+		if (arg)
+			cruft_expiration = approxidate(arg);
+	}
+	return 0;
+}
+
 struct po_filter_data {
 	unsigned have_revs:1;
 	struct rev_info revs;
@@ -3959,6 +4137,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		OPT_CALLBACK_F(0, "unpack-unreachable", NULL, N_("time"),
 		  N_("unpack unreachable objects newer than <time>"),
 		  PARSE_OPT_OPTARG, option_parse_unpack_unreachable),
+		OPT_BOOL(0, "cruft", &cruft, N_("create a cruft pack")),
+		OPT_CALLBACK_F(0, "cruft-expiration", NULL, N_("time"),
+		  N_("expire cruft objects older than <time>"),
+		  PARSE_OPT_OPTARG, option_parse_cruft_expiration),
 		OPT_BOOL(0, "sparse", &sparse,
 			 N_("use the sparse reachability algorithm")),
 		OPT_BOOL(0, "thin", &thin,
@@ -4085,7 +4267,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	if (!HAVE_THREADS && delta_search_threads != 1)
 		warning(_("no threads support, ignoring --threads"));
-	if (!pack_to_stdout && !pack_size_limit)
+	if (!pack_to_stdout && !pack_size_limit && !cruft)
 		pack_size_limit = pack_size_limit_cfg;
 	if (pack_to_stdout && pack_size_limit)
 		die(_("--max-pack-size cannot be used to build a pack for transfer"));
@@ -4112,6 +4294,15 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
 
+	if (cruft) {
+		if (use_internal_rev_list)
+			die(_("cannot use internal rev list with --cruft"));
+		if (stdin_packs)
+			die(_("cannot use --stdin-packs with --cruft"));
+		if (pack_size_limit)
+			die(_("cannot use --max-pack-size with --cruft"));
+	}
+
 	/*
 	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
 	 *
@@ -4168,7 +4359,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			    the_repository);
 	prepare_packing_data(the_repository, &to_pack);
 
-	if (progress)
+	if (progress && !cruft)
 		progress_state = start_progress(_("Enumerating objects"), 0);
 	if (stdin_packs) {
 		/* avoids adding objects in excluded packs */
@@ -4176,6 +4367,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		read_packs_list_from_stdin();
 		if (rev_list_unpacked)
 			add_unreachable_loose_objects();
+	} else if (cruft) {
+		read_cruft_objects();
 	} else if (!use_internal_rev_list) {
 		read_object_list_from_stdin();
 	} else if (pfd.have_revs) {
diff --git a/object-file.c b/object-file.c
index 5ffbf3d4fd..ff0cffe68e 100644
--- a/object-file.c
+++ b/object-file.c
@@ -997,7 +997,7 @@ int has_loose_object_nonlocal(const struct object_id *oid)
 	return check_and_freshen_nonlocal(oid, 0);
 }
 
-static int has_loose_object(const struct object_id *oid)
+int has_loose_object(const struct object_id *oid)
 {
 	return check_and_freshen(oid, 0);
 }
diff --git a/object-store.h b/object-store.h
index 2c4671ed7a..c41609e8db 100644
--- a/object-store.h
+++ b/object-store.h
@@ -330,6 +330,8 @@ int repo_has_object_file_with_flags(struct repository *r,
  */
 int has_loose_object_nonlocal(const struct object_id *);
 
+int has_loose_object(const struct object_id *);
+
 /**
  * format_object_header() is a thin wrapper around s xsnprintf() that
  * writes the initial "<type> <obj-len>" part of the loose object
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
new file mode 100755
index 0000000000..003ca7344e
--- /dev/null
+++ b/t/t5329-pack-objects-cruft.sh
@@ -0,0 +1,218 @@
+#!/bin/sh
+
+test_description='cruft pack related pack-objects tests'
+. ./test-lib.sh
+
+objdir=.git/objects
+packdir=$objdir/pack
+
+basic_cruft_pack_tests () {
+	expire="$1"
+
+	test_expect_success "unreachable loose objects are packed (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit base &&
+			git repack -Ad &&
+			test_commit loose &&
+
+			test-tool chmtime +2000 "$objdir/$(test_oid_to_path \
+				$(git rev-parse loose:loose.t))" &&
+			test-tool chmtime +1000 "$objdir/$(test_oid_to_path \
+				$(git rev-parse loose^{tree}))" &&
+
+			(
+				git rev-list --objects --no-object-names base..loose |
+				while read oid
+				do
+					path="$objdir/$(test_oid_to_path "$oid")" &&
+					printf "%s %d\n" "$oid" "$(test-tool chmtime --get "$path")"
+				done |
+				sort -k1
+			) >expect &&
+
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+			cruft="$(echo $keep | git pack-objects --cruft \
+				--cruft-expiration="$expire" $packdir/pack)" &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "unreachable packed objects are packed (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit packed &&
+			git repack -Ad &&
+			test_commit other &&
+
+			git rev-list --objects --no-object-names packed.. >objects &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+			other="$(git pack-objects --delta-base-offset \
+				$packdir/pack <objects)" &&
+			git prune-packed &&
+
+			test-tool chmtime --get -100 "$packdir/pack-$other.pack" >expect &&
+
+			cruft="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF
+			$keep
+			-pack-$other.pack
+			EOF
+			)" &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+			cut -d" " -f2 <actual.raw | sort -u >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "unreachable cruft objects are repacked (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit packed &&
+			git repack -Ad &&
+			test_commit other &&
+
+			git rev-list --objects --no-object-names packed.. >objects &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+
+			cruft_a="$(echo $keep | git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack)" &&
+			git prune-packed &&
+			cruft_b="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF
+			$keep
+			-pack-$cruft_a.pack
+			EOF
+			)" &&
+
+			test-tool pack-mtimes "pack-$cruft_a.mtimes" >expect.raw &&
+			test-tool pack-mtimes "pack-$cruft_b.mtimes" >actual.raw &&
+
+			sort <expect.raw >expect &&
+			sort <actual.raw >actual &&
+
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "multiple cruft packs (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			git repack -Ad &&
+			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+
+			test_commit cruft &&
+			loose="$objdir/$(test_oid_to_path $(git rev-parse cruft))" &&
+
+			# generate three copies of the cruft object in different
+			# cruft packs, each with a unique mtime:
+			#   - one expired (1000 seconds ago)
+			#   - two non-expired (one 1000 seconds in the future,
+			#     one 1500 seconds in the future)
+			test-tool chmtime =-1000 "$loose" &&
+			git pack-objects --cruft $packdir/pack-A <<-EOF &&
+			$keep
+			EOF
+			test-tool chmtime =+1000 "$loose" &&
+			git pack-objects --cruft $packdir/pack-B <<-EOF &&
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			EOF
+			test-tool chmtime =+1500 "$loose" &&
+			git pack-objects --cruft $packdir/pack-C <<-EOF &&
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			-$(basename $(ls $packdir/pack-B-*.pack))
+			EOF
+
+			# ensure the resulting cruft pack takes the most recent
+			# mtime among all copies
+			cruft="$(git pack-objects --cruft \
+				--cruft-expiration="$expire" \
+				$packdir/pack <<-EOF
+			$keep
+			-$(basename $(ls $packdir/pack-A-*.pack))
+			-$(basename $(ls $packdir/pack-B-*.pack))
+			-$(basename $(ls $packdir/pack-C-*.pack))
+			EOF
+			)" &&
+
+			test-tool pack-mtimes "$(basename $(ls $packdir/pack-C-*.mtimes))" >expect.raw &&
+			test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+			sort expect.raw >expect &&
+			sort actual.raw >actual &&
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "cruft packs tolerate missing trees (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			test_commit cruft &&
+
+			tree="$(git rev-parse cruft^{tree})" &&
+
+			git reset --hard reachable &&
+			git tag -d cruft &&
+			git reflog expire --all --expire=all &&
+
+			# remove the unreachable tree, but leave the commit
+			# which has it as its root tree intact
+			rm -fr "$objdir/$(test_oid_to_path "$tree")" &&
+
+			git repack -Ad &&
+			basename $(ls $packdir/pack-*.pack) >in &&
+			git pack-objects --cruft --cruft-expiration="$expire" \
+				$packdir/pack <in
+		)
+	'
+
+	test_expect_success "cruft packs tolerate missing blobs (expire $expire)" '
+		git init repo &&
+		test_when_finished "rm -fr repo" &&
+		(
+			cd repo &&
+
+			test_commit reachable &&
+			test_commit cruft &&
+
+			blob="$(git rev-parse cruft:cruft.t)" &&
+
+			git reset --hard reachable &&
+			git tag -d cruft &&
+			git reflog expire --all --expire=all &&
+
+			# remove the unreachable blob, but leave the commit (and
+			# the root tree of that commit) intact
+			rm -fr "$objdir/$(test_oid_to_path "$blob")" &&
+
+			git repack -Ad &&
+			basename $(ls $packdir/pack-*.pack) >in &&
+			git pack-objects --cruft --cruft-expiration="$expire" \
+				$packdir/pack <in
+		)
+	'
+}
+
+basic_cruft_pack_tests never
+
+test_done
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v5 09/17] reachable: add options to add_unseen_recent_objects_to_traversal
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
                     ` (7 preceding siblings ...)
  2022-05-20 23:17   ` [PATCH v5 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
@ 2022-05-20 23:17   ` Taylor Blau
  2022-05-20 23:17   ` [PATCH v5 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
                     ` (8 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 23:17 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

This function behaves very similarly to what we will need in
pack-objects in order to implement cruft packs with expiration. But it
is lacking a couple of things. Namely, it needs:

  - a mechanism to communicate the timestamps of individual recent
    objects to some external caller

  - and, in the case of packed objects, our future caller will also want
    to know the originating pack, as well as the offset within that pack
    at which the object can be found

  - finally, it needs a way to skip over packs which are marked as kept
    in-core.

To address the first two, add a callback interface in this patch which
reports the time of each recent object, as well as a (packed_git,
off_t) pair for packed objects.

Likewise, add a new option to the packed object iterators to skip over
packs which are marked as kept in core. This option will become
implicitly tested in a future patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c |  2 +-
 reachable.c            | 51 +++++++++++++++++++++++++++++++++++-------
 reachable.h            |  9 +++++++-
 3 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 9cf89be673..3b8bf6a3dd 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3957,7 +3957,7 @@ static void get_object_list(struct rev_info *revs, int ac, const char **av)
 	if (unpack_unreachable_expiration) {
 		revs->ignore_missing_links = 1;
 		if (add_unseen_recent_objects_to_traversal(revs,
-				unpack_unreachable_expiration))
+				unpack_unreachable_expiration, NULL, 0))
 			die(_("unable to add recent objects"));
 		if (prepare_revision_walk(revs))
 			die(_("revision walk setup failed"));
diff --git a/reachable.c b/reachable.c
index b9f4ad886e..d4507c4270 100644
--- a/reachable.c
+++ b/reachable.c
@@ -60,9 +60,13 @@ static void mark_commit(struct commit *c, void *data)
 struct recent_data {
 	struct rev_info *revs;
 	timestamp_t timestamp;
+	report_recent_object_fn *cb;
+	int ignore_in_core_kept_packs;
 };
 
 static void add_recent_object(const struct object_id *oid,
+			      struct packed_git *pack,
+			      off_t offset,
 			      timestamp_t mtime,
 			      struct recent_data *data)
 {
@@ -103,13 +107,29 @@ static void add_recent_object(const struct object_id *oid,
 		die("unable to lookup %s", oid_to_hex(oid));
 
 	add_pending_object(data->revs, obj, "");
+	if (data->cb)
+		data->cb(obj, pack, offset, mtime);
+}
+
+static int want_recent_object(struct recent_data *data,
+			      const struct object_id *oid)
+{
+	if (data->ignore_in_core_kept_packs &&
+	    has_object_kept_pack(oid, IN_CORE_KEEP_PACKS))
+		return 0;
+	return 1;
 }
 
 static int add_recent_loose(const struct object_id *oid,
 			    const char *path, void *data)
 {
 	struct stat st;
-	struct object *obj = lookup_object(the_repository, oid);
+	struct object *obj;
+
+	if (!want_recent_object(data, oid))
+		return 0;
+
+	obj = lookup_object(the_repository, oid);
 
 	if (obj && obj->flags & SEEN)
 		return 0;
@@ -126,7 +146,7 @@ static int add_recent_loose(const struct object_id *oid,
 		return error_errno("unable to stat %s", oid_to_hex(oid));
 	}
 
-	add_recent_object(oid, st.st_mtime, data);
+	add_recent_object(oid, NULL, 0, st.st_mtime, data);
 	return 0;
 }
 
@@ -134,29 +154,43 @@ static int add_recent_packed(const struct object_id *oid,
 			     struct packed_git *p, uint32_t pos,
 			     void *data)
 {
-	struct object *obj = lookup_object(the_repository, oid);
+	struct object *obj;
+
+	if (!want_recent_object(data, oid))
+		return 0;
+
+	obj = lookup_object(the_repository, oid);
 
 	if (obj && obj->flags & SEEN)
 		return 0;
-	add_recent_object(oid, p->mtime, data);
+	add_recent_object(oid, p, nth_packed_object_offset(p, pos), p->mtime, data);
 	return 0;
 }
 
 int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
-					   timestamp_t timestamp)
+					   timestamp_t timestamp,
+					   report_recent_object_fn *cb,
+					   int ignore_in_core_kept_packs)
 {
 	struct recent_data data;
+	enum for_each_object_flags flags;
 	int r;
 
 	data.revs = revs;
 	data.timestamp = timestamp;
+	data.cb = cb;
+	data.ignore_in_core_kept_packs = ignore_in_core_kept_packs;
 
 	r = for_each_loose_object(add_recent_loose, &data,
 				  FOR_EACH_OBJECT_LOCAL_ONLY);
 	if (r)
 		return r;
-	return for_each_packed_object(add_recent_packed, &data,
-				      FOR_EACH_OBJECT_LOCAL_ONLY);
+
+	flags = FOR_EACH_OBJECT_LOCAL_ONLY | FOR_EACH_OBJECT_PACK_ORDER;
+	if (ignore_in_core_kept_packs)
+		flags |= FOR_EACH_OBJECT_SKIP_IN_CORE_KEPT_PACKS;
+
+	return for_each_packed_object(add_recent_packed, &data, flags);
 }
 
 static int mark_object_seen(const struct object_id *oid,
@@ -217,7 +251,8 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 
 	if (mark_recent) {
 		revs->ignore_missing_links = 1;
-		if (add_unseen_recent_objects_to_traversal(revs, mark_recent))
+		if (add_unseen_recent_objects_to_traversal(revs, mark_recent,
+							   NULL, 0))
 			die("unable to mark recent objects");
 		if (prepare_revision_walk(revs))
 			die("revision walk setup failed");
diff --git a/reachable.h b/reachable.h
index 5df932ad8f..b776761baa 100644
--- a/reachable.h
+++ b/reachable.h
@@ -1,11 +1,18 @@
 #ifndef REACHEABLE_H
 #define REACHEABLE_H
 
+#include "object.h"
+
 struct progress;
 struct rev_info;
 
+typedef void report_recent_object_fn(const struct object *, struct packed_git *,
+				     off_t, time_t);
+
 int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
-					   timestamp_t timestamp);
+					   timestamp_t timestamp,
+					   report_recent_object_fn cb,
+					   int ignore_in_core_kept_packs);
 void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 			    timestamp_t mark_recent, struct progress *);
 
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v5 10/17] reachable: report precise timestamps from objects in cruft packs
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
                     ` (8 preceding siblings ...)
  2022-05-20 23:17   ` [PATCH v5 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
@ 2022-05-20 23:17   ` Taylor Blau
  2022-05-20 23:18   ` [PATCH v5 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
                     ` (7 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 23:17 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

When generating a cruft pack, the caller within pack-objects will want
to know the precise timestamps of cruft objects (i.e., their
corresponding values in the .mtimes table) rather than the mtime of the
cruft pack itself.

Teach add_recent_packed() to lookup each object's precise mtime from the
.mtimes file if one exists (indicated by the is_cruft bit on the
packed_git structure).

A couple of small things worth noting here:

  - load_pack_mtimes() needs to be called before asking for
    nth_packed_mtime(), and that call is done lazily here. That function
    exits early if the .mtimes file has already been opened and parsed,
    so only the first call is slow.

  - Checking the is_cruft bit can be done without any extra work on the
    caller's behalf, since it is set up for us automatically as a
    side-effect of calling add_packed_git() (just like the 'pack_keep'
    and 'pack_promisor' bits).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 reachable.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/reachable.c b/reachable.c
index d4507c4270..aba63ebeb3 100644
--- a/reachable.c
+++ b/reachable.c
@@ -13,6 +13,7 @@
 #include "worktree.h"
 #include "object-store.h"
 #include "pack-bitmap.h"
+#include "pack-mtimes.h"
 
 struct connectivity_progress {
 	struct progress *progress;
@@ -155,6 +156,7 @@ static int add_recent_packed(const struct object_id *oid,
 			     void *data)
 {
 	struct object *obj;
+	timestamp_t mtime = p->mtime;
 
 	if (!want_recent_object(data, oid))
 		return 0;
@@ -163,7 +165,12 @@ static int add_recent_packed(const struct object_id *oid,
 
 	if (obj && obj->flags & SEEN)
 		return 0;
-	add_recent_object(oid, p, nth_packed_object_offset(p, pos), p->mtime, data);
+	if (p->is_cruft) {
+		if (load_pack_mtimes(p) < 0)
+			die(_("could not load cruft pack .mtimes"));
+		mtime = nth_packed_mtime(p, pos);
+	}
+	add_recent_object(oid, p, nth_packed_object_offset(p, pos), mtime, data);
 	return 0;
 }
 
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v5 11/17] builtin/pack-objects.c: --cruft with expiration
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
                     ` (9 preceding siblings ...)
  2022-05-20 23:17   ` [PATCH v5 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
@ 2022-05-20 23:18   ` Taylor Blau
  2022-05-20 23:18   ` [PATCH v5 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
                     ` (6 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 23:18 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

In a previous patch, pack-objects learned how to generate a cruft pack
so long as no objects are dropped.

This patch teaches pack-objects to handle the case where a non-never
`--cruft-expiration` value is passed. This case is slightly more
complicated than before, because we want pack-objects to save
unreachable objects which would have been pruned when there is another
recent (i.e., non-prunable) unreachable object which reaches the other.
We'll call these objects "unreachable but reachable-from-recent".

Here is how pack-objects handles `--cruft-expiration`:

  - Instead of adding all objects outside of the kept pack(s) into the
    packing list, only handle the ones whose mtime is within the grace
    period.

  - Construct a reachability traversal whose tips are the
    unreachable-but-recent objects.

  - Then, walk along that traversal, stopping if we reach an object in
    the kept pack. At each step along the traversal, we add the object
    we are visiting to the packing list.

In the majority of these cases, any object we visit in this traversal
will already be in our packing list. But we will sometimes encounter
reachable-from-recent cruft objects, which we want to retain even if
they aged out of the grace period.

The most subtle point of this process is that we actually don't need to
bother to update the rescued object's mtime. Even though we will write
an .mtimes file with a value that is older than the expiration window,
it will continue to survive cruft repacks so long as any objects which
reach it haven't aged out.

That is, a future repack will also exclude that object from the initial
packing list, only to discover it later on when doing the reachability
traversal.

Finally, stopping early once an object is found in a kept pack is safe
to do because the kept packs ordinarily represent which packs will
survive after repacking. Assuming that it _isn't_ safe to halt a
traversal early would mean that there is some ancestor object which is
missing, which implies repository corruption (i.e., the complete set of
reachable objects isn't present).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c        |  84 +++++++++++++++++++-
 reachable.h                   |   4 +-
 t/t5329-pack-objects-cruft.sh | 143 ++++++++++++++++++++++++++++++++++
 3 files changed, 228 insertions(+), 3 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 3b8bf6a3dd..8decc9dc0c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3447,6 +3447,44 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
 	return;
 }
 
+static void show_cruft_object(struct object *obj, const char *name, void *data)
+{
+	/*
+	 * if we did not record it earlier, it's at least as old as our
+	 * expiration value. Rather than find it exactly, just use that
+	 * value.  This may bump it forward from its real mtime, but it
+	 * will still be "too old" next time we run with the same
+	 * expiration.
+	 *
+	 * if obj does appear in the packing list, this call is a noop (or may
+	 * set the namehash).
+	 */
+	add_cruft_object_entry(&obj->oid, obj->type, NULL, 0, name, cruft_expiration);
+}
+
+static void show_cruft_commit(struct commit *commit, void *data)
+{
+	show_cruft_object((struct object*)commit, NULL, data);
+}
+
+static int cruft_include_check_obj(struct object *obj, void *data)
+{
+	return !has_object_kept_pack(&obj->oid, IN_CORE_KEEP_PACKS);
+}
+
+static int cruft_include_check(struct commit *commit, void *data)
+{
+	return cruft_include_check_obj((struct object*)commit, data);
+}
+
+static void set_cruft_mtime(const struct object *object,
+			    struct packed_git *pack,
+			    off_t offset, time_t mtime)
+{
+	add_cruft_object_entry(&object->oid, object->type, pack, offset, NULL,
+			       mtime);
+}
+
 static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
 {
 	struct string_list_item *item = NULL;
@@ -3472,6 +3510,50 @@ static void enumerate_cruft_objects(void)
 	stop_progress(&progress_state);
 }
 
+static void enumerate_and_traverse_cruft_objects(struct string_list *fresh_packs)
+{
+	struct packed_git *p;
+	struct rev_info revs;
+	int ret;
+
+	repo_init_revisions(the_repository, &revs, NULL);
+
+	revs.tag_objects = 1;
+	revs.tree_objects = 1;
+	revs.blob_objects = 1;
+
+	revs.include_check = cruft_include_check;
+	revs.include_check_obj = cruft_include_check_obj;
+
+	revs.ignore_missing_links = 1;
+
+	if (progress)
+		progress_state = start_progress(_("Enumerating cruft objects"), 0);
+	ret = add_unseen_recent_objects_to_traversal(&revs, cruft_expiration,
+						     set_cruft_mtime, 1);
+	stop_progress(&progress_state);
+
+	if (ret)
+		die(_("unable to add cruft objects"));
+
+	/*
+	 * Re-mark only the fresh packs as kept so that objects in
+	 * unknown packs do not halt the reachability traversal early.
+	 */
+	for (p = get_all_packs(the_repository); p; p = p->next)
+		p->pack_keep_in_core = 0;
+	mark_pack_kept_in_core(fresh_packs, 1);
+
+	if (prepare_revision_walk(&revs))
+		die(_("revision walk setup failed"));
+	if (progress)
+		progress_state = start_progress(_("Traversing cruft objects"), 0);
+	nr_seen = 0;
+	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
+
+	stop_progress(&progress_state);
+}
+
 static void read_cruft_objects(void)
 {
 	struct strbuf buf = STRBUF_INIT;
@@ -3523,7 +3605,7 @@ static void read_cruft_objects(void)
 	mark_pack_kept_in_core(&discard_packs, 0);
 
 	if (cruft_expiration)
-		die("--cruft-expiration not yet implemented");
+		enumerate_and_traverse_cruft_objects(&fresh_packs);
 	else
 		enumerate_cruft_objects();
 
diff --git a/reachable.h b/reachable.h
index b776761baa..020a887b99 100644
--- a/reachable.h
+++ b/reachable.h
@@ -1,10 +1,10 @@
 #ifndef REACHEABLE_H
 #define REACHEABLE_H
 
-#include "object.h"
-
 struct progress;
 struct rev_info;
+struct object;
+struct packed_git;
 
 typedef void report_recent_object_fn(const struct object *, struct packed_git *,
 				     off_t, time_t);
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 003ca7344e..939cdc297a 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -214,5 +214,148 @@ basic_cruft_pack_tests () {
 }
 
 basic_cruft_pack_tests never
+basic_cruft_pack_tests 2.weeks.ago
+
+test_expect_success 'cruft tags rescue tagged objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit tagged &&
+		git tag -a annotated -m tag &&
+
+		git rev-list --objects --no-object-names packed.. >objects &&
+		while read oid
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $oid)"
+		done <objects &&
+
+		test-tool chmtime -500 \
+			"$objdir/$(test_oid_to_path $(git rev-parse annotated))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+		cut -f1 -d" " <actual.raw | sort >actual &&
+
+		(
+			cat objects &&
+			git rev-parse annotated
+		) >expect.raw &&
+		sort <expect.raw >expect &&
+
+		test_cmp expect actual &&
+		cat actual
+	)
+'
+
+test_expect_success 'cruft commits rescue parents, trees' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit old &&
+		test_commit new &&
+
+		git rev-list --objects --no-object-names packed..new >objects &&
+		while read object
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $object)"
+		done <objects &&
+		test-tool chmtime +500 "$objdir/$(test_oid_to_path \
+			$(git rev-parse HEAD))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+
+		cut -d" " -f1 <actual.raw | sort >actual &&
+		sort <objects >expect &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'cruft trees rescue sub-trees, blobs' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		mkdir -p dir/sub &&
+		echo foo >foo &&
+		echo bar >dir/bar &&
+		echo baz >dir/sub/baz &&
+
+		test_tick &&
+		git add . &&
+		git commit -m "pruned" &&
+
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD^{tree}))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:foo))" &&
+		test-tool chmtime  -500 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/bar))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/sub))" &&
+		test-tool chmtime -1000 "$objdir/$(test_oid_to_path $(git rev-parse HEAD:dir/sub/baz))" &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw &&
+		cut -f1 -d" " <actual.raw | sort >actual &&
+
+		git rev-parse HEAD:dir HEAD:dir/bar HEAD:dir/sub HEAD:dir/sub/baz >expect.raw &&
+		sort <expect.raw >expect &&
+
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'expired objects are pruned' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit packed &&
+		git repack -Ad &&
+
+		test_commit pruned &&
+
+		git rev-list --objects --no-object-names packed..pruned >objects &&
+		while read object
+		do
+			test-tool chmtime -1000 \
+				"$objdir/$(test_oid_to_path $object)"
+		done <objects &&
+
+		keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
+		cruft="$(echo $keep | git pack-objects --cruft \
+			--cruft-expiration=750.seconds.ago \
+			$packdir/pack)" &&
+
+		test-tool pack-mtimes "pack-$cruft.mtimes" >actual &&
+		test_must_be_empty actual
+	)
+'
 
 test_done
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v5 12/17] builtin/repack.c: support generating a cruft pack
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
                     ` (10 preceding siblings ...)
  2022-05-20 23:18   ` [PATCH v5 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
@ 2022-05-20 23:18   ` Taylor Blau
  2022-05-20 23:18   ` [PATCH v5 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
                     ` (5 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 23:18 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

Expose a way to split the contents of a repository into a main and cruft
pack when doing an all-into-one repack with `git repack --cruft -d`, and
a complementary configuration variable.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt            |  11 ++
 Documentation/technical/cruft-packs.txt |   2 +-
 builtin/repack.c                        | 105 +++++++++++-
 t/t5329-pack-objects-cruft.sh           | 207 ++++++++++++++++++++++++
 4 files changed, 319 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index ee30edc178..0bf13893d8 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -63,6 +63,17 @@ to the new separate pack will be written.
 	Also run  'git prune-packed' to remove redundant
 	loose object files.
 
+--cruft::
+	Same as `-a`, unless `-d` is used. Then any unreachable objects
+	are packed into a separate cruft pack. Unreachable objects can
+	be pruned using the normal expiry rules with the next `git gc`
+	invocation (see linkgit:git-gc[1]). Incompatible with `-k`.
+
+--cruft-expiration=<approxidate>::
+	Expire unreachable objects older than `<approxidate>`
+	immediately instead of waiting for the next `git gc` invocation.
+	Only useful with `--cruft -d`.
+
 -l::
 	Pass the `--local` option to 'git pack-objects'. See
 	linkgit:git-pack-objects[1].
diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
index c0f583cd48..d81f3a8982 100644
--- a/Documentation/technical/cruft-packs.txt
+++ b/Documentation/technical/cruft-packs.txt
@@ -17,7 +17,7 @@ pruned according to normal expiry rules with the next 'git gc' invocation.
 
 Unreachable objects aren't removed immediately, since doing so could race with
 an incoming push which may reference an object which is about to be deleted.
-Instead, those unreachable objects are stored as loose object and stay that way
+Instead, those unreachable objects are stored as loose objects and stay that way
 until they are older than the expiration window, at which point they are removed
 by linkgit:git-prune[1].
 
diff --git a/builtin/repack.c b/builtin/repack.c
index e7a3920c6d..593c18d4e8 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -18,12 +18,18 @@
 #include "pack-bitmap.h"
 #include "refs.h"
 
+#define ALL_INTO_ONE 1
+#define LOOSEN_UNREACHABLE 2
+#define PACK_CRUFT 4
+
+static int pack_everything;
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
 static int write_bitmaps = -1;
 static int use_delta_islands;
 static int run_update_server_info = 1;
 static char *packdir, *packtmp_name, *packtmp;
+static char *cruft_expiration;
 
 static const char *const git_repack_usage[] = {
 	N_("git repack [<options>]"),
@@ -305,9 +311,6 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 		die(_("could not finish pack-objects to repack promisor objects"));
 }
 
-#define ALL_INTO_ONE 1
-#define LOOSEN_UNREACHABLE 2
-
 struct pack_geometry {
 	struct packed_git **pack;
 	uint32_t pack_nr, pack_alloc;
@@ -344,6 +347,8 @@ static void init_pack_geometry(struct pack_geometry **geometry_p)
 	for (p = get_all_packs(the_repository); p; p = p->next) {
 		if (!pack_kept_objects && p->pack_keep)
 			continue;
+		if (p->is_cruft)
+			continue;
 
 		ALLOC_GROW(geometry->pack,
 			   geometry->pack_nr + 1,
@@ -605,6 +610,67 @@ static int write_midx_included_packs(struct string_list *include,
 	return finish_command(&cmd);
 }
 
+static int write_cruft_pack(const struct pack_objects_args *args,
+			    const char *pack_prefix,
+			    struct string_list *names,
+			    struct string_list *existing_packs,
+			    struct string_list *existing_kept_packs)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf line = STRBUF_INIT;
+	struct string_list_item *item;
+	FILE *in, *out;
+	int ret;
+
+	prepare_pack_objects(&cmd, args);
+
+	strvec_push(&cmd.args, "--cruft");
+	if (cruft_expiration)
+		strvec_pushf(&cmd.args, "--cruft-expiration=%s",
+			     cruft_expiration);
+
+	strvec_push(&cmd.args, "--honor-pack-keep");
+	strvec_push(&cmd.args, "--non-empty");
+	strvec_push(&cmd.args, "--max-pack-size=0");
+
+	cmd.in = -1;
+
+	ret = start_command(&cmd);
+	if (ret)
+		return ret;
+
+	/*
+	 * names has a confusing double use: it both provides the list
+	 * of just-written new packs, and accepts the name of the cruft
+	 * pack we are writing.
+	 *
+	 * By the time it is read here, it contains only the pack(s)
+	 * that were just written, which is exactly the set of packs we
+	 * want to consider kept.
+	 */
+	in = xfdopen(cmd.in, "w");
+	for_each_string_list_item(item, names)
+		fprintf(in, "%s-%s.pack\n", pack_prefix, item->string);
+	for_each_string_list_item(item, existing_packs)
+		fprintf(in, "-%s.pack\n", item->string);
+	for_each_string_list_item(item, existing_kept_packs)
+		fprintf(in, "%s.pack\n", item->string);
+	fclose(in);
+
+	out = xfdopen(cmd.out, "r");
+	while (strbuf_getline_lf(&line, out) != EOF) {
+		if (line.len != the_hash_algo->hexsz)
+			die(_("repack: Expecting full hex object ID lines only "
+			      "from pack-objects."));
+		string_list_append(names, line.buf);
+	}
+	fclose(out);
+
+	strbuf_release(&line);
+
+	return finish_command(&cmd);
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -621,7 +687,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int show_progress;
 
 	/* variables to be filled by option parsing */
-	int pack_everything = 0;
 	int delete_redundant = 0;
 	const char *unpack_unreachable = NULL;
 	int keep_unreachable = 0;
@@ -636,6 +701,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_BIT('A', NULL, &pack_everything,
 				N_("same as -a, and turn unreachable objects loose"),
 				   LOOSEN_UNREACHABLE | ALL_INTO_ONE),
+		OPT_BIT(0, "cruft", &pack_everything,
+				N_("same as -a, pack unreachable cruft objects separately"),
+				   PACK_CRUFT),
+		OPT_STRING(0, "cruft-expiration", &cruft_expiration, N_("approxidate"),
+				N_("with -C, expire objects older than this")),
 		OPT_BOOL('d', NULL, &delete_redundant,
 				N_("remove redundant packs, and run git-prune-packed")),
 		OPT_BOOL('f', NULL, &po_args.no_reuse_delta,
@@ -688,6 +758,15 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	    (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
 		die(_("options '%s' and '%s' cannot be used together"), "--keep-unreachable", "-A");
 
+	if (pack_everything & PACK_CRUFT) {
+		pack_everything |= ALL_INTO_ONE;
+
+		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
+			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-A");
+		if (keep_unreachable)
+			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-k");
+	}
+
 	if (write_bitmaps < 0) {
 		if (!write_midx &&
 		    (!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()))
@@ -771,7 +850,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (pack_everything & ALL_INTO_ONE) {
 		repack_promisor_objects(&po_args, &names);
 
-		if (existing_nonkept_packs.nr && delete_redundant) {
+		if (existing_nonkept_packs.nr && delete_redundant &&
+		    !(pack_everything & PACK_CRUFT)) {
 			for_each_string_list_item(item, &names) {
 				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
 					     packtmp_name, item->string);
@@ -833,6 +913,21 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (!names.nr && !po_args.quiet)
 		printf_ln(_("Nothing new to pack."));
 
+	if (pack_everything & PACK_CRUFT) {
+		const char *pack_prefix;
+		if (!skip_prefix(packtmp, packdir, &pack_prefix))
+			die(_("pack prefix %s does not begin with objdir %s"),
+			    packtmp, packdir);
+		if (*pack_prefix == '/')
+			pack_prefix++;
+
+		ret = write_cruft_pack(&po_args, pack_prefix, &names,
+				       &existing_nonkept_packs,
+				       &existing_kept_packs);
+		if (ret)
+			return ret;
+	}
+
 	for_each_string_list_item(item, &names) {
 		item->util = (void *)(uintptr_t)populate_pack_exts(item->string);
 	}
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 939cdc297a..067c50af38 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -358,4 +358,211 @@ test_expect_success 'expired objects are pruned' '
 	)
 '
 
+test_expect_success 'repack --cruft generates a cruft pack' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git branch -M main &&
+		git checkout --orphan other &&
+		test_commit unreachable &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d unreachable &&
+		# objects are not cruft if they are contained in the reflogs
+		git reflog expire --all --expire=all &&
+
+		git rev-list --objects --all --no-object-names >reachable.raw &&
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+		sort <reachable.raw >reachable &&
+		comm -13 reachable objects >unreachable &&
+
+		git repack --cruft -d &&
+
+		cruft=$(basename $(ls $packdir/pack-*.mtimes) .mtimes) &&
+		pack=$(basename $(ls $packdir/pack-*.pack | grep -v $cruft) .pack) &&
+
+		git show-index <$packdir/$pack.idx >actual.raw &&
+		cut -f2 -d" " actual.raw | sort >actual &&
+		test_cmp reachable actual &&
+
+		git show-index <$packdir/$cruft.idx >actual.raw &&
+		cut -f2 -d" " actual.raw | sort >actual &&
+		test_cmp unreachable actual
+	)
+'
+
+test_expect_success 'loose objects mtimes upsert others' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		test_commit cruft &&
+		# incremental repack, leaving existing objects loose (so
+		# they can be "freshened")
+		git repack &&
+
+		tip="$(git rev-parse cruft)" &&
+		path="$objdir/$(test_oid_to_path "$tip")" &&
+		test-tool chmtime --get +1000 "$path" >expect &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d cruft &&
+		git reflog expire --all --expire=all &&
+
+		git repack --cruft -d &&
+
+		mtimes="$(basename $(ls $packdir/pack-*.mtimes))" &&
+		test-tool pack-mtimes "$mtimes" >actual.raw &&
+		grep "$tip" actual.raw | cut -d" " -f2 >actual &&
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'cruft packs are not included in geometric repack' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		test_commit cruft &&
+		git repack -d &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d cruft &&
+		git reflog expire --all --expire=all &&
+
+		git repack --cruft &&
+
+		find $packdir -type f | sort >before &&
+		git repack --geometric=2 -d &&
+		find $packdir -type f | sort >after &&
+
+		test_cmp before after
+	)
+'
+
+test_expect_success 'repack --geometric collects once-cruft objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		git rm -rf . &&
+		test_commit --no-tag cruft &&
+		cruft="$(git rev-parse HEAD)" &&
+
+		git checkout main &&
+		git branch -D other &&
+		git reflog expire --all --expire=all &&
+
+		# Pack the objects created in the previous step into a cruft
+		# pack. Intentionally leave loose copies of those objects
+		# around so we can pick them up in a subsequent --geometric
+		# reapack.
+		git repack --cruft &&
+
+		# Now make those objects reachable, and ensure that they are
+		# packed into the new pack created via a --geometric repack.
+		git update-ref refs/heads/other $cruft &&
+
+		# Without this object, the set of unpacked objects is exactly
+		# the set of objects already in the cruft pack. Tweak that set
+		# to ensure we do not overwrite the cruft pack entirely.
+		test_commit reachable2 &&
+
+		find $packdir -name "pack-*.idx" | sort >before &&
+		git repack --geometric=2 -d &&
+		find $packdir -name "pack-*.idx" | sort >after &&
+
+		{
+			git rev-list --objects --no-object-names $cruft &&
+			git rev-list --objects --no-object-names reachable..reachable2
+		} >want.raw &&
+		sort want.raw >want &&
+
+		pack=$(comm -13 before after) &&
+		git show-index <$pack >objects.raw &&
+
+		cut -d" " -f2 objects.raw | sort >got &&
+
+		test_cmp want got
+	)
+'
+
+test_expect_success 'cruft repack with no reachable objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		git repack -ad &&
+
+		base="$(git rev-parse base)" &&
+
+		git for-each-ref --format="delete %(refname)" >in &&
+		git update-ref --stdin <in &&
+		git reflog expire --all --expire=all &&
+		rm -fr .git/index &&
+
+		git repack --cruft -d &&
+
+		git cat-file -t $base
+	)
+'
+
+test_expect_success 'cruft repack ignores --max-pack-size' '
+	git init max-pack-size &&
+	(
+		cd max-pack-size &&
+		test_commit base &&
+		# two cruft objects which exceed the maximum pack size
+		test-tool genrandom foo 1048576 | git hash-object --stdin -w &&
+		test-tool genrandom bar 1048576 | git hash-object --stdin -w &&
+		git repack --cruft --max-pack-size=1M &&
+		find $packdir -name "*.mtimes" >cruft &&
+		test_line_count = 1 cruft &&
+		test-tool pack-mtimes "$(basename "$(cat cruft)")" >objects &&
+		test_line_count = 2 objects
+	)
+'
+
+test_expect_success 'cruft repack ignores pack.packSizeLimit' '
+	(
+		cd max-pack-size &&
+		# repack everything back together to remove the existing cruft
+		# pack (but to keep its objects)
+		git repack -adk &&
+		git -c pack.packSizeLimit=1M repack --cruft &&
+		# ensure the same post condition is met when --max-pack-size
+		# would otherwise be inferred from the configuration
+		find $packdir -name "*.mtimes" >cruft &&
+		test_line_count = 1 cruft &&
+		test-tool pack-mtimes "$(basename "$(cat cruft)")" >objects &&
+		test_line_count = 2 objects
+	)
+'
+
 test_done
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v5 13/17] builtin/repack.c: allow configuring cruft pack generation
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
                     ` (11 preceding siblings ...)
  2022-05-20 23:18   ` [PATCH v5 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
@ 2022-05-20 23:18   ` Taylor Blau
  2022-05-20 23:18   ` [PATCH v5 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
                     ` (4 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 23:18 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

In servers which set the pack.window configuration to a large value, we
can wind up spending quite a lot of time finding new bases when breaking
delta chains between reachable and unreachable objects while generating
a cruft pack.

Introduce a handful of `repack.cruft*` configuration variables to
control the parameters used by pack-objects when generating a cruft
pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/repack.txt |  9 ++++
 builtin/repack.c                | 49 +++++++++++++------
 t/t5329-pack-objects-cruft.sh   | 83 +++++++++++++++++++++++++++++++++
 3 files changed, 127 insertions(+), 14 deletions(-)

diff --git a/Documentation/config/repack.txt b/Documentation/config/repack.txt
index 41ac6953c8..c79af6d7b8 100644
--- a/Documentation/config/repack.txt
+++ b/Documentation/config/repack.txt
@@ -30,3 +30,12 @@ repack.updateServerInfo::
 	If set to false, linkgit:git-repack[1] will not run
 	linkgit:git-update-server-info[1]. Defaults to true. Can be overridden
 	when true by the `-n` option of linkgit:git-repack[1].
+
+repack.cruftWindow::
+repack.cruftWindowMemory::
+repack.cruftDepth::
+repack.cruftThreads::
+	Parameters used by linkgit:git-pack-objects[1] when generating
+	a cruft pack and the respective parameters are not given over
+	the command line. See similarly named `pack.*` configuration
+	variables for defaults and meaning.
diff --git a/builtin/repack.c b/builtin/repack.c
index 593c18d4e8..b85483a148 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -41,9 +41,21 @@ static const char incremental_bitmap_conflict_error[] = N_(
 "--no-write-bitmap-index or disable the pack.writebitmaps configuration."
 );
 
+struct pack_objects_args {
+	const char *window;
+	const char *window_memory;
+	const char *depth;
+	const char *threads;
+	const char *max_pack_size;
+	int no_reuse_delta;
+	int no_reuse_object;
+	int quiet;
+	int local;
+};
 
 static int repack_config(const char *var, const char *value, void *cb)
 {
+	struct pack_objects_args *cruft_po_args = cb;
 	if (!strcmp(var, "repack.usedeltabaseoffset")) {
 		delta_base_offset = git_config_bool(var, value);
 		return 0;
@@ -65,6 +77,14 @@ static int repack_config(const char *var, const char *value, void *cb)
 		run_update_server_info = git_config_bool(var, value);
 		return 0;
 	}
+	if (!strcmp(var, "repack.cruftwindow"))
+		return git_config_string(&cruft_po_args->window, var, value);
+	if (!strcmp(var, "repack.cruftwindowmemory"))
+		return git_config_string(&cruft_po_args->window_memory, var, value);
+	if (!strcmp(var, "repack.cruftdepth"))
+		return git_config_string(&cruft_po_args->depth, var, value);
+	if (!strcmp(var, "repack.cruftthreads"))
+		return git_config_string(&cruft_po_args->threads, var, value);
 	return git_default_config(var, value, cb);
 }
 
@@ -157,18 +177,6 @@ static void remove_redundant_pack(const char *dir_name, const char *base_name)
 	strbuf_release(&buf);
 }
 
-struct pack_objects_args {
-	const char *window;
-	const char *window_memory;
-	const char *depth;
-	const char *threads;
-	const char *max_pack_size;
-	int no_reuse_delta;
-	int no_reuse_object;
-	int quiet;
-	int local;
-};
-
 static void prepare_pack_objects(struct child_process *cmd,
 				 const struct pack_objects_args *args)
 {
@@ -692,6 +700,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int keep_unreachable = 0;
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	struct pack_objects_args po_args = {NULL};
+	struct pack_objects_args cruft_po_args = {NULL};
 	int geometric_factor = 0;
 	int write_midx = 0;
 
@@ -746,7 +755,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_END()
 	};
 
-	git_config(repack_config, NULL);
+	git_config(repack_config, &cruft_po_args);
 
 	argc = parse_options(argc, argv, prefix, builtin_repack_options,
 				git_repack_usage, 0);
@@ -921,7 +930,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		if (*pack_prefix == '/')
 			pack_prefix++;
 
-		ret = write_cruft_pack(&po_args, pack_prefix, &names,
+		if (!cruft_po_args.window)
+			cruft_po_args.window = po_args.window;
+		if (!cruft_po_args.window_memory)
+			cruft_po_args.window_memory = po_args.window_memory;
+		if (!cruft_po_args.depth)
+			cruft_po_args.depth = po_args.depth;
+		if (!cruft_po_args.threads)
+			cruft_po_args.threads = po_args.threads;
+
+		cruft_po_args.local = po_args.local;
+		cruft_po_args.quiet = po_args.quiet;
+
+		ret = write_cruft_pack(&cruft_po_args, pack_prefix, &names,
 				       &existing_nonkept_packs,
 				       &existing_kept_packs);
 		if (ret)
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 067c50af38..c82f973b41 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -565,4 +565,87 @@ test_expect_success 'cruft repack ignores pack.packSizeLimit' '
 	)
 '
 
+test_expect_success 'cruft repack respects repack.cruftWindow' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		GIT_TRACE2_EVENT=$(pwd)/event.trace \
+		git -c pack.window=1 -c repack.cruftWindow=2 repack \
+		       --cruft --window=3 &&
+
+		grep "pack-objects.*--window=2.*--cruft" event.trace
+	)
+'
+
+test_expect_success 'cruft repack respects --window by default' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+
+		GIT_TRACE2_EVENT=$(pwd)/event.trace \
+		git -c pack.window=2 repack --cruft --window=3 &&
+
+		grep "pack-objects.*--window=3.*--cruft" event.trace
+	)
+'
+
+test_expect_success 'cruft repack respects --quiet' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		GIT_PROGRESS_DELAY=0 git repack --cruft --quiet 2>err &&
+		test_must_be_empty err
+	)
+'
+
+test_expect_success 'cruft --local drops unreachable objects' '
+	git init alternate &&
+	git init repo &&
+	test_when_finished "rm -fr alternate repo" &&
+
+	test_commit -C alternate base &&
+	# Pack all objects in alterate so that the cruft repack in "repo" sees
+	# the object it dropped due to `--local` as packed. Otherwise this
+	# object would not appear packed anywhere (since it is not packed in
+	# alternate and likewise not part of the cruft pack in the other repo
+	# because of `--local`).
+	git -C alternate repack -ad &&
+
+	(
+		cd repo &&
+
+		object="$(git -C ../alternate rev-parse HEAD:base.t)" &&
+		git -C ../alternate cat-file -p $object >contents &&
+
+		# Write some reachable objects and two unreachable ones: one
+		# that the alternate has and another that is unique.
+		test_commit other &&
+		git hash-object -w -t blob contents &&
+		cruft="$(echo cruft | git hash-object -w -t blob --stdin)" &&
+
+		( cd ../alternate/.git/objects && pwd ) \
+		       >.git/objects/info/alternates &&
+
+		test_path_is_file $objdir/$(test_oid_to_path $cruft) &&
+		test_path_is_file $objdir/$(test_oid_to_path $object) &&
+
+		git repack -d --cruft --local &&
+
+		test-tool pack-mtimes "$(basename $(ls $packdir/pack-*.mtimes))" \
+		       >objects &&
+		! grep $object objects &&
+		grep $cruft objects
+	)
+'
+
 test_done
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v5 14/17] builtin/repack.c: use named flags for existing_packs
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
                     ` (12 preceding siblings ...)
  2022-05-20 23:18   ` [PATCH v5 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
@ 2022-05-20 23:18   ` Taylor Blau
  2022-05-20 23:18   ` [PATCH v5 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
                     ` (3 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 23:18 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

We use the `util` pointer for items in the `existing_packs` string list
to indicate which packs are going to be deleted. Since that has so far
been the only use of that `util` pointer, we just set it to 0 or 1.

But we're going to add an additional state to this field in the next
patch, so prepare for that by adding a #define for the first bit so we
can more expressively inspect the flags state.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index b85483a148..36d1f03671 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -22,6 +22,8 @@
 #define LOOSEN_UNREACHABLE 2
 #define PACK_CRUFT 4
 
+#define DELETE_PACK 1
+
 static int pack_everything;
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
@@ -564,7 +566,7 @@ static void midx_included_packs(struct string_list *include,
 		}
 	} else {
 		for_each_string_list_item(item, existing_nonkept_packs) {
-			if (item->util)
+			if ((uintptr_t)item->util & DELETE_PACK)
 				continue;
 			string_list_insert(include, xstrfmt("%s.idx", item->string));
 		}
@@ -1002,7 +1004,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			 * was given) and that we will actually delete this pack
 			 * (if `-d` was given).
 			 */
-			item->util = (void*)(intptr_t)!string_list_has_string(&names, sha1);
+			if (!string_list_has_string(&names, sha1))
+				item->util = (void*)(uintptr_t)((size_t)item->util | DELETE_PACK);
 		}
 	}
 
@@ -1026,7 +1029,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (delete_redundant) {
 		int opts = 0;
 		for_each_string_list_item(item, &existing_nonkept_packs) {
-			if (!item->util)
+			if (!((uintptr_t)item->util & DELETE_PACK))
 				continue;
 			remove_redundant_pack(packdir, item->string);
 		}
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v5 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
                     ` (13 preceding siblings ...)
  2022-05-20 23:18   ` [PATCH v5 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
@ 2022-05-20 23:18   ` Taylor Blau
  2022-05-20 23:18   ` [PATCH v5 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
                     ` (2 subsequent siblings)
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 23:18 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

When using cruft packs, the following race can occur when a geometric
repack that writes a MIDX bitmap takes place afterwords:

  - First, create an unreachable object and do an all-into-one cruft
    repack which stores that object in the repository's cruft pack.
  - Then make that object reachable.
  - Finally, do a geometric repack and write a MIDX bitmap.

Assuming that we are sufficiently unlucky as to select a commit from the
MIDX which reaches that object for bitmapping, then the `git
multi-pack-index` process will complain that that object is missing.

The reason is because we don't include cruft packs in the MIDX when
doing a geometric repack. Since the "make that object reachable" doesn't
necessarily mean that we'll create a new copy of that object in one of
the packs that will get rolled up as part of a geometric repack, it's
possible that the MIDX won't see any copies of that now-reachable
object.

Of course, it's desirable to avoid including cruft packs in the MIDX
because it causes the MIDX to store a bunch of objects which are likely
to get thrown away. But excluding that pack does open us up to the above
race.

This patch demonstrates the bug, and resolves it by including cruft
packs in the MIDX even when doing a geometric repack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c              | 23 ++++++++++++++++++++---
 t/t5329-pack-objects-cruft.sh | 26 ++++++++++++++++++++++++++
 2 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 36d1f03671..15071fadbe 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -23,6 +23,7 @@
 #define PACK_CRUFT 4
 
 #define DELETE_PACK 1
+#define CRUFT_PACK 2
 
 static int pack_everything;
 static int delta_base_offset = 1;
@@ -159,10 +160,15 @@ static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 		fname = xmemdupz(e->d_name, len);
 
 		if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
-		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
+		    (file_exists(mkpath("%s/%s.keep", packdir, fname)))) {
 			string_list_append_nodup(fname_kept_list, fname);
-		else
-			string_list_append_nodup(fname_nonkept_list, fname);
+		} else {
+			struct string_list_item *item;
+			item = string_list_append_nodup(fname_nonkept_list,
+							fname);
+			if (file_exists(mkpath("%s/%s.mtimes", packdir, fname)))
+				item->util = (void*)(uintptr_t)CRUFT_PACK;
+		}
 	}
 	closedir(dir);
 }
@@ -564,6 +570,17 @@ static void midx_included_packs(struct string_list *include,
 
 			string_list_insert(include, strbuf_detach(&buf, NULL));
 		}
+
+		for_each_string_list_item(item, existing_nonkept_packs) {
+			if (!((uintptr_t)item->util & CRUFT_PACK)) {
+				/*
+				 * no need to check DELETE_PACK, since we're not
+				 * doing an ALL_INTO_ONE repack
+				 */
+				continue;
+			}
+			string_list_insert(include, xstrfmt("%s.idx", item->string));
+		}
 	} else {
 		for_each_string_list_item(item, existing_nonkept_packs) {
 			if ((uintptr_t)item->util & DELETE_PACK)
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index c82f973b41..8de87afce2 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -648,4 +648,30 @@ test_expect_success 'cruft --local drops unreachable objects' '
 	)
 '
 
+test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		test_commit cruft &&
+		unreachable="$(git rev-parse cruft)" &&
+
+		git reset --hard $unreachable^ &&
+		git tag -d cruft &&
+		git reflog expire --all --expire=all &&
+
+		git repack --cruft -d &&
+
+		# resurrect the unreachable object via a new commit. the
+		# new commit will get selected for a bitmap, but be
+		# missing one of its parents from the selected packs.
+		git reset --hard $unreachable &&
+		test_commit resurrect &&
+
+		git repack --write-midx --write-bitmap-index --geometric=2 -d
+	)
+'
+
 test_done
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v5 16/17] builtin/gc.c: conditionally avoid pruning objects via loose
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
                     ` (14 preceding siblings ...)
  2022-05-20 23:18   ` [PATCH v5 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
@ 2022-05-20 23:18   ` Taylor Blau
  2022-06-19  5:38     ` René Scharfe
  2022-05-20 23:18   ` [PATCH v5 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
  2022-05-21 11:17   ` [PATCH v5 00/17] " Ævar Arnfjörð Bjarmason
  17 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 23:18 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

Expose the new `git repack --cruft` mode from `git gc` via a new opt-in
flag. When invoked like `git gc --cruft`, `git gc` will avoid exploding
unreachable objects as loose ones, and instead create a cruft pack and
`.mtimes` file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/gc.txt   | 21 +++++++++++++-------
 Documentation/git-gc.txt      |  5 +++++
 builtin/gc.c                  | 10 +++++++++-
 t/t5329-pack-objects-cruft.sh | 37 +++++++++++++++++++++++++++++++++++
 4 files changed, 65 insertions(+), 8 deletions(-)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index c834e07991..38fea076a2 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -81,14 +81,21 @@ gc.packRefs::
 	to enable it within all non-bare repos or it can be set to a
 	boolean value.  The default is `true`.
 
+gc.cruftPacks::
+	Store unreachable objects in a cruft pack (see
+	linkgit:git-repack[1]) instead of as loose objects. The default
+	is `false`.
+
 gc.pruneExpire::
-	When 'git gc' is run, it will call 'prune --expire 2.weeks.ago'.
-	Override the grace period with this config variable.  The value
-	"now" may be used to disable this grace period and always prune
-	unreachable objects immediately, or "never" may be used to
-	suppress pruning.  This feature helps prevent corruption when
-	'git gc' runs concurrently with another process writing to the
-	repository; see the "NOTES" section of linkgit:git-gc[1].
+	When 'git gc' is run, it will call 'prune --expire 2.weeks.ago'
+	(and 'repack --cruft --cruft-expiration 2.weeks.ago' if using
+	cruft packs via `gc.cruftPacks` or `--cruft`).  Override the
+	grace period with this config variable.  The value "now" may be
+	used to disable this grace period and always prune unreachable
+	objects immediately, or "never" may be used to suppress pruning.
+	This feature helps prevent corruption when 'git gc' runs
+	concurrently with another process writing to the repository; see
+	the "NOTES" section of linkgit:git-gc[1].
 
 gc.worktreePruneExpire::
 	When 'git gc' is run, it calls
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 853967dea0..ba4e67700e 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -54,6 +54,11 @@ other housekeeping tasks (e.g. rerere, working trees, reflog...) will
 be performed as well.
 
 
+--cruft::
+	When expiring unreachable objects, pack them separately into a
+	cruft pack instead of storing the loose objects as loose
+	objects.
+
 --prune=<date>::
 	Prune loose objects older than date (default is 2 weeks ago,
 	overridable by the config variable `gc.pruneExpire`).
diff --git a/builtin/gc.c b/builtin/gc.c
index b335cffa33..4d995e85e9 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -42,6 +42,7 @@ static const char * const builtin_gc_usage[] = {
 
 static int pack_refs = 1;
 static int prune_reflogs = 1;
+static int cruft_packs = 0;
 static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
@@ -152,6 +153,7 @@ static void gc_config(void)
 	git_config_get_int("gc.auto", &gc_auto_threshold);
 	git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
 	git_config_get_bool("gc.autodetach", &detach_auto);
+	git_config_get_bool("gc.cruftpacks", &cruft_packs);
 	git_config_get_expiry("gc.pruneexpire", &prune_expire);
 	git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire);
 	git_config_get_expiry("gc.logexpiry", &gc_log_expire);
@@ -331,7 +333,11 @@ static void add_repack_all_option(struct string_list *keep_pack)
 {
 	if (prune_expire && !strcmp(prune_expire, "now"))
 		strvec_push(&repack, "-a");
-	else {
+	else if (cruft_packs) {
+		strvec_push(&repack, "--cruft");
+		if (prune_expire)
+			strvec_pushf(&repack, "--cruft-expiration=%s", prune_expire);
+	} else {
 		strvec_push(&repack, "-A");
 		if (prune_expire)
 			strvec_pushf(&repack, "--unpack-unreachable=%s", prune_expire);
@@ -551,6 +557,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 		{ OPTION_STRING, 0, "prune", &prune_expire, N_("date"),
 			N_("prune unreferenced objects"),
 			PARSE_OPT_OPTARG, NULL, (intptr_t)prune_expire },
+		OPT_BOOL(0, "cruft", &cruft_packs, N_("pack unreferenced objects separately")),
 		OPT_BOOL(0, "aggressive", &aggressive, N_("be more thorough (increased runtime)")),
 		OPT_BOOL_F(0, "auto", &auto_gc, N_("enable auto-gc mode"),
 			   PARSE_OPT_NOCOMPLETE),
@@ -670,6 +677,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 			die(FAILED_RUN, repack.v[0]);
 
 		if (prune_expire) {
+			/* run `git prune` even if using cruft packs */
 			strvec_push(&prune, prune_expire);
 			if (quiet)
 				strvec_push(&prune, "--no-progress");
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 8de87afce2..70a6a9553c 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -429,6 +429,43 @@ test_expect_success 'loose objects mtimes upsert others' '
 	)
 '
 
+test_expect_success 'expiring cruft objects with git gc' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git branch -M main &&
+		git checkout --orphan other &&
+		test_commit unreachable &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d unreachable &&
+		# objects are not cruft if they are contained in the reflogs
+		git reflog expire --all --expire=all &&
+
+		git rev-list --objects --all --no-object-names >reachable.raw &&
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+		sort <reachable.raw >reachable &&
+		comm -13 reachable objects >unreachable &&
+
+		git repack --cruft -d &&
+
+		mtimes=$(ls .git/objects/pack/pack-*.mtimes) &&
+		test_path_is_file $mtimes &&
+
+		git gc --cruft --prune=now &&
+
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+
+		comm -23 unreachable objects >removed &&
+		test_cmp unreachable removed &&
+		test_path_is_missing $mtimes
+	)
+'
+
 test_expect_success 'cruft packs are not included in geometric repack' '
 	git init repo &&
 	test_when_finished "rm -fr repo" &&
-- 
2.36.1.94.gb0d54bedca


^ permalink raw reply related	[flat|nested] 201+ messages in thread

* [PATCH v5 17/17] sha1-file.c: don't freshen cruft packs
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
                     ` (15 preceding siblings ...)
  2022-05-20 23:18   ` [PATCH v5 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
@ 2022-05-20 23:18   ` Taylor Blau
  2022-05-21 11:17   ` [PATCH v5 00/17] " Ævar Arnfjörð Bjarmason
  17 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 23:18 UTC (permalink / raw)
  To: git; +Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

We don't bother to freshen objects stored in a cruft pack individually
by updating the `.mtimes` file. This is because we can't portably `mmap`
and write into the middle of a file (i.e., to update the mtime of just
one object). Instead, we would have to rewrite the entire `.mtimes` file
which may incur some wasted effort especially if there a lot of cruft
objects and they are freshened infrequently.

Instead, force the freshening code to avoid an optimizing write by
writing out the object loose and letting it pick up a current mtime.

This works because we prefer the mtime of the loose copy of an object
when both a loose and packed one exist (whether or not the packed copy
comes from a cruft pack or not).

This could certainly do with a test and/or be included earlier in this
series/PR, but I want to wait until after I have a chance to clean up
the overly-repetitive nature of the cruft pack tests in general.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 object-file.c                 |  2 ++
 t/t5329-pack-objects-cruft.sh | 25 +++++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/object-file.c b/object-file.c
index ff0cffe68e..495a359200 100644
--- a/object-file.c
+++ b/object-file.c
@@ -2035,6 +2035,8 @@ static int freshen_packed_object(const struct object_id *oid)
 	struct pack_entry e;
 	if (!find_pack_entry(the_repository, oid, &e))
 		return 0;
+	if (e.p->is_cruft)
+		return 0;
 	if (e.p->freshened)
 		return 1;
 	if (!freshen_file(e.p->pack_name))
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 70a6a9553c..b481224b93 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -711,4 +711,29 @@ test_expect_success 'MIDX bitmaps tolerate reachable cruft objects' '
 	)
 '
 
+test_expect_success 'cruft objects are freshend via loose' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		echo "cruft" >contents &&
+		blob="$(git hash-object -w -t blob contents)" &&
+		loose="$objdir/$(test_oid_to_path $blob)" &&
+
+		test_commit base &&
+
+		git repack --cruft -d &&
+
+		test_path_is_missing "$loose" &&
+		test-tool pack-mtimes "$(basename "$(ls $packdir/pack-*.mtimes)")" >cruft &&
+		grep "$blob" cruft &&
+
+		# write the same object again
+		git hash-object -w -t blob contents &&
+
+		test_path_is_file "$loose"
+	)
+'
+
 test_done
-- 
2.36.1.94.gb0d54bedca

^ permalink raw reply related	[flat|nested] 201+ messages in thread

* Re: [PATCH v4 00/17] cruft packs
  2022-05-18 23:48   ` [PATCH v4 00/17] " Derrick Stolee
@ 2022-05-20 23:19     ` Junio C Hamano
  2022-05-20 23:30       ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Junio C Hamano @ 2022-05-20 23:19 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, avarab, jrnieder, larsxschneider, tytso

Derrick Stolee <derrickstolee@github.com> writes:

>>   - updating the `finalize_hashfile()` calls for writing `.mtimes` files
>>     to indicate that they are `FSYNC_COMPONENT_PACK_METADATA`, since the
>>     original version of this series predates the fine-grained fsync
>>     configuration in 2.36.
>
> Good to have this update and not require it to be handled at merge
> time by the maintainer.

Heh, my rerere database is good enough to make it a non-issue ;-)

>> As always, a range-diff is below. Thanks in advance for taking another
>> look!
>
> Looking at the range-diff, I'm happy with this version.

Thanks.  I am tempted to mark the topic as "expecting (hopefully the
final) reroll", to be merged down to 'next' soonish.



^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v4 00/17] cruft packs
  2022-05-20 23:19     ` Junio C Hamano
@ 2022-05-20 23:30       ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-20 23:30 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee, Taylor Blau, git, avarab, jrnieder,
	larsxschneider, tytso

On Fri, May 20, 2022 at 04:19:15PM -0700, Junio C Hamano wrote:
> >> As always, a range-diff is below. Thanks in advance for taking another
> >> look!
> >
> > Looking at the range-diff, I'm happy with this version.
>
> Thanks.  I am tempted to mark the topic as "expecting (hopefully the
> final) reroll", to be merged down to 'next' soonish.

Here it is:

    https://lore.kernel.org/git/cover.1653088640.git.me@ttaylorr.com/.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 00/17] cruft packs
  2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
                     ` (16 preceding siblings ...)
  2022-05-20 23:18   ` [PATCH v5 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
@ 2022-05-21 11:17   ` Ævar Arnfjörð Bjarmason
  2022-05-24 19:39     ` Jonathan Nieder
  17 siblings, 1 reply; 201+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-05-21 11:17 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, derrickstolee, gitster, jrnieder, larsxschneider, tytso


On Fri, May 20 2022, Taylor Blau wrote:

>   - The new section in pack-format.txt (describing the ".mtimes" format) now
>     says at the top "all 4-byte numbers are in network byte order", and avoids
>     repeating "network [byte] order" throughout that section to reduce
>     confusion.

Suggestion (outside this series) since that's fixed perhaps a small
stand-alone patch to fix the existing "network order" occurances in the
same file?

> ...and that's pretty much it. In any case, a range-diff is included below.
> Thanks again for all of the thoughtful feedback on this series.

It seems this didn't make it on-list, but for the last round (well, it
seems I replied to v1 by accident) A sent this (at
https://lore.kernel.org/git/220519.86ilq14u1a.gmgdl@evledraar.gmail.com/
if it eventually shows up):
	
	Return-Path: <avarab@gmail.com>
	Received: from gmgdl (dhcp-077-248-183-071.chello.nl. [77.248.183.71])
	        by smtp.gmail.com with ESMTPSA id en21-20020a17090728d500b006fa9820b4a2sm1979775ejc.165.2022.05.19.04.54.26
	        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
	        Thu, 19 May 2022 04:54:26 -0700 (PDT)
	Received: from avar by gmgdl with local (Exim 4.95)
		(envelope-from <avarab@gmail.com>)
		id 1nrejZ-0026V8-AE;
		Thu, 19 May 2022 13:54:25 +0200
	From: =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason <avarab@gmail.com>
	To: Taylor Blau <me@ttaylorr.com>
	Cc: git@vger.kernel.org, gitster@pobox.com, larsxschneider@gmail.com,
	 peff@peff.net, tytso@mit.edu, brian m. carlson <bk2204@github.com>
	Subject: Re: [PATCH 05/17] pack-mtimes: support writing pack .mtimes files
	Date: Thu, 19 May 2022 13:48:42 +0200
	References: <cover.1638224692.git.me@ttaylorr.com>
	 <deece9eb70e9750bb8350946679b521e59139fe2.1638224692.git.me@ttaylorr.com>
	User-agent: Debian GNU/Linux bookworm/sid; Emacs 27.1; mu4e 1.7.12
	In-reply-to: <deece9eb70e9750bb8350946679b521e59139fe2.1638224692.git.me@ttaylorr.com>
	Message-ID: <220519.86ilq14u1a.gmgdl@evledraar.gmail.com>
	MIME-Version: 1.0
	Content-Type: text/plain
	X-TUID: ta0yc5HgmrrD
	
	
	On Mon, Nov 29 2021, Taylor Blau wrote:
	
	> +static void write_mtimes_header(struct hashfile *f)
	> +{
	> +	hashwrite_be32(f, MTIMES_SIGNATURE);
	> +	hashwrite_be32(f, MTIMES_VERSION);
	> +	hashwrite_be32(f, oid_version(the_hash_algo));
	> +}
	
	Given the history noted in
	https://lore.kernel.org/git/RFC-patch-2.2-051f0612ab9-20220519T113538Z-avarab@gmail.com/
	maybe we can say this ship has just sailed at this point.
	
	But since this is a new format I think it's worth considering not using
	the 1 or 2 you get from oid_version(), but the "format_id",
	i.e. GIT_SHA1_FORMAT_ID or GIT_SHA256_FORMAT_ID.
	
	You'll use the same space in the format for it, but we'll end up with
	something more obvious (as the integer encodes the sha1 or sha256 name).
	
	AFAICT this code is just copied from your earlier work on *.rev, which
	in turn seems copied from earlier work on midx & commit-graph, which
	seems to have used this way of referring to the hash version more as an
	accident than anything explicitly indended...
	
	Then again we could just say that both are equally valid at this point,
	especially given the use in adjacent formats.

I.e. do we think continuing to use 1 v.s. 2 in new formats over instead
of 0x73686131 and 0x73323536 is the right choice?

Other than that the only question I have (I think) on this series is if
Jonathan Nieder is happy with it. I looked back in my logs and there was
an extensive on-IRC discussion about it at the end of March, which ended
in you sending: https://lore.kernel.org/git/YkICkpttOujOKeT3@nand.local/

But it seems Jonathan didn't chime in since then, and he had some major
issues with the approach here. I think those should have been addressed
by that discussion, but it would be nice to get a confirmation.
	
> Taylor Blau (17):
>   Documentation/technical: add cruft-packs.txt
>   pack-mtimes: support reading .mtimes files
>   pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
>   chunk-format.h: extract oid_version()
>   pack-mtimes: support writing pack .mtimes files
>   t/helper: add 'pack-mtimes' test-tool
>   builtin/pack-objects.c: return from create_object_entry()
>   builtin/pack-objects.c: --cruft without expiration
>   reachable: add options to add_unseen_recent_objects_to_traversal
>   reachable: report precise timestamps from objects in cruft packs
>   builtin/pack-objects.c: --cruft with expiration
>   builtin/repack.c: support generating a cruft pack
>   builtin/repack.c: allow configuring cruft pack generation
>   builtin/repack.c: use named flags for existing_packs
>   builtin/repack.c: add cruft packs to MIDX during geometric repack
>   builtin/gc.c: conditionally avoid pruning objects via loose
>   sha1-file.c: don't freshen cruft packs
>
>  Documentation/Makefile                  |   1 +
>  Documentation/config/gc.txt             |  21 +-
>  Documentation/config/repack.txt         |   9 +
>  Documentation/git-gc.txt                |   5 +
>  Documentation/git-pack-objects.txt      |  30 +
>  Documentation/git-repack.txt            |  11 +
>  Documentation/technical/cruft-packs.txt | 123 ++++
>  Documentation/technical/pack-format.txt |  19 +
>  Makefile                                |   2 +
>  builtin/gc.c                            |  10 +-
>  builtin/pack-objects.c                  | 304 +++++++++-
>  builtin/repack.c                        | 185 +++++-
>  bulk-checkin.c                          |   2 +-
>  chunk-format.c                          |  12 +
>  chunk-format.h                          |   3 +
>  commit-graph.c                          |  18 +-
>  midx.c                                  |  18 +-
>  object-file.c                           |   4 +-
>  object-store.h                          |   7 +-
>  pack-mtimes.c                           | 126 ++++
>  pack-mtimes.h                           |  15 +
>  pack-objects.c                          |   6 +
>  pack-objects.h                          |  25 +
>  pack-write.c                            |  93 ++-
>  pack.h                                  |   4 +
>  packfile.c                              |  19 +-
>  reachable.c                             |  58 +-
>  reachable.h                             |   9 +-
>  t/helper/test-pack-mtimes.c             |  56 ++
>  t/helper/test-tool.c                    |   1 +
>  t/helper/test-tool.h                    |   1 +
>  t/t5329-pack-objects-cruft.sh           | 739 ++++++++++++++++++++++++
>  32 files changed, 1834 insertions(+), 102 deletions(-)
>  create mode 100644 Documentation/technical/cruft-packs.txt
>  create mode 100644 pack-mtimes.c
>  create mode 100644 pack-mtimes.h
>  create mode 100644 t/helper/test-pack-mtimes.c
>  create mode 100755 t/t5329-pack-objects-cruft.sh
>
> Range-diff against v4:
>  -:  ---------- >  1:  f494ef7377 Documentation/technical: add cruft-packs.txt
>  1:  8f9fd21be9 !  2:  91a9d21b0b pack-mtimes: support reading .mtimes files
>     @@ Documentation/technical/pack-format.txt: Pack file entry: <+
>       
>      +== pack-*.mtimes files have the format:
>      +
>     ++All 4-byte numbers are in network byte order.
>     ++
>      +  - A 4-byte magic number '0x4d544d45' ('MTME').
>      +
>      +  - A 4-byte version identifier (= 1).
>      +
>      +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
>      +
>     -+  - A table of 4-byte unsigned integers in network order. The ith
>     -+    value is the modification time (mtime) of the ith object in the
>     -+    corresponding pack by lexicographic (index) order. The mtimes
>     -+    count standard epoch seconds.
>     ++  - A table of 4-byte unsigned integers. The ith value is the
>     ++    modification time (mtime) of the ith object in the corresponding
>     ++    pack by lexicographic (index) order. The mtimes count standard
>     ++    epoch seconds.
>      +
>      +  - A trailer, containing a checksum of the corresponding packfile,
>      +    and a checksum of all of the above (each having length according
>      +    to the specified hash function).
>     -+
>     -+All 4-byte numbers are in network order.
>      +
>       == multi-pack-index (MIDX) files have the following format:
>       
>  2:  cdb21236e1 =  3:  67c4e7209d pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
>  3:  1d775f9850 =  4:  fc86506881 chunk-format.h: extract oid_version()
>  4:  6172861bd9 =  5:  788d1f96f2 pack-mtimes: support writing pack .mtimes files
>  5:  5f9a9a5b7b =  6:  2a6cfb00bf t/helper: add 'pack-mtimes' test-tool
>  6:  b8a38fe2e4 =  7:  edb6fcd5ec builtin/pack-objects.c: return from create_object_entry()
>  7:  94fe03cc65 =  8:  e3185741f2 builtin/pack-objects.c: --cruft without expiration
>  8:  da7273f41f =  9:  1cf00d462c reachable: add options to add_unseen_recent_objects_to_traversal
>  9:  58fecd1747 = 10:  d66be44d9a reachable: report precise timestamps from objects in cruft packs
> 10:  1740b8ef01 = 11:  1434e37623 builtin/pack-objects.c: --cruft with expiration
> 11:  5992a72cbf ! 12:  0d3555d595 builtin/repack.c: support generating a cruft pack
>     @@ t/t5329-pack-objects-cruft.sh: test_expect_success 'expired objects are pruned'
>      +		git repack &&
>      +
>      +		tip="$(git rev-parse cruft)" &&
>     -+		path="$objdir/$(test_oid_to_path "$(git rev-parse cruft)")" &&
>     ++		path="$objdir/$(test_oid_to_path "$tip")" &&
>      +		test-tool chmtime --get +1000 "$path" >expect &&
>      +
>      +		git checkout main &&
> 12:  1b241f8f91 = 13:  4b721d3ee9 builtin/repack.c: allow configuring cruft pack generation
> 13:  ffae78852c = 14:  f9e3ab56b1 builtin/repack.c: use named flags for existing_packs
> 14:  0743e373ba ! 15:  e9f46e7b5e builtin/repack.c: add cruft packs to MIDX during geometric repack
>     @@ builtin/repack.c
>       static int pack_everything;
>       static int delta_base_offset = 1;
>      @@ builtin/repack.c: static void collect_pack_filenames(struct string_list *fname_nonkept_list,
>     + 		fname = xmemdupz(e->d_name, len);
>     + 
>       		if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
>     - 		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
>     +-		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
>     ++		    (file_exists(mkpath("%s/%s.keep", packdir, fname)))) {
>       			string_list_append_nodup(fname_kept_list, fname);
>      -		else
>      -			string_list_append_nodup(fname_nonkept_list, fname);
>     -+		else {
>     -+			struct string_list_item *item = string_list_append_nodup(fname_nonkept_list, fname);
>     ++		} else {
>     ++			struct string_list_item *item;
>     ++			item = string_list_append_nodup(fname_nonkept_list,
>     ++							fname);
>      +			if (file_exists(mkpath("%s/%s.mtimes", packdir, fname)))
>      +				item->util = (void*)(uintptr_t)CRUFT_PACK;
>      +		}
> 15:  9f7e0acac6 = 16:  43c14eec07 builtin/gc.c: conditionally avoid pruning objects via loose
> 16:  07fa9d4b47 = 17:  1e313b89e8 sha1-file.c: don't freshen cruft packs


^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files
  2022-05-20 23:17   ` [PATCH v5 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
@ 2022-05-24 19:32     ` Jonathan Nieder
  2022-05-24 19:44       ` rsbecker
  2022-05-24 22:21       ` [PATCH v5 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
  2022-05-25 23:02     ` Taylor Blau
  2023-06-01 13:01     ` Andreas Schwab
  2 siblings, 2 replies; 201+ messages in thread
From: Jonathan Nieder @ 2022-05-24 19:32 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, avarab, derrickstolee, gitster, larsxschneider, tytso

Hi,

Taylor Blau wrote:

> This patch prepares for cruft packs by defining the `.mtimes` format,
> and introducing a basic API that callers can use to read out individual
> mtimes.

Makes sense.  Does this intend to produce any functional change?  I'm
guessing not (and the lack of tests agrees), but the commit message
doesn't say so.

By the way, is this something we could cover in tests, e.g. using a
test helper that exercises the new code?

[...]
> --- a/Documentation/technical/pack-format.txt
> +++ b/Documentation/technical/pack-format.txt
> @@ -294,6 +294,25 @@ Pack file entry: <+
>  
>  All 4-byte numbers are in network order.
>  
> +== pack-*.mtimes files have the format:
> +
> +All 4-byte numbers are in network byte order.
> +
> +  - A 4-byte magic number '0x4d544d45' ('MTME').
> +
> +  - A 4-byte version identifier (= 1).
> +
> +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
> +
> +  - A table of 4-byte unsigned integers. The ith value is the
> +    modification time (mtime) of the ith object in the corresponding
> +    pack by lexicographic (index) order. The mtimes count standard
> +    epoch seconds.
> +
> +  - A trailer, containing a checksum of the corresponding packfile,
> +    and a checksum of all of the above (each having length according
> +    to the specified hash function).
> +

This describes the "syntax" but not the "semantics" of the file.
Should I look to a separate piece of documentation for the semantics?
If so, can this one include a mention of that piece of documentation
to make it easier to find?

[...]
> --- a/object-store.h
> +++ b/object-store.h
> @@ -115,12 +115,15 @@ struct packed_git {
>  		 freshened:1,
>  		 do_not_close:1,
>  		 pack_promisor:1,
> -		 multi_pack_index:1;
> +		 multi_pack_index:1,
> +		 is_cruft:1;
>  	unsigned char hash[GIT_MAX_RAWSZ];
>  	struct revindex_entry *revindex;
>  	const uint32_t *revindex_data;
>  	const uint32_t *revindex_map;
>  	size_t revindex_size;
> +	const uint32_t *mtimes_map;
> +	size_t mtimes_size;

What does mtimes_map contain?  A comment would help.


> --- /dev/null
> +++ b/pack-mtimes.c
> @@ -0,0 +1,126 @@
> +#include "pack-mtimes.h"
> +#include "object-store.h"
> +#include "packfile.h"

Missing #include of git-compat-util.h.

> +
> +static char *pack_mtimes_filename(struct packed_git *p)
> +{
> +	size_t len;
> +	if (!strip_suffix(p->pack_name, ".pack", &len))
> +		BUG("pack_name does not end in .pack");
> +	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
> +	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
> +}

This seems simple enough that it's not obvious we need more code
sharing.  Do you agree?  If so, I'd suggest just removing the
NEEDSWORK comment.

> +
> +#define MTIMES_HEADER_SIZE (12)
> +#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))

Hm, the all-caps name makes this feel like a compile-time constant but
it contains a reference to the_hash_algo.  Could it be an inline
function instead?

> +
> +struct mtimes_header {
> +	uint32_t signature;
> +	uint32_t version;
> +	uint32_t hash_id;
> +};
> +
> +static int load_pack_mtimes_file(char *mtimes_file,
> +				 uint32_t num_objects,
> +				 const uint32_t **data_p, size_t *len_p)

What does this function do?  A comment would help.

> +{
> +	int fd, ret = 0;
> +	struct stat st;
> +	void *data = NULL;
> +	size_t mtimes_size;
> +	struct mtimes_header header;
> +	uint32_t *hdr;
> +
> +	fd = git_open(mtimes_file);
> +
> +	if (fd < 0) {

nit: this would be more readable without the blank line between
setting and checking fd (likewise for the other examples below).
> +		ret = -1;
> +		goto cleanup;
> +	}

[...]
> +	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {

This presupposes that the hash_id matches the_hash_algo.  Maybe worth
a NEEDSWORK comment.

[...]
> +cleanup:
> +	if (ret) {
> +		if (data)
> +			munmap(data, mtimes_size);
> +	} else {
> +		*len_p = mtimes_size;
> +		*data_p = (const uint32_t *)data;

Do we know that 'data' is uint32_t aligned?  Casting earlier in the
function could make that more obvious.

[...]
> +int load_pack_mtimes(struct packed_git *p)

This could use a doc comment in the header file.  For example, what
requirements do we have on what the caller passes as 'p'?

[...]
> +uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)

Likewise.

[...]
> --- a/packfile.c
> +++ b/packfile.c
[...]
> @@ -363,7 +373,7 @@ void close_object_store(struct raw_object_store *o)
>  
>  void unlink_pack_path(const char *pack_name, int force_delete)
>  {
> -	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
> +	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".mtimes"};

Are these in any particular order?  Should they be?

[...]
> @@ -718,6 +728,10 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
>  	if (!access(p->pack_name, F_OK))
>  		p->pack_promisor = 1;
>  
> +	xsnprintf(p->pack_name + path_len, alloc - path_len, ".mtimes");
> +	if (!access(p->pack_name, F_OK))
> +		p->is_cruft = 1;
> +
>  	xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
>  	if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) {
>  		free(p);
> @@ -869,7 +883,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
>  	    ends_with(file_name, ".pack") ||
>  	    ends_with(file_name, ".bitmap") ||
>  	    ends_with(file_name, ".keep") ||
> -	    ends_with(file_name, ".promisor"))
> +	    ends_with(file_name, ".promisor") ||
> +	    ends_with(file_name, ".mtimes"))

likewise

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 00/17] cruft packs
  2022-05-21 11:17   ` [PATCH v5 00/17] " Ævar Arnfjörð Bjarmason
@ 2022-05-24 19:39     ` Jonathan Nieder
  2022-05-24 21:50       ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Jonathan Nieder @ 2022-05-24 19:39 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, git, derrickstolee, gitster, larsxschneider, tytso

Hi,

Ævar Arnfjörð Bjarmason wrote:
> On Fri, May 20 2022, Taylor Blau wrote:
> 	On Mon, Nov 29 2021, Taylor Blau wrote:

> 	> +static void write_mtimes_header(struct hashfile *f)
> 	> +{
> 	> +	hashwrite_be32(f, MTIMES_SIGNATURE);
> 	> +	hashwrite_be32(f, MTIMES_VERSION);
> 	> +	hashwrite_be32(f, oid_version(the_hash_algo));
> 	> +}
[...]
> 	But since this is a new format I think it's worth considering not using
> 	the 1 or 2 you get from oid_version(), but the "format_id",
> 	i.e. GIT_SHA1_FORMAT_ID or GIT_SHA256_FORMAT_ID.
>
> 	You'll use the same space in the format for it, but we'll end up with
> 	something more obvious (as the integer encodes the sha1 or sha256 name).

Agreed.

[...]
> Other than that the only question I have (I think) on this series is if
> Jonathan Nieder is happy with it. I looked back in my logs and there was
> an extensive on-IRC discussion about it at the end of March, which ended
> in you sending: https://lore.kernel.org/git/YkICkpttOujOKeT3@nand.local/
>
> But it seems Jonathan didn't chime in since then, and he had some major
> issues with the approach here. I think those should have been addressed
> by that discussion, but it would be nice to get a confirmation.

I would still prefer if this used a repository format extension, but
that preference is not strong enough that I'd say "this must not go in
without one".  What I think would help would be some information in
the user-facing documentation for commands that create and work with
cruft packs.  In other words, if our take on people sharing
repositories between implementations that understand and don't
understand cruft packs and get objects moving back and forth between
packed and loose objects is "you should have known you were doing
something strange", the least we can do is to warn them.

I don't see a config to enable PACK_CRUFT by default yet in this
series.  I'd like one, so that people can turn it on and get the good
new behavior. :)

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 201+ messages in thread

* RE: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files
  2022-05-24 19:32     ` Jonathan Nieder
@ 2022-05-24 19:44       ` rsbecker
  2022-05-24 22:25         ` Taylor Blau
  2022-05-24 22:21       ` [PATCH v5 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
  1 sibling, 1 reply; 201+ messages in thread
From: rsbecker @ 2022-05-24 19:44 UTC (permalink / raw)
  To: 'Jonathan Nieder', 'Taylor Blau'
  Cc: git, avarab, derrickstolee, gitster, larsxschneider, tytso

On May 24, 2022 3:32 PM, Taylor Blau wrote:
>Taylor Blau wrote:
>
>> This patch prepares for cruft packs by defining the `.mtimes` format,
>> and introducing a basic API that callers can use to read out
>> individual mtimes.
>
>Makes sense.  Does this intend to produce any functional change?  I'm
guessing
>not (and the lack of tests agrees), but the commit message doesn't say so.
>
>By the way, is this something we could cover in tests, e.g. using a test
helper that
>exercises the new code?
>
>[...]
>> --- a/Documentation/technical/pack-format.txt
>> +++ b/Documentation/technical/pack-format.txt
>> @@ -294,6 +294,25 @@ Pack file entry: <+
>>
>>  All 4-byte numbers are in network order.
>>
>> +== pack-*.mtimes files have the format:
>> +
>> +All 4-byte numbers are in network byte order.
>> +
>> +  - A 4-byte magic number '0x4d544d45' ('MTME').
>> +
>> +  - A 4-byte version identifier (= 1).
>> +
>> +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
>> +
>> +  - A table of 4-byte unsigned integers. The ith value is the
>> +    modification time (mtime) of the ith object in the corresponding
>> +    pack by lexicographic (index) order. The mtimes count standard
>> +    epoch seconds.
>> +
>> +  - A trailer, containing a checksum of the corresponding packfile,
>> +    and a checksum of all of the above (each having length according
>> +    to the specified hash function).
>> +
>
>This describes the "syntax" but not the "semantics" of the file.
>Should I look to a separate piece of documentation for the semantics?
>If so, can this one include a mention of that piece of documentation to
make it
>easier to find?
>
>[...]
>> --- a/object-store.h
>> +++ b/object-store.h
>> @@ -115,12 +115,15 @@ struct packed_git {
>>  		 freshened:1,
>>  		 do_not_close:1,
>>  		 pack_promisor:1,
>> -		 multi_pack_index:1;
>> +		 multi_pack_index:1,
>> +		 is_cruft:1;
>>  	unsigned char hash[GIT_MAX_RAWSZ];
>>  	struct revindex_entry *revindex;
>>  	const uint32_t *revindex_data;
>>  	const uint32_t *revindex_map;
>>  	size_t revindex_size;
>> +	const uint32_t *mtimes_map;
>> +	size_t mtimes_size;
>
>What does mtimes_map contain?  A comment would help.
>
>
>> --- /dev/null
>> +++ b/pack-mtimes.c
>> @@ -0,0 +1,126 @@
>> +#include "pack-mtimes.h"
>> +#include "object-store.h"
>> +#include "packfile.h"
>
>Missing #include of git-compat-util.h.
>
>> +
>> +static char *pack_mtimes_filename(struct packed_git *p) {
>> +	size_t len;
>> +	if (!strip_suffix(p->pack_name, ".pack", &len))
>> +		BUG("pack_name does not end in .pack");
>> +	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
>> +	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name); }
>
>This seems simple enough that it's not obvious we need more code sharing.
Do
>you agree?  If so, I'd suggest just removing the NEEDSWORK comment.
>
>> +
>> +#define MTIMES_HEADER_SIZE (12)
>> +#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 *
>> +the_hash_algo->rawsz))
>
>Hm, the all-caps name makes this feel like a compile-time constant but it
contains
>a reference to the_hash_algo.  Could it be an inline function instead?
>
>> +
>> +struct mtimes_header {
>> +	uint32_t signature;
>> +	uint32_t version;
>> +	uint32_t hash_id;
>> +};
>> +
>> +static int load_pack_mtimes_file(char *mtimes_file,
>> +				 uint32_t num_objects,
>> +				 const uint32_t **data_p, size_t *len_p)
>
>What does this function do?  A comment would help.
>
>> +{
>> +	int fd, ret = 0;
>> +	struct stat st;
>> +	void *data = NULL;
>> +	size_t mtimes_size;
>> +	struct mtimes_header header;
>> +	uint32_t *hdr;
>> +
>> +	fd = git_open(mtimes_file);
>> +
>> +	if (fd < 0) {
>
>nit: this would be more readable without the blank line between setting and
>checking fd (likewise for the other examples below).
>> +		ret = -1;
>> +		goto cleanup;
>> +	}
>
>[...]
>> +	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t),
>> +num_objects)) {
>
>This presupposes that the hash_id matches the_hash_algo.  Maybe worth a
>NEEDSWORK comment.
>
>[...]
>> +cleanup:
>> +	if (ret) {
>> +		if (data)
>> +			munmap(data, mtimes_size);
>> +	} else {
>> +		*len_p = mtimes_size;
>> +		*data_p = (const uint32_t *)data;
>
>Do we know that 'data' is uint32_t aligned?  Casting earlier in the
function could
>make that more obvious.
>
>[...]
>> +int load_pack_mtimes(struct packed_git *p)
>
>This could use a doc comment in the header file.  For example, what
requirements
>do we have on what the caller passes as 'p'?
>
>[...]
>> +uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)
>
>Likewise.
>
>[...]
>> --- a/packfile.c
>> +++ b/packfile.c
>[...]
>> @@ -363,7 +373,7 @@ void close_object_store(struct raw_object_store
>> *o)
>>
>>  void unlink_pack_path(const char *pack_name, int force_delete)  {
>> -	static const char *exts[] = {".pack", ".idx", ".rev", ".keep",
".bitmap",
>".promisor"};
>> +	static const char *exts[] = {".pack", ".idx", ".rev", ".keep",
>> +".bitmap", ".promisor", ".mtimes"};
>
>Are these in any particular order?  Should they be?
>
>[...]
>> @@ -718,6 +728,10 @@ struct packed_git *add_packed_git(const char *path,
>size_t path_len, int local)
>>  	if (!access(p->pack_name, F_OK))
>>  		p->pack_promisor = 1;
>>
>> +	xsnprintf(p->pack_name + path_len, alloc - path_len, ".mtimes");
>> +	if (!access(p->pack_name, F_OK))
>> +		p->is_cruft = 1;
>> +
>>  	xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
>>  	if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) {
>>  		free(p);
>> @@ -869,7 +883,8 @@ static void prepare_pack(const char *full_name,
size_t
>full_name_len,
>>  	    ends_with(file_name, ".pack") ||
>>  	    ends_with(file_name, ".bitmap") ||
>>  	    ends_with(file_name, ".keep") ||
>> -	    ends_with(file_name, ".promisor"))
>> +	    ends_with(file_name, ".promisor") ||
>> +	    ends_with(file_name, ".mtimes"))
>
>likewise

I am again concerned about 32-bit time_t assumptions. time_t is 32-bit on
some platforms, signed/unsigned, and sometimes 64-bit. We are talking about
potentially long-persistent files, as I understand this series, so we should
not be limiting times to end at 2038. That's only 16 years off and I would
wager that many clones that exist today will exist then.
--Randall


^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 00/17] cruft packs
  2022-05-24 19:39     ` Jonathan Nieder
@ 2022-05-24 21:50       ` Taylor Blau
  2022-05-24 21:55         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-05-24 21:50 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Ævar Arnfjörð Bjarmason, git, derrickstolee,
	gitster, larsxschneider, tytso

On Tue, May 24, 2022 at 12:39:11PM -0700, Jonathan Nieder wrote:
> Ævar Arnfjörð Bjarmason wrote:
> > On Fri, May 20 2022, Taylor Blau wrote:
> > 	On Mon, Nov 29 2021, Taylor Blau wrote:
>
> > 	> +static void write_mtimes_header(struct hashfile *f)
> > 	> +{
> > 	> +	hashwrite_be32(f, MTIMES_SIGNATURE);
> > 	> +	hashwrite_be32(f, MTIMES_VERSION);
> > 	> +	hashwrite_be32(f, oid_version(the_hash_algo));
> > 	> +}
> [...]
> > 	But since this is a new format I think it's worth considering not using
> > 	the 1 or 2 you get from oid_version(), but the "format_id",
> > 	i.e. GIT_SHA1_FORMAT_ID or GIT_SHA256_FORMAT_ID.
> >
> > 	You'll use the same space in the format for it, but we'll end up with
> > 	something more obvious (as the integer encodes the sha1 or sha256 name).
>
> Agreed.

I know we recommend using the format_id for on-disk formats, but I think
there is enough existing uses of "1" or "2" that either are acceptable
in practice.

E.g., grepping around for "hashwrite.*oid_version", there are three
existing formats that use "1" or "2" instead of the format_id. They are:

  - the commit-graph format
  - the midx format
  - the .rev format

Moreover, I can't seem to find any formats that _don't_ use that
convention. So I have a vague preference towards using the values "1"
and "2" as we currently do in these patches. (TBH, I don't find "sha1"
significantly more interpretable than just "1", so I would be just as
happy leaving it as-is).

> [...]
> > Other than that the only question I have (I think) on this series is if
> > Jonathan Nieder is happy with it. I looked back in my logs and there was
> > an extensive on-IRC discussion about it at the end of March, which ended
> > in you sending: https://lore.kernel.org/git/YkICkpttOujOKeT3@nand.local/
> >
> > But it seems Jonathan didn't chime in since then, and he had some major
> > issues with the approach here. I think those should have been addressed
> > by that discussion, but it would be nice to get a confirmation.
>
> I would still prefer if this used a repository format extension, but
> that preference is not strong enough that I'd say "this must not go in
> without one".  What I think would help would be some information in
> the user-facing documentation for commands that create and work with
> cruft packs.  In other words, if our take on people sharing
> repositories between implementations that understand and don't
> understand cruft packs and get objects moving back and forth between
> packed and loose objects is "you should have known you were doing
> something strange", the least we can do is to warn them.

I think that's a good suggestion. We already have some documentation in
Documentation/technical/cruft-packs.txt, but I think it could be helpful
to add user-facing documentation, too.

Would you be opposed to doing that outside of this series? ISTM that the
technical discussion has mostly settled, so I'd rather wordsmith the
user-facing documentation separately.

> I don't see a config to enable PACK_CRUFT by default yet in this
> series.  I'd like one, so that people can turn it on and get the good
> new behavior. :)

`git gc` has support for this (c.f., "gc.cruftPacks"). `git repack`
requires you to pass `--cruft`; IIRC I originally had a similar
configuration in `git repack` which would change the behavior of `-A` /
`-a` when set, but I found it too confusing and scrapped it.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 00/17] cruft packs
  2022-05-24 21:50       ` Taylor Blau
@ 2022-05-24 21:55         ` Ævar Arnfjörð Bjarmason
  2022-05-24 22:12           ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-05-24 21:55 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Jonathan Nieder, git, derrickstolee, gitster, larsxschneider,
	tytso


On Tue, May 24 2022, Taylor Blau wrote:

> On Tue, May 24, 2022 at 12:39:11PM -0700, Jonathan Nieder wrote:
>> Ævar Arnfjörð Bjarmason wrote:
>> > On Fri, May 20 2022, Taylor Blau wrote:
>> > 	On Mon, Nov 29 2021, Taylor Blau wrote:
>>
>> > 	> +static void write_mtimes_header(struct hashfile *f)
>> > 	> +{
>> > 	> +	hashwrite_be32(f, MTIMES_SIGNATURE);
>> > 	> +	hashwrite_be32(f, MTIMES_VERSION);
>> > 	> +	hashwrite_be32(f, oid_version(the_hash_algo));
>> > 	> +}
>> [...]
>> > 	But since this is a new format I think it's worth considering not using
>> > 	the 1 or 2 you get from oid_version(), but the "format_id",
>> > 	i.e. GIT_SHA1_FORMAT_ID or GIT_SHA256_FORMAT_ID.
>> >
>> > 	You'll use the same space in the format for it, but we'll end up with
>> > 	something more obvious (as the integer encodes the sha1 or sha256 name).
>>
>> Agreed.
>
> I know we recommend using the format_id for on-disk formats, but I think
> there is enough existing uses of "1" or "2" that either are acceptable
> in practice.
>
> E.g., grepping around for "hashwrite.*oid_version", there are three
> existing formats that use "1" or "2" instead of the format_id. They are:
>
>   - the commit-graph format
>   - the midx format
>   - the .rev format
>
> Moreover, I can't seem to find any formats that _don't_ use that
> convention.

It's used in the reftable format.

> So I have a vague preference towards using the values "1"
> and "2" as we currently do in these patches.

I suspect that's less "vague" and more "c'mon, I'm using it in
production already" :)

Anyway, I'm fine with leaving it be as you have it currently. I'd first
encountered this magic with reftable I think, and only recently found
that we use 1 and 2 in these other more recent places.

> (TBH, I don't find "sha1"
> significantly more interpretable than just "1", so I would be just as
> happy leaving it as-is).

Hrm, I'd think having it sha1 or s256 in big-endian would be a bit more
self-explanatory. I.e. SHA-256 is 2, not 256, and our 3 (if that ever
arrives) is likely not to be SHA-3 (but probably some successor).

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 00/17] cruft packs
  2022-05-24 21:55         ` Ævar Arnfjörð Bjarmason
@ 2022-05-24 22:12           ` Taylor Blau
  2022-05-25  7:53             ` Jonathan Nieder
  0 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-05-24 22:12 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jonathan Nieder, git, derrickstolee, gitster, larsxschneider,
	tytso

On Tue, May 24, 2022 at 11:55:02PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
> On Tue, May 24 2022, Taylor Blau wrote:
>
> > On Tue, May 24, 2022 at 12:39:11PM -0700, Jonathan Nieder wrote:
> >> Ævar Arnfjörð Bjarmason wrote:
> >> > On Fri, May 20 2022, Taylor Blau wrote:
> >> > 	On Mon, Nov 29 2021, Taylor Blau wrote:
> >>
> >> > 	> +static void write_mtimes_header(struct hashfile *f)
> >> > 	> +{
> >> > 	> +	hashwrite_be32(f, MTIMES_SIGNATURE);
> >> > 	> +	hashwrite_be32(f, MTIMES_VERSION);
> >> > 	> +	hashwrite_be32(f, oid_version(the_hash_algo));
> >> > 	> +}
> >> [...]
> >> > 	But since this is a new format I think it's worth considering not using
> >> > 	the 1 or 2 you get from oid_version(), but the "format_id",
> >> > 	i.e. GIT_SHA1_FORMAT_ID or GIT_SHA256_FORMAT_ID.
> >> >
> >> > 	You'll use the same space in the format for it, but we'll end up with
> >> > 	something more obvious (as the integer encodes the sha1 or sha256 name).
> >>
> >> Agreed.
> >
> > I know we recommend using the format_id for on-disk formats, but I think
> > there is enough existing uses of "1" or "2" that either are acceptable
> > in practice.
> >
> > E.g., grepping around for "hashwrite.*oid_version", there are three
> > existing formats that use "1" or "2" instead of the format_id. They are:
> >
> >   - the commit-graph format
> >   - the midx format
> >   - the .rev format
> >
> > Moreover, I can't seem to find any formats that _don't_ use that
> > convention.
>
> It's used in the reftable format.

Ah, thanks for pointing it out. Still, I think there's enough uses of
"1" and "2" over format_id that I'm not convinced here.

> > So I have a vague preference towards using the values "1"
> > and "2" as we currently do in these patches.
>
> I suspect that's less "vague" and more "c'mon, I'm using it in
> production already" :)

No, this wasn't a veil over anything. Yes, GitHub is using this in
production already, but that isn't why I'm opposed here. I'm opposed for
the reasons I explained in the quoted bits (and would happily carry a
small amount of custom code in GitHub's fork to continue to recognize
the "1" or "2" values if this ever changed to use format_id).

> Anyway, I'm fine with leaving it be as you have it currently. I'd first
> encountered this magic with reftable I think, and only recently found
> that we use 1 and 2 in these other more recent places.

Sounds good. Unless others have a very strong opinion, let's leave it as
is.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files
  2022-05-24 19:32     ` Jonathan Nieder
  2022-05-24 19:44       ` rsbecker
@ 2022-05-24 22:21       ` Taylor Blau
  2022-05-25  7:48         ` Jonathan Nieder
  1 sibling, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-05-24 22:21 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Taylor Blau, git, avarab, derrickstolee, gitster, larsxschneider,
	tytso

On Tue, May 24, 2022 at 12:32:01PM -0700, Jonathan Nieder wrote:
> Hi,
>
> Taylor Blau wrote:
>
> > This patch prepares for cruft packs by defining the `.mtimes` format,
> > and introducing a basic API that callers can use to read out individual
> > mtimes.
>
> Makes sense.  Does this intend to produce any functional change?  I'm
> guessing not (and the lack of tests agrees), but the commit message
> doesn't say so.
>
> By the way, is this something we could cover in tests, e.g. using a
> test helper that exercises the new code?

This does not produce a functional change, no. This commit in isolation
adds a bunch of dead code that will be used (and tested) in the
following patches.

There is a test helper that is added (and then used extensively further
on in the series) four patches later, c.f., "t/helper: add 'pack-mtimes'
test-tool".

> [...]
> > --- a/Documentation/technical/pack-format.txt
> > +++ b/Documentation/technical/pack-format.txt
> > @@ -294,6 +294,25 @@ Pack file entry: <+
> >
> >  All 4-byte numbers are in network order.
> >
> > +== pack-*.mtimes files have the format:
> > +
> > +All 4-byte numbers are in network byte order.
> > +
> > +  - A 4-byte magic number '0x4d544d45' ('MTME').
> > +
> > +  - A 4-byte version identifier (= 1).
> > +
> > +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
> > +
> > +  - A table of 4-byte unsigned integers. The ith value is the
> > +    modification time (mtime) of the ith object in the corresponding
> > +    pack by lexicographic (index) order. The mtimes count standard
> > +    epoch seconds.
> > +
> > +  - A trailer, containing a checksum of the corresponding packfile,
> > +    and a checksum of all of the above (each having length according
> > +    to the specified hash function).
> > +
>
> This describes the "syntax" but not the "semantics" of the file.
> Should I look to a separate piece of documentation for the semantics?
> If so, can this one include a mention of that piece of documentation
> to make it easier to find?
>
> [...]
> > --- a/object-store.h
> > +++ b/object-store.h
> > @@ -115,12 +115,15 @@ struct packed_git {
> >  		 freshened:1,
> >  		 do_not_close:1,
> >  		 pack_promisor:1,
> > -		 multi_pack_index:1;
> > +		 multi_pack_index:1,
> > +		 is_cruft:1;
> >  	unsigned char hash[GIT_MAX_RAWSZ];
> >  	struct revindex_entry *revindex;
> >  	const uint32_t *revindex_data;
> >  	const uint32_t *revindex_map;
> >  	size_t revindex_size;
> > +	const uint32_t *mtimes_map;
> > +	size_t mtimes_size;
>
> What does mtimes_map contain?  A comment would help.

It contains a pointer at the beginning of the mmapped region of the
.mtimes file, similar to revindex_map above it.

>
> > --- /dev/null
> > +++ b/pack-mtimes.c
> > @@ -0,0 +1,126 @@
> > +#include "pack-mtimes.h"
> > +#include "object-store.h"
> > +#include "packfile.h"
>
> Missing #include of git-compat-util.h.

Ah, good eyes: thanks.

Junio: would you like a replacement patch / a whole new copy of the
series / or can you amend this locally when queuing? Whatever is lowest
effort for you works for me.

> > +
> > +static char *pack_mtimes_filename(struct packed_git *p)
> > +{
> > +	size_t len;
> > +	if (!strip_suffix(p->pack_name, ".pack", &len))
> > +		BUG("pack_name does not end in .pack");
> > +	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
> > +	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
> > +}
>
> This seems simple enough that it's not obvious we need more code
> sharing.  Do you agree?  If so, I'd suggest just removing the
> NEEDSWORK comment.

Yeah, it is conceptually simple, though it feels like the sort of thing
that could benefit from not having to be written once for each
extension (hence the comment).

> > +
> > +#define MTIMES_HEADER_SIZE (12)
> > +#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))
>
> Hm, the all-caps name makes this feel like a compile-time constant but
> it contains a reference to the_hash_algo.  Could it be an inline
> function instead?

Yes, it could be an inline function, but I don't think there is
necessarily anything wrong with it being a #define'd macro. There are
some other examples, e.g., RIDX_MIN_SIZE, MIDX_MIN_SIZE,
GRAPH_DATA_WIDTH, and PACK_SIZE_THRESHOLD (to name a few) which also use
the_hash_algo on the right-hand side of a `#define`.

> > +
> > +struct mtimes_header {
> > +	uint32_t signature;
> > +	uint32_t version;
> > +	uint32_t hash_id;
> > +};
> > +
> > +static int load_pack_mtimes_file(char *mtimes_file,
> > +				 uint32_t num_objects,
> > +				 const uint32_t **data_p, size_t *len_p)
>
> What does this function do?  A comment would help.

I know that I'm biased as the author of this code, but I think the
signature is clear here. At least, I'm not sure what information a
comment would add that the function name and its arguments don't already
convey.

> > +	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
>
> This presupposes that the hash_id matches the_hash_algo.  Maybe worth
> a NEEDSWORK comment.

Good catch.

> [...]
> > +cleanup:
> > +	if (ret) {
> > +		if (data)
> > +			munmap(data, mtimes_size);
> > +	} else {
> > +		*len_p = mtimes_size;
> > +		*data_p = (const uint32_t *)data;
>
> Do we know that 'data' is uint32_t aligned?  Casting earlier in the
> function could make that more obvious.

`data` is definitely uint32_t aligned, but this is a tradeoff, since if
we wrote:

    uint32_t *data = xmmap(...);

then I think we would have to change the case where ret is non-zero to be:

    if (data)
        munmap((void*)data, ...);

and likewise, data_p is const.

> > +int load_pack_mtimes(struct packed_git *p)
>
> This could use a doc comment in the header file.  For example, what
> requirements do we have on what the caller passes as 'p'?
>
> [...]
> > +uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)
>
> Likewise.

Sure. I wonder when we should do that, though. I'm not trying to be
impatient to get this merged, but iterating on the documentation feels
like it could be done on top without having to re-send the substantive
parts of this series over and over.

> [...]
> > --- a/packfile.c
> > +++ b/packfile.c
> [...]
> > @@ -363,7 +373,7 @@ void close_object_store(struct raw_object_store *o)
> >
> >  void unlink_pack_path(const char *pack_name, int force_delete)
> >  {
> > -	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
> > +	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".mtimes"};
>
> Are these in any particular order?  Should they be?
>
> [...]

These aren't technically in any particular order (nor should they be),
though .idx should be first. I'm leaving it alone here
(semi-intentionally, since the race it opens up isn't related to this
series, and it's on my list to deal with after this code has settled).

> > @@ -718,6 +728,10 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
> >  	if (!access(p->pack_name, F_OK))
> >  		p->pack_promisor = 1;
> >
> > +	xsnprintf(p->pack_name + path_len, alloc - path_len, ".mtimes");
> > +	if (!access(p->pack_name, F_OK))
> > +		p->is_cruft = 1;
> > +
> >  	xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
> >  	if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) {
> >  		free(p);
> > @@ -869,7 +883,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
> >  	    ends_with(file_name, ".pack") ||
> >  	    ends_with(file_name, ".bitmap") ||
> >  	    ends_with(file_name, ".keep") ||
> > -	    ends_with(file_name, ".promisor"))
> > +	    ends_with(file_name, ".promisor") ||
> > +	    ends_with(file_name, ".mtimes"))
>
> likewise

No specific order here (since these are all OR'd together).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files
  2022-05-24 19:44       ` rsbecker
@ 2022-05-24 22:25         ` Taylor Blau
  2022-05-24 23:24           ` rsbecker
  0 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-05-24 22:25 UTC (permalink / raw)
  To: rsbecker
  Cc: 'Jonathan Nieder', 'Taylor Blau', git, avarab,
	derrickstolee, gitster, larsxschneider, tytso

On Tue, May 24, 2022 at 03:44:00PM -0400, rsbecker@nexbridge.com wrote:
> I am again concerned about 32-bit time_t assumptions. time_t is 32-bit on
> some platforms, signed/unsigned, and sometimes 64-bit. We are talking about
> potentially long-persistent files, as I understand this series, so we should
> not be limiting times to end at 2038. That's only 16 years off and I would
> wager that many clones that exist today will exist then.

Note that we're using unsigned fields here, so we have until 2106 (see
my earlier response on this in
https://lore.kernel.org/git/YdiXecK6fAKl8++G@nand.local/).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* RE: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files
  2022-05-24 22:25         ` Taylor Blau
@ 2022-05-24 23:24           ` rsbecker
  2022-05-25  0:07             ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: rsbecker @ 2022-05-24 23:24 UTC (permalink / raw)
  To: 'Taylor Blau'
  Cc: 'Jonathan Nieder', git, avarab, derrickstolee, gitster,
	larsxschneider, tytso

On May 24, 2022 6:25 PM ,Taylor Blau write:
>On Tue, May 24, 2022 at 03:44:00PM -0400, rsbecker@nexbridge.com wrote:
>> I am again concerned about 32-bit time_t assumptions. time_t is 32-bit
>> on some platforms, signed/unsigned, and sometimes 64-bit. We are
>> talking about potentially long-persistent files, as I understand this
>> series, so we should not be limiting times to end at 2038. That's only
>> 16 years off and I would wager that many clones that exist today will exist then.
>
>Note that we're using unsigned fields here, so we have until 2106 (see my earlier
>response on this in https://lore.kernel.org/git/YdiXecK6fAKl8++G@nand.local/).

I appreciate that, but 32-bit time_t is still signed on many platforms, so when cast, it still might, at some point in another series, cause issues. Please be cautious. I expect that this is the particular hill on which I will die. 😉
--Randall


^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files
  2022-05-24 23:24           ` rsbecker
@ 2022-05-25  0:07             ` Taylor Blau
  2022-05-25  0:20               ` rsbecker
  2022-05-25  9:11               ` adding new 32-bit on-disk (unsigned) timestamp formats (was: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files) Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-25  0:07 UTC (permalink / raw)
  To: rsbecker
  Cc: 'Taylor Blau', 'Jonathan Nieder', git, avarab,
	derrickstolee, gitster, larsxschneider, tytso

On Tue, May 24, 2022 at 07:24:14PM -0400, rsbecker@nexbridge.com wrote:
> On May 24, 2022 6:25 PM ,Taylor Blau write:
> >On Tue, May 24, 2022 at 03:44:00PM -0400, rsbecker@nexbridge.com wrote:
> >> I am again concerned about 32-bit time_t assumptions. time_t is 32-bit
> >> on some platforms, signed/unsigned, and sometimes 64-bit. We are
> >> talking about potentially long-persistent files, as I understand this
> >> series, so we should not be limiting times to end at 2038. That's only
> >> 16 years off and I would wager that many clones that exist today will exist then.
> >
> >Note that we're using unsigned fields here, so we have until 2106 (see my earlier
> >response on this in https://lore.kernel.org/git/YdiXecK6fAKl8++G@nand.local/).
>
> I appreciate that, but 32-bit time_t is still signed on many
> platforms, so when cast, it still might, at some point in another
> series, cause issues. Please be cautious. I expect that this is the
> particular hill on which I will die. 😉
> --Randall

Yes, definitely. There is only one spot that we turn the result of
nth_packed_mtime() into a time_t, and that's in
add_object_in_unpacked_pack(). The code there is something like:

    time_t mtime;
    if (pack->is_cruft)
      mtime = nth_packed_mtime(pack, object_pos);
    else
      mtime = pack->mtime;

    ...

    add_cruft_object_entry(oid, ..., mtime);

...and the reason mtime is a time_t is because that's the type of
pack->mtime.

And we quickly convert that back to a uint32_t in
add_cruft_object_entry(). If time_t is signed, then we'll truncate any
values beyond 2106, and pre-epoch values will become large positive
values. That means our error is one-sided in the favorable direction,
i.e., that we'll keep objects around for longer instead of pruning
something that we shouldn't have.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* RE: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files
  2022-05-25  0:07             ` Taylor Blau
@ 2022-05-25  0:20               ` rsbecker
  2022-05-25  9:11               ` adding new 32-bit on-disk (unsigned) timestamp formats (was: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files) Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 201+ messages in thread
From: rsbecker @ 2022-05-25  0:20 UTC (permalink / raw)
  To: 'Taylor Blau'
  Cc: 'Jonathan Nieder', git, avarab, derrickstolee, gitster,
	larsxschneider, tytso

On May 24, 2022 8:08 PM, Taylor Blau wrote:
>On Tue, May 24, 2022 at 07:24:14PM -0400, rsbecker@nexbridge.com wrote:
>> On May 24, 2022 6:25 PM ,Taylor Blau write:
>> >On Tue, May 24, 2022 at 03:44:00PM -0400, rsbecker@nexbridge.com wrote:
>> >> I am again concerned about 32-bit time_t assumptions. time_t is
>> >> 32-bit on some platforms, signed/unsigned, and sometimes 64-bit. We
>> >> are talking about potentially long-persistent files, as I
>> >> understand this series, so we should not be limiting times to end
>> >> at 2038. That's only
>> >> 16 years off and I would wager that many clones that exist today will exist
>then.
>> >
>> >Note that we're using unsigned fields here, so we have until 2106
>> >(see my earlier response on this in
>https://lore.kernel.org/git/YdiXecK6fAKl8++G@nand.local/).
>>
>> I appreciate that, but 32-bit time_t is still signed on many
>> platforms, so when cast, it still might, at some point in another
>> series, cause issues. Please be cautious. I expect that this is the
>> particular hill on which I will die. 😉
>> --Randall
>
>Yes, definitely. There is only one spot that we turn the result of
>nth_packed_mtime() into a time_t, and that's in
>add_object_in_unpacked_pack(). The code there is something like:
>
>    time_t mtime;
>    if (pack->is_cruft)
>      mtime = nth_packed_mtime(pack, object_pos);
>    else
>      mtime = pack->mtime;
>
>    ...
>
>    add_cruft_object_entry(oid, ..., mtime);
>
>...and the reason mtime is a time_t is because that's the type of
>pack->mtime.
>
>And we quickly convert that back to a uint32_t in add_cruft_object_entry(). If
>time_t is signed, then we'll truncate any values beyond 2106, and pre-epoch
>values will become large positive values. That means our error is one-sided in the
>favorable direction, i.e., that we'll keep objects around for longer instead of
>pruning something that we shouldn't have.

I can only hope. I am working with the platform compiler team on time_t issues. Hoping we can get to 64-bit builds within two years, but out of my control. That would make time_t an int64_t, which puts failure well outside my own lifespan and anyone else in my company. Provisioning for signed 64-bit time values would be prudent even if unsupported in a specific build. We are almost 1/4 through 20xx.
--Randall


^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files
  2022-05-24 22:21       ` [PATCH v5 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
@ 2022-05-25  7:48         ` Jonathan Nieder
  2022-05-25 21:36           ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Jonathan Nieder @ 2022-05-25  7:48 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, avarab, derrickstolee, gitster, larsxschneider, tytso

Hi,

Taylor Blau wrote:
> On Tue, May 24, 2022 at 12:32:01PM -0700, Jonathan Nieder wrote:

>> Makes sense.  Does this intend to produce any functional change?  I'm
>> guessing not (and the lack of tests agrees), but the commit message
>> doesn't say so.
[...]
> This does not produce a functional change, no. This commit in isolation
> adds a bunch of dead code that will be used (and tested) in the
> following patches.
[...]
>> What does mtimes_map contain?  A comment would help.
>
> It contains a pointer at the beginning of the mmapped region of the
> .mtimes file, similar to revindex_map above it.

To be clear, in cases like this by "comment" I mean "in-code comment".
I.e., my interest is not that _I_ find out the answer but that the
code becomes more maintainable via the answer becoming easier to find.

[...]
>> This seems simple enough that it's not obvious we need more code
>> sharing.  Do you agree?  If so, I'd suggest just removing the
>> NEEDSWORK comment.
>
> Yeah, it is conceptually simple, though it feels like the sort of thing
> that could benefit from not having to be written once for each
> extension (hence the comment).

The reason I asked is that the NEEDSWORK here actually got in the way
of comprehension for me --- it made me wonder "is there some
complexity here I'm missing?"

That's why I'd suggest one of
- removing the NEEDSWORK comment
- going ahead and implementing the code sharing you mean, or
- fleshing out the NEEDSWORK comment so the reader can wonder less

>>> +
>>> +#define MTIMES_HEADER_SIZE (12)
>>> +#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))
>>
>> Hm, the all-caps name makes this feel like a compile-time constant but
>> it contains a reference to the_hash_algo.  Could it be an inline
>> function instead?
>
> Yes, it could be an inline function, but I don't think there is
> necessarily anything wrong with it being a #define'd macro. There are
> some other examples, e.g., RIDX_MIN_SIZE, MIDX_MIN_SIZE,
> GRAPH_DATA_WIDTH, and PACK_SIZE_THRESHOLD (to name a few) which also use
> the_hash_algo on the right-hand side of a `#define`.

Those are due to an incomplete migration from use of the true constant
GIT_SHA1_RAWSZ to use of the dynamic value the_hash_algo->rawsz, no?
In other words, "other examples do it wrong" doesn't feel like a great
justification for making it worse in new code.

[...]
>>> +static int load_pack_mtimes_file(char *mtimes_file,
>>> +				 uint32_t num_objects,
>>> +				 const uint32_t **data_p, size_t *len_p)
>>
>> What does this function do?  A comment would help.
>
> I know that I'm biased as the author of this code, but I think the
> signature is clear here. At least, I'm not sure what information a
> comment would add that the function name and its arguments don't already
> convey.

Ah, thanks for this point of clarification.  What isn't clear from the
signature is
- when should I call this function?
- what does its return value represent?
- how does it handle errors?

I agree that the parameters are self-explanatory.

>>> +cleanup:
>>> +	if (ret) {
>>> +		if (data)
>>> +			munmap(data, mtimes_size);
>>> +	} else {
>>> +		*len_p = mtimes_size;
>>> +		*data_p = (const uint32_t *)data;
>>
>> Do we know that 'data' is uint32_t aligned?  Casting earlier in the
>> function could make that more obvious.
>
> `data` is definitely uint32_t aligned, but this is a tradeoff, since if
> we wrote:
>
>     uint32_t *data = xmmap(...);
>
> then I think we would have to change the case where ret is non-zero to be:
>
>     if (data)
>         munmap((void*)data, ...);
>
> and likewise, data_p is const.

Doing it that way sounds great to me.  That way, the type contains the
information we need up-front and the safety of the cast is obvious in
the place where the cast is needed.

(Although my understanding is also that in C it's fine to pass a
uint32_t* to a function expecting a void*, so the second cast would
also not be needed.)

[...]
>>> +int load_pack_mtimes(struct packed_git *p)
>>
>> This could use a doc comment in the header file.  For example, what
>> requirements do we have on what the caller passes as 'p'?
>>
>> [...]
>>> +uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos)
>>
>> Likewise.
>
> Sure. I wonder when we should do that, though. I'm not trying to be
> impatient to get this merged, but iterating on the documentation feels
> like it could be done on top without having to re-send the substantive
> parts of this series over and over.

In terms of re-sending patches, sending a "fixup!" patch with the
minor changes you want to make doesn't seem too problematic to me.  In
general a major benefit of code review is getting others' eyes on new
code from the standpoint of readability and maintainability; including
comments like this up front doesn't seem like a huge amount to ask
(versus getting those comments to be perfect, which would be
unreasonable to expect since it's not hard to update them over time).

> Thanks,
> Taylor

Thanks for looking it through.

Sincerely,
Jonathan

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 00/17] cruft packs
  2022-05-24 22:12           ` Taylor Blau
@ 2022-05-25  7:53             ` Jonathan Nieder
  2022-05-25 19:59               ` Derrick Stolee
  0 siblings, 1 reply; 201+ messages in thread
From: Jonathan Nieder @ 2022-05-25  7:53 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Ævar Arnfjörð Bjarmason, git, derrickstolee,
	gitster, larsxschneider, tytso

Hi,

Taylor Blau wrote:
> On Tue, May 24, 2022 at 11:55:02PM +0200, Ævar Arnfjörð Bjarmason wrote:

>>> Moreover, I can't seem to find any formats that _don't_ use that
>>> convention.
>>
>> It's used in the reftable format.

It's also used in the formats described in
Documentation/technical/hash-function-transition.

[...]
> Sounds good. Unless others have a very strong opinion, let's leave it as
> is.

File formats are one of those things where a little time early can save
a lot of work later.  If there were a strong reason to use "1" and "2"
here then I'd be okay with living with it --- I'm a pragmatic person.
But in general, using the magic numbers instead of a sequential value is
really helpful both in making the file formats more self-explanatory and
in making it possible to experiment with multiple new hash_algos at the
same time.

The main argument I'm hearing for using "1" and "2" is "because some
other formats got that wrong".  That reason is the opposite of
compelling to me: it makes me suspect that as a project we should more
eagerly break the old bad habits and form new ones.  I guess this
qualifies as a very strong opinion.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 201+ messages in thread

* adding new 32-bit on-disk (unsigned) timestamp formats (was: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files)
  2022-05-25  0:07             ` Taylor Blau
  2022-05-25  0:20               ` rsbecker
@ 2022-05-25  9:11               ` Ævar Arnfjörð Bjarmason
  2022-05-25 13:30                 ` Derrick Stolee
  1 sibling, 1 reply; 201+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-05-25  9:11 UTC (permalink / raw)
  To: Taylor Blau
  Cc: rsbecker, 'Jonathan Nieder', git, derrickstolee, gitster,
	larsxschneider, tytso


On Tue, May 24 2022, Taylor Blau wrote:

> On Tue, May 24, 2022 at 07:24:14PM -0400, rsbecker@nexbridge.com wrote:
>> On May 24, 2022 6:25 PM ,Taylor Blau write:
>> >On Tue, May 24, 2022 at 03:44:00PM -0400, rsbecker@nexbridge.com wrote:
>> >> I am again concerned about 32-bit time_t assumptions. time_t is 32-bit
>> >> on some platforms, signed/unsigned, and sometimes 64-bit. We are
>> >> talking about potentially long-persistent files, as I understand this
>> >> series, so we should not be limiting times to end at 2038. That's only
>> >> 16 years off and I would wager that many clones that exist today will exist then.
>> >
>> >Note that we're using unsigned fields here, so we have until 2106 (see my earlier
>> >response on this in https://lore.kernel.org/git/YdiXecK6fAKl8++G@nand.local/).
>>
>> I appreciate that, but 32-bit time_t is still signed on many
>> platforms, so when cast, it still might, at some point in another
>> series, cause issues. Please be cautious. I expect that this is the
>> particular hill on which I will die. 😉
>> --Randall
>
> Yes, definitely. There is only one spot that we turn the result of
> nth_packed_mtime() into a time_t, and that's in
> add_object_in_unpacked_pack(). The code there is something like:
>
>     time_t mtime;
>     if (pack->is_cruft)
>       mtime = nth_packed_mtime(pack, object_pos);
>     else
>       mtime = pack->mtime;
>
>     ...
>
>     add_cruft_object_entry(oid, ..., mtime);
>
> ...and the reason mtime is a time_t is because that's the type of
> pack->mtime.
>
> And we quickly convert that back to a uint32_t in
> add_cruft_object_entry(). If time_t is signed, then we'll truncate any
> values beyond 2106, and pre-epoch values will become large positive
> values. That means our error is one-sided in the favorable direction,
> i.e., that we'll keep objects around for longer instead of pruning
> something that we shouldn't have.

I must say that I really don't like this part of the format. Is it
really necessary to optimize the storage space here in a way that leaves
open questions about future time_t compatibility, and having to
introduce the first use of unsigned 32 bit timestamps to git's codebase?

Yes, this is its own self-contained format, so we don't *need* time_t
here, but it's also really handy if we can eventually consistently use
64 time_t everywhere and not worry about any compatibility issues, or
unsigned v.s. signed, or to create our own little ext4-like signed 32
bit timestamp format.

Once we hit 2038 (or near that date) this would be the only part of our
codebase & on-disk formats that I'm aware of that would differ from
time_t's signedness, but perhaps there's some I've missed.

If there isn't a demonstrable reason (as in some real numbers, or
accompanying benchmark etc.) to special-snowflake this I really think we
should just go for signed 64 bit here, i.e. matching time_t on 64 bit
systems.

If we really are trying to micro-optimize storage space here I'm willing
to bet that this is still a bad/premature optimization. There's much
better ways to store this sort of data in a compact way if that's the
concern. E.g. you'd store a 64 bit "base" timestamp in the header for
the first entry, and have smaller (signed) "delta" timestamps storing
offsets from that "base" timestamp.

This would take advantage of the fact that when we find loose objects
we're vanishingly unlikely to have them splayed over more than a
days/weeks/months or in the worst case small number of years from the
"base" (and if we ever do we could simply shrug and leave such objects
out of the pack entirely).

We could thus keep the 32 bit second-resolution timestamps you have
here, they'd just be signed deltas to the 64 bit signed "base" in a
header.

Even better (again, if micro-optimizing this is really needed) would be
to store a 64 bit signed base and a table of 16 bit signed offsets.

We'd simply declare that for our expiry times we'd "snap" any such
values to the next day. Our current GC config exposes down-to-the-second
expiry times, but in practice nobody needs that. A 16 bit signed "day
offset" would give you 2^15/365 = 89 years +/- of day-resolution expiry
for objects. To avoid thundering herds we could even fake up an exact
down-to-the-second expiry on the computed day by combining the expiry
time & the first few bits of the OID.

== BREAK

Aside about time_t being signed v.s. unsigned. This is edited from an
older off-list E-Mail of mine (from git-security): For time_t itself no
standard says that time_t must be signed, but in practice it's
ubiquitous

This thread is informative
http://mm.icann.org/pipermail/tz/2004-July/012503.html it continues the
month after: http://mm.icann.org/pipermail/tz/2004-August/thread.html

Summary: Yeah it can be unsigned in theory, but it seems like nobody's
been crazy enough to try it, so it's de-facto standardized to
signed. Everyone has a Y2038 problem, nobody has a Y2106 problem. Well,
with time_t, e.g. Linux filesystems tend to use unsigned 32 bit epochs:
https://kernelnewbies.org/y2038/vfs

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: adding new 32-bit on-disk (unsigned) timestamp formats (was: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files)
  2022-05-25  9:11               ` adding new 32-bit on-disk (unsigned) timestamp formats (was: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files) Ævar Arnfjörð Bjarmason
@ 2022-05-25 13:30                 ` Derrick Stolee
  2022-05-25 21:13                   ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Derrick Stolee @ 2022-05-25 13:30 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Taylor Blau
  Cc: rsbecker, 'Jonathan Nieder', git, gitster, larsxschneider,
	tytso

On 5/25/2022 5:11 AM, Ævar Arnfjörð Bjarmason wrote:
> I must say that I really don't like this part of the format. Is it
> really necessary to optimize the storage space here in a way that leaves
> open questions about future time_t compatibility, and having to
> introduce the first use of unsigned 32 bit timestamps to git's codebase?

The commit-graph file format uses unsigned 34-bit timestamps (packed
with 30-bit topological levels in the CDAT chunk), so this "not-64-bit
signed timestamps" thing is something we've done before.
 
> Yes, this is its own self-contained format, so we don't *need* time_t
> here, but it's also really handy if we can eventually consistently use
> 64 time_t everywhere and not worry about any compatibility issues, or
> unsigned v.s. signed, or to create our own little ext4-like signed 32
> bit timestamp format.

We can also use a new file format version when it is necessary. We
have a lot of time to add that detail without overly complicating the
format right now.

> If we really are trying to micro-optimize storage space here I'm willing
> to bet that this is still a bad/premature optimization. There's much
> better ways to store this sort of data in a compact way if that's the
> concern. E.g. you'd store a 64 bit "base" timestamp in the header for
> the first entry, and have smaller (signed) "delta" timestamps storing
> offsets from that "base" timestamp.

This is a good idea for a v2 format when that is necessary.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 00/17] cruft packs
  2022-05-25  7:53             ` Jonathan Nieder
@ 2022-05-25 19:59               ` Derrick Stolee
  2022-05-25 21:09                 ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Derrick Stolee @ 2022-05-25 19:59 UTC (permalink / raw)
  To: Jonathan Nieder, Taylor Blau
  Cc: Ævar Arnfjörð Bjarmason, git, gitster,
	larsxschneider, tytso

On 5/25/2022 3:53 AM, Jonathan Nieder wrote:
> Taylor Blau wrote:
>> On Tue, May 24, 2022 at 11:55:02PM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
>>>> Moreover, I can't seem to find any formats that _don't_ use that
>>>> convention.
>>>
>>> It's used in the reftable format.

The use in reftable is the only one I can find and that implementation
is not idiomatic. Specifically, the way the four-byte header was
implemented is not easy to extract and share in other formats.

This series does the good work of extracting oid_version() as a
common method across these formats so it is easier to share.

> It's also used in the formats described in
> Documentation/technical/hash-function-transition.

It documents things that have not been implemented, such as the v3
pack-index format:

  Pack index (.idx) files use a new v3 format that supports multiple
  hash functions. They have the following format (all integers are in
  network byte order):
(...)
  * 4-byte number of object formats in this pack index: 2
  * For each object format:
    ** 4-byte format identifier (e.g., 'sha1' for SHA-1)
    ** 4-byte length in bytes of shortened object names. This is the
      shortest possible length needed to make names in the shortened
      object name table unambiguous.
    ** 4-byte integer, recording where tables relating to this format
      are stored in this index file, as an offset from the beginning.

This was added in your 752414ae431 (technical doc: add a design doc
for hash function transition, 2017-09-27), but has not been acted upon
yet.

> [...]
>> Sounds good. Unless others have a very strong opinion, let's leave it as
>> is.
> 
> File formats are one of those things where a little time early can save
> a lot of work later.  If there were a strong reason to use "1" and "2"
> here then I'd be okay with living with it --- I'm a pragmatic person.
> But in general, using the magic numbers instead of a sequential value is
> really helpful both in making the file formats more self-explanatory and
> in making it possible to experiment with multiple new hash_algos at the
> same time.
> 
> The main argument I'm hearing for using "1" and "2" is "because some
> other formats got that wrong".  That reason is the opposite of
> compelling to me: it makes me suspect that as a project we should more
> eagerly break the old bad habits and form new ones.  I guess this
> qualifies as a very strong opinion.

Either way, these are magic numbers. One happens to somewhat spell
out something when looking at the file in a hex editor with ASCII
previews, but that doesn't change the fact that it is most important
that the hash function is correctly indicated by the file format and
parsed by the Git executable (not a human).

I'd much rather have a consistent and proven way of specifying the
hash value (using the oid_version() helper) than to try and make a
new mechanism.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 00/17] cruft packs
  2022-05-25 19:59               ` Derrick Stolee
@ 2022-05-25 21:09                 ` Taylor Blau
  2022-05-26  0:06                   ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-05-25 21:09 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Jonathan Nieder, Ævar Arnfjörð Bjarmason, git,
	gitster, larsxschneider, tytso

On Wed, May 25, 2022 at 03:59:24PM -0400, Derrick Stolee wrote:
> I'd much rather have a consistent and proven way of specifying the
> hash value (using the oid_version() helper) than to try and make a
> new mechanism.

To be clear, I absolutely don't think any of us should have the attitude
of repeating past bad decisions for the sake of consistency.

As best I can tell, our (Jonathan and I's) disagreement is on whether
using "1" and "2" to identify which hash function is used by the .mtimes
file is OK or not. I happen to think that it is acceptable, so the
choice to continue to adopt this pattern was motivated by being
consistent with a pattern that is good and works.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: adding new 32-bit on-disk (unsigned) timestamp formats (was: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files)
  2022-05-25 13:30                 ` Derrick Stolee
@ 2022-05-25 21:13                   ` Taylor Blau
  2022-05-26  0:02                     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-05-25 21:13 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Ævar Arnfjörð Bjarmason, rsbecker,
	'Jonathan Nieder', git, gitster, larsxschneider, tytso

On Wed, May 25, 2022 at 09:30:55AM -0400, Derrick Stolee wrote:
> On 5/25/2022 5:11 AM, Ævar Arnfjörð Bjarmason wrote:
> > I must say that I really don't like this part of the format. Is it
> > really necessary to optimize the storage space here in a way that leaves
> > open questions about future time_t compatibility, and having to
> > introduce the first use of unsigned 32 bit timestamps to git's codebase?
>
> The commit-graph file format uses unsigned 34-bit timestamps (packed
> with 30-bit topological levels in the CDAT chunk), so this "not-64-bit
> signed timestamps" thing is something we've done before.
>
> > Yes, this is its own self-contained format, so we don't *need* time_t
> > here, but it's also really handy if we can eventually consistently use
> > 64 time_t everywhere and not worry about any compatibility issues, or
> > unsigned v.s. signed, or to create our own little ext4-like signed 32
> > bit timestamp format.
>
> We can also use a new file format version when it is necessary. We
> have a lot of time to add that detail without overly complicating the
> format right now.
>
> > If we really are trying to micro-optimize storage space here I'm willing
> > to bet that this is still a bad/premature optimization. There's much
> > better ways to store this sort of data in a compact way if that's the
> > concern. E.g. you'd store a 64 bit "base" timestamp in the header for
> > the first entry, and have smaller (signed) "delta" timestamps storing
> > offsets from that "base" timestamp.
>
> This is a good idea for a v2 format when that is necessary.

I agree here.

I'm not opposed to such a change (or even being the one to work on it!),
but I would encourage us to pursue that change outside of this series,
since it can easily be done on top.

Of course, if we ever did decide to implement 64-bit mtimes, we would
have to maintain support for reading both the 32-bit and 64-bit values.
But I think the code is well-equipped to do that, and it could be done
on top without significant additional complexity.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files
  2022-05-25  7:48         ` Jonathan Nieder
@ 2022-05-25 21:36           ` Taylor Blau
  2022-05-25 21:58             ` rsbecker
  0 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-05-25 21:36 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: git, avarab, derrickstolee, gitster, larsxschneider, tytso

On Wed, May 25, 2022 at 12:48:54AM -0700, Jonathan Nieder wrote:
> >> What does mtimes_map contain?  A comment would help.
> >
> > It contains a pointer at the beginning of the mmapped region of the
> > .mtimes file, similar to revindex_map above it.
>
> To be clear, in cases like this by "comment" I mean "in-code comment".
> I.e., my interest is not that _I_ find out the answer but that the
> code becomes more maintainable via the answer becoming easier to find.

OK. I'll add a comment in the fixup! patch which I'm about to send.

> [...]
> >> This seems simple enough that it's not obvious we need more code
> >> sharing.  Do you agree?  If so, I'd suggest just removing the
> >> NEEDSWORK comment.
> >
> > Yeah, it is conceptually simple, though it feels like the sort of thing
> > that could benefit from not having to be written once for each
> > extension (hence the comment).
>
> The reason I asked is that the NEEDSWORK here actually got in the way
> of comprehension for me --- it made me wonder "is there some
> complexity here I'm missing?"
>
> That's why I'd suggest one of
> - removing the NEEDSWORK comment
> - going ahead and implementing the code sharing you mean, or
> - fleshing out the NEEDSWORK comment so the reader can wonder less

I am a little sad to remove it, since I thought it was useful as-is. But
I can just as easily remember to come back to this myself in the future,
so if it is distracting to you in the meantime, then I don't mind
holding onto it in my own head.

> >>> +
> >>> +#define MTIMES_HEADER_SIZE (12)
> >>> +#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))
> >>
> >> Hm, the all-caps name makes this feel like a compile-time constant but
> >> it contains a reference to the_hash_algo.  Could it be an inline
> >> function instead?
> >
> > Yes, it could be an inline function, but I don't think there is
> > necessarily anything wrong with it being a #define'd macro. There are
> > some other examples, e.g., RIDX_MIN_SIZE, MIDX_MIN_SIZE,
> > GRAPH_DATA_WIDTH, and PACK_SIZE_THRESHOLD (to name a few) which also use
> > the_hash_algo on the right-hand side of a `#define`.
>
> Those are due to an incomplete migration from use of the true constant
> GIT_SHA1_RAWSZ to use of the dynamic value the_hash_algo->rawsz, no?
> In other words, "other examples do it wrong" doesn't feel like a great
> justification for making it worse in new code.

Fair point. I can imagine reasons for the existing pattern, but updating
it to handle the variable rawsz is easy to do (and it probably should
have been that way since the beginning).

> [...]
> >>> +static int load_pack_mtimes_file(char *mtimes_file,
> >>> +				 uint32_t num_objects,
> >>> +				 const uint32_t **data_p, size_t *len_p)
> >>
> >> What does this function do?  A comment would help.
> >
> > I know that I'm biased as the author of this code, but I think the
> > signature is clear here. At least, I'm not sure what information a
> > comment would add that the function name and its arguments don't already
> > convey.
>
> Ah, thanks for this point of clarification.  What isn't clear from the
> signature is
> - when should I call this function?
> - what does its return value represent?
> - how does it handle errors?
>
> I agree that the parameters are self-explanatory.

I'm hesitant to over-document a static function with a single caller,
but when looking at this, I think there is an opportunity to document
_its_ caller (`load_pack_mtimes()`) which isn't static, but was also
missing documentation.

> >>> +cleanup:
> >>> +	if (ret) {
> >>> +		if (data)
> >>> +			munmap(data, mtimes_size);
> >>> +	} else {
> >>> +		*len_p = mtimes_size;
> >>> +		*data_p = (const uint32_t *)data;
> >>
> >> Do we know that 'data' is uint32_t aligned?  Casting earlier in the
> >> function could make that more obvious.
> >
> > `data` is definitely uint32_t aligned, but this is a tradeoff, since if
> > we wrote:
> >
> >     uint32_t *data = xmmap(...);
> >
> > then I think we would have to change the case where ret is non-zero to be:
> >
> >     if (data)
> >         munmap((void*)data, ...);
> >
> > and likewise, data_p is const.
>
> Doing it that way sounds great to me.  That way, the type contains the
> information we need up-front and the safety of the cast is obvious in
> the place where the cast is needed.
>
> (Although my understanding is also that in C it's fine to pass a
> uint32_t* to a function expecting a void*, so the second cast would
> also not be needed.)
>
> [...]

Done, thanks for the suggestion.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* RE: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files
  2022-05-25 21:36           ` Taylor Blau
@ 2022-05-25 21:58             ` rsbecker
  2022-05-25 22:59               ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: rsbecker @ 2022-05-25 21:58 UTC (permalink / raw)
  To: 'Taylor Blau', 'Jonathan Nieder'
  Cc: git, avarab, derrickstolee, gitster, larsxschneider, tytso

On May 25, 2022 5:37 PM, Taylor Blau wrote:
>On Wed, May 25, 2022 at 12:48:54AM -0700, Jonathan Nieder wrote:
>> >> What does mtimes_map contain?  A comment would help.
>> >
>> > It contains a pointer at the beginning of the mmapped region of the
>> > .mtimes file, similar to revindex_map above it.
>>
>> To be clear, in cases like this by "comment" I mean "in-code comment".
>> I.e., my interest is not that _I_ find out the answer but that the
>> code becomes more maintainable via the answer becoming easier to find.
>
>OK. I'll add a comment in the fixup! patch which I'm about to send.
>
>> [...]
>> >> This seems simple enough that it's not obvious we need more code
>> >> sharing.  Do you agree?  If so, I'd suggest just removing the
>> >> NEEDSWORK comment.
>> >
>> > Yeah, it is conceptually simple, though it feels like the sort of
>> > thing that could benefit from not having to be written once for each
>> > extension (hence the comment).
>>
>> The reason I asked is that the NEEDSWORK here actually got in the way
>> of comprehension for me --- it made me wonder "is there some
>> complexity here I'm missing?"
>>
>> That's why I'd suggest one of
>> - removing the NEEDSWORK comment
>> - going ahead and implementing the code sharing you mean, or
>> - fleshing out the NEEDSWORK comment so the reader can wonder less
>
>I am a little sad to remove it, since I thought it was useful as-is. But I can just as
>easily remember to come back to this myself in the future, so if it is distracting to
>you in the meantime, then I don't mind holding onto it in my own head.
>
>> >>> +
>> >>> +#define MTIMES_HEADER_SIZE (12)
>> >>> +#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 *
>> >>> +the_hash_algo->rawsz))
>> >>
>> >> Hm, the all-caps name makes this feel like a compile-time constant
>> >> but it contains a reference to the_hash_algo.  Could it be an
>> >> inline function instead?
>> >
>> > Yes, it could be an inline function, but I don't think there is
>> > necessarily anything wrong with it being a #define'd macro. There
>> > are some other examples, e.g., RIDX_MIN_SIZE, MIDX_MIN_SIZE,
>> > GRAPH_DATA_WIDTH, and PACK_SIZE_THRESHOLD (to name a few) which
>also
>> > use the_hash_algo on the right-hand side of a `#define`.
>>
>> Those are due to an incomplete migration from use of the true constant
>> GIT_SHA1_RAWSZ to use of the dynamic value the_hash_algo->rawsz, no?
>> In other words, "other examples do it wrong" doesn't feel like a great
>> justification for making it worse in new code.
>
>Fair point. I can imagine reasons for the existing pattern, but updating it to handle
>the variable rawsz is easy to do (and it probably should have been that way since
>the beginning).
>
>> [...]
>> >>> +static int load_pack_mtimes_file(char *mtimes_file,
>> >>> +				 uint32_t num_objects,
>> >>> +				 const uint32_t **data_p, size_t *len_p)
>> >>
>> >> What does this function do?  A comment would help.
>> >
>> > I know that I'm biased as the author of this code, but I think the
>> > signature is clear here. At least, I'm not sure what information a
>> > comment would add that the function name and its arguments don't
>> > already convey.
>>
>> Ah, thanks for this point of clarification.  What isn't clear from the
>> signature is
>> - when should I call this function?
>> - what does its return value represent?
>> - how does it handle errors?
>>
>> I agree that the parameters are self-explanatory.
>
>I'm hesitant to over-document a static function with a single caller, but when
>looking at this, I think there is an opportunity to document _its_ caller
>(`load_pack_mtimes()`) which isn't static, but was also missing documentation.
>
>> >>> +cleanup:
>> >>> +	if (ret) {
>> >>> +		if (data)
>> >>> +			munmap(data, mtimes_size);
>> >>> +	} else {
>> >>> +		*len_p = mtimes_size;
>> >>> +		*data_p = (const uint32_t *)data;
>> >>
>> >> Do we know that 'data' is uint32_t aligned?  Casting earlier in the
>> >> function could make that more obvious.
>> >
>> > `data` is definitely uint32_t aligned, but this is a tradeoff, since
>> > if we wrote:
>> >
>> >     uint32_t *data = xmmap(...);
>> >
>> > then I think we would have to change the case where ret is non-zero to be:
>> >
>> >     if (data)
>> >         munmap((void*)data, ...);
>> >
>> > and likewise, data_p is const.
>>
>> Doing it that way sounds great to me.  That way, the type contains the
>> information we need up-front and the safety of the cast is obvious in
>> the place where the cast is needed.
>>
>> (Although my understanding is also that in C it's fine to pass a
>> uint32_t* to a function expecting a void*, so the second cast would
>> also not be needed.)

I do not think c99 allows this in 100% of cases - specifically if there a const void * involved. gcc does not care. I do not think c89 cares either. I will watch out for it when this is merged.
--Randall


^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files
  2022-05-25 21:58             ` rsbecker
@ 2022-05-25 22:59               ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-25 22:59 UTC (permalink / raw)
  To: rsbecker
  Cc: 'Jonathan Nieder', git, avarab, derrickstolee, gitster,
	larsxschneider, tytso

On Wed, May 25, 2022 at 05:58:49PM -0400, rsbecker@nexbridge.com wrote:
> >> > `data` is definitely uint32_t aligned, but this is a tradeoff, since
> >> > if we wrote:
> >> >
> >> >     uint32_t *data = xmmap(...);
> >> >
> >> > then I think we would have to change the case where ret is non-zero to be:
> >> >
> >> >     if (data)
> >> >         munmap((void*)data, ...);
> >> >
> >> > and likewise, data_p is const.
> >>
> >> Doing it that way sounds great to me.  That way, the type contains the
> >> information we need up-front and the safety of the cast is obvious in
> >> the place where the cast is needed.
> >>
> >> (Although my understanding is also that in C it's fine to pass a
> >> uint32_t* to a function expecting a void*, so the second cast would
> >> also not be needed.)
>
> I do not think c99 allows this in 100% of cases - specifically if
> there a const void * involved. gcc does not care. I do not think c89
> cares either. I will watch out for it when this is merged.

Thanks for the heads up. I looked through the results of "git grep '=
xmmap'" to see if we had contemporary examples of either assigning to a
non-'void *', or passing a non-'void *' variable to munmap.

Luckily, we have both, so this shouldn't cause a problem. fixup! patch
incoming shortly...

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files
  2022-05-20 23:17   ` [PATCH v5 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
  2022-05-24 19:32     ` Jonathan Nieder
@ 2022-05-25 23:02     ` Taylor Blau
  2022-05-26  0:30       ` Junio C Hamano
  2023-06-01 13:01     ` Andreas Schwab
  2 siblings, 1 reply; 201+ messages in thread
From: Taylor Blau @ 2022-05-25 23:02 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, avarab, derrickstolee, gitster, jrnieder, larsxschneider,
	tytso

Junio,

On Fri, May 20, 2022 at 07:17:35PM -0400, Taylor Blau wrote:
> To store the individual mtimes of objects in a cruft pack, introduce a
> new `.mtimes` format that can optionally accompany a single pack in the
> repository.

Like I mentioned in this sub-thread, here is a small fixup! to apply on
top of this patch when queueing. I'm hoping this will be easier than
reapplying the dozen+ or so patches in this series (the rest of which
are unchanged). But if it isn't, please let me know and I can send you a
reroll of the whole thing.

In the meantime, here's the fixup...

--- 8< ---
Subject: [PATCH] fixup! pack-mtimes: support reading .mtimes files

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 object-store.h |  5 +++++
 pack-mtimes.c  | 35 +++++++++++++++++++----------------
 pack-mtimes.h  | 11 +++++++++++
 3 files changed, 35 insertions(+), 16 deletions(-)

diff --git a/object-store.h b/object-store.h
index 2c4671ed7a..05cc9a33ed 100644
--- a/object-store.h
+++ b/object-store.h
@@ -122,6 +122,11 @@ struct packed_git {
 	const uint32_t *revindex_data;
 	const uint32_t *revindex_map;
 	size_t revindex_size;
+	/*
+	 * mtimes_map points at the beginning of the memory mapped region of
+	 * this pack's corresponding .mtimes file, and mtimes_size is the size
+	 * of that .mtimes file
+	 */
 	const uint32_t *mtimes_map;
 	size_t mtimes_size;
 	/* something like ".git/objects/pack/xxxxx.pack" */
diff --git a/pack-mtimes.c b/pack-mtimes.c
index 46ad584af1..0e0aafdcb0 100644
--- a/pack-mtimes.c
+++ b/pack-mtimes.c
@@ -1,3 +1,4 @@
+#include "git-compat-util.h"
 #include "pack-mtimes.h"
 #include "object-store.h"
 #include "packfile.h"
@@ -7,12 +8,10 @@ static char *pack_mtimes_filename(struct packed_git *p)
 	size_t len;
 	if (!strip_suffix(p->pack_name, ".pack", &len))
 		BUG("pack_name does not end in .pack");
-	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
 	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
 }

 #define MTIMES_HEADER_SIZE (12)
-#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))

 struct mtimes_header {
 	uint32_t signature;
@@ -26,10 +25,9 @@ static int load_pack_mtimes_file(char *mtimes_file,
 {
 	int fd, ret = 0;
 	struct stat st;
-	void *data = NULL;
-	size_t mtimes_size;
+	uint32_t *data = NULL;
+	size_t mtimes_size, expected_size;
 	struct mtimes_header header;
-	uint32_t *hdr;

 	fd = git_open(mtimes_file);

@@ -44,21 +42,16 @@ static int load_pack_mtimes_file(char *mtimes_file,

 	mtimes_size = xsize_t(st.st_size);

-	if (mtimes_size < MTIMES_MIN_SIZE) {
+	if (mtimes_size < MTIMES_HEADER_SIZE) {
 		ret = error(_("mtimes file %s is too small"), mtimes_file);
 		goto cleanup;
 	}

-	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
-		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
-		goto cleanup;
-	}
+	data = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);

-	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
-
-	header.signature = ntohl(hdr[0]);
-	header.version = ntohl(hdr[1]);
-	header.hash_id = ntohl(hdr[2]);
+	header.signature = ntohl(data[0]);
+	header.version = ntohl(data[1]);
+	header.hash_id = ntohl(data[2]);

 	if (header.signature != MTIMES_SIGNATURE) {
 		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
@@ -77,13 +70,23 @@ static int load_pack_mtimes_file(char *mtimes_file,
 		goto cleanup;
 	}

+
+	expected_size = MTIMES_HEADER_SIZE;
+	expected_size = st_add(expected_size, st_mult(sizeof(uint32_t), num_objects));
+	expected_size = st_add(expected_size, 2 * (header.hash_id == 1 ? GIT_SHA1_RAWSZ : GIT_SHA256_RAWSZ));
+
+	if (mtimes_size != expected_size) {
+		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
+		goto cleanup;
+	}
+
 cleanup:
 	if (ret) {
 		if (data)
 			munmap(data, mtimes_size);
 	} else {
 		*len_p = mtimes_size;
-		*data_p = (const uint32_t *)data;
+		*data_p = data;
 	}

 	close(fd);
diff --git a/pack-mtimes.h b/pack-mtimes.h
index 38ddb9f893..cc957b3e85 100644
--- a/pack-mtimes.h
+++ b/pack-mtimes.h
@@ -8,8 +8,19 @@

 struct packed_git;

+/*
+ * Loads the .mtimes file corresponding to "p", if any, returning zero
+ * on success.
+ */
 int load_pack_mtimes(struct packed_git *p);

+/* Returns the mtime associated with the object at position "pos" (in
+ * lexicographic/index order) in pack "p".
+ *
+ * Note that it is a BUG() to call this function if either (a) "p" does
+ * not have a corresponding .mtimes file, or (b) it does, but it hasn't
+ * been loaded
+ */
 uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos);

 #endif
--
2.36.1.94.gb0d54bedca

--- >8 ---

^ permalink raw reply related	[flat|nested] 201+ messages in thread

* Re: adding new 32-bit on-disk (unsigned) timestamp formats (was: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files)
  2022-05-25 21:13                   ` Taylor Blau
@ 2022-05-26  0:02                     ` Ævar Arnfjörð Bjarmason
  2022-05-26  0:12                       ` Taylor Blau
  0 siblings, 1 reply; 201+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-05-26  0:02 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Derrick Stolee, rsbecker, 'Jonathan Nieder', git, gitster,
	larsxschneider, tytso


On Wed, May 25 2022, Taylor Blau wrote:

> On Wed, May 25, 2022 at 09:30:55AM -0400, Derrick Stolee wrote:
>> On 5/25/2022 5:11 AM, Ævar Arnfjörð Bjarmason wrote:
>> > I must say that I really don't like this part of the format. Is it
>> > really necessary to optimize the storage space here in a way that leaves
>> > open questions about future time_t compatibility, and having to
>> > introduce the first use of unsigned 32 bit timestamps to git's codebase?
>>
>> The commit-graph file format uses unsigned 34-bit timestamps (packed
>> with 30-bit topological levels in the CDAT chunk), so this "not-64-bit
>> signed timestamps" thing is something we've done before.
>>
>> > Yes, this is its own self-contained format, so we don't *need* time_t
>> > here, but it's also really handy if we can eventually consistently use
>> > 64 time_t everywhere and not worry about any compatibility issues, or
>> > unsigned v.s. signed, or to create our own little ext4-like signed 32
>> > bit timestamp format.
>>
>> We can also use a new file format version when it is necessary. We
>> have a lot of time to add that detail without overly complicating the
>> format right now.
>>
>> > If we really are trying to micro-optimize storage space here I'm willing
>> > to bet that this is still a bad/premature optimization. There's much
>> > better ways to store this sort of data in a compact way if that's the
>> > concern. E.g. you'd store a 64 bit "base" timestamp in the header for
>> > the first entry, and have smaller (signed) "delta" timestamps storing
>> > offsets from that "base" timestamp.
>>
>> This is a good idea for a v2 format when that is necessary.
>
> I agree here.
>
> I'm not opposed to such a change (or even being the one to work on it!),
> but I would encourage us to pursue that change outside of this series,
> since it can easily be done on top.
>
> Of course, if we ever did decide to implement 64-bit mtimes, we would
> have to maintain support for reading both the 32-bit and 64-bit values.
> But I think the code is well-equipped to do that, and it could be done
> on top without significant additional complexity.

Do you mean "on top" in the sense that we'd expect that before the next
release, so that we wouldn't need to deal with bumping the format, and
have some phase-out period for the older version etc.

Or that we would need to treat what's landing here as something we'll
need to support going forward?

I think if a format change is worthwhile doing at all that it's worth
just doing it now if it's going to be the latter of those, as changing
file formats before they're in the wild is easy, but after that it's at
best a bit tedious. E.g. we'll need testing to see how we deal with
mixed new/old format files etc. etc.

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 00/17] cruft packs
  2022-05-25 21:09                 ` Taylor Blau
@ 2022-05-26  0:06                   ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 201+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-05-26  0:06 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Derrick Stolee, Jonathan Nieder, git, gitster, larsxschneider,
	tytso


On Wed, May 25 2022, Taylor Blau wrote:

> On Wed, May 25, 2022 at 03:59:24PM -0400, Derrick Stolee wrote:
>> I'd much rather have a consistent and proven way of specifying the
>> hash value (using the oid_version() helper) than to try and make a
>> new mechanism.
>
> To be clear, I absolutely don't think any of us should have the attitude
> of repeating past bad decisions for the sake of consistency.
>
> As best I can tell, our (Jonathan and I's) disagreement is on whether
> using "1" and "2" to identify which hash function is used by the .mtimes
> file is OK or not. I happen to think that it is acceptable, so the
> choice to continue to adopt this pattern was motivated by being
> consistent with a pattern that is good and works.

I don't have a strong opinion on whether we "bless" that or not, and say
that we should just use 1, 2 etc. going forward or not.

But I do think that us doing so initially wasn't intentional, and has
been in opposition to a strongly worded claim in a comment in hash.h
(which I modified in my earlier related RFC series).

So maybe not part of this series, but it seems prudent if you feel
strongly about using this for new formats over what hash.h is currently
recommending that we have some patch sooner than later to update it
accordingly.

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: adding new 32-bit on-disk (unsigned) timestamp formats (was: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files)
  2022-05-26  0:02                     ` Ævar Arnfjörð Bjarmason
@ 2022-05-26  0:12                       ` Taylor Blau
  0 siblings, 0 replies; 201+ messages in thread
From: Taylor Blau @ 2022-05-26  0:12 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, Derrick Stolee, rsbecker, 'Jonathan Nieder',
	git, gitster, larsxschneider, tytso

On Thu, May 26, 2022 at 02:02:39AM +0200, Ævar Arnfjörð Bjarmason wrote:
> >> > If we really are trying to micro-optimize storage space here I'm willing
> >> > to bet that this is still a bad/premature optimization. There's much
> >> > better ways to store this sort of data in a compact way if that's the
> >> > concern. E.g. you'd store a 64 bit "base" timestamp in the header for
> >> > the first entry, and have smaller (signed) "delta" timestamps storing
> >> > offsets from that "base" timestamp.
> >>
> >> This is a good idea for a v2 format when that is necessary.
> >
> > I agree here.
> >
> > I'm not opposed to such a change (or even being the one to work on it!),
> > but I would encourage us to pursue that change outside of this series,
> > since it can easily be done on top.
> >
> > Of course, if we ever did decide to implement 64-bit mtimes, we would
> > have to maintain support for reading both the 32-bit and 64-bit values.
> > But I think the code is well-equipped to do that, and it could be done
> > on top without significant additional complexity.
>
> Do you mean "on top" in the sense that we'd expect that before the next
> release, so that we wouldn't need to deal with bumping the format, and
> have some phase-out period for the older version etc.
>
> Or that we would need to treat what's landing here as something we'll
> need to support going forward?

My plan is to treat what will hopefully land here as something we're
going to support.

I meant "on top" in the sense that the format implemented here does not
restrict us against making changes (like adding support for wider
records) in the future. IOW, I did not mean to suggest that we should
expect more patches from me in this cycle to deprecate parts of the v1
format.

In other words (again ;-)), I would like to see us ship this format with
the existing 32-bit records.

> I think if a format change is worthwhile doing at all that it's worth
> just doing it now if it's going to be the latter of those, as changing
> file formats before they're in the wild is easy, but after that it's at
> best a bit tedious. E.g. we'll need testing to see how we deal with
> mixed new/old format files etc. etc.

I can understand where you're coming from, though as I noted earlier in
the thread, I don't think changing the format in the manner you suggest
would be that difficult in practice.

But in the meantime, the existing format is useful and works, and I
don't think we should go back to the drawing board for something that we
can do later if we decide to.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files
  2022-05-25 23:02     ` Taylor Blau
@ 2022-05-26  0:30       ` Junio C Hamano
  0 siblings, 0 replies; 201+ messages in thread
From: Junio C Hamano @ 2022-05-26  0:30 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, avarab, derrickstolee, jrnieder, larsxschneider, tytso

Taylor Blau <me@ttaylorr.com> writes:

> diff --git a/pack-mtimes.c b/pack-mtimes.c
> index 46ad584af1..0e0aafdcb0 100644
> --- a/pack-mtimes.c
> +++ b/pack-mtimes.c
> @@ -1,3 +1,4 @@
> +#include "git-compat-util.h"
>  #include "pack-mtimes.h"
>  #include "object-store.h"
>  #include "packfile.h"
> @@ -7,12 +8,10 @@ static char *pack_mtimes_filename(struct packed_git *p)
>  	size_t len;
>  	if (!strip_suffix(p->pack_name, ".pack", &len))
>  		BUG("pack_name does not end in .pack");
> -	/* NEEDSWORK: this could reuse code from pack-revindex.c. */
>  	return xstrfmt("%.*s.mtimes", (int)len, p->pack_name);
>  }
>
>  #define MTIMES_HEADER_SIZE (12)
> -#define MTIMES_MIN_SIZE (MTIMES_HEADER_SIZE + (2 * the_hash_algo->rawsz))
>
>  struct mtimes_header {
>  	uint32_t signature;
> @@ -26,10 +25,9 @@ static int load_pack_mtimes_file(char *mtimes_file,
>  {
>  	int fd, ret = 0;
>  	struct stat st;
> -	void *data = NULL;
> -	size_t mtimes_size;
> +	uint32_t *data = NULL;
> +	size_t mtimes_size, expected_size;
>  	struct mtimes_header header;
> -	uint32_t *hdr;
>
>  	fd = git_open(mtimes_file);
>
> @@ -44,21 +42,16 @@ static int load_pack_mtimes_file(char *mtimes_file,
>
>  	mtimes_size = xsize_t(st.st_size);
>
> -	if (mtimes_size < MTIMES_MIN_SIZE) {
> +	if (mtimes_size < MTIMES_HEADER_SIZE) {
>  		ret = error(_("mtimes file %s is too small"), mtimes_file);
>  		goto cleanup;
>  	}
>
> -	if (mtimes_size - MTIMES_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
> -		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
> -		goto cleanup;
> -	}
> +	data = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
>
> -	data = hdr = xmmap(NULL, mtimes_size, PROT_READ, MAP_PRIVATE, fd, 0);
> -
> -	header.signature = ntohl(hdr[0]);
> -	header.version = ntohl(hdr[1]);
> -	header.hash_id = ntohl(hdr[2]);
> +	header.signature = ntohl(data[0]);
> +	header.version = ntohl(data[1]);
> +	header.hash_id = ntohl(data[2]);

So, instead of assuming that the size of the file is cast in stone
to match the size the current implementation happens to give and
reject a file from a future version, we check the header first to
give a more readable error when we see a version of the file that
we do not understand.

Makes sense.

At least, "here is a small fixup!" should have been accompanied by a
brief explanation to say something like that, i.e. why a fixup is
needed, what shortcoming in the original it is meant to address,
etc.

Will queue between 2/17 and 3/17 without squashing (yet).

Thanks.

>  	if (header.signature != MTIMES_SIGNATURE) {
>  		ret = error(_("mtimes file %s has unknown signature"), mtimes_file);
> @@ -77,13 +70,23 @@ static int load_pack_mtimes_file(char *mtimes_file,
>  		goto cleanup;
>  	}
>
> +
> +	expected_size = MTIMES_HEADER_SIZE;
> +	expected_size = st_add(expected_size, st_mult(sizeof(uint32_t), num_objects));
> +	expected_size = st_add(expected_size, 2 * (header.hash_id == 1 ? GIT_SHA1_RAWSZ : GIT_SHA256_RAWSZ));
> +
> +	if (mtimes_size != expected_size) {
> +		ret = error(_("mtimes file %s is corrupt"), mtimes_file);
> +		goto cleanup;
> +	}
> +
>  cleanup:
>  	if (ret) {
>  		if (data)
>  			munmap(data, mtimes_size);
>  	} else {
>  		*len_p = mtimes_size;
> -		*data_p = (const uint32_t *)data;
> +		*data_p = data;
>  	}
>
>  	close(fd);
> diff --git a/pack-mtimes.h b/pack-mtimes.h
> index 38ddb9f893..cc957b3e85 100644
> --- a/pack-mtimes.h
> +++ b/pack-mtimes.h
> @@ -8,8 +8,19 @@
>
>  struct packed_git;
>
> +/*
> + * Loads the .mtimes file corresponding to "p", if any, returning zero
> + * on success.
> + */
>  int load_pack_mtimes(struct packed_git *p);
>
> +/* Returns the mtime associated with the object at position "pos" (in
> + * lexicographic/index order) in pack "p".
> + *
> + * Note that it is a BUG() to call this function if either (a) "p" does
> + * not have a corresponding .mtimes file, or (b) it does, but it hasn't
> + * been loaded
> + */
>  uint32_t nth_packed_mtime(struct packed_git *p, uint32_t pos);
>
>  #endif
> --
> 2.36.1.94.gb0d54bedca
>
> --- >8 ---

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 16/17] builtin/gc.c: conditionally avoid pruning objects via loose
  2022-05-20 23:18   ` [PATCH v5 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
@ 2022-06-19  5:38     ` René Scharfe
  2022-06-21 15:58       ` Junio C Hamano
  0 siblings, 1 reply; 201+ messages in thread
From: René Scharfe @ 2022-06-19  5:38 UTC (permalink / raw)
  To: Taylor Blau, git
  Cc: avarab, derrickstolee, gitster, jrnieder, larsxschneider, tytso

Am 21.05.22 um 01:18 schrieb Taylor Blau:
> diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
> index 853967dea0..ba4e67700e 100644
> --- a/Documentation/git-gc.txt
> +++ b/Documentation/git-gc.txt
> @@ -54,6 +54,11 @@ other housekeeping tasks (e.g. rerere, working trees, reflog...) will
>  be performed as well.
>
>
> +--cruft::
> +	When expiring unreachable objects, pack them separately into a
> +	cruft pack instead of storing the loose objects as loose
> +	objects.

The last part looks tautological.  How about:

--- >8 ---
Subject: [PATCH] gc: simplify --cruft description

Remove duplicate "loose objects".

Signed-off-by: René Scharfe <l.s.r@web.de>
---
 Documentation/git-gc.txt | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index ba4e67700e..0af7540a0c 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -56,8 +56,7 @@ be performed as well.

 --cruft::
 	When expiring unreachable objects, pack them separately into a
-	cruft pack instead of storing the loose objects as loose
-	objects.
+	cruft pack instead of storing them as loose objects.

 --prune=<date>::
 	Prune loose objects older than date (default is 2 weeks ago,
--
2.36.1

^ permalink raw reply related	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 16/17] builtin/gc.c: conditionally avoid pruning objects via loose
  2022-06-19  5:38     ` René Scharfe
@ 2022-06-21 15:58       ` Junio C Hamano
  0 siblings, 0 replies; 201+ messages in thread
From: Junio C Hamano @ 2022-06-21 15:58 UTC (permalink / raw)
  To: René Scharfe
  Cc: Taylor Blau, git, avarab, derrickstolee, jrnieder, larsxschneider,
	tytso

René Scharfe <l.s.r@web.de> writes:

> Am 21.05.22 um 01:18 schrieb Taylor Blau:
>> diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
>> index 853967dea0..ba4e67700e 100644
>> --- a/Documentation/git-gc.txt
>> +++ b/Documentation/git-gc.txt
>> @@ -54,6 +54,11 @@ other housekeeping tasks (e.g. rerere, working trees, reflog...) will
>>  be performed as well.
>>
>>
>> +--cruft::
>> +	When expiring unreachable objects, pack them separately into a
>> +	cruft pack instead of storing the loose objects as loose
>> +	objects.
>
> The last part looks tautological.  How about:
>
> --- >8 ---
> Subject: [PATCH] gc: simplify --cruft description
>
> Remove duplicate "loose objects".
>
> Signed-off-by: René Scharfe <l.s.r@web.de>
> ---

Sounds good.  Will apply.

Thanks.

>  Documentation/git-gc.txt | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
> index ba4e67700e..0af7540a0c 100644
> --- a/Documentation/git-gc.txt
> +++ b/Documentation/git-gc.txt
> @@ -56,8 +56,7 @@ be performed as well.
>
>  --cruft::
>  	When expiring unreachable objects, pack them separately into a
> -	cruft pack instead of storing the loose objects as loose
> -	objects.
> +	cruft pack instead of storing them as loose objects.
>
>  --prune=<date>::
>  	Prune loose objects older than date (default is 2 weeks ago,
> --
> 2.36.1

^ permalink raw reply	[flat|nested] 201+ messages in thread

* Re: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files
  2022-05-20 23:17   ` [PATCH v5 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
  2022-05-24 19:32     ` Jonathan Nieder
  2022-05-25 23:02     ` Taylor Blau
@ 2023-06-01 13:01     ` Andreas Schwab
  2 siblings, 0 replies; 201+ messages in thread
From: Andreas Schwab @ 2023-06-01 13:01 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, avarab, derrickstolee, gitster, jrnieder, larsxschneider,
	tytso

On Mai 20 2022, Taylor Blau wrote:

> diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
> index 6d3efb7d16..b520aa9c45 100644
> --- a/Documentation/technical/pack-format.txt
> +++ b/Documentation/technical/pack-format.txt
> @@ -294,6 +294,25 @@ Pack file entry: <+
>  
>  All 4-byte numbers are in network order.
>  
> +== pack-*.mtimes files have the format:
> +
> +All 4-byte numbers are in network byte order.
> +
> +  - A 4-byte magic number '0x4d544d45' ('MTME').

This is identified by file(1) as "Multitracker Version 4.05". ;-)

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

^ permalink raw reply	[flat|nested] 201+ messages in thread

end of thread, other threads:[~2023-06-01 13:01 UTC | newest]

Thread overview: 201+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-29 22:25 [PATCH 00/17] cruft packs Taylor Blau
2021-11-29 22:25 ` [PATCH 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
2021-12-02 14:33   ` Derrick Stolee
2021-12-03 21:53     ` Taylor Blau
2021-12-04 22:20   ` Elijah Newren
2021-12-04 23:32     ` Taylor Blau
2021-11-29 22:25 ` [PATCH 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
2021-12-02 15:06   ` Derrick Stolee
2021-12-02 22:32     ` brian m. carlson
2021-12-03 22:24     ` Taylor Blau
2022-01-07 19:41       ` Taylor Blau
2021-11-29 22:25 ` [PATCH 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
2021-11-29 22:25 ` [PATCH 04/17] chunk-format.h: extract oid_version() Taylor Blau
2021-12-02 15:22   ` Derrick Stolee
2021-12-03 22:40     ` Taylor Blau
2021-12-06 17:33       ` Derrick Stolee
2021-11-29 22:25 ` [PATCH 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
2021-12-02 15:36   ` Derrick Stolee
2021-12-03 23:04     ` Taylor Blau
2021-11-29 22:25 ` [PATCH 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
2021-12-06 21:16   ` Derrick Stolee
2022-02-23 22:24     ` Taylor Blau
2021-11-29 22:25 ` [PATCH 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
2021-11-29 22:25 ` [PATCH 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
2021-12-06 21:44   ` Derrick Stolee
2022-03-01  2:48     ` Taylor Blau
2021-12-07 15:17   ` Derrick Stolee
2022-02-23 23:34     ` Taylor Blau
2021-11-29 22:25 ` [PATCH 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
2021-11-29 22:25 ` [PATCH 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
2021-11-29 22:25 ` [PATCH 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
2021-12-07 15:30   ` Derrick Stolee
2022-02-23 23:35     ` Taylor Blau
2021-11-29 22:25 ` [PATCH 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
2021-12-05 20:46   ` Junio C Hamano
2022-03-01  2:00     ` Taylor Blau
2021-12-07 15:38   ` Derrick Stolee
2022-02-23 23:37     ` Taylor Blau
2021-11-29 22:25 ` [PATCH 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
2021-11-29 22:25 ` [PATCH 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
2021-11-29 22:25 ` [PATCH 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
2021-11-29 22:25 ` [PATCH 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
2021-11-29 22:25 ` [PATCH 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
2021-12-03 19:51 ` [PATCH 00/17] " Junio C Hamano
2021-12-03 20:08   ` Taylor Blau
2021-12-03 20:47     ` Taylor Blau
2022-03-02  0:57 ` [PATCH v2 " Taylor Blau
2022-03-02  0:58   ` [PATCH v2 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
2022-03-02  0:58   ` [PATCH v2 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
2022-03-02 20:22     ` Derrick Stolee
2022-03-02 21:33       ` Taylor Blau
2022-03-02  0:58   ` [PATCH v2 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
2022-03-02  0:58   ` [PATCH v2 04/17] chunk-format.h: extract oid_version() Taylor Blau
2022-03-02  0:58   ` [PATCH v2 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
2022-03-02  0:58   ` [PATCH v2 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
2022-03-02  0:58   ` [PATCH v2 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
2022-03-02  0:58   ` [PATCH v2 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
2022-03-02  0:58   ` [PATCH v2 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
2022-03-02 20:19     ` Derrick Stolee
2022-03-02 21:28       ` Taylor Blau
2022-03-02  0:58   ` [PATCH v2 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
2022-03-02  0:58   ` [PATCH v2 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
2022-03-02  7:42     ` Junio C Hamano
2022-03-02 15:54       ` Taylor Blau
2022-03-02 19:57         ` Derrick Stolee
2022-03-02  0:58   ` [PATCH v2 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
2022-03-02  0:58   ` [PATCH v2 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
2022-03-02  0:58   ` [PATCH v2 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
2022-03-02  0:58   ` [PATCH v2 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
2022-03-02  0:58   ` [PATCH v2 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
2022-03-02  0:58   ` [PATCH v2 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
2022-03-02 20:23   ` [PATCH v2 00/17] " Derrick Stolee
2022-03-02 21:36     ` Taylor Blau
2022-03-03  0:20 ` [PATCH v3 " Taylor Blau
2022-03-03  0:20   ` [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
2022-03-07 18:03     ` Jonathan Nieder
2022-03-22  1:16       ` Taylor Blau
2022-03-22 21:45         ` Jonathan Nieder
2022-03-22 22:02           ` Taylor Blau
2022-03-22 23:04             ` Jonathan Nieder
2022-03-23  1:01               ` Taylor Blau
2022-03-28 18:46                 ` Taylor Blau
2022-03-28 20:55                   ` Junio C Hamano
2022-03-28 21:21                     ` Taylor Blau
2022-03-29 15:59                       ` Junio C Hamano
2022-03-30  2:23                         ` Taylor Blau
2022-03-30 13:37                           ` Junio C Hamano
2022-03-30 17:30                             ` Taylor Blau
2022-03-03  0:20   ` [PATCH v3 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
2022-03-03  0:20   ` [PATCH v3 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
2022-03-03  0:20   ` [PATCH v3 04/17] chunk-format.h: extract oid_version() Taylor Blau
2022-03-03 16:30     ` Ævar Arnfjörð Bjarmason
2022-03-03 23:32       ` Taylor Blau
2022-03-04  0:16         ` Junio C Hamano
2022-03-03  0:20   ` [PATCH v3 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
2022-03-03 16:45     ` Ævar Arnfjörð Bjarmason
2022-03-03 23:35       ` Taylor Blau
2022-03-04 10:40         ` Ævar Arnfjörð Bjarmason
2022-03-03  0:20   ` [PATCH v3 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
2022-03-03  0:21   ` [PATCH v3 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
2022-03-03  0:21   ` [PATCH v3 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
2022-03-03  0:21   ` [PATCH v3 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
2022-03-03  0:21   ` [PATCH v3 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
2022-03-03  0:21   ` [PATCH v3 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
2022-03-03  0:21   ` [PATCH v3 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
2022-03-03  0:21   ` [PATCH v3 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
2022-03-03  0:21   ` [PATCH v3 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
2022-03-03  0:21   ` [PATCH v3 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
2022-03-03  0:21   ` [PATCH v3 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
2022-03-03  0:21   ` [PATCH v3 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
2022-03-03  1:29   ` [PATCH v3 00/17] " Derrick Stolee
2022-05-18 23:10 ` [PATCH v4 " Taylor Blau
2022-05-18 23:10   ` [PATCH v4 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
2022-05-19 14:04     ` Junio C Hamano
2022-05-18 23:10   ` [PATCH v4 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
2022-05-19 10:40     ` Ævar Arnfjörð Bjarmason
2022-05-19 15:21       ` Junio C Hamano
2022-05-20  7:32         ` Ævar Arnfjörð Bjarmason
2022-05-20 22:37           ` Taylor Blau
2022-05-18 23:10   ` [PATCH v4 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
2022-05-18 23:11   ` [PATCH v4 04/17] chunk-format.h: extract oid_version() Taylor Blau
2022-05-19 11:44     ` Ævar Arnfjörð Bjarmason
2022-05-18 23:11   ` [PATCH v4 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
2022-05-18 23:11   ` [PATCH v4 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
2022-05-18 23:11   ` [PATCH v4 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
2022-05-18 23:11   ` [PATCH v4 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
2022-05-19 10:04     ` Junio C Hamano
2022-05-19 15:16       ` Junio C Hamano
2022-05-20 22:52         ` Taylor Blau
2022-05-18 23:11   ` [PATCH v4 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
2022-05-18 23:11   ` [PATCH v4 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
2022-05-18 23:11   ` [PATCH v4 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
2022-05-18 23:11   ` [PATCH v4 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
2022-05-19 11:29     ` Ævar Arnfjörð Bjarmason
2022-05-20 22:39       ` Taylor Blau
2022-05-18 23:11   ` [PATCH v4 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
2022-05-18 23:11   ` [PATCH v4 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
2022-05-18 23:11   ` [PATCH v4 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
2022-05-19 11:32     ` Ævar Arnfjörð Bjarmason
2022-05-20 22:42       ` Taylor Blau
2022-05-18 23:11   ` [PATCH v4 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
2022-05-18 23:11   ` [PATCH v4 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
2022-05-18 23:48   ` [PATCH v4 00/17] " Derrick Stolee
2022-05-20 23:19     ` Junio C Hamano
2022-05-20 23:30       ` Taylor Blau
2022-05-19 11:42   ` [RFC PATCH 0/2] Utility functions for duplicated pack(write) code Ævar Arnfjörð Bjarmason
2022-05-19 11:42     ` [RFC PATCH 1/2] packfile API: add and use a pack_name_to_ext() utility function Ævar Arnfjörð Bjarmason
2022-05-19 15:40       ` Junio C Hamano
2022-05-19 11:42     ` [RFC PATCH 2/2] hash API: add and use a hash_short_id_by_algo() function Ævar Arnfjörð Bjarmason
2022-05-19 15:50       ` Junio C Hamano
2022-05-19 19:07         ` Ævar Arnfjörð Bjarmason
2022-05-19 15:31     ` [RFC PATCH 0/2] Utility functions for duplicated pack(write) code Junio C Hamano
2022-05-19 11:54   ` [PATCH v4 00/17] cruft packs Ævar Arnfjörð Bjarmason
2022-05-20 23:17 ` [PATCH v5 " Taylor Blau
2022-05-20 23:17   ` [PATCH v5 01/17] Documentation/technical: add cruft-packs.txt Taylor Blau
2022-05-20 23:17   ` [PATCH v5 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
2022-05-24 19:32     ` Jonathan Nieder
2022-05-24 19:44       ` rsbecker
2022-05-24 22:25         ` Taylor Blau
2022-05-24 23:24           ` rsbecker
2022-05-25  0:07             ` Taylor Blau
2022-05-25  0:20               ` rsbecker
2022-05-25  9:11               ` adding new 32-bit on-disk (unsigned) timestamp formats (was: [PATCH v5 02/17] pack-mtimes: support reading .mtimes files) Ævar Arnfjörð Bjarmason
2022-05-25 13:30                 ` Derrick Stolee
2022-05-25 21:13                   ` Taylor Blau
2022-05-26  0:02                     ` Ævar Arnfjörð Bjarmason
2022-05-26  0:12                       ` Taylor Blau
2022-05-24 22:21       ` [PATCH v5 02/17] pack-mtimes: support reading .mtimes files Taylor Blau
2022-05-25  7:48         ` Jonathan Nieder
2022-05-25 21:36           ` Taylor Blau
2022-05-25 21:58             ` rsbecker
2022-05-25 22:59               ` Taylor Blau
2022-05-25 23:02     ` Taylor Blau
2022-05-26  0:30       ` Junio C Hamano
2023-06-01 13:01     ` Andreas Schwab
2022-05-20 23:17   ` [PATCH v5 03/17] pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' Taylor Blau
2022-05-20 23:17   ` [PATCH v5 04/17] chunk-format.h: extract oid_version() Taylor Blau
2022-05-20 23:17   ` [PATCH v5 05/17] pack-mtimes: support writing pack .mtimes files Taylor Blau
2022-05-20 23:17   ` [PATCH v5 06/17] t/helper: add 'pack-mtimes' test-tool Taylor Blau
2022-05-20 23:17   ` [PATCH v5 07/17] builtin/pack-objects.c: return from create_object_entry() Taylor Blau
2022-05-20 23:17   ` [PATCH v5 08/17] builtin/pack-objects.c: --cruft without expiration Taylor Blau
2022-05-20 23:17   ` [PATCH v5 09/17] reachable: add options to add_unseen_recent_objects_to_traversal Taylor Blau
2022-05-20 23:17   ` [PATCH v5 10/17] reachable: report precise timestamps from objects in cruft packs Taylor Blau
2022-05-20 23:18   ` [PATCH v5 11/17] builtin/pack-objects.c: --cruft with expiration Taylor Blau
2022-05-20 23:18   ` [PATCH v5 12/17] builtin/repack.c: support generating a cruft pack Taylor Blau
2022-05-20 23:18   ` [PATCH v5 13/17] builtin/repack.c: allow configuring cruft pack generation Taylor Blau
2022-05-20 23:18   ` [PATCH v5 14/17] builtin/repack.c: use named flags for existing_packs Taylor Blau
2022-05-20 23:18   ` [PATCH v5 15/17] builtin/repack.c: add cruft packs to MIDX during geometric repack Taylor Blau
2022-05-20 23:18   ` [PATCH v5 16/17] builtin/gc.c: conditionally avoid pruning objects via loose Taylor Blau
2022-06-19  5:38     ` René Scharfe
2022-06-21 15:58       ` Junio C Hamano
2022-05-20 23:18   ` [PATCH v5 17/17] sha1-file.c: don't freshen cruft packs Taylor Blau
2022-05-21 11:17   ` [PATCH v5 00/17] " Ævar Arnfjörð Bjarmason
2022-05-24 19:39     ` Jonathan Nieder
2022-05-24 21:50       ` Taylor Blau
2022-05-24 21:55         ` Ævar Arnfjörð Bjarmason
2022-05-24 22:12           ` Taylor Blau
2022-05-25  7:53             ` Jonathan Nieder
2022-05-25 19:59               ` Derrick Stolee
2022-05-25 21:09                 ` Taylor Blau
2022-05-26  0:06                   ` Ævar Arnfjörð Bjarmason

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).