git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [WIP 0/2] Modifying pack-objects to support --blob-size-limit
@ 2017-06-02  0:14 Jonathan Tan
  2017-06-02  0:14 ` [WIP 1/2] pack-objects: rename want_.* to ignore_.* Jonathan Tan
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Jonathan Tan @ 2017-06-02  0:14 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, git

I took a look at modifying builtin/pack-objects.c to support excluding
oversized blobs and reporting the exclusions that it has performed.
Here's a work in progress - it might be of aid to others who are working
on a similar feature and/or are modifying pack-objects for something
else.

The way I used to obtain a blob's size seems inefficient and redundant
with check_object() - if you have a suggestion to improve this, I'm
interested.

This is similar to [1] except that this reports blobs and their sizes
(after the packfile is printed), and is slightly more comprehensive in
that the read_object_list_from_stdin() codepath is also covered in
addition to the get_object_list() codepath. (Although, to be clear,
upload-pack always passes "--revs" and thus only needs the
get_object_list() codepath).

[1] https://public-inbox.org/git/1488994685-37403-3-git-send-email-jeffhost@microsoft.com/

Jonathan Tan (2):
  pack-objects: rename want_.* to ignore_.*
  pack-objects: support --blob-size-limit

 Documentation/git-pack-objects.txt |  19 ++++-
 builtin/pack-objects.c             | 163 ++++++++++++++++++++++++++++++-------
 t/t5300-pack-object.sh             |  71 ++++++++++++++++
 3 files changed, 223 insertions(+), 30 deletions(-)

-- 
2.13.0.506.g27d5fe0cd-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [WIP 1/2] pack-objects: rename want_.* to ignore_.*
  2017-06-02  0:14 [WIP 0/2] Modifying pack-objects to support --blob-size-limit Jonathan Tan
@ 2017-06-02  0:14 ` Jonathan Tan
  2017-06-02  3:45   ` Junio C Hamano
  2017-06-02  0:14 ` [WIP 2/2] pack-objects: support --blob-size-limit Jonathan Tan
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Jonathan Tan @ 2017-06-02  0:14 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, git

Currently, pack-objects conflates the concepts of "ignoring an object"
and "including it in to_pack". This is fine for now, but a subsequent
commit will introduce the concept of an object that cannot be completely
ignored, but should not be included in to_pack either. To separate these
concepts, restrict want_found_object() and want_object_in_pack() to only
indicate if the object is to be ignored. This is done by renaming these
methods and swapping the meanings of the return values 0 and 1.

We also take the opportunity to use the terminology "preferred_base"
instead of "excluded" in these methods. It is true that preferred bases
are not included in the final packfile generation, but at this point in
the code, there is no exclusion taking place - on the contrary, if
something is "excluded", it is in fact guaranteed to be in to_pack.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/pack-objects.c | 50 +++++++++++++++++++++++++-------------------------
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 80439047a..1062d8fe2 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -946,12 +946,12 @@ static int have_duplicate_entry(const unsigned char *sha1,
 	return 1;
 }
 
-static int want_found_object(int exclude, struct packed_git *p)
+static int ignore_found_object(int preferred_base, struct packed_git *p)
 {
-	if (exclude)
-		return 1;
-	if (incremental)
+	if (preferred_base)
 		return 0;
+	if (incremental)
+		return 1;
 
 	/*
 	 * When asked to do --local (do not include an object that appears in a
@@ -969,19 +969,19 @@ static int want_found_object(int exclude, struct packed_git *p)
 	 */
 	if (!ignore_packed_keep &&
 	    (!local || !have_non_local_packs))
-		return 1;
+		return 0;
 
 	if (local && !p->pack_local)
-		return 0;
+		return 1;
 	if (ignore_packed_keep && p->pack_local && p->pack_keep)
-		return 0;
+		return 1;
 
 	/* we don't know yet; keep looking for more packs */
 	return -1;
 }
 
 /*
- * Check whether we want the object in the pack (e.g., we do not want
+ * Check whether we should ignore this object (e.g., we do not want
  * objects found in non-local stores if the "--local" option was used).
  *
  * If the caller already knows an existing pack it wants to take the object
@@ -989,16 +989,16 @@ static int want_found_object(int exclude, struct packed_git *p)
  * function finds if there is any pack that has the object and returns the pack
  * and its offset in these variables.
  */
-static int want_object_in_pack(const unsigned char *sha1,
-			       int exclude,
-			       struct packed_git **found_pack,
-			       off_t *found_offset)
+static int ignore_object(const unsigned char *sha1,
+			 int preferred_base,
+			 struct packed_git **found_pack,
+			 off_t *found_offset)
 {
 	struct mru_entry *entry;
-	int want;
+	int ignore;
 
-	if (!exclude && local && has_loose_object_nonlocal(sha1))
-		return 0;
+	if (!preferred_base && local && has_loose_object_nonlocal(sha1))
+		return 1;
 
 	/*
 	 * If we already know the pack object lives in, start checks from that
@@ -1006,9 +1006,9 @@ static int want_object_in_pack(const unsigned char *sha1,
 	 * are present we will determine the answer right now.
 	 */
 	if (*found_pack) {
-		want = want_found_object(exclude, *found_pack);
-		if (want != -1)
-			return want;
+		ignore = ignore_found_object(preferred_base, *found_pack);
+		if (ignore != -1)
+			return ignore;
 	}
 
 	for (entry = packed_git_mru->head; entry; entry = entry->next) {
@@ -1027,15 +1027,15 @@ static int want_object_in_pack(const unsigned char *sha1,
 				*found_offset = offset;
 				*found_pack = p;
 			}
-			want = want_found_object(exclude, p);
-			if (!exclude && want > 0)
+			ignore = ignore_found_object(preferred_base, p);
+			if (!preferred_base && ignore > 0)
 				mru_mark(packed_git_mru, entry);
-			if (want != -1)
-				return want;
+			if (ignore != -1)
+				return ignore;
 		}
 	}
 
-	return 1;
+	return 0;
 }
 
 static void create_object_entry(const unsigned char *sha1,
@@ -1079,7 +1079,7 @@ static int add_object_entry(const unsigned char *sha1, enum object_type type,
 	if (have_duplicate_entry(sha1, exclude, &index_pos))
 		return 0;
 
-	if (!want_object_in_pack(sha1, exclude, &found_pack, &found_offset)) {
+	if (ignore_object(sha1, exclude, &found_pack, &found_offset)) {
 		/* The pack is missing an object, so it will not have closure */
 		if (write_bitmap_index) {
 			warning(_(no_closure_warning));
@@ -1106,7 +1106,7 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
 	if (have_duplicate_entry(sha1, 0, &index_pos))
 		return 0;
 
-	if (!want_object_in_pack(sha1, 0, &pack, &offset))
+	if (ignore_object(sha1, 0, &pack, &offset))
 		return 0;
 
 	create_object_entry(sha1, type, name_hash, 0, 0, index_pos, pack, offset);
-- 
2.13.0.506.g27d5fe0cd-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [WIP 2/2] pack-objects: support --blob-size-limit
  2017-06-02  0:14 [WIP 0/2] Modifying pack-objects to support --blob-size-limit Jonathan Tan
  2017-06-02  0:14 ` [WIP 1/2] pack-objects: rename want_.* to ignore_.* Jonathan Tan
@ 2017-06-02  0:14 ` Jonathan Tan
  2017-06-02  4:48   ` Junio C Hamano
  2017-06-02 19:38 ` [WIP v2 0/2] Modifying pack objects to support --blob-max-bytes Jonathan Tan
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Jonathan Tan @ 2017-06-02  0:14 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, git

As part of an effort to improve Git support for very large repositories
in which clients typically have only a subset of all version-controlled
blobs, teach pack-objects to support --blob-size-limit, packing only
blobs below that size limit unless the blob corresponds to a file whose
name starts with ".git". upload-pack will eventually be taught to use
this new parameter if needed to exclude certain blobs during a fetch or
clone, potentially drastically reducing network consumption when serving
these very large repositories.

Since any excluded blob should not act as a delta base, I did this by
avoiding such oversized blobs from ever going into the "to_pack" data
structure, which contains both preferred bases (which do not go into the
final packfile but can act as delta bases) and the objects to be packed
(which go into the final packfile and also can act as delta bases). To
that end, add_object_entry() has been modified to exclude the
appropriate non-preferred-base objects. (Preferred bases, regardless of
size, still enter "to_pack" - they are something that the client
indicates that it has, not something that the server needs to serve, so
no exclusion occurs.) Any such exclusions are recorded and sent after
the packfile.

I have taken the opportunity to rename the "exclude" parameter to
"preferred_base" to avoid confusion between the concept of "excluding an
object from to_pack" and "including an object in to_pack which will be
excluded in the final pack generation".

The other method in which objects are added to "to_pack",
add_object_entry_from_bitmap(), has not been modified - thus packing in
the presence of bitmaps still packs all blobs regardless of size. See
the documentation update in this commit for the rationale.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/git-pack-objects.txt |  19 +++++-
 builtin/pack-objects.c             | 115 +++++++++++++++++++++++++++++++++++--
 t/t5300-pack-object.sh             |  71 +++++++++++++++++++++++
 3 files changed, 199 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index 8973510a4..c4257cacc 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -12,7 +12,8 @@ SYNOPSIS
 'git pack-objects' [-q | --progress | --all-progress] [--all-progress-implied]
 	[--no-reuse-delta] [--delta-base-offset] [--non-empty]
 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
-	[--revs [--unpacked | --all]] [--stdout | base-name]
+	[--revs [--unpacked | --all]]
+	[--stdout [--blob-size-limit=<n>] | base-name]
 	[--shallow] [--keep-true-parents] < object-list
 
 
@@ -231,6 +232,22 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it creates a bundle.
 	With this option, parents that are hidden by grafts are packed
 	nevertheless.
 
+--blob-size-limit=<n>::
+	This option can only be used with --stdout. If specified, a blob
+	of at least this size will not be packed unless a to-be-packed
+	tree has that blob with a filename beginning with ".git".
+	The size can be suffixed with "k", "m", or "g", and may be "0".
++
+If specified, after printing the packfile, pack-objects will print the
+count of excluded blobs (8 bytes) . Subsequently, for each excluded blob
+in hash order, pack-objects will print its hash (20 bytes) and its size
+(8 bytes). All numbers are in network byte order.
++
+If pack-objects cannot efficiently determine, for an arbitrary blob,
+that no to-be-packed tree has that blob with a filename beginning with
+".git" (for example, if bitmaps are used to calculate the objects to be
+packed), it will pack all blobs regardless of size.
+
 SEE ALSO
 --------
 linkgit:git-rev-list[1]
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 1062d8fe2..aaf7e92b0 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -77,6 +77,8 @@ static unsigned long cache_max_small_delta_size = 1000;
 
 static unsigned long window_memory_limit = 0;
 
+static long blob_size_limit = -1;
+
 /*
  * stats
  */
@@ -776,6 +778,50 @@ static const char no_split_warning[] = N_(
 "disabling bitmap writing, packs are split due to pack.packSizeLimit"
 );
 
+struct oversized_blob {
+	struct hashmap_entry entry;
+	struct object_id oid;
+	unsigned long size;
+};
+struct hashmap oversized_blobs;
+
+static int oversized_blob_cmp_fn(const void *b1, const void *b2,
+				 const void *keydata)
+{
+	const struct object_id *oid2 = keydata ? keydata : 
+		&((const struct oversized_blob *) b2)->oid;
+	return oidcmp(&((const struct oversized_blob *) b1)->oid, oid2);
+}
+
+static int oversized_blob_cmp(const void *b1, const void *b2)
+{
+	return oidcmp(&(*(const struct oversized_blob **) b1)->oid,
+		      &(*(const struct oversized_blob **) b2)->oid);
+}
+
+static void write_oversized_info(void)
+{
+	struct oversized_blob **obs = xmalloc(oversized_blobs.size *
+					      sizeof(*obs));
+	struct hashmap_iter iter;
+	struct oversized_blob *ob;
+	int i = 0;
+	uint64_t size_network;
+
+	for (ob = hashmap_iter_first(&oversized_blobs, &iter);
+	     ob;
+	     ob = hashmap_iter_next(&iter))
+		obs[i++] = ob;
+	QSORT(obs, oversized_blobs.size, oversized_blob_cmp);
+	size_network = htonll(oversized_blobs.size);
+	write_in_full(1, &size_network, sizeof(size_network));
+	for (i = 0; i < oversized_blobs.size; i++) {
+		write_in_full(1, &obs[i]->oid, sizeof(obs[i]->oid));
+		size_network = htonll(obs[i]->size);
+		write_in_full(1, &size_network, sizeof(size_network));
+	}
+}
+
 static void write_pack_file(void)
 {
 	uint32_t i = 0, j;
@@ -822,7 +868,11 @@ static void write_pack_file(void)
 		 * If so, rewrite it like in fast-import
 		 */
 		if (pack_to_stdout) {
-			sha1close(f, sha1, CSUM_CLOSE);
+			sha1close(f, sha1, 0);
+			write_in_full(1, sha1, sizeof(sha1));
+			if (blob_size_limit >= 0) {
+				write_oversized_info();
+			}
 		} else if (nr_written == nr_remaining) {
 			sha1close(f, sha1, CSUM_FSYNC);
 		} else {
@@ -1069,17 +1119,58 @@ static const char no_closure_warning[] = N_(
 "disabling bitmap writing, as some objects are not being packed"
 );
 
+/*
+ * Returns 1 and records this blob in oversized_blobs if the given blob does
+ * not meet any defined blob size limits.  Blobs that appear as a tree entry
+ * whose basename begins with ".git" are never considered oversized.
+ *
+ * If the tree entry corresponding to the given blob is unknown, pass NULL as
+ * name. In this case, this function will always return 0 to avoid false
+ * positives.
+ */
+static int oversized(const unsigned char *sha1, const char *name) {
+	const char *last_slash, *basename;
+	unsigned long size;
+	unsigned int hash;
+
+	if (blob_size_limit < 0 || !name)
+		return 0;
+
+	last_slash = strrchr(name, '/');
+	basename = last_slash ? last_slash + 1 : name;
+	if (starts_with(basename, ".git"))
+		return 0;
+
+	sha1_object_info(sha1, &size);
+	if (size < blob_size_limit)
+		return 0;
+	
+	if (!oversized_blobs.tablesize)
+		hashmap_init(&oversized_blobs, oversized_blob_cmp_fn, 0);
+	hash = sha1hash(sha1);
+	if (!hashmap_get_from_hash(&oversized_blobs, hash, sha1)) {
+		struct oversized_blob *ob = xmalloc(sizeof(*ob));
+		hashmap_entry_init(ob, hash);
+		hashcpy(ob->oid.hash, sha1);
+		ob->size = size;
+		hashmap_add(&oversized_blobs, ob);
+	}
+	return 1;
+}
+
 static int add_object_entry(const unsigned char *sha1, enum object_type type,
-			    const char *name, int exclude)
+			    const char *name, int preferred_base)
 {
 	struct packed_git *found_pack = NULL;
 	off_t found_offset = 0;
 	uint32_t index_pos;
+	struct hashmap_entry entry;
 
-	if (have_duplicate_entry(sha1, exclude, &index_pos))
+	if (have_duplicate_entry(sha1, preferred_base, &index_pos))
 		return 0;
 
-	if (ignore_object(sha1, exclude, &found_pack, &found_offset)) {
+	if (ignore_object(sha1, preferred_base, &found_pack, &found_offset) ||
+	    (!preferred_base && type == OBJ_BLOB && oversized(sha1, name))) {
 		/* The pack is missing an object, so it will not have closure */
 		if (write_bitmap_index) {
 			warning(_(no_closure_warning));
@@ -1088,8 +1179,17 @@ static int add_object_entry(const unsigned char *sha1, enum object_type type,
 		return 0;
 	}
 
+	/*
+	 * Ensure that we do not report this blob as oversized if we are going
+	 * to use it, be it in the returned pack or as a preferred base
+	 */
+	if (oversized_blobs.tablesize) {
+		hashmap_entry_init(&entry, sha1hash(sha1));
+		free(hashmap_remove(&oversized_blobs, &entry, sha1));
+	}
+
 	create_object_entry(sha1, type, pack_name_hash(name),
-			    exclude, name && no_try_delta(name),
+			    preferred_base, name && no_try_delta(name),
 			    index_pos, found_pack, found_offset);
 
 	display_progress(progress_state, nr_result);
@@ -2947,6 +3047,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			 N_("use a bitmap index if available to speed up counting objects")),
 		OPT_BOOL(0, "write-bitmap-index", &write_bitmap_index,
 			 N_("write a bitmap index together with the pack index")),
+		OPT_MAGNITUDE(0, "blob-size-limit", &blob_size_limit,
+			      N_("exclude and report blobs not smaller than this limit")),
 		OPT_END(),
 	};
 
@@ -3022,6 +3124,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		die("--keep-unreachable and --unpack-unreachable are incompatible.");
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
+	
+	if (blob_size_limit >= 0 && !pack_to_stdout)
+		die("if --blob-size-limit is specified, --stdout must also be specified");
 
 	/*
 	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 43a672c34..d12280cbf 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -421,6 +421,77 @@ test_expect_success 'index-pack <pack> works in non-repo' '
 	test_path_is_file foo.idx
 '
 
+unpack() {
+	perl -e '$/ = undef; $input = <>; print unpack("H2" x length($input), $input)'
+}
+
+rcut() {
+	perl -e '$/ = undef; $_ = <>; s/.{'$1'}$//s; print $_'
+}
+
+test_expect_success '--blob-size-limit works with multiple excluded' '
+	rm -rf server &&
+	git init server &&
+	printf a > server/a &&
+	printf b > server/b &&
+	printf c-very-long-file > server/c &&
+	printf d-very-long-file > server/d &&
+	git -C server add a b c d &&
+	git -C server commit -m x &&
+
+	git -C server rev-parse HEAD >objects &&
+	git -C server pack-objects --revs --stdout --blob-size-limit=10 <objects >out &&
+	unpack <out >actual &&
+
+	# Check that the excluded hashes are included, sorted, at the end
+	if [ $(git hash-object server/c) \< $(git hash-object server/d) ]
+	then
+		printf "0000000000000002%s0000000000000010%s0000000000000010$" \
+			$(git hash-object server/c) \
+			$(git hash-object server/d) >expect
+	else
+		printf "0000000000000002%s0000000000000010%s0000000000000010$" \
+			$(git hash-object server/d) \
+			$(git hash-object server/c) >expect
+	fi &&
+	grep $(cat expect) actual &&
+
+	# Ensure that only the small blobs are in the packfile
+	rcut 64 <out >my.pack &&
+	git index-pack my.pack &&
+	git verify-pack -v my.idx >objectlist &&
+	grep $(git hash-object server/a) objectlist &&
+	grep $(git hash-object server/b) objectlist &&
+	! grep $(git hash-object server/c) objectlist &&
+	! grep $(git hash-object server/d) objectlist
+'
+
+test_expect_success '--blob-size-limit never excludes special files' '
+	rm -rf server &&
+	git init server &&
+	printf a-very-long-file > server/a &&
+	printf a-very-long-file > server/.git-a &&
+	printf b-very-long-file > server/b &&
+	git -C server add a .git-a b &&
+	git -C server commit -m x &&
+
+	git -C server rev-parse HEAD >objects &&
+	git -C server pack-objects --revs --stdout --blob-size-limit=10 <objects >out &&
+	unpack <out >actual &&
+
+	# Check that the excluded hash is included at the end
+	printf "0000000000000001%s0000000000000010$" \
+		$(git hash-object server/b) >expect &&
+	grep $(cat expect) actual &&
+
+	# Ensure that the .git-a blob is in the packfile, despite also
+	# appearing as a non-.git file
+	rcut 36 <out >my.pack &&
+	git index-pack my.pack &&
+	git verify-pack -v my.idx >objectlist &&
+	grep $(git hash-object server/a) objectlist
+'
+
 #
 # WARNING!
 #
-- 
2.13.0.506.g27d5fe0cd-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [WIP 1/2] pack-objects: rename want_.* to ignore_.*
  2017-06-02  0:14 ` [WIP 1/2] pack-objects: rename want_.* to ignore_.* Jonathan Tan
@ 2017-06-02  3:45   ` Junio C Hamano
  0 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2017-06-02  3:45 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, git

Jonathan Tan <jonathantanmy@google.com> writes:

> Currently, pack-objects conflates the concepts of "ignoring an object"
> and "including it in to_pack".

Hmph, that statement is a hard to read and agree to.  I thought an
ignored object that is not going to be packed is one that won't hit
to_pack?  

I agree that "including to to_pack" and "actually appearing in the
resulting pack" are three different things, though.  Preferred base
objects that are used when constructing a thin pack are thrown into
to_pack because they need to participate in the delta base selection
computation, but by definition they shouldn't be contained in the
resulting pack (hence, nr_result is not incremented for them).

> This is fine for now, but a subsequent
> commit will introduce the concept of an object that cannot be completely
> ignored, but should not be included in to_pack either. To separate these
> concepts, restrict want_found_object() and want_object_in_pack() to only
> indicate if the object is to be ignored. This is done by renaming these
> methods and swapping the meanings of the return values 0 and 1.

I am a bit confused by your reasoning.  I guess it will become
clearer if I knew exactly what you mean by "ignoring".  It is not
like "pretend as if it didn't exist in the rev-list --objects output
we are working off of".

> We also take the opportunity to use the terminology "preferred_base"
> instead of "excluded" in these methods. It is true that preferred bases
> are not included in the final packfile generation, but at this point in
> the code, there is no exclusion taking place - on the contrary, if
> something is "excluded", it is in fact guaranteed to be in to_pack.

This one I can understand.

> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  builtin/pack-objects.c | 50 +++++++++++++++++++++++++-------------------------
>  1 file changed, 25 insertions(+), 25 deletions(-)

Without understanding why this change is desirable, I can tell that
overall this does not change the behaviour, which is good ;-).


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WIP 2/2] pack-objects: support --blob-size-limit
  2017-06-02  0:14 ` [WIP 2/2] pack-objects: support --blob-size-limit Jonathan Tan
@ 2017-06-02  4:48   ` Junio C Hamano
  0 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2017-06-02  4:48 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, git

Jonathan Tan <jonathantanmy@google.com> writes:

> diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
> index 8973510a4..c4257cacc 100644
> --- a/Documentation/git-pack-objects.txt
> +++ b/Documentation/git-pack-objects.txt
> @@ -12,7 +12,8 @@ SYNOPSIS
>  'git pack-objects' [-q | --progress | --all-progress] [--all-progress-implied]
>  	[--no-reuse-delta] [--delta-base-offset] [--non-empty]
>  	[--local] [--incremental] [--window=<n>] [--depth=<n>]
> -	[--revs [--unpacked | --all]] [--stdout | base-name]
> +	[--revs [--unpacked | --all]]
> +	[--stdout [--blob-size-limit=<n>] | base-name]
>  	[--shallow] [--keep-true-parents] < object-list
>  
>  
> @@ -231,6 +232,22 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it creates a bundle.
>  	With this option, parents that are hidden by grafts are packed
>  	nevertheless.
>  
> +--blob-size-limit=<n>::
> +	This option can only be used with --stdout. If specified, a blob
> +	of at least this size will not be packed unless a to-be-packed
> +	tree has that blob with a filename beginning with ".git".
> +	The size can be suffixed with "k", "m", or "g", and may be "0".

It would be nice if the name of the parameter hints that this is
counted in bytes, e.g. "--blob-max-bytes"; 

Or at least s/<n>/<bytes>/.

> ++
> +If specified, after printing the packfile, pack-objects will print the
> +count of excluded blobs (8 bytes) . Subsequently, for each excluded blob
> +in hash order, pack-objects will print its hash (20 bytes) and its size
> +(8 bytes). All numbers are in network byte order.

Do we need to future-proof the output format so that we can later
use 32-byte hash?  The input to pack-objects (i.e. rev-list --objects)
is hexadecimal text, and it may not be so bad to make this also
text, e.g. "<hash> SP <length> LF".  That way, you do not have to
worry about byte order, hash size, or length limited to uint64.

Right now, the readers of a pack stream like unpack-objects and
index-pack know to copy any "garbage" that follow a pack stream to
its standard output, so older readers piped downstream of this new
pack-objects will simply accept the (somewhat incomplete) pack and
ignore this trailer, which is probably what we want to happen.

> ++
> +If pack-objects cannot efficiently determine, for an arbitrary blob,
> +that no to-be-packed tree has that blob with a filename beginning with
> +".git" (for example, if bitmaps are used to calculate the objects to be
> +packed), it will pack all blobs regardless of size.
> +

This is a bit hard to understand, at least to me.  Let me step-wise
deconstruct what I think you are saying.

 - A blob can appear in multiple trees, and each tree has name
   (i.e. pathname component) for the blob.

 - Sometimes we may know _all_ trees that contain a particular
   blob and hence _all_ names these trees call this blob.  In such a
   case, we can _prove_ that the blob never is used as ".git<something>"
   in any tree.

 - We exclude a blob that is larger than the specified limit, but
   only when we can prove the above.  If it is possible that the
   blob might appear as ".git<something>" in some tree, the blob is
   not omitted no matter its size.

And this is a very conservative way to avoid omitting something that
could be one of those control files like ".gitignore".

Am I following your thought correctly?


> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 1062d8fe2..aaf7e92b0 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -77,6 +77,8 @@ static unsigned long cache_max_small_delta_size = 1000;
>  
>  static unsigned long window_memory_limit = 0;
>  
> +static long blob_size_limit = -1;
> +

Which is the largest, long, ssize_t, or offset_t?  Perhaps intmax_t
but it may be overkill?

In any case, not within the scope of this series; we assume that an
individual object's size fits in ulong around here.

> @@ -776,6 +778,50 @@ static const char no_split_warning[] = N_(
>  "disabling bitmap writing, packs are split due to pack.packSizeLimit"
>  );
>  
> +struct oversized_blob {
> +	struct hashmap_entry entry;
> +	struct object_id oid;
> +	unsigned long size;
> +};
> +struct hashmap oversized_blobs;
> +
> +static int oversized_blob_cmp_fn(const void *b1, const void *b2,
> +				 const void *keydata)
> +{
> +	const struct object_id *oid2 = keydata ? keydata : 
> +		&((const struct oversized_blob *) b2)->oid;
> +	return oidcmp(&((const struct oversized_blob *) b1)->oid, oid2);
> +}
> +
> +static int oversized_blob_cmp(const void *b1, const void *b2)
> +{
> +	return oidcmp(&(*(const struct oversized_blob **) b1)->oid,
> +		      &(*(const struct oversized_blob **) b2)->oid);
> +}
> +
> +static void write_oversized_info(void)
> +{
> +	struct oversized_blob **obs = xmalloc(oversized_blobs.size *
> +					      sizeof(*obs));

Can this multiplication overflow (hint: st_mult)?

> +	struct hashmap_iter iter;
> +	struct oversized_blob *ob;
> +	int i = 0;
> +	uint64_t size_network;
> +
> +	for (ob = hashmap_iter_first(&oversized_blobs, &iter);
> +	     ob;
> +	     ob = hashmap_iter_next(&iter))
> +		obs[i++] = ob;
> +	QSORT(obs, oversized_blobs.size, oversized_blob_cmp);

This sorting is a part of external contract (i.e. "output in hash
order" is documented), but is it necessary?  Just wondering.

> +	size_network = htonll(oversized_blobs.size);
> +	write_in_full(1, &size_network, sizeof(size_network));
> +	for (i = 0; i < oversized_blobs.size; i++) {
> +		write_in_full(1, &obs[i]->oid, sizeof(obs[i]->oid));
> +		size_network = htonll(obs[i]->size);
> +		write_in_full(1, &size_network, sizeof(size_network));
> +	}
> +}

> @@ -822,7 +868,11 @@ static void write_pack_file(void)
>  		 * If so, rewrite it like in fast-import
>  		 */
>  		if (pack_to_stdout) {
> -			sha1close(f, sha1, CSUM_CLOSE);
> +			sha1close(f, sha1, 0);

You do want to exclude the trailing thing out of the checksum, but
CSUM_CLOSE would close the fd and that is why you pass 0 here.  OK.

> +			write_in_full(1, sha1, sizeof(sha1));

Even though we are in "pack_to_stdout" codepath, I'd prefer to
consistently use f->fd, not a hardcoded 1 here.  Of course,
sha1close() would have freed 'f' at this point, so we need to grab
the return value from the function to learn which fd is connected to
our original "stdout" at this point.

Likewise for write_oversized_info(), which probably should just take
"int fd" as parameter.

> +			if (blob_size_limit >= 0) {
> +				write_oversized_info();
> +			}

I do not necessarily think it is a bad idea to show this as trailing
data after the pack, but when we come to this function, do we have
enough information to make the "oversize" information appear before
the pack data if we wanted to?  I have a suspicion that it is easier
to handle at the receiving end, after the capability exchange
decides to use this "omit oversized ones" extension, to receive this
before the packdata begins.  If it is at the end, the receiver needs
to spawn either unpack-objects or index-objects to cause it write to
the disk, but when it does so, it has to have a FD open to read from
their standard output so that the receiver can grab the "remainder"
that these programs did not process.  "Write to the end of the pack
stream data to these programs, and process the remainder ourselves"
is much harder to arrange, I'd suspect.

>  		} else if (nr_written == nr_remaining) {
>  			sha1close(f, sha1, CSUM_FSYNC);
>  		} else {
> @@ -1069,17 +1119,58 @@ static const char no_closure_warning[] = N_(
>  "disabling bitmap writing, as some objects are not being packed"
>  );
>  
> +/*
> + * Returns 1 and records this blob in oversized_blobs if the given blob does
> + * not meet any defined blob size limits.  Blobs that appear as a tree entry
> + * whose basename begins with ".git" are never considered oversized.
> + *
> + * If the tree entry corresponding to the given blob is unknown, pass NULL as
> + * name. In this case, this function will always return 0 to avoid false
> + * positives.
> + */
> +static int oversized(const unsigned char *sha1, const char *name) {

Not object_id?

> +	const char *last_slash, *basename;
> +	unsigned long size;
> +	unsigned int hash;
> +
> +	if (blob_size_limit < 0 || !name)
> +		return 0;
> +
> +	last_slash = strrchr(name, '/');
> +	basename = last_slash ? last_slash + 1 : name;
> +	if (starts_with(basename, ".git"))
> +		return 0;

When we have two paths to a blob, and the blob is first fed to this
function with one name that does not begin with ".git", we would
register it to the hashmap.  And then the other name that begins
with ".git" is later discovered to point at the same blob, what
happens?  Would we need to unregister it from the hashmap elsewhere
in the code?

> +	sha1_object_info(sha1, &size);
> +	if (size < blob_size_limit)
> +		return 0;
> +	
> +	if (!oversized_blobs.tablesize)
> +		hashmap_init(&oversized_blobs, oversized_blob_cmp_fn, 0);
> +	hash = sha1hash(sha1);
> +	if (!hashmap_get_from_hash(&oversized_blobs, hash, sha1)) {
> +		struct oversized_blob *ob = xmalloc(sizeof(*ob));
> +		hashmap_entry_init(ob, hash);
> +		hashcpy(ob->oid.hash, sha1);
> +		ob->size = size;
> +		hashmap_add(&oversized_blobs, ob);
> +	}
> +	return 1;
> +}
> +
>  static int add_object_entry(const unsigned char *sha1, enum object_type type,
> -			    const char *name, int exclude)
> +			    const char *name, int preferred_base)

This belongs to the previous step that renamed "exclude", no?

>  {
>  	struct packed_git *found_pack = NULL;
>  	off_t found_offset = 0;
>  	uint32_t index_pos;
> +	struct hashmap_entry entry;
>  
> -	if (have_duplicate_entry(sha1, exclude, &index_pos))
> +	if (have_duplicate_entry(sha1, preferred_base, &index_pos))
>  		return 0;
>  
> -	if (ignore_object(sha1, exclude, &found_pack, &found_offset)) {
> +	if (ignore_object(sha1, preferred_base, &found_pack, &found_offset) ||
> +	    (!preferred_base && type == OBJ_BLOB && oversized(sha1, name))) {
>  		/* The pack is missing an object, so it will not have closure */
>  		if (write_bitmap_index) {
>  			warning(_(no_closure_warning));
> @@ -1088,8 +1179,17 @@ static int add_object_entry(const unsigned char *sha1, enum object_type type,
>  		return 0;
>  	}
>  
> +	/*
> +	 * Ensure that we do not report this blob as oversized if we are going
> +	 * to use it, be it in the returned pack or as a preferred base
> +	 */
> +	if (oversized_blobs.tablesize) {
> +		hashmap_entry_init(&entry, sha1hash(sha1));
> +		free(hashmap_remove(&oversized_blobs, &entry, sha1));
> +	}
> +

Ah, is this where the "unregistering" happens?

>  	create_object_entry(sha1, type, pack_name_hash(name),
> -			    exclude, name && no_try_delta(name),
> +			    preferred_base, name && no_try_delta(name),
>  			    index_pos, found_pack, found_offset);
>  
>  	display_progress(progress_state, nr_result);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [WIP v2 0/2] Modifying pack objects to support --blob-max-bytes
  2017-06-02  0:14 [WIP 0/2] Modifying pack-objects to support --blob-size-limit Jonathan Tan
  2017-06-02  0:14 ` [WIP 1/2] pack-objects: rename want_.* to ignore_.* Jonathan Tan
  2017-06-02  0:14 ` [WIP 2/2] pack-objects: support --blob-size-limit Jonathan Tan
@ 2017-06-02 19:38 ` Jonathan Tan
  2017-06-02 22:16   ` Jeff King
  2017-06-02 19:38 ` [WIP v2 1/2] pack-objects: rename want_.* to ignore_.* Jonathan Tan
  2017-06-02 19:38 ` [WIP v2 2/2] pack-objects: support --blob-max-bytes Jonathan Tan
  4 siblings, 1 reply; 17+ messages in thread
From: Jonathan Tan @ 2017-06-02 19:38 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster

Here's a new version addressing Junio's comments.

> Hmph, that statement is a hard to read and agree to.  I thought an
> ignored object that is not going to be packed is one that won't hit
> to_pack?

That is true currently, but will not be the full truth once the 2nd
patch is applied.  I have expanded the commit message to explain the
plan in more detail.

> It would be nice if the name of the parameter hints that this is
> counted in bytes, e.g. "--blob-max-bytes"; 

Done. Originally I wanted to avoid terms like "max" because this
disallowed the user from requesting that even 0-byte blobs not be sent,
but realized that I probably shouldn't bother about that. :-)

> Do we need to future-proof the output format so that we can later
> use 32-byte hash?  The input to pack-objects (i.e. rev-list --objects)
> is hexadecimal text, and it may not be so bad to make this also
> text, e.g. "<hash> SP <length> LF".  That way, you do not have to
> worry about byte order, hash size, or length limited to uint64.

The reason for using binary is for the convenience of the client to
avoid another conversion before storing it to disk (and also network
savings). In a large repo, I think this list will be quite large. I
realized that I didn't mention anything about this in the commit
message, so I have added an explanation.

I think this is sufficiently future-proof in that the format of this
hash matches the format of the hashes used in the objects in the
packfile. As for object size being limited to uint64, I think the
benefits of the space savings (in using binary) outweigh the small risk
that our files will get larger than that before we upgrade our protocol
:-P

>> ++
>> +If pack-objects cannot efficiently determine, for an arbitrary blob,
>> +that no to-be-packed tree has that blob with a filename beginning with
>> +".git" (for example, if bitmaps are used to calculate the objects to be
>> +packed), it will pack all blobs regardless of size.
>> +
> This is a bit hard to understand, at least to me.  Let me step-wise
> deconstruct what I think you are saying.
> 
>  - A blob can appear in multiple trees, and each tree has name
>    (i.e. pathname component) for the blob.
> 
>  - Sometimes we may know _all_ trees that contain a particular
>    blob and hence _all_ names these trees call this blob.  In such a
>    case, we can _prove_ that the blob never is used as ".git<something>"
>    in any tree.
> 
>  - We exclude a blob that is larger than the specified limit, but
>    only when we can prove the above.  If it is possible that the
>    blob might appear as ".git<something>" in some tree, the blob is
>    not omitted no matter its size.
> 
> And this is a very conservative way to avoid omitting something that
> could be one of those control files like ".gitignore".
> 
> Am I following your thought correctly?

We only need to prove for *to-be-packed* trees (the trees that
pack-objects are about to pack in the packfile(s) that it is printing),
but otherwise, you are correct. I was trying to be thorough in saying
that "if we don't have bitmaps, we know if a blob's name starts with
'.git', so we can do filtering by size, but if we have bitmaps, we don't
have any relevant name information" - any rewording suggestions are
appreciated.

> Can this multiplication overflow (hint: st_mult)?

Thanks - used st_mult.

> This sorting is a part of external contract (i.e. "output in hash
> order" is documented), but is it necessary?  Just wondering.

It is convenient for the client because the client can then store it
directly and binary search when accessing it. (Added a note about
convenience to the commit message.)

> You do want to exclude the trailing thing out of the checksum, but
> CSUM_CLOSE would close the fd and that is why you pass 0 here.  OK.
>
> Even though we are in "pack_to_stdout" codepath, I'd prefer to
> consistently use f->fd, not a hardcoded 1 here.  Of course,
> sha1close() would have freed 'f' at this point, so we need to grab
> the return value from the function to learn which fd is connected to
> our original "stdout" at this point.
> 
> Likewise for write_oversized_info(), which probably should just take
> "int fd" as parameter.

Done for write_oversized_info(), although now that we are moving the
invocation of write_oversized_info() out of write_pack_file(), this
might not be so important. (For the rest, I have reverted the code
because we are now printing the excluded hashes before the packfile
instead of after.)

> I do not necessarily think it is a bad idea to show this as trailing
> data after the pack, but when we come to this function, do we have
> enough information to make the "oversize" information appear before
> the pack data if we wanted to?  I have a suspicion that it is easier
> to handle at the receiving end, after the capability exchange
> decides to use this "omit oversized ones" extension, to receive this
> before the packdata begins.  If it is at the end, the receiver needs
> to spawn either unpack-objects or index-objects to cause it write to
> the disk, but when it does so, it has to have a FD open to read from
> their standard output so that the receiver can grab the "remainder"
> that these programs did not process.  "Write to the end of the pack
> stream data to these programs, and process the remainder ourselves"
> is much harder to arrange, I'd suspect.

Yes, we can write at the beginning - done.

> > +/*
> > + * Returns 1 and records this blob in oversized_blobs if the given blob does
> > + * not meet any defined blob size limits.  Blobs that appear as a tree entry
> > + * whose basename begins with ".git" are never considered oversized.
> > + *
> > + * If the tree entry corresponding to the given blob is unknown, pass NULL as
> > + * name. In this case, this function will always return 0 to avoid false
> > + * positives.
> > + */
> > +static int oversized(const unsigned char *sha1, const char *name) {  
>
> Not object_id?

This would require a "(const struct object_id *) sha1" cast in its
caller that potentially violates alignment constraints, as far as I
know, so I used "const unsigned char *" for now. If the object_id
conversion doesn't include this file in the near future, I'll try to
send out a patch for this file.

> When we have two paths to a blob, and the blob is first fed to this
> function with one name that does not begin with ".git", we would
> register it to the hashmap.  And then the other name that begins
> with ".git" is later discovered to point at the same blob, what
> happens?  Would we need to unregister it from the hashmap elsewhere
> in the code?
[snip]
> Ah, is this where the "unregistering" happens?

Yes.

> >  static int add_object_entry(const unsigned char *sha1, enum object_type type,
> > -			    const char *name, int exclude)
> > +			    const char *name, int preferred_base)  
> 
> This belongs to the previous step that renamed "exclude", no?

Yes - done.

Jonathan Tan (2):
  pack-objects: rename want_.* to ignore_.*
  pack-objects: support --blob-max-bytes

 Documentation/git-pack-objects.txt |  19 ++++-
 builtin/pack-objects.c             | 159 ++++++++++++++++++++++++++++++-------
 t/t5300-pack-object.sh             |  71 +++++++++++++++++
 3 files changed, 220 insertions(+), 29 deletions(-)

-- 
2.13.0.506.g27d5fe0cd-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [WIP v2 1/2] pack-objects: rename want_.* to ignore_.*
  2017-06-02  0:14 [WIP 0/2] Modifying pack-objects to support --blob-size-limit Jonathan Tan
                   ` (2 preceding siblings ...)
  2017-06-02 19:38 ` [WIP v2 0/2] Modifying pack objects to support --blob-max-bytes Jonathan Tan
@ 2017-06-02 19:38 ` Jonathan Tan
  2017-06-02 19:38 ` [WIP v2 2/2] pack-objects: support --blob-max-bytes Jonathan Tan
  4 siblings, 0 replies; 17+ messages in thread
From: Jonathan Tan @ 2017-06-02 19:38 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster

Currently, in pack_objects, add_object_entry() distinguishes between 2
types of non-preferred-base objects:

 (1) objects that should not be in "to_pack" because an option like
     --local or --honor-pack-keep is set
 (2) objects that should be in "to_pack"

A subsequent commit will teach pack-objects to exclude certain objects
(specifically, blobs that are larger than a user-given threshold and are
not potentially a Git special file [1]), but unlike in (1) above, these
exclusions need to be reported. So now we have 3 types of
non-preferred-base objects:

 (a) objects that should not be in "to_pack" and should not be reported
     because an option like --local or --honor-pack-keep is set
 (b) objects that should not be in "to_pack" and should be reported
     because they are blobs that are oversized and non-Git-special
 (c) objects that should be in "to_pack"

add_object_entry() will be taught to distinguish between these 3 types
by the following:

 - renaming want_found_object() and want_object_in_pack() to ignore_.*
   to make it clear that they only check for (a) (this commit)
 - adding a new function to check for (b) and using it within
   add_object_entry() (a subsequent commit)

The alternative would be to retain the names want_found_object() and
want_object_in_pack() and have them handle both the cases (a) and (b),
but that would result in more complicated code.

We also take the opportunity to use the terminology "preferred_base"
instead of "excluded" in these methods. It is true that preferred bases
are not included in the final packfile generation, but at this point in
the code, there is no exclusion taking place - on the contrary, if
something is "excluded", it is in fact guaranteed to be in to_pack.

[1] For the purposes of pack_objects, a blob is a Git special file if it
    appears in a to-be-packed tree with a filename beginning with
    ".git".

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/pack-objects.c | 56 +++++++++++++++++++++++++-------------------------
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 80439047a..730fa3d85 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -946,12 +946,12 @@ static int have_duplicate_entry(const unsigned char *sha1,
 	return 1;
 }
 
-static int want_found_object(int exclude, struct packed_git *p)
+static int ignore_found_object(int preferred_base, struct packed_git *p)
 {
-	if (exclude)
-		return 1;
-	if (incremental)
+	if (preferred_base)
 		return 0;
+	if (incremental)
+		return 1;
 
 	/*
 	 * When asked to do --local (do not include an object that appears in a
@@ -969,19 +969,19 @@ static int want_found_object(int exclude, struct packed_git *p)
 	 */
 	if (!ignore_packed_keep &&
 	    (!local || !have_non_local_packs))
-		return 1;
+		return 0;
 
 	if (local && !p->pack_local)
-		return 0;
+		return 1;
 	if (ignore_packed_keep && p->pack_local && p->pack_keep)
-		return 0;
+		return 1;
 
 	/* we don't know yet; keep looking for more packs */
 	return -1;
 }
 
 /*
- * Check whether we want the object in the pack (e.g., we do not want
+ * Check whether we should ignore this object (e.g., we do not want
  * objects found in non-local stores if the "--local" option was used).
  *
  * If the caller already knows an existing pack it wants to take the object
@@ -989,16 +989,16 @@ static int want_found_object(int exclude, struct packed_git *p)
  * function finds if there is any pack that has the object and returns the pack
  * and its offset in these variables.
  */
-static int want_object_in_pack(const unsigned char *sha1,
-			       int exclude,
-			       struct packed_git **found_pack,
-			       off_t *found_offset)
+static int ignore_object(const unsigned char *sha1,
+			 int preferred_base,
+			 struct packed_git **found_pack,
+			 off_t *found_offset)
 {
 	struct mru_entry *entry;
-	int want;
+	int ignore;
 
-	if (!exclude && local && has_loose_object_nonlocal(sha1))
-		return 0;
+	if (!preferred_base && local && has_loose_object_nonlocal(sha1))
+		return 1;
 
 	/*
 	 * If we already know the pack object lives in, start checks from that
@@ -1006,9 +1006,9 @@ static int want_object_in_pack(const unsigned char *sha1,
 	 * are present we will determine the answer right now.
 	 */
 	if (*found_pack) {
-		want = want_found_object(exclude, *found_pack);
-		if (want != -1)
-			return want;
+		ignore = ignore_found_object(preferred_base, *found_pack);
+		if (ignore != -1)
+			return ignore;
 	}
 
 	for (entry = packed_git_mru->head; entry; entry = entry->next) {
@@ -1027,15 +1027,15 @@ static int want_object_in_pack(const unsigned char *sha1,
 				*found_offset = offset;
 				*found_pack = p;
 			}
-			want = want_found_object(exclude, p);
-			if (!exclude && want > 0)
+			ignore = ignore_found_object(preferred_base, p);
+			if (!preferred_base && ignore > 0)
 				mru_mark(packed_git_mru, entry);
-			if (want != -1)
-				return want;
+			if (ignore != -1)
+				return ignore;
 		}
 	}
 
-	return 1;
+	return 0;
 }
 
 static void create_object_entry(const unsigned char *sha1,
@@ -1070,16 +1070,16 @@ static const char no_closure_warning[] = N_(
 );
 
 static int add_object_entry(const unsigned char *sha1, enum object_type type,
-			    const char *name, int exclude)
+			    const char *name, int preferred_base)
 {
 	struct packed_git *found_pack = NULL;
 	off_t found_offset = 0;
 	uint32_t index_pos;
 
-	if (have_duplicate_entry(sha1, exclude, &index_pos))
+	if (have_duplicate_entry(sha1, preferred_base, &index_pos))
 		return 0;
 
-	if (!want_object_in_pack(sha1, exclude, &found_pack, &found_offset)) {
+	if (ignore_object(sha1, preferred_base, &found_pack, &found_offset)) {
 		/* The pack is missing an object, so it will not have closure */
 		if (write_bitmap_index) {
 			warning(_(no_closure_warning));
@@ -1089,7 +1089,7 @@ static int add_object_entry(const unsigned char *sha1, enum object_type type,
 	}
 
 	create_object_entry(sha1, type, pack_name_hash(name),
-			    exclude, name && no_try_delta(name),
+			    preferred_base, name && no_try_delta(name),
 			    index_pos, found_pack, found_offset);
 
 	display_progress(progress_state, nr_result);
@@ -1106,7 +1106,7 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
 	if (have_duplicate_entry(sha1, 0, &index_pos))
 		return 0;
 
-	if (!want_object_in_pack(sha1, 0, &pack, &offset))
+	if (ignore_object(sha1, 0, &pack, &offset))
 		return 0;
 
 	create_object_entry(sha1, type, name_hash, 0, 0, index_pos, pack, offset);
-- 
2.13.0.506.g27d5fe0cd-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [WIP v2 2/2] pack-objects: support --blob-max-bytes
  2017-06-02  0:14 [WIP 0/2] Modifying pack-objects to support --blob-size-limit Jonathan Tan
                   ` (3 preceding siblings ...)
  2017-06-02 19:38 ` [WIP v2 1/2] pack-objects: rename want_.* to ignore_.* Jonathan Tan
@ 2017-06-02 19:38 ` Jonathan Tan
  2017-06-02 22:26   ` Jeff King
  4 siblings, 1 reply; 17+ messages in thread
From: Jonathan Tan @ 2017-06-02 19:38 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster

As part of an effort to improve Git support for very large repositories
in which clients typically have only a subset of all version-controlled
blobs, teach pack-objects to support --blob-max-bytes, packing only
blobs not exceeding that size unless the blob corresponds to a file
whose name starts with ".git". upload-pack will eventually be taught to
use this new parameter if needed to exclude certain blobs during a fetch
or clone, potentially drastically reducing network consumption when
serving these very large repositories.

Since any excluded blob should not act as a delta base, I did this by
avoiding such oversized blobs from ever going into the "to_pack" data
structure, which contains both preferred bases (which do not go into the
final packfile but can act as delta bases) and the objects to be packed
(which go into the final packfile and also can act as delta bases). To
that end, add_object_entry() has been modified to exclude the
appropriate non-preferred-base objects. (Preferred bases, regardless of
size, still enter "to_pack" - they are something that the client
indicates that it has, not something that the server needs to serve, so
no exclusion occurs.)

In addition, any such exclusions are recorded and sent before the
packfile. These are recorded in binary format and in hash order, in a
format convenient for a client to use if stored directly to disk.

The other method in which objects are added to "to_pack",
add_object_entry_from_bitmap(), has not been modified - thus packing in
the presence of bitmaps still packs all blobs regardless of size. See
the documentation update in this commit for the rationale.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/git-pack-objects.txt |  19 ++++++-
 builtin/pack-objects.c             | 105 ++++++++++++++++++++++++++++++++++++-
 t/t5300-pack-object.sh             |  71 +++++++++++++++++++++++++
 3 files changed, 193 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index 8973510a4..23ff8b5be 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -12,7 +12,8 @@ SYNOPSIS
 'git pack-objects' [-q | --progress | --all-progress] [--all-progress-implied]
 	[--no-reuse-delta] [--delta-base-offset] [--non-empty]
 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
-	[--revs [--unpacked | --all]] [--stdout | base-name]
+	[--revs [--unpacked | --all]]
+	[--stdout [--blob-max-bytes=<n>] | base-name]
 	[--shallow] [--keep-true-parents] < object-list
 
 
@@ -231,6 +232,22 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it creates a bundle.
 	With this option, parents that are hidden by grafts are packed
 	nevertheless.
 
+--blob-max-bytes=<n>::
+	This option can only be used with --stdout. If specified, a blob
+	larger than this will not be packed unless a to-be-packed tree
+	has that blob with a filename beginning with ".git".  The size
+	can be suffixed with "k", "m", or "g", and may be "0".
++
+If specified, after printing the packfile, pack-objects will print the
+count of excluded blobs (8 bytes) . Subsequently, for each excluded blob
+in hash order, pack-objects will print its hash (20 bytes) and its size
+(8 bytes). All numbers are in network byte order.
++
+If pack-objects cannot efficiently determine, for an arbitrary blob,
+that no to-be-packed tree has that blob with a filename beginning with
+".git" (for example, if bitmaps are used to calculate the objects to be
+packed), it will pack all blobs regardless of size.
+
 SEE ALSO
 --------
 linkgit:git-rev-list[1]
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 730fa3d85..46578c05b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -77,6 +77,8 @@ static unsigned long cache_max_small_delta_size = 1000;
 
 static unsigned long window_memory_limit = 0;
 
+static long blob_max_bytes = -1;
+
 /*
  * stats
  */
@@ -1069,17 +1071,73 @@ static const char no_closure_warning[] = N_(
 "disabling bitmap writing, as some objects are not being packed"
 );
 
+struct oversized_blob {
+	struct hashmap_entry entry;
+	struct object_id oid;
+	unsigned long size;
+};
+struct hashmap oversized_blobs;
+
+static int oversized_blob_cmp_fn(const void *b1, const void *b2,
+				 const void *keydata)
+{
+	const struct object_id *oid2 = keydata ? keydata :
+		&((const struct oversized_blob *) b2)->oid;
+	return oidcmp(&((const struct oversized_blob *) b1)->oid, oid2);
+}
+
+/*
+ * Returns 1 and records this blob in oversized_blobs if the given blob does
+ * not meet any defined blob size limits.  Blobs that appear as a tree entry
+ * whose basename begins with ".git" are never considered oversized.
+ *
+ * If the tree entry corresponding to the given blob is unknown, pass NULL as
+ * name. In this case, this function will always return 0 to avoid false
+ * positives.
+ */
+static int oversized(const unsigned char *sha1, const char *name) {
+	const char *last_slash, *basename;
+	unsigned long size;
+	unsigned int hash;
+
+	if (blob_max_bytes < 0 || !name)
+		return 0;
+
+	last_slash = strrchr(name, '/');
+	basename = last_slash ? last_slash + 1 : name;
+	if (starts_with(basename, ".git"))
+		return 0;
+
+	sha1_object_info(sha1, &size);
+	if (size <= blob_max_bytes)
+		return 0;
+
+	if (!oversized_blobs.tablesize)
+		hashmap_init(&oversized_blobs, oversized_blob_cmp_fn, 0);
+	hash = sha1hash(sha1);
+	if (!hashmap_get_from_hash(&oversized_blobs, hash, sha1)) {
+		struct oversized_blob *ob = xmalloc(sizeof(*ob));
+		hashmap_entry_init(ob, hash);
+		hashcpy(ob->oid.hash, sha1);
+		ob->size = size;
+		hashmap_add(&oversized_blobs, ob);
+	}
+	return 1;
+}
+
 static int add_object_entry(const unsigned char *sha1, enum object_type type,
 			    const char *name, int preferred_base)
 {
 	struct packed_git *found_pack = NULL;
 	off_t found_offset = 0;
 	uint32_t index_pos;
+	struct hashmap_entry entry;
 
 	if (have_duplicate_entry(sha1, preferred_base, &index_pos))
 		return 0;
 
-	if (ignore_object(sha1, preferred_base, &found_pack, &found_offset)) {
+	if (ignore_object(sha1, preferred_base, &found_pack, &found_offset) ||
+	    (!preferred_base && type == OBJ_BLOB && oversized(sha1, name))) {
 		/* The pack is missing an object, so it will not have closure */
 		if (write_bitmap_index) {
 			warning(_(no_closure_warning));
@@ -1088,6 +1146,15 @@ static int add_object_entry(const unsigned char *sha1, enum object_type type,
 		return 0;
 	}
 
+	/*
+	 * Ensure that we do not report this blob as oversized if we are going
+	 * to use it, be it in the returned pack or as a preferred base
+	 */
+	if (oversized_blobs.tablesize) {
+		hashmap_entry_init(&entry, sha1hash(sha1));
+		free(hashmap_remove(&oversized_blobs, &entry, sha1));
+	}
+
 	create_object_entry(sha1, type, pack_name_hash(name),
 			    preferred_base, name && no_try_delta(name),
 			    index_pos, found_pack, found_offset);
@@ -2864,6 +2931,35 @@ static int option_parse_unpack_unreachable(const struct option *opt,
 	return 0;
 }
 
+static int oversized_blob_cmp(const void *b1, const void *b2)
+{
+	return oidcmp(&(*(const struct oversized_blob **) b1)->oid,
+		      &(*(const struct oversized_blob **) b2)->oid);
+}
+
+static void write_oversized_info(int fd)
+{
+	struct oversized_blob **obs = xmalloc(st_mult(oversized_blobs.size,
+						      sizeof(*obs)));
+	struct hashmap_iter iter;
+	struct oversized_blob *ob;
+	int i = 0;
+	uint64_t size_network;
+
+	for (ob = hashmap_iter_first(&oversized_blobs, &iter);
+	     ob;
+	     ob = hashmap_iter_next(&iter))
+		obs[i++] = ob;
+	QSORT(obs, oversized_blobs.size, oversized_blob_cmp);
+	size_network = htonll(oversized_blobs.size);
+	write_in_full(fd, &size_network, sizeof(size_network));
+	for (i = 0; i < oversized_blobs.size; i++) {
+		write_in_full(fd, &obs[i]->oid, sizeof(obs[i]->oid));
+		size_network = htonll(obs[i]->size);
+		write_in_full(fd, &size_network, sizeof(size_network));
+	}
+}
+
 int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 {
 	int use_internal_rev_list = 0;
@@ -2947,6 +3043,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			 N_("use a bitmap index if available to speed up counting objects")),
 		OPT_BOOL(0, "write-bitmap-index", &write_bitmap_index,
 			 N_("write a bitmap index together with the pack index")),
+		OPT_MAGNITUDE(0, "blob-max-bytes", &blob_max_bytes,
+			      N_("exclude and report blobs larger than this")),
 		OPT_END(),
 	};
 
@@ -3023,6 +3121,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
+	if (blob_max_bytes >= 0 && !pack_to_stdout)
+		die("if --blob-max-bytes is specified, --stdout must also be specified");
+
 	/*
 	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
 	 *
@@ -3089,6 +3190,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		return 0;
 	if (nr_result)
 		prepare_pack(window, depth);
+	if (blob_max_bytes >= 0)
+		write_oversized_info(1);
 	write_pack_file();
 	if (progress)
 		fprintf(stderr, "Total %"PRIu32" (delta %"PRIu32"),"
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 43a672c34..987327784 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -421,6 +421,77 @@ test_expect_success 'index-pack <pack> works in non-repo' '
 	test_path_is_file foo.idx
 '
 
+unpack() {
+	perl -e '$/ = undef; $input = <>; print unpack("H2" x length($input), $input)'
+}
+
+lcut() {
+	perl -e '$/ = undef; $_ = <>; s/^.{'$1'}//s; print $_'
+}
+
+test_expect_success '--blob-size-limit works with multiple excluded' '
+	rm -rf server &&
+	git init server &&
+	printf a > server/a &&
+	printf b > server/b &&
+	printf c-very-long-file > server/c &&
+	printf d-very-long-file > server/d &&
+	git -C server add a b c d &&
+	git -C server commit -m x &&
+
+	git -C server rev-parse HEAD >objects &&
+	git -C server pack-objects --revs --stdout --blob-max-bytes=10 <objects >out &&
+	unpack <out >actual &&
+
+	# Check that the excluded hashes are included, sorted, at the beginning
+	if [ $(git hash-object server/c) \< $(git hash-object server/d) ]
+	then
+		printf "^0000000000000002%s0000000000000010%s0000000000000010" \
+			$(git hash-object server/c) \
+			$(git hash-object server/d) >expect
+	else
+		printf "^0000000000000002%s0000000000000010%s0000000000000010" \
+			$(git hash-object server/d) \
+			$(git hash-object server/c) >expect
+	fi &&
+	grep $(cat expect) actual &&
+
+	# Ensure that only the small blobs are in the packfile
+	lcut 64 <out >my.pack &&
+	git index-pack my.pack &&
+	git verify-pack -v my.idx >objectlist &&
+	grep $(git hash-object server/a) objectlist &&
+	grep $(git hash-object server/b) objectlist &&
+	! grep $(git hash-object server/c) objectlist &&
+	! grep $(git hash-object server/d) objectlist
+'
+
+test_expect_success '--blob-size-limit never excludes special files' '
+	rm -rf server &&
+	git init server &&
+	printf a-very-long-file > server/a &&
+	printf a-very-long-file > server/.git-a &&
+	printf b-very-long-file > server/b &&
+	git -C server add a .git-a b &&
+	git -C server commit -m x &&
+
+	git -C server rev-parse HEAD >objects &&
+	git -C server pack-objects --revs --stdout --blob-max-bytes=10 <objects >out &&
+	unpack <out >actual &&
+
+	# Check that the excluded hash is included at the beginning
+	printf "^0000000000000001%s0000000000000010" \
+		$(git hash-object server/b) >expect &&
+	grep $(cat expect) actual &&
+
+	# Ensure that the .git-a blob is in the packfile, despite also
+	# appearing as a non-.git file
+	lcut 36 <out >my.pack &&
+	git index-pack my.pack &&
+	git verify-pack -v my.idx >objectlist &&
+	grep $(git hash-object server/a) objectlist
+'
+
 #
 # WARNING!
 #
-- 
2.13.0.506.g27d5fe0cd-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [WIP v2 0/2] Modifying pack objects to support --blob-max-bytes
  2017-06-02 19:38 ` [WIP v2 0/2] Modifying pack objects to support --blob-max-bytes Jonathan Tan
@ 2017-06-02 22:16   ` Jeff King
  2017-06-05 17:35     ` Jonathan Tan
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff King @ 2017-06-02 22:16 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, gitster

On Fri, Jun 02, 2017 at 12:38:43PM -0700, Jonathan Tan wrote:

> > Do we need to future-proof the output format so that we can later
> > use 32-byte hash?  The input to pack-objects (i.e. rev-list --objects)
> > is hexadecimal text, and it may not be so bad to make this also
> > text, e.g. "<hash> SP <length> LF".  That way, you do not have to
> > worry about byte order, hash size, or length limited to uint64.
> 
> The reason for using binary is for the convenience of the client to
> avoid another conversion before storing it to disk (and also network
> savings). In a large repo, I think this list will be quite large. I
> realized that I didn't mention anything about this in the commit
> message, so I have added an explanation.
> 
> I think this is sufficiently future-proof in that the format of this
> hash matches the format of the hashes used in the objects in the
> packfile. As for object size being limited to uint64, I think the
> benefits of the space savings (in using binary) outweigh the small risk
> that our files will get larger than that before we upgrade our protocol
> :-P

The rest of the pack code uses a varint encoding which is generally
much smaller than a uint64 for most files, but can handle arbitrary
sizes.

The one thing it loses is that you wouldn't have a fixed-size record, so
if you were planning to dump this directly to disk and binary-search it,
that won't work. OTOH, you could make pseudo-pack-entries and just
index them along with the rest of the objects in the pack .idx.

The one subtle thing there is that the pseudo-entry would have to say
"this is my sha1". And then we'd end up repeating that sha1 in the .idx
file. So it's optimal on the network but wastes 20 bytes on disk (unless
index-pack throws away the in-pack sha1s as it indexes, which is
certainly an option).

> > Can this multiplication overflow (hint: st_mult)?
> 
> Thanks - used st_mult.

Actually, I think this is a good candidate for ALLOC_ARRAY().

> > This sorting is a part of external contract (i.e. "output in hash
> > order" is documented), but is it necessary?  Just wondering.
> 
> It is convenient for the client because the client can then store it
> directly and binary search when accessing it. (Added a note about
> convenience to the commit message.)

In general the Git client doesn't trust the pack data coming from a
remote, and you can't corrupt a client by sending it bogus data. Either
the client computes it from scratch (e.g., the sha1s of each object) or
the client will reject nonsense (missing bases, refs pointing to objects
that aren't sent, etc).

I know this feature implies a certain amount of trust (after all, the
server could claim that it omitted any sha1 it likes), but we should
probably still be as strict as possible that what the other side is
sending makes sense. In this case, we should probably hashcmp() each
entry with the last and make sure they're strictly increasing (no
out-of-order and no duplicates).

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes
  2017-06-02 19:38 ` [WIP v2 2/2] pack-objects: support --blob-max-bytes Jonathan Tan
@ 2017-06-02 22:26   ` Jeff King
  2017-06-02 23:25     ` Jonathan Nieder
                       ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Jeff King @ 2017-06-02 22:26 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, gitster

On Fri, Jun 02, 2017 at 12:38:45PM -0700, Jonathan Tan wrote:

> +--blob-max-bytes=<n>::
> +	This option can only be used with --stdout. If specified, a blob
> +	larger than this will not be packed unless a to-be-packed tree
> +	has that blob with a filename beginning with ".git".  The size
> +	can be suffixed with "k", "m", or "g", and may be "0".
> ++
> +If specified, after printing the packfile, pack-objects will print the
> +count of excluded blobs (8 bytes) . Subsequently, for each excluded blob
> +in hash order, pack-objects will print its hash (20 bytes) and its size
> +(8 bytes). All numbers are in network byte order.
> ++
> +If pack-objects cannot efficiently determine, for an arbitrary blob,
> +that no to-be-packed tree has that blob with a filename beginning with
> +".git" (for example, if bitmaps are used to calculate the objects to be
> +packed), it will pack all blobs regardless of size.

Hmm. So this feature can't be used with bitmaps at all?  That seems like
a big downside, as we still have to walk the whole graph to come up with
the list of blobs (even if we're sending them). That's 30-40 seconds of
CPU per clone on torvalds/linux. I imagine it's much worse on
repositories big enough to need a full GVFS-style "don't send any blobs"
approach.

We have a name-hash cache extension in the bitmap file, but it doesn't
carry enough information to deduce the .git-ness of a file. I don't
think it would be too hard to add a "flags" extension, and give a single
bit to "this is a .git file".

I do also wonder if the two features would need to be separated for a
GVFS-style solution. If you're not just avoiding large blobs but trying
to get a narrow clone, you don't want the .git files from the
uninteresting parts of the tree. You want to get no blobs at all, and
then fault them in as they become relevant due to user action.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes
  2017-06-02 22:26   ` Jeff King
@ 2017-06-02 23:25     ` Jonathan Nieder
  2017-06-07  9:44       ` Jeff King
  2017-06-03 23:48     ` Junio C Hamano
  2017-06-15 20:28     ` Jeff Hostetler
  2 siblings, 1 reply; 17+ messages in thread
From: Jonathan Nieder @ 2017-06-02 23:25 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, git, gitster

Jeff King wrote:
> On Fri, Jun 02, 2017 at 12:38:45PM -0700, Jonathan Tan wrote:

>> +--blob-max-bytes=<n>::
>> +	This option can only be used with --stdout. If specified, a blob
>> +	larger than this will not be packed unless a to-be-packed tree
>> +	has that blob with a filename beginning with ".git".  The size
>> +	can be suffixed with "k", "m", or "g", and may be "0".
>> ++
>> +If specified, after printing the packfile, pack-objects will print the
>> +count of excluded blobs (8 bytes) . Subsequently, for each excluded blob
>> +in hash order, pack-objects will print its hash (20 bytes) and its size
>> +(8 bytes). All numbers are in network byte order.
>> ++
>> +If pack-objects cannot efficiently determine, for an arbitrary blob,
>> +that no to-be-packed tree has that blob with a filename beginning with
>> +".git" (for example, if bitmaps are used to calculate the objects to be
>> +packed), it will pack all blobs regardless of size.
>
> Hmm. So this feature can't be used with bitmaps at all?  That seems like
> a big downside, as we still have to walk the whole graph to come up with
> the list of blobs (even if we're sending them). That's 30-40 seconds of
> CPU per clone on torvalds/linux. I imagine it's much worse on
> repositories big enough to need a full GVFS-style "don't send any blobs"
> approach.
>
> We have a name-hash cache extension in the bitmap file, but it doesn't
> carry enough information to deduce the .git-ness of a file. I don't
> think it would be too hard to add a "flags" extension, and give a single
> bit to "this is a .git file".

A nicer approach IMHO is to include an extra bitmap, like the existing
object-type bitmaps (see the dump_bitmap calls in
bitmap_writer_finish).  This would would represent the set of all
.git* blobs in the pack.

[...]
>                      If you're not just avoiding large blobs but trying
> to get a narrow clone, you don't want the .git files from the
> uninteresting parts of the tree.

This is something I've wondered about, too.  Part of the story is that
we haven't started omitting trees, so there is already O(number of
trees) objects being sent and some additional small blobs for .git*
specials doesn't make it much worse.  Sending all .git* blobs keeps
things simple since the server doesn't have to infer which .git* blobs
are relevant to this client.

Longer term, we will likely want to allow clients to request omission
of some trees, too.  Omitting the corresponding .git* files becomes
more straightforward at that point.

Does that make sense?

Jonathan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes
  2017-06-02 22:26   ` Jeff King
  2017-06-02 23:25     ` Jonathan Nieder
@ 2017-06-03 23:48     ` Junio C Hamano
  2017-06-15 20:28     ` Jeff Hostetler
  2 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2017-06-03 23:48 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, git

Jeff King <peff@peff.net> writes:

> I do also wonder if the two features would need to be separated for a
> GVFS-style solution. If you're not just avoiding large blobs but trying
> to get a narrow clone, you don't want the .git files from the
> uninteresting parts of the tree. You want to get no blobs at all, and
> then fault them in as they become relevant due to user action.

Thanks; I didn't think of this one while reviewing the overall design,
but I agree that this is something important to think about.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WIP v2 0/2] Modifying pack objects to support --blob-max-bytes
  2017-06-02 22:16   ` Jeff King
@ 2017-06-05 17:35     ` Jonathan Tan
  2017-06-07  9:46       ` Jeff King
  0 siblings, 1 reply; 17+ messages in thread
From: Jonathan Tan @ 2017-06-05 17:35 UTC (permalink / raw)
  To: Jeff King; +Cc: git, gitster

Thanks for your comments.

On Fri, 2 Jun 2017 18:16:45 -0400
Jeff King <peff@peff.net> wrote:

> On Fri, Jun 02, 2017 at 12:38:43PM -0700, Jonathan Tan wrote:
> 
> > > Do we need to future-proof the output format so that we can later
> > > use 32-byte hash?  The input to pack-objects (i.e. rev-list --objects)
> > > is hexadecimal text, and it may not be so bad to make this also
> > > text, e.g. "<hash> SP <length> LF".  That way, you do not have to
> > > worry about byte order, hash size, or length limited to uint64.
> > 
> > The reason for using binary is for the convenience of the client to
> > avoid another conversion before storing it to disk (and also network
> > savings). In a large repo, I think this list will be quite large. I
> > realized that I didn't mention anything about this in the commit
> > message, so I have added an explanation.
> > 
> > I think this is sufficiently future-proof in that the format of this
> > hash matches the format of the hashes used in the objects in the
> > packfile. As for object size being limited to uint64, I think the
> > benefits of the space savings (in using binary) outweigh the small risk
> > that our files will get larger than that before we upgrade our protocol
> > :-P
> 
> The rest of the pack code uses a varint encoding which is generally
> much smaller than a uint64 for most files, but can handle arbitrary
> sizes.
> 
> The one thing it loses is that you wouldn't have a fixed-size record, so
> if you were planning to dump this directly to disk and binary-search it,
> that won't work. OTOH, you could make pseudo-pack-entries and just
> index them along with the rest of the objects in the pack .idx.
> 
> The one subtle thing there is that the pseudo-entry would have to say
> "this is my sha1". And then we'd end up repeating that sha1 in the .idx
> file. So it's optimal on the network but wastes 20 bytes on disk (unless
> index-pack throws away the in-pack sha1s as it indexes, which is
> certainly an option).

If we end up going with the varint approach (which seems reasonable),
maybe the client could just expand the varints into uint64s so that it
has a binary-searchable file. I think it's better to keep this list
separate from the pack .idx file (there has been some discussion on this
- [1] and its replies).

[1] https://public-inbox.org/git/777ab8f2-c31a-d07b-ffe3-f8333f408ea1@jeffhostetler.com/

> > > Can this multiplication overflow (hint: st_mult)?
> > 
> > Thanks - used st_mult.
> 
> Actually, I think this is a good candidate for ALLOC_ARRAY().

Thanks - I've made this change in my local version.

> > > This sorting is a part of external contract (i.e. "output in hash
> > > order" is documented), but is it necessary?  Just wondering.
> > 
> > It is convenient for the client because the client can then store it
> > directly and binary search when accessing it. (Added a note about
> > convenience to the commit message.)
> 
> In general the Git client doesn't trust the pack data coming from a
> remote, and you can't corrupt a client by sending it bogus data. Either
> the client computes it from scratch (e.g., the sha1s of each object) or
> the client will reject nonsense (missing bases, refs pointing to objects
> that aren't sent, etc).
> 
> I know this feature implies a certain amount of trust (after all, the
> server could claim that it omitted any sha1 it likes), but we should
> probably still be as strict as possible that what the other side is
> sending makes sense. In this case, we should probably hashcmp() each
> entry with the last and make sure they're strictly increasing (no
> out-of-order and no duplicates).

Good point.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes
  2017-06-02 23:25     ` Jonathan Nieder
@ 2017-06-07  9:44       ` Jeff King
  0 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2017-06-07  9:44 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jonathan Tan, git, gitster

On Fri, Jun 02, 2017 at 04:25:08PM -0700, Jonathan Nieder wrote:

> > We have a name-hash cache extension in the bitmap file, but it doesn't
> > carry enough information to deduce the .git-ness of a file. I don't
> > think it would be too hard to add a "flags" extension, and give a single
> > bit to "this is a .git file".
> 
> A nicer approach IMHO is to include an extra bitmap, like the existing
> object-type bitmaps (see the dump_bitmap calls in
> bitmap_writer_finish).  This would would represent the set of all
> .git* blobs in the pack.

Yeah, it could be stored as a bitmap, which would be slightly smaller
(since it would be mostly 0's). I think either way it would need an
extension flag in the header to signal its presence.

Older versions of Git are OK with having flags they don't understand. I
know JGit used to complain about seeing a bitmap with unknown flags, but
I'm not sure if that is still the case.

> >                      If you're not just avoiding large blobs but trying
> > to get a narrow clone, you don't want the .git files from the
> > uninteresting parts of the tree.
> 
> This is something I've wondered about, too.  Part of the story is that
> we haven't started omitting trees, so there is already O(number of
> trees) objects being sent and some additional small blobs for .git*
> specials doesn't make it much worse.  Sending all .git* blobs keeps
> things simple since the server doesn't have to infer which .git* blobs
> are relevant to this client.
> 
> Longer term, we will likely want to allow clients to request omission
> of some trees, too.  Omitting the corresponding .git* files becomes
> more straightforward at that point.
> 
> Does that make sense?

Yeah, I agree we'd want to avoid the trees, too, in that case.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WIP v2 0/2] Modifying pack objects to support --blob-max-bytes
  2017-06-05 17:35     ` Jonathan Tan
@ 2017-06-07  9:46       ` Jeff King
  0 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2017-06-07  9:46 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, gitster

On Mon, Jun 05, 2017 at 10:35:23AM -0700, Jonathan Tan wrote:

> > The rest of the pack code uses a varint encoding which is generally
> > much smaller than a uint64 for most files, but can handle arbitrary
> > sizes.
> > 
> > The one thing it loses is that you wouldn't have a fixed-size record, so
> > if you were planning to dump this directly to disk and binary-search it,
> > that won't work. OTOH, you could make pseudo-pack-entries and just
> > index them along with the rest of the objects in the pack .idx.
> > 
> > The one subtle thing there is that the pseudo-entry would have to say
> > "this is my sha1". And then we'd end up repeating that sha1 in the .idx
> > file. So it's optimal on the network but wastes 20 bytes on disk (unless
> > index-pack throws away the in-pack sha1s as it indexes, which is
> > certainly an option).
> 
> If we end up going with the varint approach (which seems reasonable),
> maybe the client could just expand the varints into uint64s so that it
> has a binary-searchable file. I think it's better to keep this list
> separate from the pack .idx file (there has been some discussion on this
> - [1] and its replies).
> 
> [1] https://public-inbox.org/git/777ab8f2-c31a-d07b-ffe3-f8333f408ea1@jeffhostetler.com/

OK. If we're keeping it separate anyway, then I agree that just
expanding the varints is a good solution. And we don't have to care too
much about the local storage, because it can be changed out later
without touching the on-the-wire protocol.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes
  2017-06-02 22:26   ` Jeff King
  2017-06-02 23:25     ` Jonathan Nieder
  2017-06-03 23:48     ` Junio C Hamano
@ 2017-06-15 20:28     ` Jeff Hostetler
  2017-06-15 21:03       ` Jonathan Tan
  2 siblings, 1 reply; 17+ messages in thread
From: Jeff Hostetler @ 2017-06-15 20:28 UTC (permalink / raw)
  To: Jeff King, Jonathan Tan; +Cc: git, gitster



On 6/2/2017 6:26 PM, Jeff King wrote:
> On Fri, Jun 02, 2017 at 12:38:45PM -0700, Jonathan Tan wrote:
...
> We have a name-hash cache extension in the bitmap file, but it doesn't
> carry enough information to deduce the .git-ness of a file. I don't
> think it would be too hard to add a "flags" extension, and give a single
> bit to "this is a .git file".
> 
> I do also wonder if the two features would need to be separated for a
> GVFS-style solution. If you're not just avoiding large blobs but trying
> to get a narrow clone, you don't want the .git files from the
> uninteresting parts of the tree. You want to get no blobs at all, and
> then fault them in as they become relevant due to user action.
> 
> -Peff
> 

I agree with Peff here.  I've been working on my partial/narrow/sparse
clone/fetch ideas since my original RFC and have come to the conclusion
that the server can do the size limiting efficiently, but we should
leave the pathname filtering to the client.  That is, let the client
get the commits and trees and then locally apply pattern matching,
whether that be a sparse-checkout definition or simple ".git*"
matching and then make a later request for the "blobs of interest".

Whether we "fault-in" the missing blobs or have a "fetch-blobs"
command (like fetch-pack) to get them is open to debate.

There are concerns about the size of the requested blob-id list in a
fetch-blobs approach, but I think there are ways to say I need all
of the blobs referenced by the directory /foo in commit xxxx (and
optionally, that aren't present in directory /foo in commit yyyy
or in the range yyyy..xxxx).  (handwave)

Jeff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WIP v2 2/2] pack-objects: support --blob-max-bytes
  2017-06-15 20:28     ` Jeff Hostetler
@ 2017-06-15 21:03       ` Jonathan Tan
  0 siblings, 0 replies; 17+ messages in thread
From: Jonathan Tan @ 2017-06-15 21:03 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: Jeff King, git, gitster

On Thu, 15 Jun 2017 16:28:24 -0400
Jeff Hostetler <git@jeffhostetler.com> wrote:

> I agree with Peff here.  I've been working on my partial/narrow/sparse
> clone/fetch ideas since my original RFC and have come to the conclusion
> that the server can do the size limiting efficiently, but we should
> leave the pathname filtering to the client.  That is, let the client
> get the commits and trees and then locally apply pattern matching,
> whether that be a sparse-checkout definition or simple ".git*"
> matching and then make a later request for the "blobs of interest".

This means that the client would need to inflate all the trees it
received, but that might not be that bad.

> Whether we "fault-in" the missing blobs or have a "fetch-blobs"
> command (like fetch-pack) to get them is open to debate.
> 
> There are concerns about the size of the requested blob-id list in a
> fetch-blobs approach, but I think there are ways to say I need all
> of the blobs referenced by the directory /foo in commit xxxx (and
> optionally, that aren't present in directory /foo in commit yyyy
> or in the range yyyy..xxxx).  (handwave)

Unless the server no longer has the relevant commit (e.g. because a
branch was rewound), but even then, even if we only sent blob hashes,
the list would be probably much smaller than the downloaded pack anyway,
so things might still be OK.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-06-15 21:03 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-02  0:14 [WIP 0/2] Modifying pack-objects to support --blob-size-limit Jonathan Tan
2017-06-02  0:14 ` [WIP 1/2] pack-objects: rename want_.* to ignore_.* Jonathan Tan
2017-06-02  3:45   ` Junio C Hamano
2017-06-02  0:14 ` [WIP 2/2] pack-objects: support --blob-size-limit Jonathan Tan
2017-06-02  4:48   ` Junio C Hamano
2017-06-02 19:38 ` [WIP v2 0/2] Modifying pack objects to support --blob-max-bytes Jonathan Tan
2017-06-02 22:16   ` Jeff King
2017-06-05 17:35     ` Jonathan Tan
2017-06-07  9:46       ` Jeff King
2017-06-02 19:38 ` [WIP v2 1/2] pack-objects: rename want_.* to ignore_.* Jonathan Tan
2017-06-02 19:38 ` [WIP v2 2/2] pack-objects: support --blob-max-bytes Jonathan Tan
2017-06-02 22:26   ` Jeff King
2017-06-02 23:25     ` Jonathan Nieder
2017-06-07  9:44       ` Jeff King
2017-06-03 23:48     ` Junio C Hamano
2017-06-15 20:28     ` Jeff Hostetler
2017-06-15 21:03       ` Jonathan Tan

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).