git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/3] WIP list-objects and pack-objects for partial clone
@ 2017-06-22 20:36 Jeff Hostetler
  2017-06-22 20:36 ` [PATCH 1/3] list-objects: add filter_blob to traverse_commit_list Jeff Hostetler
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Jeff Hostetler @ 2017-06-22 20:36 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, jrnieder, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

This WIP is a follow up to earlier patches to teach pack-objects
to omit large blobs from packfiles.  This doesn't attempt to solve
the whole end-to-end problem of partial/sparse clone/fetch or that
of the client operating with missing blobs.  This WIP is for now
limited to building the packfile with omitted blobs and hopefully
can mesh nicely with Jonathan Tan's work in [3].

It supports filtering by size while always including blobs associated
with ".git*" paths, something we both proposed in [1] and [3].

The approach here differs from [1] and [3] in that it extends
traverse_commit_list() to allow custom blob filtering using a new
callback provided by pack-objects.  This should make it easier to
do other filters laters.  Part of this based upon Peff's suggestion
about rev-list in [2].  I have not updated the rev-list command,
but rather the routines in list-objects.c that it calls.  Jonathan's
ideas in [3] to build and send the omitted blobs list means that I
think we need pack-objects.c manage the filter-proc used here.

I considered, but omitted from this version, ideas to allow the
filter-proc to know of the process_tree() boundaries which might
let pack-objects filter by sub-tree (think sparse-checkout) as
suggested in [4] and various replies.

[1] https://public-inbox.org/git/1488994685-37403-3-git-send-email-jeffhost@microsoft.com/
[2] https://public-inbox.org/git/20170309073117.g3br5btsfwntcdpe@sigill.intra.peff.net/
[3] https://public-inbox.org/git/cover.1496361873.git.jonathantanmy@google.com/
[4] https://public-inbox.org/git/20170602232508.GA21733@aiede.mtv.corp.google.com/


Jeff Hostetler (3):
  list-objects: add filter_blob to traverse_commit_list
  pack-objects: WIP add max-blob-size filtering
  pack-objects: add t5317 to test max-blob-size

 builtin/pack-objects.c                 | 76 +++++++++++++++++++++++++++++++++-
 list-objects.c                         | 39 +++++++++++++++--
 list-objects.h                         |  8 ++++
 t/t5317-pack-objects-blob-filtering.sh | 68 ++++++++++++++++++++++++++++++
 4 files changed, 186 insertions(+), 5 deletions(-)
 create mode 100644 t/t5317-pack-objects-blob-filtering.sh

-- 
2.9.3


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/3] list-objects: add filter_blob to traverse_commit_list
  2017-06-22 20:36 [PATCH 0/3] WIP list-objects and pack-objects for partial clone Jeff Hostetler
@ 2017-06-22 20:36 ` Jeff Hostetler
  2017-06-22 21:45   ` Jonathan Tan
  2017-06-28 16:23   ` Junio C Hamano
  2017-06-22 20:36 ` [PATCH 2/3] pack-objects: WIP add max-blob-size filtering Jeff Hostetler
  2017-06-22 20:36 ` [PATCH 3/3] pack-objects: add t5317 to test max-blob-size Jeff Hostetler
  2 siblings, 2 replies; 12+ messages in thread
From: Jeff Hostetler @ 2017-06-22 20:36 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, jrnieder, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

In preparation for partial/sparse clone/fetch where the
server is allowed to omit large/all blobs from the packfile,
teach traverse_commit_list() to take a blob filter-proc that
controls when blobs are shown and marked as SEEN.

Normally, traverse_commit_list() always marks visited blobs
as SEEN, so that the show_object() callback will never see
duplicates.  Since a single blob OID may be referenced by
multiple pathnames, we may not be able to decide if a blob
should be excluded until later pathnames have been traversed.
With the filter-proc, the automatic deduping is disabled.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 list-objects.c | 39 +++++++++++++++++++++++++++++++++++----
 list-objects.h |  8 ++++++++
 2 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/list-objects.c b/list-objects.c
index f3ca6aa..c9ca81c 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -11,6 +11,7 @@
 static void process_blob(struct rev_info *revs,
 			 struct blob *blob,
 			 show_object_fn show,
+			 filter_blob_fn filter_blob,
 			 struct strbuf *path,
 			 const char *name,
 			 void *cb_data)
@@ -24,11 +25,28 @@ static void process_blob(struct rev_info *revs,
 		die("bad blob object");
 	if (obj->flags & (UNINTERESTING | SEEN))
 		return;
-	obj->flags |= SEEN;
 
 	pathlen = path->len;
 	strbuf_addstr(path, name);
-	show(obj, path->buf, cb_data);
+	if (!filter_blob) {
+		/*
+		 * Normal processing is to imediately dedup blobs
+		 * during commit traversal, regardless of how many
+		 * times it appears in a single or multiple commits,
+		 * so we always set SEEN.
+		 */
+		obj->flags |= SEEN;
+		show(obj, path->buf, cb_data);
+	} else {
+		/*
+		 * Use the filter-proc to decide whether to show
+		 * the blob.  We only set SEEN if requested.  For
+		 * example, this could be used to omit a specific
+		 * blob until it appears with a ".git*" entryname.
+		 */
+		if (filter_blob(obj, path->buf, &path->buf[pathlen], cb_data))
+			obj->flags |= SEEN;
+	}
 	strbuf_setlen(path, pathlen);
 }
 
@@ -67,6 +85,7 @@ static void process_gitlink(struct rev_info *revs,
 static void process_tree(struct rev_info *revs,
 			 struct tree *tree,
 			 show_object_fn show,
+			 filter_blob_fn filter_blob,
 			 struct strbuf *base,
 			 const char *name,
 			 void *cb_data)
@@ -111,7 +130,7 @@ static void process_tree(struct rev_info *revs,
 		if (S_ISDIR(entry.mode))
 			process_tree(revs,
 				     lookup_tree(entry.oid->hash),
-				     show, base, entry.path,
+				     show, filter_blob, base, entry.path,
 				     cb_data);
 		else if (S_ISGITLINK(entry.mode))
 			process_gitlink(revs, entry.oid->hash,
@@ -120,7 +139,7 @@ static void process_tree(struct rev_info *revs,
 		else
 			process_blob(revs,
 				     lookup_blob(entry.oid->hash),
-				     show, base, entry.path,
+				     show, filter_blob, base, entry.path,
 				     cb_data);
 	}
 	strbuf_setlen(base, baselen);
@@ -188,6 +207,16 @@ void traverse_commit_list(struct rev_info *revs,
 			  show_object_fn show_object,
 			  void *data)
 {
+	traverse_commit_list_filtered(revs, show_commit, show_object, NULL, data);
+}
+
+void traverse_commit_list_filtered(
+	struct rev_info *revs,
+	show_commit_fn show_commit,
+	show_object_fn show_object,
+	filter_blob_fn filter_blob,
+	void *data)
+{
 	int i;
 	struct commit *commit;
 	struct strbuf base;
@@ -218,11 +247,13 @@ void traverse_commit_list(struct rev_info *revs,
 			path = "";
 		if (obj->type == OBJ_TREE) {
 			process_tree(revs, (struct tree *)obj, show_object,
+				     filter_blob,
 				     &base, path, data);
 			continue;
 		}
 		if (obj->type == OBJ_BLOB) {
 			process_blob(revs, (struct blob *)obj, show_object,
+				     filter_blob,
 				     &base, path, data);
 			continue;
 		}
diff --git a/list-objects.h b/list-objects.h
index 0cebf85..7f823aa 100644
--- a/list-objects.h
+++ b/list-objects.h
@@ -3,7 +3,15 @@
 
 typedef void (*show_commit_fn)(struct commit *, void *);
 typedef void (*show_object_fn)(struct object *, const char *, void *);
+typedef int  (*filter_blob_fn)(struct object *, const char *, const char *, void *);
+
 void traverse_commit_list(struct rev_info *, show_commit_fn, show_object_fn, void *);
+void traverse_commit_list_filtered(
+    struct rev_info *revs,
+    show_commit_fn show_commit,
+    show_object_fn show_object,
+    filter_blob_fn filter_blob,
+    void *data);
 
 typedef void (*show_edge_fn)(struct commit *);
 void mark_edges_uninteresting(struct rev_info *, show_edge_fn);
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] pack-objects: WIP add max-blob-size filtering
  2017-06-22 20:36 [PATCH 0/3] WIP list-objects and pack-objects for partial clone Jeff Hostetler
  2017-06-22 20:36 ` [PATCH 1/3] list-objects: add filter_blob to traverse_commit_list Jeff Hostetler
@ 2017-06-22 20:36 ` Jeff Hostetler
  2017-06-22 21:54   ` Jonathan Tan
  2017-06-22 20:36 ` [PATCH 3/3] pack-objects: add t5317 to test max-blob-size Jeff Hostetler
  2 siblings, 1 reply; 12+ messages in thread
From: Jeff Hostetler @ 2017-06-22 20:36 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, jrnieder, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Teach pack-objects command to accept --max-blob-size=<n> argument
and use a traverse_commit_list filter-proc to omit unwanted blobs
from the resulting packfile.

This filter-proc always includes special files matching ".git*"
(such as ".gitignore") and blobs smaller than <n>.  <n> is a
magnitude value and accepts [kmg] suffixes.  A value of zero
can be used to omit all blobs (except for special files).

There are 2 placeholder TODOs in this code to talk about building
an omitted-blob list for the client.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 builtin/pack-objects.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 75 insertions(+), 1 deletion(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 50e01aa..cdcd4d6 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -77,6 +77,8 @@ static unsigned long cache_max_small_delta_size = 1000;
 
 static unsigned long window_memory_limit = 0;
 
+static signed long max_blob_size = -1;
+
 /*
  * stats
  */
@@ -2519,6 +2521,7 @@ static void read_object_list_from_stdin(void)
 }
 
 #define OBJECT_ADDED (1u<<20)
+#define BLOB_OMITTED (1u<<21)
 
 static void show_commit(struct commit *commit, void *data)
 {
@@ -2536,6 +2539,70 @@ static void show_object(struct object *obj, const char *name, void *data)
 	obj->flags |= OBJECT_ADDED;
 }
 
+/*
+ * Filter blobs by pathname or size.
+ * Return 1 to mark the blob SEEN so that it will not be reported again.
+ * Return 0 to allow it to be presented again.
+ */
+static int filter_blob(
+	struct object *obj,
+	const char *pathname,
+	const char *entryname,
+	void *data)
+{
+	assert(obj->type == OBJ_BLOB);
+	assert((obj->flags & SEEN) == 0);
+	assert((obj->flags & OBJECT_ADDED) == 0);
+	assert(max_blob_size >= 0);
+
+	/*
+	 * Always include blobs for special files of the form ".git*".
+	 */
+	if ((strncmp(entryname, ".git", 4) == 0) && entryname[4]) {
+		if (obj->flags & BLOB_OMITTED) {
+			/*
+			 * TODO
+			 * TODO Remove this blob from the omitted blob list.
+			 * TODO
+			 */
+			obj->flags &= ~BLOB_OMITTED;
+		}
+		show_object(obj, pathname, data);
+		return 1;
+	}
+
+	/*
+	 * We already know the blob is too big because it was previously
+	 * omitted.  We still don't want it yet.  DO NOT mark it SEEN
+	 * in case it is associated with a ".git*" path in another tree
+	 * or commit.
+	 */
+	if (obj->flags & BLOB_OMITTED)
+		return 0;
+
+	/*
+	 * We only want blobs that are LESS THAN the maximum.
+	 * This allows zero to mean NO BLOBS.
+	 */
+	if (max_blob_size > 0) {
+		unsigned long s;
+		enum object_type t = sha1_object_info(obj->oid.hash, &s);
+		assert(t == OBJ_BLOB);
+		if (s < max_blob_size) {
+			show_object(obj, pathname, data);
+			return 1;
+		}
+	}
+
+	/*
+	 * TODO
+	 * TODO (Provisionally) add this blob to the omitted blob list.
+	 * TODO
+	 */
+	obj->flags |= BLOB_OMITTED;
+	return 0;
+}
+
 static void show_edge(struct commit *commit)
 {
 	add_preferred_base(commit->object.oid.hash);
@@ -2800,7 +2867,12 @@ static void get_object_list(int ac, const char **av)
 	if (prepare_revision_walk(&revs))
 		die("revision walk setup failed");
 	mark_edges_uninteresting(&revs, show_edge);
-	traverse_commit_list(&revs, show_commit, show_object, NULL);
+
+	if (max_blob_size == -1)
+		traverse_commit_list(&revs, show_commit, show_object, NULL);
+	else
+		traverse_commit_list_filtered(&revs, show_commit, show_object,
+			filter_blob, NULL);
 
 	if (unpack_unreachable_expiration) {
 		revs.ignore_missing_links = 1;
@@ -2936,6 +3008,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			 N_("use a bitmap index if available to speed up counting objects")),
 		OPT_BOOL(0, "write-bitmap-index", &write_bitmap_index,
 			 N_("write a bitmap index together with the pack index")),
+		OPT_MAGNITUDE(0, "max-blob-size", (unsigned long *)&max_blob_size,
+					  N_("omit large blobs from packfile")),
 		OPT_END(),
 	};
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/3] pack-objects: add t5317 to test max-blob-size
  2017-06-22 20:36 [PATCH 0/3] WIP list-objects and pack-objects for partial clone Jeff Hostetler
  2017-06-22 20:36 ` [PATCH 1/3] list-objects: add filter_blob to traverse_commit_list Jeff Hostetler
  2017-06-22 20:36 ` [PATCH 2/3] pack-objects: WIP add max-blob-size filtering Jeff Hostetler
@ 2017-06-22 20:36 ` Jeff Hostetler
  2 siblings, 0 replies; 12+ messages in thread
From: Jeff Hostetler @ 2017-06-22 20:36 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, jrnieder, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 t/t5317-pack-objects-blob-filtering.sh | 68 ++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)
 create mode 100644 t/t5317-pack-objects-blob-filtering.sh

diff --git a/t/t5317-pack-objects-blob-filtering.sh b/t/t5317-pack-objects-blob-filtering.sh
new file mode 100644
index 0000000..58124ab
--- /dev/null
+++ b/t/t5317-pack-objects-blob-filtering.sh
@@ -0,0 +1,68 @@
+#!/bin/sh
+
+test_description='pack-objects blob filtering'
+
+. ./test-lib.sh
+
+test_expect_success setup '
+	printf "%10s"   X >x10.txt   &&
+	printf "%100s"  X >x100.txt  &&
+	printf "%1000s" X >x1000.txt &&
+	git add *.txt &&
+	git commit -m txt
+'
+
+test_expect_success 'all blobs' '
+	test_when_finished "rm -f *.pack *.idx" &&
+	git pack-objects --revs --thin --stdout >z.pack <<-EOF &&
+	master
+
+	EOF
+	git index-pack z.pack &&
+	test 3 = $(git verify-pack -v z.pack | grep blob | wc -l)
+'
+
+test_expect_success 'no blobs (max equals 0)' '
+	test_when_finished "rm -f *.pack *.idx" &&
+	git pack-objects --revs --thin --stdout --max-blob-size=0 >z.pack <<-EOF &&
+	master
+
+	EOF
+	git index-pack z.pack &&
+	test 0 = $(git verify-pack -v z.pack | grep blob | wc -l)
+'
+
+test_expect_success 'small 20 blobs' '
+	test_when_finished "rm -f *.pack *.idx" &&
+	git pack-objects --revs --thin --stdout --max-blob-size=20 >z.pack <<-EOF &&
+	master
+
+	EOF
+	git index-pack z.pack &&
+	test 1 = $(git verify-pack -v z.pack | grep blob | wc -l)
+'
+
+test_expect_success 'small 200 blobs' '
+	test_when_finished "rm -f *.pack *.idx" &&
+	git pack-objects --revs --thin --stdout --max-blob-size=200 >z.pack <<-EOF &&
+	master
+
+	EOF
+	git index-pack z.pack &&
+	test 2 = $(git verify-pack -v z.pack | grep blob | wc -l)
+'
+
+test_expect_success 'special files always present' '
+	test_when_finished "rm -f *.pack *.idx" &&
+	cp x1000.txt .gitignore &&
+	git add .gitignore &&
+	git commit -m "add ignores" &&
+	git pack-objects --revs --stdout --max-blob-size=0 >z.pack <<-EOF &&
+	master
+
+	EOF
+	git index-pack z.pack &&
+	test 1 = $(git verify-pack -v z.pack | grep blob | wc -l)
+'
+
+test_done
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] list-objects: add filter_blob to traverse_commit_list
  2017-06-22 20:36 ` [PATCH 1/3] list-objects: add filter_blob to traverse_commit_list Jeff Hostetler
@ 2017-06-22 21:45   ` Jonathan Tan
  2017-06-22 22:10     ` Jonathan Tan
  2017-06-28 16:23   ` Junio C Hamano
  1 sibling, 1 reply; 12+ messages in thread
From: Jonathan Tan @ 2017-06-22 21:45 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, gitster, peff, jrnieder, Jeff Hostetler

On Thu, 22 Jun 2017 20:36:13 +0000
Jeff Hostetler <git@jeffhostetler.com> wrote:

> From: Jeff Hostetler <jeffhost@microsoft.com>
> 
> In preparation for partial/sparse clone/fetch where the
> server is allowed to omit large/all blobs from the packfile,
> teach traverse_commit_list() to take a blob filter-proc that
> controls when blobs are shown and marked as SEEN.
> 
> Normally, traverse_commit_list() always marks visited blobs
> as SEEN, so that the show_object() callback will never see
> duplicates.  Since a single blob OID may be referenced by
> multiple pathnames, we may not be able to decide if a blob
> should be excluded until later pathnames have been traversed.
> With the filter-proc, the automatic deduping is disabled.

Comparing this approach (this patch and the next one) against mine [1],
I see that this has the advantage of (in pack-objects) avoiding the
invocation of add_preferred_base_object() on blobs that are filtered
out, avoiding polluting the "to_pack" data structure with information
that we do not need.

But it does add an extra place where blobs are filtered out (besides
add_object_entry()), which makes it harder for the reader to keep track
of what's going on. I took a brief look to see if filtering could be
eliminated from add_object_entry(), but that function is called from
many places, so I couldn't tell.

Anyway, I think both approaches will work (this patch's and mine [1]).

[1] https://public-inbox.org/git/6f7934621717141ce3bb6bc05cf1d59c7900ccc5.1496432147.git.jonathantanmy@google.com/

[snip]

>  static void process_blob(struct rev_info *revs,
>  			 struct blob *blob,
>  			 show_object_fn show,
> +			 filter_blob_fn filter_blob,
>  			 struct strbuf *path,
>  			 const char *name,
>  			 void *cb_data)
> @@ -24,11 +25,28 @@ static void process_blob(struct rev_info *revs,
>  		die("bad blob object");
>  	if (obj->flags & (UNINTERESTING | SEEN))
>  		return;
> -	obj->flags |= SEEN;
>  
>  	pathlen = path->len;
>  	strbuf_addstr(path, name);
> -	show(obj, path->buf, cb_data);
> +	if (!filter_blob) {
> +		/*
> +		 * Normal processing is to imediately dedup blobs
> +		 * during commit traversal, regardless of how many
> +		 * times it appears in a single or multiple commits,
> +		 * so we always set SEEN.
> +		 */
> +		obj->flags |= SEEN;
> +		show(obj, path->buf, cb_data);
> +	} else {
> +		/*
> +		 * Use the filter-proc to decide whether to show
> +		 * the blob.  We only set SEEN if requested.  For
> +		 * example, this could be used to omit a specific
> +		 * blob until it appears with a ".git*" entryname.
> +		 */
> +		if (filter_blob(obj, path->buf, &path->buf[pathlen], cb_data))
> +			obj->flags |= SEEN;
> +	}
>  	strbuf_setlen(path, pathlen);
>  }

After looking at this code again, I wonder if we should explicitly
document that SEEN will be set on the object before show(), and that
show() is allowed to unset SEEN if it wants traversal to process that
object again. That seems to accomplish what we want to accomplish
without this expansion of the API.

(This does mean that the next patch will need to call strrchr() itself,
since show() does not provide the basename of the file name.)

Having said that, if we keep with the current approach, I think there
should be a show() call after the invocation to filter_blob(). That is
much less surprising to me, and also allows us to remove some
show_object() invocations in the next patch.

> +void traverse_commit_list_filtered(
> +	struct rev_info *revs,
> +	show_commit_fn show_commit,
> +	show_object_fn show_object,
> +	filter_blob_fn filter_blob,
> +	void *data)
> +{
>  	int i;
>  	struct commit *commit;
>  	struct strbuf base;

Git style is to indent to the location after the first "(", I believe.
Likewise in the header file below.

>  typedef void (*show_commit_fn)(struct commit *, void *);
>  typedef void (*show_object_fn)(struct object *, const char *, void *);
> +typedef int  (*filter_blob_fn)(struct object *, const char *, const char *, void *);

Can this have parameter names added (especially both the const char *
ones, otherwise indistinguishable) and, if necessary, documented?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] pack-objects: WIP add max-blob-size filtering
  2017-06-22 20:36 ` [PATCH 2/3] pack-objects: WIP add max-blob-size filtering Jeff Hostetler
@ 2017-06-22 21:54   ` Jonathan Tan
  2017-06-22 22:14     ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Jonathan Tan @ 2017-06-22 21:54 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, gitster, peff, jrnieder, Jeff Hostetler

On Thu, 22 Jun 2017 20:36:14 +0000
Jeff Hostetler <git@jeffhostetler.com> wrote:

> +static signed long max_blob_size = -1;

FYI Junio suggested "blob-max-bytes" when he looked at my patch [1].

[1] https://public-inbox.org/git/xmqqmv9ryoym.fsf@gitster.mtv.corp.google.com/

[snip]

> +/*
> + * Filter blobs by pathname or size.
> + * Return 1 to mark the blob SEEN so that it will not be reported again.
> + * Return 0 to allow it to be presented again.
> + */
> +static int filter_blob(
> +	struct object *obj,
> +	const char *pathname,
> +	const char *entryname,
> +	void *data)
> +{
> +	assert(obj->type == OBJ_BLOB);
> +	assert((obj->flags & SEEN) == 0);
> +	assert((obj->flags & OBJECT_ADDED) == 0);
> +	assert(max_blob_size >= 0);
> +
> +	/*
> +	 * Always include blobs for special files of the form ".git*".
> +	 */
> +	if ((strncmp(entryname, ".git", 4) == 0) && entryname[4]) {
> +		if (obj->flags & BLOB_OMITTED) {
> +			/*
> +			 * TODO
> +			 * TODO Remove this blob from the omitted blob list.
> +			 * TODO
> +			 */
> +			obj->flags &= ~BLOB_OMITTED;
> +		}
> +		show_object(obj, pathname, data);
> +		return 1;
> +	}
> +
> +	/*
> +	 * We already know the blob is too big because it was previously
> +	 * omitted.  We still don't want it yet.  DO NOT mark it SEEN
> +	 * in case it is associated with a ".git*" path in another tree
> +	 * or commit.
> +	 */
> +	if (obj->flags & BLOB_OMITTED)
> +		return 0;
> +
> +	/*
> +	 * We only want blobs that are LESS THAN the maximum.
> +	 * This allows zero to mean NO BLOBS.

That is not what maximum means...

This is the reason why I originally called it "limit", but after some
reflection, I decided that it is no big deal to always send zero-sized
blobs. The client must be able to handle the presence of blobs anyway
(because it will receive the ".git" ones), and excluding all blobs
regardless of size does not remove the necessity of obtaining their
sizes, since we need those sizes to put in the omitted blob list.

> +	 */
> +	if (max_blob_size > 0) {
> +		unsigned long s;
> +		enum object_type t = sha1_object_info(obj->oid.hash, &s);
> +		assert(t == OBJ_BLOB);
> +		if (s < max_blob_size) {
> +			show_object(obj, pathname, data);
> +			return 1;
> +		}
> +	}
> +
> +	/*
> +	 * TODO
> +	 * TODO (Provisionally) add this blob to the omitted blob list.
> +	 * TODO

As for the omitted blob list itself, you can see from my patch [2] that I
used a hashmap, but the decorate.h functionality might work too (I
haven't looked into that in detail though).

[2] https://public-inbox.org/git/6f7934621717141ce3bb6bc05cf1d59c7900ccc5.1496432147.git.jonathantanmy@google.com/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] list-objects: add filter_blob to traverse_commit_list
  2017-06-22 21:45   ` Jonathan Tan
@ 2017-06-22 22:10     ` Jonathan Tan
  2017-06-23 17:16       ` Jeff Hostetler
  0 siblings, 1 reply; 12+ messages in thread
From: Jonathan Tan @ 2017-06-22 22:10 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Jeff Hostetler, git, gitster, peff, jrnieder, Jeff Hostetler

On Thu, 22 Jun 2017 14:45:26 -0700
Jonathan Tan <jonathantanmy@google.com> wrote:

> On Thu, 22 Jun 2017 20:36:13 +0000
> Jeff Hostetler <git@jeffhostetler.com> wrote:
> 
> > From: Jeff Hostetler <jeffhost@microsoft.com>
> > 
> > In preparation for partial/sparse clone/fetch where the
> > server is allowed to omit large/all blobs from the packfile,
> > teach traverse_commit_list() to take a blob filter-proc that
> > controls when blobs are shown and marked as SEEN.
> > 
> > Normally, traverse_commit_list() always marks visited blobs
> > as SEEN, so that the show_object() callback will never see
> > duplicates.  Since a single blob OID may be referenced by
> > multiple pathnames, we may not be able to decide if a blob
> > should be excluded until later pathnames have been traversed.
> > With the filter-proc, the automatic deduping is disabled.
> 
> Comparing this approach (this patch and the next one) against mine [1],
> I see that this has the advantage of (in pack-objects) avoiding the
> invocation of add_preferred_base_object() on blobs that are filtered
> out, avoiding polluting the "to_pack" data structure with information
> that we do not need.
> 
> But it does add an extra place where blobs are filtered out (besides
> add_object_entry()), which makes it harder for the reader to keep track
> of what's going on. I took a brief look to see if filtering could be
> eliminated from add_object_entry(), but that function is called from
> many places, so I couldn't tell.
> 
> Anyway, I think both approaches will work (this patch's and mine [1]).
> 
> [1] https://public-inbox.org/git/6f7934621717141ce3bb6bc05cf1d59c7900ccc5.1496432147.git.jonathantanmy@google.com/

Also it should be mentioned somewhere that this does not cover the
bitmap case - a similar discussion should be included in one of the
patches, like I did in [1].

And looking back at my original cover letter [2], I wrote:

> This is similar to [1] except that this
[...]
> is slightly more comprehensive in
> that the read_object_list_from_stdin() codepath is also covered in
> addition to the get_object_list() codepath. (Although, to be clear,
> upload-pack always passes "--revs" and thus only needs the
> get_object_list() codepath).

(The [1] in the quote above refers to one of Jeff Hostetler's patches,
[QUOTE 1].)

The same issue applies to this patch (since
read_object_list_from_stdin() invokes add_object_entry() directly
without going through the traversal mechanism) and probably warrants at
least some description in one of the commit messages.

[1] https://public-inbox.org/git/6f7934621717141ce3bb6bc05cf1d59c7900ccc5.1496432147.git.jonathantanmy@google.com/
[2] https://public-inbox.org/git/cover.1496361873.git.jonathantanmy@google.com/

[QUOTE 1] https://public-inbox.org/git/1488994685-37403-3-git-send-email-jeffhost@microsoft.com/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] pack-objects: WIP add max-blob-size filtering
  2017-06-22 21:54   ` Jonathan Tan
@ 2017-06-22 22:14     ` Junio C Hamano
  0 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2017-06-22 22:14 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Jeff Hostetler, git, peff, jrnieder, Jeff Hostetler

Jonathan Tan <jonathantanmy@google.com> writes:

> On Thu, 22 Jun 2017 20:36:14 +0000
> Jeff Hostetler <git@jeffhostetler.com> wrote:
>
>> +static signed long max_blob_size = -1;
>
> FYI Junio suggested "blob-max-bytes" when he looked at my patch [1].
>
> [1] https://public-inbox.org/git/xmqqmv9ryoym.fsf@gitster.mtv.corp.google.com/

To give credit to where credit is due, I learned this from akpm.

When you are tempted to say "size", "length", "weight", etc., if you
think of a way to phrase it with a more concrete unit, you'd end up
with a much less ambiguous name.  People can imagine max_blob_size
is counted in kilobytes, or megabytes, or whatever.  There is no
room in max_blob_bytes to be interpreted in multiple ways.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] list-objects: add filter_blob to traverse_commit_list
  2017-06-22 22:10     ` Jonathan Tan
@ 2017-06-23 17:16       ` Jeff Hostetler
  0 siblings, 0 replies; 12+ messages in thread
From: Jeff Hostetler @ 2017-06-23 17:16 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, gitster, peff, jrnieder, Jeff Hostetler



On 6/22/2017 6:10 PM, Jonathan Tan wrote:
> On Thu, 22 Jun 2017 14:45:26 -0700
> Jonathan Tan <jonathantanmy@google.com> wrote:
> 
>> On Thu, 22 Jun 2017 20:36:13 +0000
>> Jeff Hostetler <git@jeffhostetler.com> wrote:
>>
>>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>>
>>> In preparation for partial/sparse clone/fetch where the
>>> server is allowed to omit large/all blobs from the packfile,
>>> teach traverse_commit_list() to take a blob filter-proc that
>>> controls when blobs are shown and marked as SEEN.
>>>
>>> Normally, traverse_commit_list() always marks visited blobs
>>> as SEEN, so that the show_object() callback will never see
>>> duplicates.  Since a single blob OID may be referenced by
>>> multiple pathnames, we may not be able to decide if a blob
>>> should be excluded until later pathnames have been traversed.
>>> With the filter-proc, the automatic deduping is disabled.
>>
>> Comparing this approach (this patch and the next one) against mine [1],
>> I see that this has the advantage of (in pack-objects) avoiding the
>> invocation of add_preferred_base_object() on blobs that are filtered
>> out, avoiding polluting the "to_pack" data structure with information
>> that we do not need.
>>
>> But it does add an extra place where blobs are filtered out (besides
>> add_object_entry()), which makes it harder for the reader to keep track
>> of what's going on. I took a brief look to see if filtering could be
>> eliminated from add_object_entry(), but that function is called from
>> many places, so I couldn't tell.
>>
>> Anyway, I think both approaches will work (this patch's and mine [1]).
>>
>> [1] https://public-inbox.org/git/6f7934621717141ce3bb6bc05cf1d59c7900ccc5.1496432147.git.jonathantanmy@google.com/
> 
> Also it should be mentioned somewhere that this does not cover the
> bitmap case - a similar discussion should be included in one of the
> patches, like I did in [1].
> 
> And looking back at my original cover letter [2], I wrote:
> 
>> This is similar to [1] except that this
> [...]
>> is slightly more comprehensive in
>> that the read_object_list_from_stdin() codepath is also covered in
>> addition to the get_object_list() codepath. (Although, to be clear,
>> upload-pack always passes "--revs" and thus only needs the
>> get_object_list() codepath).
> 
> (The [1] in the quote above refers to one of Jeff Hostetler's patches,
> [QUOTE 1].)
> 
> The same issue applies to this patch (since
> read_object_list_from_stdin() invokes add_object_entry() directly
> without going through the traversal mechanism) and probably warrants at
> least some description in one of the commit messages.
> 
> [1] https://public-inbox.org/git/6f7934621717141ce3bb6bc05cf1d59c7900ccc5.1496432147.git.jonathantanmy@google.com/
> [2] https://public-inbox.org/git/cover.1496361873.git.jonathantanmy@google.com/
> 
> [QUOTE 1] https://public-inbox.org/git/1488994685-37403-3-git-send-email-jeffhost@microsoft.com/
> 


Thanks for the quick feedback.  I'll dig into each of these comments
as I work on my next draft.

Jeff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] list-objects: add filter_blob to traverse_commit_list
  2017-06-22 20:36 ` [PATCH 1/3] list-objects: add filter_blob to traverse_commit_list Jeff Hostetler
  2017-06-22 21:45   ` Jonathan Tan
@ 2017-06-28 16:23   ` Junio C Hamano
  2017-06-28 17:13     ` Jeff Hostetler
  1 sibling, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2017-06-28 16:23 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, peff, jonathantanmy, jrnieder, Jeff Hostetler

Jeff Hostetler <git@jeffhostetler.com> writes:

> diff --git a/list-objects.c b/list-objects.c
> index f3ca6aa..c9ca81c 100644
> --- a/list-objects.c
> +++ b/list-objects.c
> @@ -24,11 +25,28 @@ static void process_blob(struct rev_info *revs,
>  		die("bad blob object");
>  	if (obj->flags & (UNINTERESTING | SEEN))
>  		return;
> -	obj->flags |= SEEN;
>  
>  	pathlen = path->len;
>  	strbuf_addstr(path, name);
> -	show(obj, path->buf, cb_data);
> +	if (!filter_blob) {
> +		/*
> +		 * Normal processing is to imediately dedup blobs
> +		 * during commit traversal, regardless of how many
> +		 * times it appears in a single or multiple commits,
> +		 * so we always set SEEN.
> +		 */
> +		obj->flags |= SEEN;
> +		show(obj, path->buf, cb_data);
> +	} else {
> +		/*
> +		 * Use the filter-proc to decide whether to show
> +		 * the blob.  We only set SEEN if requested.  For
> +		 * example, this could be used to omit a specific
> +		 * blob until it appears with a ".git*" entryname.
> +		 */
> +		if (filter_blob(obj, path->buf, &path->buf[pathlen], cb_data))
> +			obj->flags |= SEEN;
> +	}

This somehow looks a bit surprising organization and division of
responsibility.  I would have expected

	if (!filter_blob || 
	    filter_blob(obj, path->buf, &path->buf[pathlen], cb_data) {
		obj->flags |= SEEN;
		show(obj, path->buf, cb_data);
	}

i.e. making the filter function responsible for only making a
decision to include or exclude, not giving it a chance to decide to
"show" anything different.

> @@ -67,6 +85,7 @@ static void process_gitlink(struct rev_info *revs,
>  static void process_tree(struct rev_info *revs,
>  			 struct tree *tree,
>  			 show_object_fn show,
> +			 filter_blob_fn filter_blob,
>  			 struct strbuf *base,
>  			 const char *name,
>  			 void *cb_data)
> @@ -111,7 +130,7 @@ static void process_tree(struct rev_info *revs,
>  		if (S_ISDIR(entry.mode))
>  			process_tree(revs,
>  				     lookup_tree(entry.oid->hash),
> -				     show, base, entry.path,
> +				     show, filter_blob, base, entry.path,
>  				     cb_data);
>  		else if (S_ISGITLINK(entry.mode))
>  			process_gitlink(revs, entry.oid->hash,

I wonder if we'll need filter_tree_fn in the future in this
codepath.  When somebody wants to do a "narrow fetch/clone", would
the approach taken by this series, i.e. decide not to show certain
objects during the "rev-list --objects" traversal, a good precedent
to follow?  Would this approach be a good foundation to build on
such a future?

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] list-objects: add filter_blob to traverse_commit_list
  2017-06-28 16:23   ` Junio C Hamano
@ 2017-06-28 17:13     ` Jeff Hostetler
  2017-06-28 17:54       ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff Hostetler @ 2017-06-28 17:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, jonathantanmy, jrnieder, Jeff Hostetler



On 6/28/2017 12:23 PM, Junio C Hamano wrote:
> Jeff Hostetler <git@jeffhostetler.com> writes:
> 
>> diff --git a/list-objects.c b/list-objects.c
>> index f3ca6aa..c9ca81c 100644
>> --- a/list-objects.c
>> +++ b/list-objects.c
>> @@ -24,11 +25,28 @@ static void process_blob(struct rev_info *revs,
>>   		die("bad blob object");
>>   	if (obj->flags & (UNINTERESTING | SEEN))
>>   		return;
>> -	obj->flags |= SEEN;
>>   
>>   	pathlen = path->len;
>>   	strbuf_addstr(path, name);
>> -	show(obj, path->buf, cb_data);
>> +	if (!filter_blob) {
>> +		/*
>> +		 * Normal processing is to imediately dedup blobs
>> +		 * during commit traversal, regardless of how many
>> +		 * times it appears in a single or multiple commits,
>> +		 * so we always set SEEN.
>> +		 */
>> +		obj->flags |= SEEN;
>> +		show(obj, path->buf, cb_data);
>> +	} else {
>> +		/*
>> +		 * Use the filter-proc to decide whether to show
>> +		 * the blob.  We only set SEEN if requested.  For
>> +		 * example, this could be used to omit a specific
>> +		 * blob until it appears with a ".git*" entryname.
>> +		 */
>> +		if (filter_blob(obj, path->buf, &path->buf[pathlen], cb_data))
>> +			obj->flags |= SEEN;
>> +	}
> 
> This somehow looks a bit surprising organization and division of
> responsibility.  I would have expected
> 
> 	if (!filter_blob ||
> 	    filter_blob(obj, path->buf, &path->buf[pathlen], cb_data) {
> 		obj->flags |= SEEN;
> 		show(obj, path->buf, cb_data);
> 	}
> 
> i.e. making the filter function responsible for only making a
> decision to include or exclude, not giving it a chance to decide to
> "show" anything different.

Yes, my logic was a little confusing there.  Jonathan Tan said
something similar the other day.  I have a new version that I'm
working on now that looks like this:

	list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_SHOW;
	...
	if (filter)
		r = filter(obj, path->buf, ...
	if (r & LOFR_MARK_SEEN)
		obj->flags |= SEEN;
	if (r & LOFR_SHOW)
		show(obj, path->buf, cb_data);

I'm generalizing it a little to let the filter return 2 flags:
() SEEN to indicate that the filter doesn't want to see it again
() SHOW to include the object in the result.
These let filters do "hard" and "provisional" omits.  (This will
make more sense later when I get my patch cleaned up.)


>> @@ -67,6 +85,7 @@ static void process_gitlink(struct rev_info *revs,
>>   static void process_tree(struct rev_info *revs,
>>   			 struct tree *tree,
>>   			 show_object_fn show,
>> +			 filter_blob_fn filter_blob,
>>   			 struct strbuf *base,
>>   			 const char *name,
>>   			 void *cb_data)
>> @@ -111,7 +130,7 @@ static void process_tree(struct rev_info *revs,
>>   		if (S_ISDIR(entry.mode))
>>   			process_tree(revs,
>>   				     lookup_tree(entry.oid->hash),
>> -				     show, base, entry.path,
>> +				     show, filter_blob, base, entry.path,
>>   				     cb_data);
>>   		else if (S_ISGITLINK(entry.mode))
>>   			process_gitlink(revs, entry.oid->hash,
> 
> I wonder if we'll need filter_tree_fn in the future in this
> codepath.  When somebody wants to do a "narrow fetch/clone", would
> the approach taken by this series, i.e. decide not to show certain
> objects during the "rev-list --objects" traversal, a good precedent
> to follow?  Would this approach be a good foundation to build on
> such a future?

Yes, I'm including similar logic inside process_tree() to allow that
and let the filter know about entering and leaving each tree.  So we
only need one filter-proc to handle a particular strategy and it will
handle both tree and blob objects.

I want to be able to use this mechanism to do narrow clone/fetch
using such a filter-proc and a sparse-checkout-like spec.

Thanks,
Jeff



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] list-objects: add filter_blob to traverse_commit_list
  2017-06-28 17:13     ` Jeff Hostetler
@ 2017-06-28 17:54       ` Junio C Hamano
  0 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2017-06-28 17:54 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, peff, jonathantanmy, jrnieder, Jeff Hostetler

Jeff Hostetler <git@jeffhostetler.com> writes:

> Yes, my logic was a little confusing there.  Jonathan Tan said
> something similar the other day.  I have a new version that I'm
> working on now that looks like this:
>
> 	list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_SHOW;
> 	...
> 	if (filter)
> 		r = filter(obj, path->buf, ...
> 	if (r & LOFR_MARK_SEEN)
> 		obj->flags |= SEEN;
> 	if (r & LOFR_SHOW)
> 		show(obj, path->buf, cb_data);
>
> I'm generalizing it a little to let the filter return 2 flags:
> () SEEN to indicate that the filter doesn't want to see it again
> () SHOW to include the object in the result.
> These let filters do "hard" and "provisional" omits.  (This will
> make more sense later when I get my patch cleaned up.)

It is not immediately obvious to me, especially without seeing the
actual patch, why MARK_SEEN is needed.  Especially given that I
think a call to show() must set obj->flags |= SEEN anyway to avoid
duplicate output, with or without the objects-filter mechanism.

But that question can and should wait.

> Yes, I'm including similar logic inside process_tree() to allow that
> and let the filter know about entering and leaving each tree.  So we
> only need one filter-proc to handle a particular strategy and it will
> handle both tree and blob objects.
>
> I want to be able to use this mechanism to do narrow clone/fetch
> using such a filter-proc and a sparse-checkout-like spec.

Good to know ;-).

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-06-28 17:55 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-22 20:36 [PATCH 0/3] WIP list-objects and pack-objects for partial clone Jeff Hostetler
2017-06-22 20:36 ` [PATCH 1/3] list-objects: add filter_blob to traverse_commit_list Jeff Hostetler
2017-06-22 21:45   ` Jonathan Tan
2017-06-22 22:10     ` Jonathan Tan
2017-06-23 17:16       ` Jeff Hostetler
2017-06-28 16:23   ` Junio C Hamano
2017-06-28 17:13     ` Jeff Hostetler
2017-06-28 17:54       ` Junio C Hamano
2017-06-22 20:36 ` [PATCH 2/3] pack-objects: WIP add max-blob-size filtering Jeff Hostetler
2017-06-22 21:54   ` Jonathan Tan
2017-06-22 22:14     ` Junio C Hamano
2017-06-22 20:36 ` [PATCH 3/3] pack-objects: add t5317 to test max-blob-size Jeff Hostetler

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).