[PATCH 0/7] cat-file --batch-check performance improvements

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH 0/7] cat-file --batch-check performance improvements
@ 2013-07-12  6:15 Jeff King
  2013-07-12  6:20 ` [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode Jeff King
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Jeff King @ 2013-07-12  6:15 UTC (permalink / raw)
  To: git

In my earlier series introducing "git cat-file --batch-check=<format>",
found here:

  http://thread.gmane.org/gmane.comp.version-control.git/229761/focus=230041

I spent a little time optimizing revindex generation, and measured by
requesting information on a single object from a large repository. This
series takes the next logical step: requesting a large number of objects
from a large repository.

There are two major optimizations here:

  1. Avoiding extra ref lookups due to the warning in 798c35f (get_sha1:
     warn about full or short object names that look like refs,
     2013-05-29).

  2. Avoiding extra work for delta type resolution when the user has not
     asked for %(objecttype).

I prepared the series on top of jk/in-pack-size-measurement, and
certainly optimization 2 is pointless without it (before that topic,
--batch-check always printed the type).

However, the first optimization affects regular --batch-check, and
represents a much more serious performance regression. It looks like
798c35f is in master, but hasn't been released yet, so assuming these
topics graduate before the next release, it should be OK. But if not, we
should consider pulling the first patch out and applying it (or
something like it) separately.

The results for running (in linux.git):

  $ git rev-list --objects --all >objects
  $ git cat-file --batch-check='%(objectsize:disk)' <objects >/dev/null

are:

        before     after
  real  1m17.143s  0m7.205s
  user  0m27.684s  0m6.580s
  sys   0m49.320s  0m0.608s

Now, _most_ of that speedup is coming from the first patch, and it's
quite trivial. The rest of the patches involve a lot of refactoring, and
only manage to eke out one more second of performance, so it may not be
worth it (though I think the result actually cleans up the
sha1_object_info_extended interface a bit, and is worth it). Individual
timings are in the commit messages.

The patches are:

  [1/7]: cat-file: disable object/refname ambiguity check for batch mode

Optimization 1.

  [2/7]: sha1_object_info_extended: rename "status" to "type"
  [3/7]: sha1_loose_object_info: make type lookup optional
  [4/7]: packed_object_info: hoist delta type resolution to helper
  [5/7]: packed_object_info: make type lookup optional
  [6/7]: sha1_object_info_extended: make type calculation optional

Optimization 2.

  [7/7]: sha1_object_info_extended: pass object_info to helpers

Optional cleanup.

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode
  2013-07-12  6:15 [PATCH 0/7] cat-file --batch-check performance improvements Jeff King
@ 2013-07-12  6:20 ` Jeff King
  2013-07-12  8:47   ` Michael Haggerty
  2013-07-12  6:21 ` [PATCH 2/7] sha1_object_info_extended: rename "status" to "type" Jeff King
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Jeff King @ 2013-07-12  6:20 UTC (permalink / raw)
  To: git; +Cc: Nguyễn Thái Ngọc Duy, Michael Haggerty

A common use of "cat-file --batch-check" is to feed a list
of objects from "rev-list --objects" or a similar command.
In this instance, all of our input objects are 40-byte sha1
ids. However, cat-file has always allowed arbitrary revision
specifiers, and feeds the result to get_sha1().

Fortunately, get_sha1() recognizes a 40-byte sha1 before
doing any hard work trying to look up refs, meaning this
scenario should end up spending very little time converting
the input into an object sha1. However, since 798c35f
(get_sha1: warn about full or short object names that look
like refs, 2013-05-29), when we encounter this case, we
spend the extra effort to do a refname lookup anyway, just
to print a warning. This is further exacerbated by ca91993
(get_packed_ref_cache: reload packed-refs file when it
changes, 2013-06-20), which makes individual ref lookup more
expensive by requiring a stat() of the packed-refs file for
each missing ref.

With no patches, this is the time it takes to run:

  $ git rev-list --objects --all >objects
  $ time git cat-file --batch-check='%(objectname)' <objects

on the linux.git repository:

  real    1m13.494s
  user    0m25.924s
  sys     0m47.532s

If we revert ca91993, the packed-refs up-to-date check, it
gets a little better:

  real    0m54.697s
  user    0m21.692s
  sys     0m32.916s

but we are still spending quite a bit of time on ref lookup
(and we would not want to revert that patch, anyway, which
has correctness issues).  If we revert 798c35f, disabling
the warning entirely, we get a much more reasonable time:

  real    0m7.452s
  user    0m6.836s
  sys     0m0.608s

This patch does the moral equivalent of this final case (and
gets similar speedups). We introduce a global flag that
callers of get_sha1() can use to avoid paying the price for
the warning.

Signed-off-by: Jeff King <peff@peff.net>
---
The solution feels a little hacky, but I'm not sure there is a better
one short of reverting the warning entirely.

We could also tie it to warn_ambiguous_refs (or add another config
variable), but I don't think that makes sense. It is not about "do I
care about ambiguities in this repository", but rather "I am going to
do a really large number of sha1 resolutions, and I do not want to pay
the price in this particular code path for the warning").

There may be other sites that resolve a large number of refs and run
into this, but I couldn't think of any. Doing for_each_ref would not
have the same problem, as we already know they are refs there.

 builtin/cat-file.c |  9 +++++++++
 cache.h            |  1 +
 environment.c      |  1 +
 sha1_name.c        | 14 ++++++++------
 4 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 0e64b41..fe5c77f 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -266,6 +266,15 @@ static int batch_objects(struct batch_options *opt)
 	strbuf_expand(&buf, opt->format, expand_format, &data);
 	data.mark_query = 0;

+	/*
+	 * We are going to call get_sha1 on a potentially very large number of
+	 * objects. In most large cases, these will be actual object sha1s. The
+	 * cost to double-check that each one is not also a ref (just so we can
+	 * warn) ends up dwarfing the actual cost of the object lookups
+	 * themselves. We can work around it by just turning off the warning.
+	 */
+	warn_on_object_refname_ambiguity = 0;
+
 	while (strbuf_getline(&buf, stdin, '\n') != EOF) {
 		char *p;
 		int error;
diff --git a/cache.h b/cache.h
index 2d06169..c1fd82c 100644
--- a/cache.h
+++ b/cache.h
@@ -575,6 +575,7 @@ extern int warn_ambiguous_refs;
 extern int prefer_symlink_refs;
 extern int log_all_ref_updates;
 extern int warn_ambiguous_refs;
+extern int warn_on_object_refname_ambiguity;
 extern int shared_repository;
 extern const char *apply_default_whitespace;
 extern const char *apply_default_ignorewhitespace;
diff --git a/environment.c b/environment.c
index 0cb67b2..5398c36 100644
--- a/environment.c
+++ b/environment.c
@@ -22,6 +22,7 @@ int warn_ambiguous_refs = 1;
 int is_bare_repository_cfg = -1; /* unspecified */
 int log_all_ref_updates = -1; /* unspecified */
 int warn_ambiguous_refs = 1;
+int warn_on_object_refname_ambiguity = 1;
 int repository_format_version;
 const char *git_commit_encoding;
 const char *git_log_output_encoding;
diff --git a/sha1_name.c b/sha1_name.c
index 90419ef..6f8812a 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -452,13 +452,15 @@ static int get_sha1_basic(const char *str, int len, unsigned char *sha1)
 	int at, reflog_len, nth_prior = 0;

 	if (len == 40 && !get_sha1_hex(str, sha1)) {
-		refs_found = dwim_ref(str, len, tmp_sha1, &real_ref);
-		if (refs_found > 0 && warn_ambiguous_refs) {
-			warning(warn_msg, len, str);
-			if (advice_object_name_warning)
-				fprintf(stderr, "%s\n", _(object_name_msg));
+		if (warn_on_object_refname_ambiguity) {
+			refs_found = dwim_ref(str, len, tmp_sha1, &real_ref);
+			if (refs_found > 0 && warn_ambiguous_refs) {
+				warning(warn_msg, len, str);
+				if (advice_object_name_warning)
+					fprintf(stderr, "%s\n", _(object_name_msg));
+			}
+			free(real_ref);
 		}
-		free(real_ref);
 		return 0;
 	}

-- 
1.8.3.rc3.24.gec82cb9

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/7] sha1_object_info_extended: rename "status" to "type"
  2013-07-12  6:15 [PATCH 0/7] cat-file --batch-check performance improvements Jeff King
  2013-07-12  6:20 ` [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode Jeff King
@ 2013-07-12  6:21 ` Jeff King
  2013-07-12  6:30 ` [PATCH 3/7] sha1_loose_object_info: make type lookup optional Jeff King
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Jeff King @ 2013-07-12  6:21 UTC (permalink / raw)
  To: git

The value we get from each low-level object_info function
(e.g., loose, packed) is actually the object type (or -1 for
error). Let's explicitly call it "type", which will make
further refactorings easier to read.

Signed-off-by: Jeff King <peff@peff.net>
---
 sha1_file.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 4c2365f..e826066 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2394,7 +2394,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi)
 {
 	struct cached_object *co;
 	struct pack_entry e;
-	int status, rtype;
+	int type, rtype;
 
 	co = find_cached_object(sha1);
 	if (co) {
@@ -2408,23 +2408,23 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi)
 
 	if (!find_pack_entry(sha1, &e)) {
 		/* Most likely it's a loose object. */
-		status = sha1_loose_object_info(sha1, oi->sizep, oi->disk_sizep);
-		if (status >= 0) {
+		type = sha1_loose_object_info(sha1, oi->sizep, oi->disk_sizep);
+		if (type >= 0) {
 			oi->whence = OI_LOOSE;
-			return status;
+			return type;
 		}
 
 		/* Not a loose object; someone else may have just packed it. */
 		reprepare_packed_git();
 		if (!find_pack_entry(sha1, &e))
-			return status;
+			return type;
 	}
 
-	status = packed_object_info(e.p, e.offset, oi->sizep, &rtype,
-				    oi->disk_sizep);
-	if (status < 0) {
+	type = packed_object_info(e.p, e.offset, oi->sizep, &rtype,
+				  oi->disk_sizep);
+	if (type < 0) {
 		mark_bad_packed_object(e.p, sha1);
-		status = sha1_object_info_extended(sha1, oi);
+		type = sha1_object_info_extended(sha1, oi);
 	} else if (in_delta_base_cache(e.p, e.offset)) {
 		oi->whence = OI_DBCACHED;
 	} else {
@@ -2435,7 +2435,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi)
 					 rtype == OBJ_OFS_DELTA);
 	}
 
-	return status;
+	return type;
 }
 
 int sha1_object_info(const unsigned char *sha1, unsigned long *sizep)
-- 
1.8.3.rc3.24.gec82cb9

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/7] sha1_loose_object_info: make type lookup optional
  2013-07-12  6:15 [PATCH 0/7] cat-file --batch-check performance improvements Jeff King
  2013-07-12  6:20 ` [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode Jeff King
  2013-07-12  6:21 ` [PATCH 2/7] sha1_object_info_extended: rename "status" to "type" Jeff King
@ 2013-07-12  6:30 ` Jeff King
  2013-07-12  6:31 ` [PATCH 4/7] packed_object_info: hoist delta type resolution to helper Jeff King
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Jeff King @ 2013-07-12  6:30 UTC (permalink / raw)
  To: git

Until recently, the only items to request from
sha1_object_info_extended were type and size. This meant
that we always had to open a loose object file to determine
one or the other.  But with the addition of the disk_size
query, it's possible that we can fulfill the query without
even opening the object file at all. However, since the
function interface always returns the type, we have no way
of knowing whether the caller cares about it or not.

This patch only modified sha1_loose_object_info to make type
lookup optional using an out-parameter, similar to the way
the size is handled (and the return value is "0" or "-1" for
success or error, respectively).

There should be no functional change yet, though, as
sha1_object_info_extended, the only caller, will always ask
for a type.

Signed-off-by: Jeff King <peff@peff.net>
---
Obviously the end goal is to have sha1_object_info_extended do this
optionally, too (which happens in patch 6).

I'm not too happy about the stat_sha1_file function, which is almost
identical to open_sha1_file (and which in turn is almost the same thing
as has_loose_object). They all do:

  try operation X on sha1_file_name(sha1)
  prepare_alt_odb();
  foreach alt_odb
    try operation X on alt_odb/sha1_file_name(sha1)

Unfortunately it's hard to do this kind of factoring out in C, because
the argument and return types for operation X are different in these
cases; you are stuck with providing callback function that takes a void
pointer to some operation-specific data. The boilerplate ends up worse
than the repeated code.

Another solution would be to have a "find the file for loose object Y"
function, and then just do operation X on that. But since X is a
filesystem operation in each case, you do not want to lose the atomicity
of performing the operation directly (not to mention incurring the cost
of an extra stat() on each file).

So I am open to clever refactoring suggestions.

 sha1_file.c | 48 +++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 43 insertions(+), 5 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index e826066..39e7313 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1306,6 +1306,26 @@ static int git_open_noatime(const char *name)
 	}
 }

+static int stat_sha1_file(const unsigned char *sha1, struct stat *st)
+{
+	char *name = sha1_file_name(sha1);
+	struct alternate_object_database *alt;
+
+	if (!lstat(name, st))
+		return 0;
+
+	prepare_alt_odb();
+	errno = ENOENT;
+	for (alt = alt_odb_list; alt; alt = alt->next) {
+		name = alt->name;
+		fill_sha1_path(name, sha1);
+		if (!lstat(alt->base, st))
+			return 0;
+	}
+
+	return -1;
+}
+
 static int open_sha1_file(const unsigned char *sha1)
 {
 	int fd;
@@ -2363,7 +2383,9 @@ struct packed_git *find_sha1_pack(const unsigned char *sha1,

 }

-static int sha1_loose_object_info(const unsigned char *sha1, unsigned long *sizep,
+static int sha1_loose_object_info(const unsigned char *sha1,
+				  enum object_type *typep,
+				  unsigned long *sizep,
 				  unsigned long *disk_sizep)
 {
 	int status;
@@ -2372,6 +2394,20 @@ static int sha1_loose_object_info(const unsigned char *sha1, unsigned long *size
 	git_zstream stream;
 	char hdr[32];

+	/*
+	 * If we don't care about type or size, then we don't
+	 * need to look inside the object at all.
+	 */
+	if (!typep && !sizep) {
+		if (disk_sizep) {
+			struct stat st;
+			if (stat_sha1_file(sha1, &st) < 0)
+				return -1;
+			*disk_sizep = st.st_size;
+		}
+		return 0;
+	}
+
 	map = map_sha1_file(sha1, &mapsize);
 	if (!map)
 		return -1;
@@ -2386,7 +2422,9 @@ static int sha1_loose_object_info(const unsigned char *sha1, unsigned long *size
 		*sizep = size;
 	git_inflate_end(&stream);
 	munmap(map, mapsize);
-	return status;
+	if (typep)
+		*typep = status;
+	return 0;
 }

 /* returns enum object_type or negative */
@@ -2408,8 +2446,8 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi)

 	if (!find_pack_entry(sha1, &e)) {
 		/* Most likely it's a loose object. */
-		type = sha1_loose_object_info(sha1, oi->sizep, oi->disk_sizep);
-		if (type >= 0) {
+		if (!sha1_loose_object_info(sha1, &type,
+					    oi->sizep, oi->disk_sizep)) {
 			oi->whence = OI_LOOSE;
 			return type;
 		}
@@ -2417,7 +2455,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi)
 		/* Not a loose object; someone else may have just packed it. */
 		reprepare_packed_git();
 		if (!find_pack_entry(sha1, &e))
-			return type;
+			return -1;
 	}

 	type = packed_object_info(e.p, e.offset, oi->sizep, &rtype,
-- 
1.8.3.rc3.24.gec82cb9

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/7] packed_object_info: hoist delta type resolution to helper
  2013-07-12  6:15 [PATCH 0/7] cat-file --batch-check performance improvements Jeff King
                   ` (2 preceding siblings ...)
  2013-07-12  6:30 ` [PATCH 3/7] sha1_loose_object_info: make type lookup optional Jeff King
@ 2013-07-12  6:31 ` Jeff King
  2013-07-12  6:32 ` [PATCH 5/7] packed_object_info: make type lookup optional Jeff King
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Jeff King @ 2013-07-12  6:31 UTC (permalink / raw)
  To: git

To calculate the type of a packed object, we must walk down
its delta chain until we hit a true base object with a real
type. Most of the code in packed_object_info is for handling
this case.

Let's hoist it out into a separate helper function, which
will make it easier to make the type-lookup optional in the
future (and keep our indentation level sane).

Signed-off-by: Jeff King <peff@peff.net>
---
Annoyingly, since the part we are moving comprises the bulk
of the function, the diff tends to show the opposite of what
actually happened: it looks like we renamed the function to its helper
name, and moved the other parts to a new function, which happens to have
the same name as the old one. So read the diff with care. :)

 sha1_file.c | 93 +++++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 53 insertions(+), 40 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 39e7313..6f4aff3 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1713,52 +1713,21 @@ static int packed_object_info(struct packed_git *p, off_t obj_offset,
 	return type;
 }
 
-
 #define POI_STACK_PREALLOC 64
 
-static int packed_object_info(struct packed_git *p, off_t obj_offset,
-			      unsigned long *sizep, int *rtype,
-			      unsigned long *disk_sizep)
+static enum object_type packed_to_object_type(struct packed_git *p,
+					      off_t obj_offset,
+					      enum object_type type,
+					      struct pack_window **w_curs,
+					      off_t curpos)
 {
-	struct pack_window *w_curs = NULL;
-	unsigned long size;
-	off_t curpos = obj_offset;
-	enum object_type type;
 	off_t small_poi_stack[POI_STACK_PREALLOC];
 	off_t *poi_stack = small_poi_stack;
 	int poi_stack_nr = 0, poi_stack_alloc = POI_STACK_PREALLOC;
 
-	type = unpack_object_header(p, &w_curs, &curpos, &size);
-
-	if (rtype)
-		*rtype = type; /* representation type */
-
-	if (sizep) {
-		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
-			off_t tmp_pos = curpos;
-			off_t base_offset = get_delta_base(p, &w_curs, &tmp_pos,
-							   type, obj_offset);
-			if (!base_offset) {
-				type = OBJ_BAD;
-				goto out;
-			}
-			*sizep = get_size_from_delta(p, &w_curs, tmp_pos);
-			if (*sizep == 0) {
-				type = OBJ_BAD;
-				goto out;
-			}
-		} else {
-			*sizep = size;
-		}
-	}
-
-	if (disk_sizep) {
-		struct revindex_entry *revidx = find_pack_revindex(p, obj_offset);
-		*disk_sizep = revidx[1].offset - obj_offset;
-	}
-
 	while (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
 		off_t base_offset;
+		unsigned long size;
 		/* Push the object we're going to leave behind */
 		if (poi_stack_nr >= poi_stack_alloc && poi_stack == small_poi_stack) {
 			poi_stack_alloc = alloc_nr(poi_stack_nr);
@@ -1769,11 +1738,11 @@ static int packed_object_info(struct packed_git *p, off_t obj_offset,
 		}
 		poi_stack[poi_stack_nr++] = obj_offset;
 		/* If parsing the base offset fails, just unwind */
-		base_offset = get_delta_base(p, &w_curs, &curpos, type, obj_offset);
+		base_offset = get_delta_base(p, w_curs, &curpos, type, obj_offset);
 		if (!base_offset)
 			goto unwind;
 		curpos = obj_offset = base_offset;
-		type = unpack_object_header(p, &w_curs, &curpos, &size);
+		type = unpack_object_header(p, w_curs, &curpos, &size);
 		if (type <= OBJ_NONE) {
 			/* If getting the base itself fails, we first
 			 * retry the base, otherwise unwind */
@@ -1800,7 +1769,6 @@ out:
 out:
 	if (poi_stack != small_poi_stack)
 		free(poi_stack);
-	unuse_pack(&w_curs);
 	return type;
 
 unwind:
@@ -1814,6 +1782,51 @@ unwind:
 	goto out;
 }
 
+static int packed_object_info(struct packed_git *p, off_t obj_offset,
+			      unsigned long *sizep, int *rtype,
+			      unsigned long *disk_sizep)
+{
+	struct pack_window *w_curs = NULL;
+	unsigned long size;
+	off_t curpos = obj_offset;
+	enum object_type type;
+
+	type = unpack_object_header(p, &w_curs, &curpos, &size);
+
+	if (rtype)
+		*rtype = type; /* representation type */
+
+	if (sizep) {
+		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
+			off_t tmp_pos = curpos;
+			off_t base_offset = get_delta_base(p, &w_curs, &tmp_pos,
+							   type, obj_offset);
+			if (!base_offset) {
+				type = OBJ_BAD;
+				goto out;
+			}
+			*sizep = get_size_from_delta(p, &w_curs, tmp_pos);
+			if (*sizep == 0) {
+				type = OBJ_BAD;
+				goto out;
+			}
+		} else {
+			*sizep = size;
+		}
+	}
+
+	if (disk_sizep) {
+		struct revindex_entry *revidx = find_pack_revindex(p, obj_offset);
+		*disk_sizep = revidx[1].offset - obj_offset;
+	}
+
+	type = packed_to_object_type(p, obj_offset, type, &w_curs, curpos);
+
+out:
+	unuse_pack(&w_curs);
+	return type;
+}
+
 static void *unpack_compressed_entry(struct packed_git *p,
 				    struct pack_window **w_curs,
 				    off_t curpos,
-- 
1.8.3.rc3.24.gec82cb9

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/7] packed_object_info: make type lookup optional
  2013-07-12  6:15 [PATCH 0/7] cat-file --batch-check performance improvements Jeff King
                   ` (3 preceding siblings ...)
  2013-07-12  6:31 ` [PATCH 4/7] packed_object_info: hoist delta type resolution to helper Jeff King
@ 2013-07-12  6:32 ` Jeff King
  2013-07-12  6:34 ` [PATCH 6/7] sha1_object_info_extended: make type calculation optional Jeff King
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Jeff King @ 2013-07-12  6:32 UTC (permalink / raw)
  To: git

Currently, packed_object_info can save some work by not
calculating the size or disk_size of the object if the
caller is not interested. However, it always calculates the
true object type, whether the caller cares or not, and only
optionally returns the easy-to-get "representation type".

Let's swap these types. The function will now return the
representation type (or OBJ_BAD on failure), and will only
optionally fill in the true type.

There should be no behavior change yet, as the only caller,
sha1_object_info_extended, will always feed it a type
pointer.

Signed-off-by: Jeff King <peff@peff.net>
---
 sha1_file.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 6f4aff3..2a1e230 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1783,7 +1783,7 @@ static int packed_object_info(struct packed_git *p, off_t obj_offset,
 }
 
 static int packed_object_info(struct packed_git *p, off_t obj_offset,
-			      unsigned long *sizep, int *rtype,
+			      enum object_type *typep, unsigned long *sizep,
 			      unsigned long *disk_sizep)
 {
 	struct pack_window *w_curs = NULL;
@@ -1791,11 +1791,12 @@ static int packed_object_info(struct packed_git *p, off_t obj_offset,
 	off_t curpos = obj_offset;
 	enum object_type type;
 
+	/*
+	 * We always get the representation type, but only convert it to
+	 * a "real" type later if the caller is interested.
+	 */
 	type = unpack_object_header(p, &w_curs, &curpos, &size);
 
-	if (rtype)
-		*rtype = type; /* representation type */
-
 	if (sizep) {
 		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
 			off_t tmp_pos = curpos;
@@ -1820,7 +1821,13 @@ static int packed_object_info(struct packed_git *p, off_t obj_offset,
 		*disk_sizep = revidx[1].offset - obj_offset;
 	}
 
-	type = packed_to_object_type(p, obj_offset, type, &w_curs, curpos);
+	if (typep) {
+		*typep = packed_to_object_type(p, obj_offset, type, &w_curs, curpos);
+		if (*typep < 0) {
+			type = OBJ_BAD;
+			goto out;
+		}
+	}
 
 out:
 	unuse_pack(&w_curs);
@@ -2471,11 +2478,11 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi)
 			return -1;
 	}
 
-	type = packed_object_info(e.p, e.offset, oi->sizep, &rtype,
-				  oi->disk_sizep);
-	if (type < 0) {
+	rtype = packed_object_info(e.p, e.offset, &type, oi->sizep,
+				   oi->disk_sizep);
+	if (rtype < 0) {
 		mark_bad_packed_object(e.p, sha1);
-		type = sha1_object_info_extended(sha1, oi);
+		return sha1_object_info_extended(sha1, oi);
 	} else if (in_delta_base_cache(e.p, e.offset)) {
 		oi->whence = OI_DBCACHED;
 	} else {
-- 
1.8.3.rc3.24.gec82cb9

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/7] sha1_object_info_extended: make type calculation optional
  2013-07-12  6:15 [PATCH 0/7] cat-file --batch-check performance improvements Jeff King
                   ` (4 preceding siblings ...)
  2013-07-12  6:32 ` [PATCH 5/7] packed_object_info: make type lookup optional Jeff King
@ 2013-07-12  6:34 ` Jeff King
  2013-07-12  6:37 ` [PATCH 7/7] sha1_object_info_extended: pass object_info to helpers Jeff King
  2013-07-12 17:23 ` [PATCH 0/7] cat-file --batch-check performance improvements Junio C Hamano
  7 siblings, 0 replies; 16+ messages in thread
From: Jeff King @ 2013-07-12  6:34 UTC (permalink / raw)
  To: git

Each caller of sha1_object_info_extended sets up an
object_info struct to tell the function which elements of
the object it wants to get. Until now, getting the type of
the object has always been required (and it is returned via
the return type rather than a pointer in object_info).

This can involve actually opening a loose object file to
determine its type, or following delta chains to determine a
packed file's base type. These effects produce a measurable
slow-down when doing a "cat-file --batch-check" that does
not include %(objecttype).

This patch adds a "typep" query to struct object_info, so
that it can be optionally queried just like size and
disk_size. As a result, the return type of the function is
no longer the object type, but rather 0/-1 for success/error.

As there are only three callers total, we just fix up each
caller rather than keep a compatibility wrapper:

  1. The simpler sha1_object_info wrapper continues to
     always ask for and return the type field.

  2. The istream_source function wants to know the type, and
     so always asks for it.

  3. The cat-file batch code asks for the type only when
     %(objecttype) is part of the format string.

On linux.git, the best-of-five for running:

  $ git rev-list --objects --all >objects
  $ time git cat-file --batch-check='%(objectsize:disk)'

on a fully packed repository goes from:

  real    0m8.680s
  user    0m8.160s
  sys     0m0.512s

to:

  real    0m7.205s
  user    0m6.580s
  sys     0m0.608s

Signed-off-by: Jeff King <peff@peff.net>
---
This ends up changing the behavior of sha1_object_info_extended without
changing its function signature. Given that it is a fairly inactive area
of the code and that there are no topics in flight, I think this is OK.
But an alternative could be to add (yet another) wrapper to leave the
first two call-sites untouched.

 builtin/cat-file.c |  7 ++++---
 cache.h            |  1 +
 sha1_file.c        | 20 +++++++++++++-------
 streaming.c        |  2 +-
 4 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index fe5c77f..163ce6c 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -150,7 +150,9 @@ static void expand_atom(struct strbuf *sb, const char *atom, int len,
 		if (!data->mark_query)
 			strbuf_addstr(sb, sha1_to_hex(data->sha1));
 	} else if (is_atom("objecttype", atom, len)) {
-		if (!data->mark_query)
+		if (data->mark_query)
+			data->info.typep = &data->type;
+		else
 			strbuf_addstr(sb, typename(data->type));
 	} else if (is_atom("objectsize", atom, len)) {
 		if (data->mark_query)
@@ -229,8 +231,7 @@ static int batch_one_object(const char *obj_name, struct batch_options *opt,
 		return 0;
 	}
 
-	data->type = sha1_object_info_extended(data->sha1, &data->info);
-	if (data->type <= 0) {
+	if (sha1_object_info_extended(data->sha1, &data->info) < 0) {
 		printf("%s missing\n", obj_name);
 		fflush(stdout);
 		return 0;
diff --git a/cache.h b/cache.h
index c1fd82c..d3b770c 100644
--- a/cache.h
+++ b/cache.h
@@ -1130,6 +1130,7 @@ struct object_info {
 
 struct object_info {
 	/* Request */
+	enum object_type *typep;
 	unsigned long *sizep;
 	unsigned long *disk_sizep;
 
diff --git a/sha1_file.c b/sha1_file.c
index 2a1e230..52f7a1e 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2452,24 +2452,26 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi)
 {
 	struct cached_object *co;
 	struct pack_entry e;
-	int type, rtype;
+	int rtype;
 
 	co = find_cached_object(sha1);
 	if (co) {
+		if (oi->typep)
+			*(oi->typep) = co->type;
 		if (oi->sizep)
 			*(oi->sizep) = co->size;
 		if (oi->disk_sizep)
 			*(oi->disk_sizep) = 0;
 		oi->whence = OI_CACHED;
-		return co->type;
+		return 0;
 	}
 
 	if (!find_pack_entry(sha1, &e)) {
 		/* Most likely it's a loose object. */
-		if (!sha1_loose_object_info(sha1, &type,
+		if (!sha1_loose_object_info(sha1, oi->typep,
 					    oi->sizep, oi->disk_sizep)) {
 			oi->whence = OI_LOOSE;
-			return type;
+			return 0;
 		}
 
 		/* Not a loose object; someone else may have just packed it. */
@@ -2478,7 +2480,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi)
 			return -1;
 	}
 
-	rtype = packed_object_info(e.p, e.offset, &type, oi->sizep,
+	rtype = packed_object_info(e.p, e.offset, oi->typep, oi->sizep,
 				   oi->disk_sizep);
 	if (rtype < 0) {
 		mark_bad_packed_object(e.p, sha1);
@@ -2493,15 +2495,19 @@ int sha1_object_info(const unsigned char *sha1, unsigned long *sizep)
 					 rtype == OBJ_OFS_DELTA);
 	}
 
-	return type;
+	return 0;
 }
 
 int sha1_object_info(const unsigned char *sha1, unsigned long *sizep)
 {
+	enum object_type type;
 	struct object_info oi = {0};
 
+	oi.typep = &type;
 	oi.sizep = sizep;
-	return sha1_object_info_extended(sha1, &oi);
+	if (sha1_object_info_extended(sha1, &oi) < 0)
+		return -1;
+	return type;
 }
 
 static void *read_packed_sha1(const unsigned char *sha1,
diff --git a/streaming.c b/streaming.c
index cac282f..870657a 100644
--- a/streaming.c
+++ b/streaming.c
@@ -111,11 +111,11 @@ static enum input_source istream_source(const unsigned char *sha1,
 	unsigned long size;
 	int status;
 
+	oi->typep = type;
 	oi->sizep = &size;
 	status = sha1_object_info_extended(sha1, oi);
 	if (status < 0)
 		return stream_error;
-	*type = status;
 
 	switch (oi->whence) {
 	case OI_LOOSE:
-- 
1.8.3.rc3.24.gec82cb9

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 7/7] sha1_object_info_extended: pass object_info to helpers
  2013-07-12  6:15 [PATCH 0/7] cat-file --batch-check performance improvements Jeff King
                   ` (5 preceding siblings ...)
  2013-07-12  6:34 ` [PATCH 6/7] sha1_object_info_extended: make type calculation optional Jeff King
@ 2013-07-12  6:37 ` Jeff King
  2013-07-12 17:23 ` [PATCH 0/7] cat-file --batch-check performance improvements Junio C Hamano
  7 siblings, 0 replies; 16+ messages in thread
From: Jeff King @ 2013-07-12  6:37 UTC (permalink / raw)
  To: git

We take in a "struct object_info" which contains pointers to
storage for items the caller cares about. But then rather
than pass the whole object to the low-level loose/packed
helper functions, we pass the individual pointers.

Let's pass the whole struct instead, which will make adding
more items later easier.

Signed-off-by: Jeff King <peff@peff.net>
---
This one is an optional cleanup. The diff is quite noisy due to all of
the s/foo/oi->foo/, so it is arguable whether the result is nicer or
not. It would make later additions to object_info nicer, but I do not
plan to add any more.

It _would_ have been a nice cleanup to do at the beginning of the series
(and further diffs would not have to add extra parameters to the
function calls), but that would make the incremental "learn to do type
optionally" patches quite awkward.

So I am on the fence over this one, and do not mind too much if it gets
dropped.

 sha1_file.c | 49 ++++++++++++++++++++++---------------------------
 1 file changed, 22 insertions(+), 27 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 52f7a1e..563f521 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1783,8 +1783,7 @@ static int packed_object_info(struct packed_git *p, off_t obj_offset,
 }
 
 static int packed_object_info(struct packed_git *p, off_t obj_offset,
-			      enum object_type *typep, unsigned long *sizep,
-			      unsigned long *disk_sizep)
+			      struct object_info *oi)
 {
 	struct pack_window *w_curs = NULL;
 	unsigned long size;
@@ -1797,7 +1796,7 @@ static int packed_object_info(struct packed_git *p, off_t obj_offset,
 	 */
 	type = unpack_object_header(p, &w_curs, &curpos, &size);
 
-	if (sizep) {
+	if (oi->sizep) {
 		if (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
 			off_t tmp_pos = curpos;
 			off_t base_offset = get_delta_base(p, &w_curs, &tmp_pos,
@@ -1806,24 +1805,24 @@ static int packed_object_info(struct packed_git *p, off_t obj_offset,
 				type = OBJ_BAD;
 				goto out;
 			}
-			*sizep = get_size_from_delta(p, &w_curs, tmp_pos);
-			if (*sizep == 0) {
+			*oi->sizep = get_size_from_delta(p, &w_curs, tmp_pos);
+			if (*oi->sizep == 0) {
 				type = OBJ_BAD;
 				goto out;
 			}
 		} else {
-			*sizep = size;
+			*oi->sizep = size;
 		}
 	}
 
-	if (disk_sizep) {
+	if (oi->disk_sizep) {
 		struct revindex_entry *revidx = find_pack_revindex(p, obj_offset);
-		*disk_sizep = revidx[1].offset - obj_offset;
+		*oi->disk_sizep = revidx[1].offset - obj_offset;
 	}
 
-	if (typep) {
-		*typep = packed_to_object_type(p, obj_offset, type, &w_curs, curpos);
-		if (*typep < 0) {
+	if (oi->typep) {
+		*oi->typep = packed_to_object_type(p, obj_offset, type, &w_curs, curpos);
+		if (*oi->typep < 0) {
 			type = OBJ_BAD;
 			goto out;
 		}
@@ -2404,9 +2403,7 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 }
 
 static int sha1_loose_object_info(const unsigned char *sha1,
-				  enum object_type *typep,
-				  unsigned long *sizep,
-				  unsigned long *disk_sizep)
+				  struct object_info *oi)
 {
 	int status;
 	unsigned long mapsize, size;
@@ -2418,12 +2415,12 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	 * If we don't care about type or size, then we don't
 	 * need to look inside the object at all.
 	 */
-	if (!typep && !sizep) {
-		if (disk_sizep) {
+	if (!oi->typep && !oi->sizep) {
+		if (oi->disk_sizep) {
 			struct stat st;
 			if (stat_sha1_file(sha1, &st) < 0)
 				return -1;
-			*disk_sizep = st.st_size;
+			*oi->disk_sizep = st.st_size;
 		}
 		return 0;
 	}
@@ -2431,19 +2428,19 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	map = map_sha1_file(sha1, &mapsize);
 	if (!map)
 		return -1;
-	if (disk_sizep)
-		*disk_sizep = mapsize;
+	if (oi->disk_sizep)
+		*oi->disk_sizep = mapsize;
 	if (unpack_sha1_header(&stream, map, mapsize, hdr, sizeof(hdr)) < 0)
 		status = error("unable to unpack %s header",
 			       sha1_to_hex(sha1));
 	else if ((status = parse_sha1_header(hdr, &size)) < 0)
 		status = error("unable to parse %s header", sha1_to_hex(sha1));
-	else if (sizep)
-		*sizep = size;
+	else if (oi->sizep)
+		*oi->sizep = size;
 	git_inflate_end(&stream);
 	munmap(map, mapsize);
-	if (typep)
-		*typep = status;
+	if (oi->typep)
+		*oi->typep = status;
 	return 0;
 }
 
@@ -2468,8 +2465,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi)
 
 	if (!find_pack_entry(sha1, &e)) {
 		/* Most likely it's a loose object. */
-		if (!sha1_loose_object_info(sha1, oi->typep,
-					    oi->sizep, oi->disk_sizep)) {
+		if (!sha1_loose_object_info(sha1, oi)) {
 			oi->whence = OI_LOOSE;
 			return 0;
 		}
@@ -2480,8 +2476,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi)
 			return -1;
 	}
 
-	rtype = packed_object_info(e.p, e.offset, oi->typep, oi->sizep,
-				   oi->disk_sizep);
+	rtype = packed_object_info(e.p, e.offset, oi);
 	if (rtype < 0) {
 		mark_bad_packed_object(e.p, sha1);
 		return sha1_object_info_extended(sha1, oi);
-- 
1.8.3.rc3.24.gec82cb9

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode
  2013-07-12  6:20 ` [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode Jeff King
@ 2013-07-12  8:47   ` Michael Haggerty
  2013-07-12  9:22     ` Jeff King
  0 siblings, 1 reply; 16+ messages in thread
From: Michael Haggerty @ 2013-07-12  8:47 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Nguyễn Thái Ngọc Duy

On 07/12/2013 08:20 AM, Jeff King wrote:
> A common use of "cat-file --batch-check" is to feed a list
> of objects from "rev-list --objects" or a similar command.
> In this instance, all of our input objects are 40-byte sha1
> ids. However, cat-file has always allowed arbitrary revision
> specifiers, and feeds the result to get_sha1().
> 
> Fortunately, get_sha1() recognizes a 40-byte sha1 before
> doing any hard work trying to look up refs, meaning this
> scenario should end up spending very little time converting
> the input into an object sha1. However, since 798c35f
> (get_sha1: warn about full or short object names that look
> like refs, 2013-05-29), when we encounter this case, we
> spend the extra effort to do a refname lookup anyway, just
> to print a warning. This is further exacerbated by ca91993
> (get_packed_ref_cache: reload packed-refs file when it
> changes, 2013-06-20), which makes individual ref lookup more
> expensive by requiring a stat() of the packed-refs file for
> each missing ref.
> 
> With no patches, this is the time it takes to run:
> 
>   $ git rev-list --objects --all >objects
>   $ time git cat-file --batch-check='%(objectname)' <objects
> 
> on the linux.git repository:
> 
>   real    1m13.494s
>   user    0m25.924s
>   sys     0m47.532s
> 
> If we revert ca91993, the packed-refs up-to-date check, it
> gets a little better:
> 
>   real    0m54.697s
>   user    0m21.692s
>   sys     0m32.916s
> 
> but we are still spending quite a bit of time on ref lookup
> (and we would not want to revert that patch, anyway, which
> has correctness issues).  If we revert 798c35f, disabling
> the warning entirely, we get a much more reasonable time:
> 
>   real    0m7.452s
>   user    0m6.836s
>   sys     0m0.608s
> 
> This patch does the moral equivalent of this final case (and
> gets similar speedups). We introduce a global flag that
> callers of get_sha1() can use to avoid paying the price for
> the warning.
> 
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> The solution feels a little hacky, but I'm not sure there is a better
> one short of reverting the warning entirely.
> 
> We could also tie it to warn_ambiguous_refs (or add another config
> variable), but I don't think that makes sense. It is not about "do I
> care about ambiguities in this repository", but rather "I am going to
> do a really large number of sha1 resolutions, and I do not want to pay
> the price in this particular code path for the warning").
> 
> There may be other sites that resolve a large number of refs and run
> into this, but I couldn't think of any. Doing for_each_ref would not
> have the same problem, as we already know they are refs there.

ISTM that callers of --batch-check would usually know whether they are
feeding it SHA-1s or arbitrary object specifiers.  (And in most cases
when there are a large number of them, they are probably all SHA-1s).

So instead of framing this change as "skip object refname ambiguity
check" (i.e., sacrifice some correctness for performance), perhaps it
would be less hacky to offer the following new feature:

To cat-file we could add an option like "--sha1-only" or "--literal" or
"--no-dwim" (... better names are failing me) which would skip *all*
dwimming of 40-character strings.  It would also assume that any shorter
strings are abbreviated SHA-1s and fail if they are not.  This would be
a nice feature by itself ("these are object names, dammit, and don't try
to tell me differently!") and would have the additional small advantage
of speeding up lookups of abbreviated SHA-1s, which (regardless of your
patch) otherwise go through the whole DWIM process.

I understand that such an option alone would leave some old scripts
suffering the performance regression caused by the check, but in most
cases it would be easy to fix them.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode
  2013-07-12  8:47   ` Michael Haggerty
@ 2013-07-12  9:22     ` Jeff King
  2013-07-12 10:30       ` Michael Haggerty
  2013-07-15  3:45       ` Junio C Hamano
  0 siblings, 2 replies; 16+ messages in thread
From: Jeff King @ 2013-07-12  9:22 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, Nguyễn Thái Ngọc Duy

On Fri, Jul 12, 2013 at 10:47:46AM +0200, Michael Haggerty wrote:

> > The solution feels a little hacky, but I'm not sure there is a better
> > one short of reverting the warning entirely.
> > 
> > We could also tie it to warn_ambiguous_refs (or add another config
> > variable), but I don't think that makes sense. It is not about "do I
> > care about ambiguities in this repository", but rather "I am going to
> > do a really large number of sha1 resolutions, and I do not want to pay
> > the price in this particular code path for the warning").
> > 
> > There may be other sites that resolve a large number of refs and run
> > into this, but I couldn't think of any. Doing for_each_ref would not
> > have the same problem, as we already know they are refs there.
> 
> ISTM that callers of --batch-check would usually know whether they are
> feeding it SHA-1s or arbitrary object specifiers.  (And in most cases
> when there are a large number of them, they are probably all SHA-1s).

Thanks for bringing this up; I had meant to outline this option in the
cover letter, but forgot when posting.

If were designing cat-file from scratch, I'd consider having --batch
take just 40-hex sha1s in the first place. Since we can't do that now
for backwards compatibility, having an option is the best we can do
there.

But having to use such an option to avoid a 10x slowdown just strikes me
as ugly for two reasons:

  1. It's part of the user-visible interface, and it seems like an extra
     "wtf?" moment when somebody wonders why their script is painfully
     slow, and why they need a magic option to fix it. We're cluttering
     the interface forever.

  2. It's a sign that the refname check was put in for a specific
     performance profile (resolving one or a few refs), but didn't
     consider resolving a large number. I'm wondering what other
     performance regressions it's possible to trigger.

> So instead of framing this change as "skip object refname ambiguity
> check" (i.e., sacrifice some correctness for performance), perhaps it
> would be less hacky to offer the following new feature:

I wouldn't frame it as "correctness for performance" at all. The check
does not impact behavior at all, and is purely informational (to catch
quite a rare situation, too). I'd argue that the check simply isn't that
interesting in this case in the first place.

> To cat-file we could add an option like "--sha1-only" or "--literal" or
> "--no-dwim" (... better names are failing me) which would skip *all*
> dwimming of 40-character strings.  It would also assume that any shorter
> strings are abbreviated SHA-1s and fail if they are not.  This would be
> a nice feature by itself ("these are object names, dammit, and don't try
> to tell me differently!") and would have the additional small advantage
> of speeding up lookups of abbreviated SHA-1s, which (regardless of your
> patch) otherwise go through the whole DWIM process.

I can see in theory that somebody might want that, but I am having a
hard time thinking of a practical use.

> I understand that such an option alone would leave some old scripts
> suffering the performance regression caused by the check, but in most
> cases it would be easy to fix them.

I'm less worried about old scripts, and more worried about carrying
around a confusing option forever, especially when I expect most
invocations to use it (perhaps my experience is biased, but I use
cat-file --batch quite a lot in scripting, and it has almost always been
on the output of rev-list).

IOW, it seems like a poor default, and we are choosing it only because
of backwards compatibility. I guess another option is to switch the
default with the usual deprecation dance.

Yet another option is to consider what the check is doing, and
accomplish the same thing in a different way. The real pain is that we
are individually trying to resolve each object by hitting the filesystem
(and doing lots of silly checks on the refname format, when we know it
must be valid).

We don't actually care in this case if the ref list is up to date (we
are not trying to update or read a ref, but only know if it exists, and
raciness is OK). IOW, could we replace the dwim_ref call for the warning
with something that directly queries the ref cache?

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode
  2013-07-12  9:22     ` Jeff King
@ 2013-07-12 10:30       ` Michael Haggerty
  2013-07-15  4:23         ` Jeff King
  2013-07-15  3:45       ` Junio C Hamano
  1 sibling, 1 reply; 16+ messages in thread
From: Michael Haggerty @ 2013-07-12 10:30 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Nguyễn Thái Ngọc Duy

On 07/12/2013 11:22 AM, Jeff King wrote:
> Yet another option is to consider what the check is doing, and
> accomplish the same thing in a different way. The real pain is that we
> are individually trying to resolve each object by hitting the filesystem
> (and doing lots of silly checks on the refname format, when we know it
> must be valid).
> 
> We don't actually care in this case if the ref list is up to date (we
> are not trying to update or read a ref, but only know if it exists, and
> raciness is OK). IOW, could we replace the dwim_ref call for the warning
> with something that directly queries the ref cache?

I think it would be quite practical to add an API something like

    struct ref_snapshot *get_ref_snapshot(const char *prefix)
    void release_ref_snapshot(struct ref_snapshot *)
    int lookup_ref(struct ref_snapshot *, const char *refname,
                   unsigned char *sha1, int *flags)

where prefix is the part of the refs tree that you want included in the
snapshot (e.g., "refs/heads") and ref_snapshot is probably opaque
outside of the refs module.

Symbolic refs, which are currently not stored in the ref_cache, would
have to be added because otherwise we would have to do all of the
lookups anyway.

I think this would be a good step to take for many reasons, including
because it would be another useful step in the direction of ref
transactions.

But with particular respect to "git cat-file", I see problems:

1. get_ref_snapshot() would have to read all loose and packed refs
within the specified subtree, because loose refs have to be read before
packed refs.  So the call would be expensive if there are a lot of loose
refs.  And DWIM wouldn't know in advance where the references might be,
so it would have to set prefix="".  If many refs are looked up, then it
would presumably be worth it.  But if only a couple of lookups are done
and there are a lot of loose refs, then using a cache would probably
slow things down.

The slowdown could be ameliorated by adding some more intelligence, for
example only populating the loose refs cache after a certain number of
lookups have already been done.

2. A "git cat-file --batch" process can be long-lived.  What guarantees
would users expect regarding its lookup results?  Currently, its ref
lookups reflect the state of the repo at the moment the commit
identifier is written into the pipe.  Using a cache like this would mean
that ref lookups would always reflect the snapshot taken at the start of
the "git cat-file" run, regardless of whether the script using it might
have added or modified some references since then.  I think this would
have to be considered a regression.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/7] cat-file --batch-check performance improvements
  2013-07-12  6:15 [PATCH 0/7] cat-file --batch-check performance improvements Jeff King
                   ` (6 preceding siblings ...)
  2013-07-12  6:37 ` [PATCH 7/7] sha1_object_info_extended: pass object_info to helpers Jeff King
@ 2013-07-12 17:23 ` Junio C Hamano
  2013-07-12 20:12   ` Jeff King
  7 siblings, 1 reply; 16+ messages in thread
From: Junio C Hamano @ 2013-07-12 17:23 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> The results for running (in linux.git):
>
>   $ git rev-list --objects --all >objects
>   $ git cat-file --batch-check='%(objectsize:disk)' <objects >/dev/null

I can see how these patches are very logical avenue to grab only
on-disk footprint for large number of objects, but among the type,
payload size and on-disk footprint, I find it highly narrow niche
that a real user or script is interested _only_ in on-disk footprint
without even worrying about the type of object.

> ... (though I think the result actually cleans up the
> sha1_object_info_extended interface a bit, and is worth it).

I tend to agree, especially eyeballing the result of 7/7.

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/7] cat-file --batch-check performance improvements
  2013-07-12 17:23 ` [PATCH 0/7] cat-file --batch-check performance improvements Junio C Hamano
@ 2013-07-12 20:12   ` Jeff King
  0 siblings, 0 replies; 16+ messages in thread
From: Jeff King @ 2013-07-12 20:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Fri, Jul 12, 2013 at 10:23:34AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > The results for running (in linux.git):
> >
> >   $ git rev-list --objects --all >objects
> >   $ git cat-file --batch-check='%(objectsize:disk)' <objects >/dev/null
> 
> I can see how these patches are very logical avenue to grab only
> on-disk footprint for large number of objects, but among the type,
> payload size and on-disk footprint, I find it highly narrow niche
> that a real user or script is interested _only_ in on-disk footprint
> without even worrying about the type of object.

Yeah, I agree it is a bit of a niche. However, there are other code
paths that might want only the size and not the type (e.g., we already
know the object is a blob, but want to know size before deciding how to
handle diff). But in general, I doubt the performance impact is a big
deal there. It's only measurable when you're doing millions of objects.

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode
  2013-07-12  9:22     ` Jeff King
  2013-07-12 10:30       ` Michael Haggerty
@ 2013-07-15  3:45       ` Junio C Hamano
  2013-07-15  4:17         ` Jeff King
  1 sibling, 1 reply; 16+ messages in thread
From: Junio C Hamano @ 2013-07-15  3:45 UTC (permalink / raw)
  To: Jeff King; +Cc: Michael Haggerty, git, Nguyễn Thái Ngọc Duy

Jeff King <peff@peff.net> writes:

>> To cat-file we could add an option like "--sha1-only" or "--literal" or
>> "--no-dwim" (... better names are failing me) which would skip *all*
>> dwimming of 40-character strings.  It would also assume that any shorter
>> strings are abbreviated SHA-1s and fail if they are not.  This would be
>> a nice feature by itself ("these are object names, dammit, and don't try
>> to tell me differently!") and would have the additional small advantage
>> of speeding up lookups of abbreviated SHA-1s, which (regardless of your
>> patch) otherwise go through the whole DWIM process.
>
> I can see in theory that somebody might want that, but I am having a
> hard time thinking of a practical use.

Would it be a good alternative to call get_sha1_hex() to catch the
most common case (reading from rev-list output, for example) and
then let the more general get_sha1() to let extended SHA-1 to be
handled?

> IOW, it seems like a poor default, and we are choosing it only because
> of backwards compatibility. I guess another option is to switch the
> default with the usual deprecation dance.

I agree that "did you mean the unreadable refname or 40-hex object?"
turned on everywhere get_sha1() is called is a very poor default.  I
wonder if we can limit it only to the end-user input somehow at the
API level.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode
  2013-07-15  3:45       ` Junio C Hamano
@ 2013-07-15  4:17         ` Jeff King
  0 siblings, 0 replies; 16+ messages in thread
From: Jeff King @ 2013-07-15  4:17 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Michael Haggerty, git, Nguyễn Thái Ngọc Duy

On Sun, Jul 14, 2013 at 08:45:37PM -0700, Junio C Hamano wrote:

> >> To cat-file we could add an option like "--sha1-only" or "--literal" or
> >> "--no-dwim" (... better names are failing me) which would skip *all*
> >> dwimming of 40-character strings.  It would also assume that any shorter
> >> strings are abbreviated SHA-1s and fail if they are not.  This would be
> >> a nice feature by itself ("these are object names, dammit, and don't try
> >> to tell me differently!") and would have the additional small advantage
> >> of speeding up lookups of abbreviated SHA-1s, which (regardless of your
> >> patch) otherwise go through the whole DWIM process.
> >
> > I can see in theory that somebody might want that, but I am having a
> > hard time thinking of a practical use.
> 
> Would it be a good alternative to call get_sha1_hex() to catch the
> most common case (reading from rev-list output, for example) and
> then let the more general get_sha1() to let extended SHA-1 to be
> handled?

For a 40-byte sha1, that _should_ be what get_sha1 does (i.e., go more
or less directly to the 40-hex code path, and return). And that's
basically what happens now, except that after we do so, we now have the
extra "oh, is it also a refname?" check.

For a shortened sha1, I don't think it would have the same behavior.
Right now, I believe the order is to treat a short sha1 as a possible
refname, and only if that fails consider it as a short sha1.

> > IOW, it seems like a poor default, and we are choosing it only because
> > of backwards compatibility. I guess another option is to switch the
> > default with the usual deprecation dance.
> 
> I agree that "did you mean the unreadable refname or 40-hex object?"
> turned on everywhere get_sha1() is called is a very poor default.  I
> wonder if we can limit it only to the end-user input somehow at the
> API level.

It is easy to do on top of my patch (just flip the default on the switch
I introduced, and turn it back on in whichever code paths are
appropriate).  But the question is: what is end-user input? Do you mean
command-line arguments to "rev-list" and friends?

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode
  2013-07-12 10:30       ` Michael Haggerty
@ 2013-07-15  4:23         ` Jeff King
  0 siblings, 0 replies; 16+ messages in thread
From: Jeff King @ 2013-07-15  4:23 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, Nguyễn Thái Ngọc Duy

On Fri, Jul 12, 2013 at 12:30:07PM +0200, Michael Haggerty wrote:

> But with particular respect to "git cat-file", I see problems:
> 
> 1. get_ref_snapshot() would have to read all loose and packed refs
> within the specified subtree, because loose refs have to be read before
> packed refs.  So the call would be expensive if there are a lot of loose
> refs.  And DWIM wouldn't know in advance where the references might be,
> so it would have to set prefix="".  If many refs are looked up, then it
> would presumably be worth it.  But if only a couple of lookups are done
> and there are a lot of loose refs, then using a cache would probably
> slow things down.
> 
> The slowdown could be ameliorated by adding some more intelligence, for
> example only populating the loose refs cache after a certain number of
> lookups have already been done.

I had assumed we would load the loose ref cache progressively by
directory. Of course that only helps avoiding "refs/foo/*" when you are
interested in "refs/heads/foo". If your "refs/heads/*" is very large,
you have to load all of it, and at some point it is cheaper to do a few
stat() calls than to actually readdir() the whole directory. On the
other hand, that is basically how packed-refs work now (if we hit any
ref, we have to load the whole file), and repos with many refs would
typically pack them all anyway.

> 2. A "git cat-file --batch" process can be long-lived.  What guarantees
> would users expect regarding its lookup results?

I hadn't really intended this for general lookups, but just to speed up
the refname warning, where being out-of-date is more acceptable (since
the warning is a purely informational hint).

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2013-07-15  4:23 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-12  6:15 [PATCH 0/7] cat-file --batch-check performance improvements Jeff King
2013-07-12  6:20 ` [PATCH 1/7] cat-file: disable object/refname ambiguity check for batch mode Jeff King
2013-07-12  8:47   ` Michael Haggerty
2013-07-12  9:22     ` Jeff King
2013-07-12 10:30       ` Michael Haggerty
2013-07-15  4:23         ` Jeff King
2013-07-15  3:45       ` Junio C Hamano
2013-07-15  4:17         ` Jeff King
2013-07-12  6:21 ` [PATCH 2/7] sha1_object_info_extended: rename "status" to "type" Jeff King
2013-07-12  6:30 ` [PATCH 3/7] sha1_loose_object_info: make type lookup optional Jeff King
2013-07-12  6:31 ` [PATCH 4/7] packed_object_info: hoist delta type resolution to helper Jeff King
2013-07-12  6:32 ` [PATCH 5/7] packed_object_info: make type lookup optional Jeff King
2013-07-12  6:34 ` [PATCH 6/7] sha1_object_info_extended: make type calculation optional Jeff King
2013-07-12  6:37 ` [PATCH 7/7] sha1_object_info_extended: pass object_info to helpers Jeff King
2013-07-12 17:23 ` [PATCH 0/7] cat-file --batch-check performance improvements Junio C Hamano
2013-07-12 20:12   ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).